Lessons in Ops and SRE

How traveling the Americas shaped my views of Ops, SRE & the World

A few weeks ago I had my first (semi-)public speaking experience. I submitted a talk with the title “Lessons in Ops & SRE from Overlanding the Americas” to TX|Conference with the following pitch:

Little did I know how my experiences of Overlanding the Americas for three years was going to help me to establish SRE as a discipline at Ricardo. In this talk, I’ll show you five simple steps that will help you on your journey - being it at work introducing Site Reliability Engineering or when traveling the world.

Of course, the conference was held mostly virtual. The conference was held on hopin and for a few select speakers at the TX Group HQ. My talk was accepted, unfortunately not on the main track but the session track. The sessions are like a moderated Google Meet or Zoom call. Each speaker had about 10-15min for their talk and then there were about 5 minutes for a Q&A.

The following is a loose transcript of the talk.

Intro

When I was six years old, my parents bought a Land Rover Defender. Ever since then, I was fascinated by these cars. When I grew older I learned that people are traveling around the world in their cars. In the age of Instagram, this type of travel is called Overlanding. This talk is about Overlanding the Americas and how Overlanding is not that different from Operations.

Purpose

Once I fulfilled my childhood dream of owning a Land Rover I quickly realized that the car isn’t very practical, especially in Switzerland. Every (legal) road is paved and every corner has been visited. It was clear I want to become an “Overlander” and travel the world for an unspecified time.

When I started at Ricardo in February 2020 I quickly realized that the SRE team was lacking purpose. Asking my peers and teammates about the team’s mission I got many different answers. My first task was to understand what we as SRE should be doing and aiming for at Ricardo.

Impediments

Now we know why we are here, so what’s stopping us from achieving our mission?

When Overlanding you can count on the fact that there will be obstacles waiting for you. Be it a fallen tree, a road closure due to a landslide, or a technical defect. You will have to deal with difficulties. In Operations, usually the biggest struggle is all the unplanned work. If things aren’t going well, firefighting is the norm and a lot of effort from individuals are required to keep to lights on. This is not sustainable; people will burn out, it will bring all innovation to a halt, and you worsen the situation. Your first step must be to identify the bottlenecks and stop the bleeding.

Improvements made anywhere besides the bottleneck are an illusion.

What helped me in the past is to work in very short cycles. Set clear goals and targets for one week maybe two. Work with the Plan-Do-Check-Act method.

Adapt

A wise person once said: “Life is what happens to us while we are making other plans.” Whatever your plan is, reality will be different. Adapting your plans and goals is inevitable.

When we were in Ecuador we’ve camped at a nice Laguna which was a long and bumpy ride off the next main road. Google Maps showed a minor road going through the mountains, short-cutting the long detour that we would have had to take if we’d go back the way we came. You would think that after more than a year on the road, we’d know better than to trust Google Maps. After a few kilometers, this minor road turned into a mountain bike track. Because it was all downhill and I wasn’t sure if turning around would be feasible we decided to push through.

We made it down without wrecking the car. Nevertheless, the Landy sustained some damage and we had to abandon our initial plans of continuing south. After a quick field repair (always carry a hammer!), we turned around and headed back to Quito where we could get spare parts. We could have of course continued, but like with tech debt in software engineering, you never quite know when the hacks blow up and once that happens, the damage is much harder to repair.

Just like driving through unknown territory running an internet service requires you to pay attention to detail, don’t make long arching plans, don’t just do what Google tells you, and keep your house in order.

Expect the best, prepare for the worst.

Destination

680 days after starting in Halifax and zic-zacing more than 75'000km through North and South America, we arrived at the “fin del mundo” in Ushuaia, the most southern city of the Americas. But our journey was far from over and so is yours after the first achievements.

In the beginning, it seemed impossible to ever arrive in Ushuaia in one piece. There are so many borders to cross, so many foreign countries, so many people telling you how dangerous it is, and so many things to see! In the beginning, we didn’t even speak Spanish.

You might feel the same at work, but if you take what comes day by day and don’t lose your goal out of sight you will make it. Once you get the fires under control you can start doing are more forward-looking work. Ask yourself which areas will profit the most from improvements.

Community

A journey like that is impossible alone.

One of the most amazing experiences of our trip was realizing that the overwhelming majority of people on this planet will help you when you need it with no questions asked and no expectation of repayment.

Another truth we learned is that societies fall apart once the feeling of belonging goes away. When people get disenfranchised everybody falls back into survival mode and it’s difficult to rebuild what was lost.

One example of this is Medellin. During the narco wars in the 90s Medellin was one of the most dangerous places in the world. Once the fighting was over, the communities at the fringes of the cities where most of the violence happened were destroyed. It was clear that it needs a tremendous effort to integrate the comunas back into society and make sure something like this doesn’t happen again.

Thus, the government together with the people started an unprecedented program to rebuild the communities. The comunas are all built on the fringes of the city on steep hills, with no real roads or public services.

  • To make it cheaper and faster for the people to reach to city were they worked outdoor escalators and gondolas where built
  • They built big libraries that were open for all
  • They built parks and sports grounds
  • And they build infrastructures like power and sewage

Building places for the people to gather and socialize was vital to the success of the program. Nowadays Medellin is much safer and in the four weeks we stayed in Medellin we had not a single bad encounter.

Of course, we are very privileged, our problems are nowhere near what these people had to endure but we can learn from what they did.

If you are at the beginning of your journey from a traditional dev and ops approach, start building your network of enthusiasts. If your company is new to this, start seeking out people who share the same passion:

  • Building a reliable platform
  • A secure platform
  • And a platform that is fun to work with

And again, step by step you can improve the world around you.

Summary

  1. Figure out what the purpose of your team is
  2. Remove the biggest impediments to get out of the cycle
  3. Adapt your plans, keep your iterations short
  4. Once your able to take care of more forward-looking projects don’t forget about your tech-debt
  5. Build a community of people with the same interests and goals

Recording

The recording will be posted here once it’s available.

Slides