I recently celebrated my one year anniversary as a Site Reliability Engineering (SRE) trainee at SoundCloud. Looking back, I had very little idea of what I was getting into. I studied politics, and had done various teaching and translating jobs over the years before deciding to learn programming. Now I had a coding bootcamp under my belt, but no SRE experience. That said, I was drawn to the idea of troubleshooting complex systems and automating away repetitive tasks. I was also incredibly excited to work at SoundCloud, the world’s largest music and audio platform that lets people discover the greatest selection of music from the most diverse creator community on earth.
SRE is an approach that uses software engineering concepts to solve operations problems. Its principles work to align the goals of development and operations to create reliable and scalable systems.
My team, Production Engineering, is responsible for ensuring the availability of SoundCloud. We make sure that systems are resilient and running smoothly, whether by taking care of it ourselves or enabling other teams to own what they build.
In this post, I want to share my experiences as a new SRE. I’ll reflect on the lessons I’ve learned and challenges I’ve faced over the past year. I hope to give you an insight into the role of an SRE, and perhaps even inspire others to consider a career in this field.
The first thing I noticed was the scope of the work that we do. Everything from Infrastructure as Code to Monitoring and Incident Response training potentially falls under the SRE umbrella. This has been overwhelming at times. However I remind myself that I’m not going to learn everything in one day, and take it one step at a time.
On the other hand, this huge scope is also why I love it. There’s definitely no shortage of things to learn. Some things I’ve worked on so far include upgrading Kubernetes and decommissioning part of our infrastructure. I’m currently in the process of migrating a service from our data center to Google Cloud. This is part of the exploratory phase of a bigger migration, and what I learn from it will benefit other teams when the time comes for them to do the same.
I’ve also taken on the role of First Responder. Supporting other teams is a big part of our job, and the First Responder is the dedicated person for this. It means it’s clear who to ask for help, and the rest of our team can work without interruptions. Part of the role is also looking into the alerts our team gets. This can be daunting because it involves getting to know the nitty gritty details of each specific system. However it’s no coincidence this is also when I learn the most.
On the cultural side of things, I lead the SRE Collective. This is a space where we get together to talk about SRE topics and share knowledge. The format is flexible, and leading means anything from facilitating a discussion to inviting someone to give a presentation.
If this year has been about getting a high level overview, I’m looking forward to diving deeper into some topics in the year to come.
Another thing that stood out to me is the way we view failure at SoundCloud. Failure is inherent in complex systems, what matters is how we deal with it. At SoundCloud we practice blame aware postmortems. This means we carry out incident reviews with a focus on learning rather than pointing fingers. When failure is viewed as something normal, not as something to be swept under the rug, then we can examine what happened, learn from it, and make our systems stronger.
This outlook has also challenged how I think about cause and effect. Often in life we look for the root cause of something. However there is rarely (if ever) one root cause, but rather a series of contributing factors. One of the most interesting yet challenging aspects of my traineeship has been understanding how these factors interact to result in the functioning or failure of a system.
Perhaps the most intimidating technology I’ve had to learn this year is Kubernetes. I’m not sure if it was its reputation for being complex, or the fact that it comes with a whole new vocabulary (pods, replicaSets, etc.), but I put off learning it for a while before finally diving in. And it’s true - it is complex and it took some time to get my head around it. However the process of learning it taught me something that can be applied to any technology.
Basically, it’s essential to understand why that technology exists in the first place. What problem does it solve? In the case of Kubernetes, it solves the problem of having a running application tied to physical infrastructure. This is a problem because if a server crashes then the application goes down with it. While it’s true there’s more that can be said about Kubernetes, understanding this made it click for me and from that point on it became a lot less confusing. I still have a lot more to learn, but knowing its basic purpose makes new concepts fall into place easily.
There have been so many new concepts to grasp and technologies to learn, and asking questions has been key to improving my understanding. It’s true that reading and googling have their part, but asking questions and chatting with my teammates is when it really comes together.
I’m also lucky enough to be working with some incredibly smart and talented people who are more than happy to share their knowledge with me. It’s an opportunity that’s not to be missed!
As a team, we decided to set aside a dedicated time for questions, and we now have a ‘Coffee and Q&A’ session twice a week. It started with me as a trainee in mind, but quickly turned into an informal knowledge sharing space for the whole team. It’s a time where anyone can bring questions that have come up during the day, or ask things that are outside the scope of their daily tasks. It’s also a great team bonding moment.
When I look back to when I started it’s hard for me to believe how far I’ve come. I’ve improved my technical skills, worked on some very exciting projects, and learned a lot about SRE culture along the way. However all this was only possible because of the support I got from my colleagues. Whether as pair programming sessions or informal chats, the time they have given me and the kindness and enthusiasm they have shown me has been astounding.
I’m also happy to share that I’ve accepted a permanent position on the team, and I’m excited to continue my journey as an SRE at SoundCloud.
Look out for my next post, where I’ll share my experiences on migrating to the cloud.