Building a Healthy On-Call Culture

Paging Doctor Software

In the past, on-call duty was often associated with doctors, but in recent years, it’s become common for software engineers to be asked to be available for support work on short notice. As software has grown to power so much of the world, the need for high availability and rapid incident response has likewise grown.

Another driver of the popularity of on-call work is the ubiquity of smartphones. The days of bulky, unreliable pagers are gone. Now when automated software monitoring detects a system anomaly, an on-call engineer gets an alert in the form of a phone call, a text, or a loud noise from a mobile app.

On-Call Work at SoundCloud

First and foremost, on-call duty for SoundCloud engineers is optional. We believe this is important to our engineering culture for reasons I’ll discuss below. Secondly, on-call duty outside of normal office hours is compensated at an hourly rate, with additional hourly payments when responding to pages.

On-call engineers are organized into rotations. Each rotation consists of a group of engineers representing a team or, more typically, several teams. At any time, there’s always one engineer on call for the rotation. That engineer is expected to provide first-level support for the systems belonging to all the teams in the rotation.

Additionally, every engineer in the rotation is always on call to provide second-level support for the systems belonging to that engineer’s team. Second-level support is on a best-efforts basis, meaning that while engineers can receive a second-level page at any time, they’re not required to answer if they’re unavailable to help for any reason.

Why On-Call Work Is Good for Engineers

Having a wide range of engineers, and not just DevOps and Site Reliability Engineers, on call has a number of benefits both for the company and for the engineers themselves.

Perhaps most obviously, it lightens the burden on the operational engineers, who often have substantial out-of-hours support commitments as part of their core job descriptions.

It also equips and motivates engineers to build reliable, well-documented systems. Seeing firsthand how things go wrong in production powers insights into how systems can be improved and made more robust.

And finally, supporting both their own and others’ systems is a great learning opportunity for engineers. It provides valuable hands-on experience with infrastructure such as databases, as well as experience diagnosing faults and making operational decisions.

Procedural Best Practices

Every engineering organization is different, but through trial and error we’ve found the following practices work well for SoundCloud.

Different rotations have different shift cadences, but most shifts last only one or two days.

The optimal frequency for being on call is about three days a month. More than that and people risk burning out over time. Less than that and people get rusty and aren’t as effective at dealing with incidents. This means the optimal size for a rotation is between eight and twelve engineers, with ten being just about perfect. In fact, I was once part of a rotation that had a waitlist to join because we collectively agreed to not grow bigger than twelve people.

Most rotations have a formal or informal rotation administrator drawn from the engineers in the rotation. The administrator maintains the shift schedule, deals with personnel changes, and performs other ad hoc tasks that assure the rotation’s health. For example, in many rotations, the administrator organizes a meeting to set the shift schedule for the holiday period. Deciding together how to cover times of low engineer availability has proved the fairest and least stressful way of handling these situations.

A Word about Rotations and Teams

Most SoundCloud on-call rotations started as groups of engineers from related teams in the same area of the engineering organization. But, as is the case with most engineering organizations, SoundCloud’s has evolved over time. Teams have merged or split, new teams have been created, teams have been moved into different divisions, and so on. However, the on-call rotations have generally not evolved at the same pace as the engineering organization. This means the rotation structure has come to bear less and less resemblance to the wider organization structure, to the point where many rotations now represent what seem like random groups of unrelated teams.

On the whole, this has not proved to be a problem. Furthermore, attempts to reorganize the rotations to match the current company structure have usually stalled due to objections from the on-call engineers. These engineers have, in some cases, supported the systems in their rotations for years and acquired deep knowledge of those systems. Change for the sake of matching the current organizational chart represents a significant upheaval for a comparatively modest benefit. (And when the organizational chart changes again, the upheaval repeats itself.)

Cultural Best Practices

We try hard to foster the following norms and attitudes for the benefit of the on-call engineers, and by extension, the company as a whole.

The people who are on call want to be on call. Engineers who’ve freely taken on the obligation (and are being compensated for it) are more motivated when responding to incidents. Having experienced workplaces where on-call work was mandatory or close to mandatory, I can personally attest to the positive atmosphere engendered by a voluntary on-call policy.

Practical matters like shift cadences are decided collectively by the engineers in the rotation. The rotation administrator leads the decision-making process but does not act unilaterally. This is why not all of SoundCloud’s rotations have the same shift patterns, shift handover times, procedures for trading or giving away shifts, and so on. Each rotation is free to do what works best for its members.

On-call engineers often spend some of their normal working day looking into minor operational issues to prevent these issues worsening and having to page someone out of hours.

When responding to incidents, engineers can always ask for help by paging other engineers. This is something we stress multiple times during on-call onboarding. It’s a cornerstone of a healthy, high-trust on-call culture. Nobody enjoys getting a second-level support page in the middle of the night, but responding (if possible) is an important act of empathy for the engineer who needs help. It’s also an investment in training them to handle the situation autonomously in the future.

In addition to feeling free to ask for help, engineers also feel free to hand work over to others after a reasonable length of time. For major or long-running incidents, we encourage engineers to hand over after four hours — or sooner if they become too tired to work effectively.

The most important cultural practice — which influences all the other cultural practices I’ve discussed — is fostering a learning culture rather than a blame culture. This truly can’t be emphasized enough. Mistakes are an inevitable part of incident response. Learning from mistakes builds a stronger, more technically proficient engineering organization. Punishing people for mistakes makes engineers afraid to act in new situations, afraid to ask for help when they need it, and afraid to be transparent. Ultimately, engineers will choose to leave the on-call rotations or even the company if there’s a culture of blame.

When a Major Incident Happens

Responding to a total site outage or other severe incident is stressful for all involved. It’s also a stress test for a company’s on-call culture. It’s more important than ever for engineers to work well together and trust each other. They’ll resolve the incident faster if they feel comfortable admitting what they don’t know, asking for help from others, being honest if they make mistakes, and speaking up if they’re too tired to carry on.

The time to foster these behaviors is before a major incident occurs. Engineers learn them from experience, by responding to smaller incidents and interacting with their colleagues. In this sense, smaller incidents are essentially practice for larger incidents.

Conclusions

Underpinning all the practices and behaviors I’ve described in this article is one thing: respect. Companies that respect their engineers institute policies that enable those engineers to do their best work. Engineers who respect each other help each other succeed, and in turn, help the company succeed. This applies to on-call rotations as well as the wider engineering culture. When mutual respect exists, everyone wins.