Alerting on SLOs like Pros

If there is anything like a silver bullet for creating meaningful and actionable alerts with a high signal-to-noise ratio, it is alerting based on service-level objectives (SLOs). Fulfilling a well-defined SLO is the very definition of meeting your users’ expectations. Conversely, a certain level of service errors is OK as long as you stay within the SLO — in other words, if the SLO grants you an error budget. Burning through this error budget too quickly is the ultimate signal that some rectifying action is needed. The faster the budget is burned, the more urgent it is that engineers get involved.

This post describes how we implemented this concept at SoundCloud, enabling us to fulfill our SLOs without flooding our engineers on call with an unsustainable amount of pages.

Let’s Talk SRE

SLOs, alerting, and error budgets are central concepts of Site Reliability Engineering (SRE), a discipline invented by Google in the early 2000s. Although it was initially proprietary to Google, over the last few years, SRE has become a widely discussed topic and is now applied in various flavors throughout the industry. An important milestone in this development was Google’s publication of the bestselling book Site Reliability Engineering: How Google Runs Production Systems. The book is available online for free. If you run anything even remotely at scale, there is really no excuse as to why you shouldn’t read this book in its entirety. Among all of its important chapters, there are three of particular importance for this blog post:

  • Chapter 4 — Service Level Objectives explains everything you need to know about SLOs (and SLIs and SLAs, in case you’ve ever wondered about the difference).
  • Chapter 6 — Monitoring Distributed Systems lays the foundation of “alerting philosophy” and creates an understanding of when it makes sense to page engineers (and when it does not).
  • Chapter 10 — Practical Alerting mostly describes Borgmon, Google’s traditional internal monitoring system. While Borgmon is proprietary to Google, the author points out that you can try out the described concepts at home (or at work, even) because many current open-source monitoring systems follow similar semantics, with Prometheus receiving a special shoutout. Conveniently, we use Prometheus for the large majority of our monitoring (which is not surprising, as most of Prometheus’ initial development took place here at SoundCloud).

Keep in mind, though, that you are not Google. We all can learn from the giants, but blindly copying them could go horribly wrong. On that note, how different organizations ventured into SRE territory is the main topic of yet another SRE-themed book: Seeking SRE, which was curated and edited by David Blank-Edelman. We at SoundCloud contributed a chapter, too, in which we talked about our somewhat bumpy road to applying SRE principles in an organization that is totally not Google.

The Google SREs themselves also felt the pain of us “mere mortals” trying to figure out how to make use of the lessons from the SRE book in our day-to-day work, so they published a companion to the original SRE book, called [The Site Reliability Workbook], which they again made available online for free. There are a bunch of cookbook-style chapters that are directly applicable in practice. One of them, Chapter 5 — Alerting on SLOs, describes perfectly what we had been doing already, albeit in a less sophisticated way. With the inspiration from that chapter, we were able to refine our SLO-based alerting with great success. In fact, the solution we ended up using is so similar to the one described in the chapter that you should just read it together with this blog post to get the most out of it.

Symptoms vs. Causes

But let’s first take a step back and think about alerting on symptoms vs. causes. This is an important topic in Chapter 6 from the original SRE book (where you can read up all the details if you feel so inclined).

Back in the days when your entire site ran on one LAMP server, there wasn’t much of a need to think about the difference between symptoms and causes. If your one web server went down (cause), your site was down (symptom). Waking somebody up the moment your web server stopped replying to ping probes was the right thing to do.

Nowadays, even mid-size sites run on hundreds, if not thousands of servers, and one or two of them are almost guaranteed to be down at any time. And that’s fine, because the software is designed to tolerate some failures. As such, it wouldn’t make any sense to wake somebody up whenever a single machine goes down.

You can apply the same thought to many other — actual or potential — causes of an outage. Paging alerts, those that wake you up in the night, should be based on symptoms, i.e. something that actively or imminently impacts the user experience. In addition, the alerting condition should also be urgent and actionable. If any of these requirements are not met, the alert should not be a page.

However, there could still be a meaningful alert on a non-paging level. Let’s call these kinds of alerts tickets, as somebody will have to look at them eventually, e.g. a dead server has to be dealt with at a suitable time. A user-visible but not urgent problem can easily be addressed during work hours rather than in the middle of the night, but even during an actual outage, it is helpful to get informed (via some kind of dashboard rather than via your pager) about possible causes of an outage that are detected by the monitoring system.

Ideally, your SLO is a precise definition of an unimpacted user experience. In other words, symptom-based alerts are usually alerts on (active or imminent) violations of your SLO.

Black-Box vs. White-Box Monitoring

With the lessons of symptoms vs. causes in mind, you might think now that black-box monitoring is the obvious choice for detecting symptoms, while white-box monitoring is ideal for causes. That’s a reasonable thought, but in reality, our practice at SoundCloud is a perfect match of what Google tells us in Chapter 6 of the SRE book: “We combine heavy use of white-box monitoring with modest but critical uses of black-box monitoring.”

We mostly use black-box probes as a workaround for legacy software that isn’t yet instrumented for white-box monitoring. And we run fairly complex Catchpoint probes to simulate typical usage of our site. Interestingly, these probes are what we use to determine if we have hit our monthly, quarterly, and yearly availability goals, but we don’t use them to page anybody. Why is that so? The lazy answer is: We need white-box monitoring anyway (to investigate causes). Why should we invest in additional black-box monitoring except for in the most critical cases? But there are more reasons (quotes are again from the SRE book):

  • Ideally, we want to detect user-impacting outages before they do noticeable harm. However, “for not-yet-occurring but imminent problems, black-box monitoring is fairly useless.”
  • Probes are not real user traffic. For example, the test track that our Catchpoint probes attempt to play might be perfectly available, while, for some complex reasons, the rap battle that has just gone viral is not.
  • Measuring long-tail latency and detecting low error rates take a lot of probes and thus a long time. However, tail latency is quite crucial in distributed systems, and our availability goals are usually measured in many nines, and so even a small error rate might violate our SLO. This could be the main reason why we don’t page anybody based on our Catchpoint probe. The probes are too slow to detect low but significant error rates, despite already limiting them to probing a very specific business KPI. The latter is a problem on its own, as we want our paging alerts to cover all relevant features of the site. However, the long-term results of the Catchpoint probes are great as a general quality calibration for our other monitoring. If they detect something, we can revisit historical alerting records to see if the near-real-time monitoring caught the issue, too.
  • In a multilayered system, like a microservice architecture, “one person’s symptom is another person’s cause.” A frontend service sees a failing backend service as a cause, while for the backend service, its own failure is a symptom, as it is now violating the (internal service-to-service) SLO. Adopting this view, we can use the white-box monitoring of the instrumented frontend service to track requests issued to the backend service. This is essentially black-box probing of the backend with the complete and real user traffic.

The most rewarding place to apply the pattern described in the last point above is our HAProxy-based terminator layer at the edge of our infrastructure. An earlier Backstage Blog post described it in detail.

Recently, HAProxy announced the introduction of a native Prometheus metrics endpoint, but the versions we have been working with are not yet instrumented for Prometheus. This gap is bridged by the Prometheus HAProxy exporter, a small glue binary that can be scraped by the Prometheus server in the usual way, whereupon it talks to HAProxy in its own fashion to retrieve metrics and convert them to the Prometheus format and then sends them back as the scrape result. By running such an HAProxy exporter next to each HAProxy instance, we get all the metrics we need. In particular, there is a counter for all HTTP requests HAProxy has sent to the various backends, partitioned not only by the backend, but also by the HTTP status code. With Prometheus, we can easily calculate relative rates of 5xx responses over arbitrary time spans.

Multiwindow, Multi-Burn-Rate Alerts

With the error rate as seen by the HAProxy terminator layer, we can now apply the alerting technique described in Chapter 5 of the SRE workbook. If you haven’t yet read that chapter, now is probably really a good time to do so. It progresses quite nicely, approaching the ultimate solution iteratively and solving one problem after another. Spoiler alert: Here, we will only implement the ultimate solution, the sixth iteration in the chapter, which is called Multiwindow, Multi-Burn-Rate Alerts. This features a set of alerts, each of which has a different alerting threshold that corresponds to different rates of burning through the monthly error budget. Depending on the burn rate, each alert uses a combination of two different time windows over which the error rate is calculated. Fast burning is detected quickly. Slow burning needs a longer time window.

First, we need to actually set the SLO. The chapter from the SRE workbook assumes an error rate of 0.1% (or a 99.9% success rate) as a typical value. That’s also generally true for SoundCloud, so the various numbers calculated in the chapter make sense for our scenario. However, some backends might have a stricter or more relaxed SLO. We still keep the various burn rates and window sizes the same for them, but we allow the target error rate to be configured by backend. This configuration takes the form of rules in PromQL (the Prometheus expression language, which is also used in the SRE workbook). They look like the following:

- name: slos_by_backend
  rules:
  - record: backend:error_slo:percent
    labels:
      backend: "api-v4"
    expr: 0.1
  - record: backend:error_slo:percent
    labels:
      backend: "api-v3"
    expr: 0.2
  # ... Many more backends.

Next, we need the actual error rate averaged over various time windows: 5m, 30m, 1h, 2h, 6h, 1d, 3d. These are also calculated as recording rules, but they are now based on actual live data instead of a constant number. This is the rule for the 5m case:

- name: multiwindow_recording_rules
  rules:
  - record: backend:haproxy_backend_http_errors_per_response:ratio_rate5m
    expr: |2
        sum by (backend)(rate(haproxy_backend_http_responses_total{job="haproxy", code="5xx"}[5m]))
      /
        sum by (backend)(rate(haproxy_backend_http_responses_total{job="haproxy"}[5m]))
  # Other rules accordingly with 5m replaced by 30m, 1h, 2h, 6h, 1d, 3d.

The rate with the [5m] range selector calculates requests per second, averaged over the specified timeframe, from the ever-increasing counters. The sum aggregator in both enumerator and denominator separately sums up the rate from all HAProxy instances for each backend. Each backend ends up as a label on the elements in the resulting vector. The denominator includes all requests, while the enumerator only selects those with code="5xx — which results in an error rate.

With all these recording rules in place, we can now assemble the multiwindow, multi-burn-rate alerts. We have chosen to use four different alerts, each combining a long window, a short window, and a burn rate factor, as listed in the following table.

Alert Long Window Short Window for Duration Burn Rate Factor Error Budget Consumed
Page 1h 5m 2m 14.4 2%
Page 6h 30m 15m 6 5%
Ticket 1d 2h 1h 3 10%
Ticket 3d 6h 1h 1 10%

Each alert fires if the SLO error rate times the burn rate factor is exceeded when averaged over both the long window and the short window.

Example: The SLO error rate for api-v4 is 0.1% (to achieve “three 9s” of availability). The owners of api-v4 get paged if their backend has returned more than 1.44% 5xx responses over the last 1h and over the last 5m. They also get paged if their backend has returned more than 0.6% 5xx responses over the last 6d and over the last 30m.

The short window ensures a short reset time of the alert, i.e. an alert should stop firing soon after the problem has been solved (which is not only convenient but also allows for paging to happen again if the problem comes back).

Why the different burn rate factors? They enable pages for fast burning of the monthly error budget and tickets for slow burning of the budget. (Slow burning of the error budget must not go undetected and must be addressed eventually, but it’s no reason to wake somebody up in the middle of the night.) The Error Budget Consumed column lists the percentage of the monthly error budget consumed at the time the alert triggers. The math is easiest to check in the last row, with a burn rate factor of 1: If the error rate happens to be exactly at the target of 0.1%, and that happens for 3 days, then 10% of the monthly error budget is burned because 3 days is 10% of a 30-day-long month.

Note that an alert can fire much more quickly than within the length of the long window. A backend that suddenly starts to return 100% 5xx will cross the 1.44% hourly error rate after only 52s (1.44% of 3,600s).

This leaves us with the for duration column, which is interestingly missing in the SRE workbook. We’ll talk about it in a minute. But let’s first see how the alerting rules actually appear:

- name: multiwindow_alerts
  rules:
  - alert: ErrorBudgetBurn
    expr: |2
        (
          100 * backend:haproxy_backend_http_errors_per_response:ratio_rate1h
        > on (backend)
          14.4 * backend:error_slo:percent
        )
      and
        (
          100 * backend:haproxy_backend_http_errors_per_response:ratio_rate5m
        > on (backend)
          14.4 * backend:error_slo:percent
        )
    for: 2m
    labels:
      system: "{{$labels.backend}}"
      severity: "page"
      long_window: "1h"
    annotations:
      summary: "a backend burns its error budget very fast"
      description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx over the last hour."
      runbook: "https://runbooks.soundcloud.com/terminator/#errorbudgetburn"
  # Followed by the other three alerting rules.

For the sake of brevity, this only lists the first of the four alerting rules. By replacing the relevant recording rules, the factors, the for duration, and the applicable label values with the values from the table above, you can easily deduce how the other three alerting rules appear.

The SRE workbook also lists the alerting rules in PromQL, but in a more compact form. Our example here is more verbose because we have added a few things we needed for our particular use case:

  • As explained above, the SLO error rate might be different per backend and the threshold is configured as a recording rule named backend:error_slo:percent. You can find it in the expression above. Note that the > comparison operator is used with a modifier, on (backend), to tell Prometheus which label to use for the label matching.
  • annotations which follow internal conventions, are added to facilitate handling the alerts. For example, each alert should contain a link to the relevant runbook in our runbook repository.
  • We also add a system label for alert routing. The Prometheus Alertmanager allows us to configure alert routing based on labels in a very powerful way. Somewhere else in our codebase, we have a registry of which team owns which systems. From there, we autogenerate an Alertmanager config that routes alerts to their owners based on the system label. Everything that can be labeled (for example, pods on Kubernetes) also gets a system label, and that label is propagated into Prometheus. HAProxy only knows the backend, and thus the HAProxy exporter doesn’t expose a system label. However, by convention, we use the system name as the backend name, and thus we can simply set the system label to the value of the backend label. (In harsh reality, the backend label is a multi-component string with other components, too. Prometheus has you covered there as well, with the label_replace function used to cut out the system part. This little complication is left out above to keep things a bit shorter.)
  • Finally, there is the for clause, with the duration from the table above. An alert with a for clause only fires once the alerting condition has been fulfilled continuously for the configured duration. Intriguingly, the SRE workbook explicitly advises against for clauses because they inevitably increase the detection time for the alerting condition. We still decided to use a for clause with a duration that is short compared to expected response times (of the alert itself and of the engineers getting alerted). The for clause is essentially a simple but effective defense against statistical outliers. For example, if a service has just started to receive traffic after a launch or an undrain operation, the error rate over any time window will be dominated by the very short time it has received traffic at all. Without the for clause, even a very short and modest initial error spike (maybe caused by cold caches) will trigger all the alerts.

Conclusion

With the alerts described above, we can reliably detect breaches of configurable SLOs with an appropriate detection time (the quicker the error budget is burned, the sooner the alert will fire) and a sufficiently short reset time. The resulting alerts are strictly symptom-based and therefore always relevant. To reach a person able to act on a particular alert, it is routed to the on-call rotation owning the affected backend.