If there is anything like a silver bullet for creating meaningful and actionable alerts with a high signal-to-noise ratio, it is alerting based on service-level objectives (SLOs). Fulfilling a well-defined SLO is the very definition of meeting your users’ expectations. Conversely, a certain level of service errors is OK as long as you stay within the SLO — in other words, if the SLO grants you an error budget. Burning through this error budget too quickly is the ultimate signal that some rectifying action is needed. The faster the budget is burned, the more urgent it is that engineers get involved.
This post describes how we implemented this concept at SoundCloud, enabling us to fulfill our SLOs without flooding our engineers on call with an unsustainable amount of pages.