Backstage Blog

You're browsing posts of the category Monitoring

What Is New with Periskop in 2022

February 1st, 2022 by Marc Tuduri

We made some updates to Periskop, our internal pull-based exception monitoring service. To learn about what is new in 2022, read on.

Periskop: Exception Monitoring Service

March 16th, 2020 by Jorge Creixell

A new approach to exception monitoring, designed for high scalability...

Alerting on SLOs like Pros

June 4th, 2019 by Björn “Beorn” Rabenstein

If there is anything like a silver bullet for creating meaningful and actionable alerts with a high signal-to-noise ratio, it is alerting based on service-level objectives (SLOs). Fulfilling a well-defined SLO is the very definition of meeting your users’ expectations. Conversely, a certain level of service errors is OK as long as you stay within the SLO — in other words, if the SLO grants you an error budget. Burning through this error budget too quickly is the ultimate signal that some rectifying action is needed. The faster the budget is burned, the more urgent it is that engineers get involved.

This post describes how we implemented this concept at SoundCloud, enabling us to fulfill our SLOs without flooding our engineers on call with an unsustainable amount of pages.

Hands-Off Deployment with Canary

August 29th, 2018 by Jorge Creixell and Tobias Schmidt

At SoundCloud, we follow best practices around continuous delivery, i.e. deploying small incremental changes often (many times a day). In order to improve the user experience, we’ve been exploring different ways of reducing the impact and the Mean Time to Recovery (MTTR) of faulty deployments. Enter canary releases.

Prometheus has come of age – a reflection on the development of an open-source project

July 19th, 2016 by Bjorn "Beorn" Rabenstein

On Monday this week, the Prometheus authors have released version 1.0.0 of the central component of the Prometheus monitoring and alerting system, the Prometheus server. (Other components will follow suit over the next months.) This is a major milestone for the project. Read more about it on the Prometheus blog, and check out the announcement of the CNCF, which has recently accepted Prometheus as a hosted project.

Prometheus: Monitoring at SoundCloud

January 26th, 2015 by Julius Volz, Björn Rabenstein

In previous blog posts, we discussed how SoundCloud has been moving towards a microservice architecture. Soon we had hundreds of services, with many thousand instances running and changing at the same time. With our existing monitoring set-up, mostly based on StatsD and Graphite, we ran into a number of serious limitations. What we really needed was a system with the following features:

A multi-dimensional data model, so that data can be sliced and diced at will, along dimensions like instance, service, endpoint, and method.
Operational simplicity, so that you can spin up a monitoring server where and when you want, even on your local workstation, without setting up a distributed storage backend or reconfiguring the world.
Scalable data collection and decentralized architecture, so that you can reliably monitor the many instances of your services, and independent teams can set up independent monitoring servers.
Finally, a powerful query language that leverages the data model for meaningful alerting (including easy silencing) and graphing (for dashboards and for ad-hoc exploration).

All of these features existed in various systems. However, we could not identify a system that combined them all until a colleague started an ambitious pet project in 2012 that aimed to do so. Shortly thereafter, we decided to develop it into SoundCloud’s monitoring system: Prometheus was born.