Periskop is an exception monitoring service that we built here at SoundCloud. It was designed with microservice environments in mind, but it’s also useful for long-running services in general. It has been architected for scale and is heavily influenced by what we’ve learned with Prometheus. It uses a pull-based model to efficiently aggregate exceptions in clients, after which point a server component collects exceptions and further aggregates them across instances.
In the past, SoundCloud relied on Airbrake to help troubleshoot and identify problems happening in production. As traffic increased and our services started receiving requests in the tens of thousands per second, small hiccups would send huge bursts of exception reports, in turn depleting our entire monthly budget in just a few minutes, and causing loss of information. grep
ping or indexing the logs was not an option given the huge volume generated by our services.
Considering the increasing amount of services SoundCloud runs, we needed a solution that would easily discover and scrape services with the minimum amount of manual effort.
Since SoundCloud employees are given the opportunity to use up to 20 percent of time to experiment with new ideas (known here as self-allocated time), this was a perfect opportunity for a group of engineers to get together and build something fun and useful to improve tooling at SoundCloud, and hopefully, for the entire open source community.
Periskop is composed of both a client library component and a server component.
The client library provides an exception collector. Every time an exception is handled, it can be added to the collector. Unhandled exceptions can be added to the collector by a top-level exception handler. Once an exception is added:
Collected exception aggregates are exported via an HTTP endpoint using a well-known schema.
The server component uses service discovery to find all instances of a service. It then scrapes and further aggregates exported exceptions across instances. A simple UI allows navigating and inspecting exceptions as they occur.
Every design decision comes with a set of tradeoffs that are important to be aware of. In the case of Periskop, we opted for a pull model for collecting exceptions as opposed to the usual push model of similar tools like Sentry. This is advantageous because the pull model scales well with the number of exceptions and instances, and it provides interesting capabilities with little effort, specifically:
This model also has some disadvantages when compared to the push model, namely:
Periskop is in its early stages and lacks some capabilities. Some of the ideas for improving and extending it that are on the roadmap include:
In addition, high-level libraries for widely used web frameworks and client libraries for more programming languages need to be developed for Periskop to realize its full potential.
Periskop is open source and we are happy to accept external contributions! If you find this project useful, we would love to hear from you. Please drop us a line at periskop-maintainers@soundcloud.com.