Periskop: Exception Monitoring Service

March 16th, 2020 by Jorge Creixell

Periskop is an exception monitoring service that we built here at SoundCloud. It was designed with microservice environments in mind, but it’s also useful for long-running services in general. It has been architected for scale and is heavily influenced by what we’ve learned with Prometheus. It uses a pull-based model to efficiently aggregate exceptions in clients, after which point a server component collects exceptions and further aggregates them across instances.

Why Yet Another Exception Aggregation System?

In the past, SoundCloud relied on Airbrake to help troubleshoot and identify problems happening in production. As traffic increased and our services started receiving requests in the tens of thousands per second, small hiccups would send huge bursts of exception reports, in turn depleting our entire monthly budget in just a few minutes, and causing loss of information. grepping or indexing the logs was not an option given the huge volume generated by our services.

Considering the increasing amount of services SoundCloud runs, we needed a solution that would easily discover and scrape services with the minimum amount of manual effort.

Since SoundCloud employees are given the opportunity to use up to 20 percent of time to experiment with new ideas (known here as self-allocated time), this was a perfect opportunity for a group of engineers to get together and build something fun and useful to improve tooling at SoundCloud, and hopefully, for the entire open source community.

How Does It Work?

Periskop is composed of both a client library component and a server component.

The client library provides an exception collector. Every time an exception is handled, it can be added to the collector. Unhandled exceptions can be added to the collector by a top-level exception handler. Once an exception is added:

The collector builds a unique key with the exception’s message and a hash of the stack trace. This key represents a concrete exception type.
The collector aggregates the exception using the unique key and stores the aggregate using an in-memory data structure.
The concrete exception occurrence is added to the aggregate’s list of recent exceptions for inspection. A queue data structure is used to implement a moving window of the most recent exceptions with a configurable size.

Collected exception aggregates are exported via an HTTP endpoint using a well-known schema.

The server component uses service discovery to find all instances of a service. It then scrapes and further aggregates exported exceptions across instances. A simple UI allows navigating and inspecting exceptions as they occur.

Design Tradeoffs

Every design decision comes with a set of tradeoffs that are important to be aware of. In the case of Periskop, we opted for a pull model for collecting exceptions as opposed to the usual push model of similar tools like Sentry. This is advantageous because the pull model scales well with the number of exceptions and instances, and it provides interesting capabilities with little effort, specifically:

Memory used by the client library is bounded by the number of different types of exceptions occurring instead of the total number of exceptions overall. Memory on the server side is similarly bounded, as it uses the same aggregation logic.
If the server component needs to scrape a large amount of service instances, it might end up skipping scrape cycles, thereby reducing the freshness of the data, but it does not require additional resources. In practice, this means that Periskop can handle large numbers of services and instances with few resources.
Creating hierarchies of Periskop instances (known as federation) becomes trivial. A primary Periskop instance could aggregate exceptions collected by secondary, data center-specific instances, thus providing a global view of the entire service stack across data centers.

This model also has some disadvantages when compared to the push model, namely:

If a fatal exception occurs and the process dies, the exception won’t be collected. However, this can be mitigated by the reporting capabilities of orchestration services like Kubernetes or other forms of logging.
The pull model is not well suited for short-lived processes like scheduled jobs. This could be solved by the use of a push-based event gateway, although we haven’t yet had the need at SoundCloud, as it is usually more convenient to use logs to inspect failed jobs.

What Is Coming Next

Periskop is in its early stages and lacks some capabilities. Some of the ideas for improving and extending it that are on the roadmap include:

Server-side persistence of exception occurrences over time
UI filtering and sorting
Federation (hierarchical Periskop servers)
Pluggable service discovery mechanisms
Connectors to other error monitoring services like Sentry

In addition, high-level libraries for widely used web frameworks and client libraries for more programming languages need to be developed for Periskop to realize its full potential.

Periskop is open source and we are happy to accept external contributions! If you find this project useful, we would love to hear from you. Please drop us a line at periskop-maintainers@soundcloud.com.