Building and operating services distributed across a network is hard. Failures are inevitable. The way forward is having resiliency as a key part of design decisions.
This post talks about two key aspects of resiliency when doing RPC at scale - the circuit breaker pattern, and its power combined with client-side load balancing.
A basic definition of the Circuit Breaker pattern and its applicability is explained in Martin Fowler’s post on the topic.
SoundCloud’s original implementation meant a client’s view of a remote service transitioned between three pre-defined states:
With this implementation, a single failure becomes the deciding factor for opening a circuit, blocking all requests to the remote service for a period. This meant degraded end user experience and having to manually restart client systems to reset the state.
As you start operating systems under high load, the following questions arise:
Given JSON over HTTP is used for inter-service communication at SoundCloud by most services, resiliency at this layer is critical to our uptime and dealing with all possible classes of failures.
Earlier this year, we started exploring solutions aimed at improving resilience for our HTTP clients.
We found that most of the gaps detailed above could be addressed through a combination of modules included in the Finagle stack for building clients.
This was good news, but we still needed custom tweaks. The most important of these was to identify and compose multiple available modules, to help us define policies for how clients behave in failure scenarios and meet our expectations for high availability and a great user experience.
Specifically, we could map each question to a module in Finagle that could potentially address the gaps:
The problem : An unbounded duration, instead of a moving window, means clients infer inaccurate information about the health of remote services. A failure accrual module in Finagle allows clients to define a policy for success in terms of:
This captures all dimensions by which we want to define success thresholds.
The problem : A client’s aggregated view of a remote service means a single rogue instance can trick you into believing that the whole service is unhealthy.
Finagle monitors health of services, per instance, and applies the success policy defined by clients to decide if it can allow requests through to a given instance.
A few unhealthy nodes can no longer render your service completely unavailable. Big win!
In conjunction with client side load balancing, this creates a dynamic view of a service for clients.
Clients can now continue fetching data, with requests smartly routed only to service instances marked healthy, while unhealthy instances are temporarily ignored and allowed to recover.
The problem : network layer information alone is often not good enough to make informed decisions about health of remote services.
The ability for clients to define what classifies as failures, using information available in Application layer protocols, is powerful. This gives fine-grained control, and is an improvement over the standard network layer signals to classify remote invocations as success or failure.
Clients should be able to classify failures from services, based on protocol specific data (e.g.: a HTTP 5xx response code).
This provides much needed application specific context to circuit breakers and results in accurate decisions regarding the health of individual instances of a service.
Support for response classification was introduced in Finagle earlier this year, just in time for us to put it to good use.
It is important to acknowledge that even the task of composing appropriate modules and configuring thresholds takes a fair bit of learning until you get it right.
We switched to using a moving window of duration instead of number of requests, as we realised it was hard to define thresholds that worked equally well for high and low traffic services.
Similar implementations exist for services that talk to database systems, cache nodes etc.
Given that we were already on the Finagle stack, the capabilities listed were compelling. A combination of Ribbon + Hystrix (Java-based libraries, aimed at fault tolerance, from Netflix) make for a decent alternative.
The above points aim to provide a high level summary of capabilities to look for when choosing to implement or use a framework aimed at improving resiliency.
We hope to share our experiences on these aspects, as we move forward, and scale out further.