Since we started breaking our monolith and introduced a microservices architecture we rely a lot on synchronous request-response style communication. In this blog post we’ll go over our current status and some of the lessons we learned.
Finagle and jvmkit
Most of our microservices are built on top of Finagle and are written in Scala. Finagle is a highly extensible and protocol agnostic RPC framework developed by Twitter.
Finagle clients and servers use non blocking I/O on top of Netty. Non blocking I/O improves our scalability since we have a lot of scatter-gather style microservices or microservices that are I/O bound on external resources like databases.
Finagle is also highly configurable and extensible which means we can tailor it to work well with our infrastructure.
To make it easy for developers to build microservices that play nicely with the SoundCloud infrastructure and conventions we have an internal library called jvmkit which is maintained by our Core Services team. Some of the functionality jvmkit provides:
- Wrappers around Finagle HTTP, memcached and MySQL with reasonable default configurations and integration with Prometheus
- DNS based service discovery implementation
- HTTP request routing on top of Finagle
- JSON parsing support for request / response entities
- Logging support according to our standards
- API support for rollout our feature toggle system
Teams are responsible from development till running their microservices in production so they can decide not to use jvmkit as long as they stick to our standard protocol. In practice however most of our microservices do use jvmkit. By using jvmkit teams have more time to focus on delivering business features.
HTTP/1.1 + JSON
Almost all our services use HTTP/1.1 + JSON for communication.
The advantage of HTTP + JSON is that it is supported everywhere so we don’t lock ourselves into a specific technology or programming language.
There is very good tooling support for HTTP + JSON which makes it easy to debug and interact with our services.
The HTTP method indicates useful properties of requests. It indicates if a request is idempotent and safe (read-only). These properties tell us if requests can be safely retried or cached. We found this useful when replaying requests to test new revisions of a service. Indication of these properties are missing from other protocols we looked into.
Of course HTTP + JSON results in more overhead compared to other more optimized protocols. We have some services using Finagle ThriftMux and have thought about making ThriftMux the recommended protocol but we stepped away from that idea. ThriftMux is not ubiquitous and would lock us into Finagle if we want to get all the benefits.
Client side service discovery and load balancing
When we started using microservices we used separate load balancer processes as entry points for our services. With the increase in traffic we ran into failing load balancer processes which caused several outages. Our load balancers were single points of failure.
These outages made us move to client side service discovery and load balancing. We wrote a Java service discovery implementation that queries DNS SRV records to get the instances for a given service. Writing it in Java means it can be integrated in any JVM based language and framework that supports Service Discovery. We obviously integrated it in Finagle by using the Finagle Resolver api. Having support in Finagle means we have client side service discovery and load balancing support for HTTP, ThriftMux, MySQL and memcached.
Resilience against failures
When investigating the root cause of some of our outages we realized that we had to invest more in improving our resilience against failures. It quickly showed how important it is to use a mature library or system like Finagle which is already equipped with important features like:
- Timeouts: every request across the network needs realistic timeouts. This protects you from issues with the systems you depend on.
- Transport and application level Circuit Breakers: supports ignoring hosts that become unavailable or unhealthy with automatic recovery when they become healthy again. It protects you from issues with systems you depend on and allows them to recover.
- Conditional retries: retrying can be effective but you shouldn’t retry unconditionally as this can result in retry storms that can overload the systems you depend on. Retry budgets allow you to configure a percentage of requests that can be retried on top of a minimum number within a time frame.
- Limiting concurrent requests: If you know the capacity your service can handle you can protect it from overload by limiting the maximum number of concurrent requests and waiters (queue size).
Using and configuring these features have allowed us to better deal with failures which are inevitable in a distributed system like a microservices architecture.
We also introduced automated integration tests for jvmkit which show and document how our systems will react to failures. These tests also notify us about potential behavioural changes when upgrading Finagle.
No more client libraries
When we started implementing microservices teams who implemented a service typically also provided a client library that could be used by other teams to access their service. Those client libraries were implemented in the same programming language as the service and were hiding network communication and serializing / deserializing of requests and responses behind an easy to use api. The idea was that the code to access a service only needed to be written once and could be re-used by many.
After a while we noticed the client libraries approach also brings a lot of disadvantages:
- A client library brings in dependencies. It might bring in a HTTP library, JSON parsing library etc. This might result in conflicts with the libraries already used by your service. If you depend on more than 1 client library this problem only increases resulting in unnecessary dependencies and conflicts between teams.
- A client library might become bloated with custom logic. Using the client library might be the only way to successfully use the service and prevents it from being used by applications using other technologies. This imposes the use of a programming language and technology on other teams.
- A client library can have default error handling logic which is probably not a good fit for all use cases. Users of a service should be aware of the error scenario’s that can happen and act upon them in the best way for their use case.
- A client library will likely parse all returned fields of a response even if the client is only interested in a subset. This not only might have performance implications but also violates the tolerant reader principle.
We think these disadvantages far outweigh the benefits. The use of client libraries results in increased coupling and dependencies. Instead of publishing clients we encourage teams to publish their API’s. Consumers write their own clients in a way they find most appropriate using their technology of choice.
We hope to introduce Consumer Driven Contract tests soon. These tests will catch breaking changes between services and consumers during build time. Some teams are currently working on proof of concepts and are building out and validating the process.
A logical step for us would be to move from HTTP/1.1 to HTTP/2 as our default protocol. This would improve latency while sticking with a widely supported protocol without technology or programming language lock-in.
We might also replace JSON with a more efficient serialization protocol like protocol buffers. This would further improve performance and should result in less handwritten serialization and deserialization code but this still needs more experimentation and investigation.