Since we started breaking our monolith and introduced a microservices architecture we rely a lot on synchronous request-response style communication. In this blog post we’ll go over our current status and some of the lessons we learned.
Most of our microservices are built on top of Finagle and are written in Scala. Finagle is a highly extensible and protocol agnostic RPC framework developed by Twitter.
Finagle clients and servers use non blocking I/O on top of Netty. Non blocking I/O improves our scalability since we have a lot of scatter-gather style microservices or microservices that are I/O bound on external resources like databases.
Finagle is also highly configurable and extensible which means we can tailor it to work well with our infrastructure.
To make it easy for developers to build microservices that play nicely with the SoundCloud infrastructure and conventions we have an internal library called jvmkit which is maintained by our Core Services team. Some of the functionality jvmkit provides:
Teams are responsible from development till running their microservices in production so they can decide not to use jvmkit as long as they stick to our standard protocol. In practice however most of our microservices do use jvmkit. By using jvmkit teams have more time to focus on delivering business features.
Almost all our services use HTTP/1.1 + JSON for communication.
The advantage of HTTP + JSON is that it is supported everywhere so we don’t lock ourselves into a specific technology or programming language.
There is very good tooling support for HTTP + JSON which makes it easy to debug and interact with our services.
The HTTP method indicates useful properties of requests. It indicates if a request is idempotent and safe (read-only). These properties tell us if requests can be safely retried or cached. We found this useful when replaying requests to test new revisions of a service. Indication of these properties are missing from other protocols we looked into.
Of course HTTP + JSON results in more overhead compared to other more optimized protocols. We have some services using Finagle ThriftMux and have thought about making ThriftMux the recommended protocol but we stepped away from that idea. ThriftMux is not ubiquitous and would lock us into Finagle if we want to get all the benefits.
When we started using microservices we used separate load balancer processes as entry points for our services. With the increase in traffic we ran into failing load balancer processes which caused several outages. Our load balancers were single points of failure.
These outages made us move to client side service discovery and load balancing. We wrote a Java service discovery implementation that queries DNS SRV records to get the instances for a given service. Writing it in Java means it can be integrated in any JVM based language and framework that supports Service Discovery. We obviously integrated it in Finagle by using the Finagle Resolver api. Having support in Finagle means we have client side service discovery and load balancing support for HTTP, ThriftMux, MySQL and memcached.
When investigating the root cause of some of our outages we realized that we had to invest more in improving our resilience against failures. It quickly showed how important it is to use a mature library or system like Finagle which is already equipped with important features like:
Using and configuring these features have allowed us to better deal with failures which are inevitable in a distributed system like a microservices architecture.
We also introduced automated integration tests for jvmkit which show and document how our systems will react to failures. These tests also notify us about potential behavioural changes when upgrading Finagle.
When we started implementing microservices teams who implemented a service typically also provided a client library that could be used by other teams to access their service. Those client libraries were implemented in the same programming language as the service and were hiding network communication and serializing / deserializing of requests and responses behind an easy to use api. The idea was that the code to access a service only needed to be written once and could be re-used by many.
After a while we noticed the client libraries approach also brings a lot of disadvantages:
We think these disadvantages far outweigh the benefits. The use of client libraries results in increased coupling and dependencies. Instead of publishing clients we encourage teams to publish their API’s. Consumers write their own clients in a way they find most appropriate using their technology of choice.
We hope to introduce Consumer Driven Contract tests soon. These tests will catch breaking changes between services and consumers during build time. Some teams are currently working on proof of concepts and are building out and validating the process.
A logical step for us would be to move from HTTP/1.1 to HTTP/2 as our default protocol. This would improve latency while sticking with a widely supported protocol without technology or programming language lock-in.
We might also replace JSON with a more efficient serialization protocol like protocol buffers. This would further improve performance and should result in less handwritten serialization and deserialization code but this still needs more experimentation and investigation.