In a previous series of blog posts, we covered our decision to move away from a monolithic architecture, replacing it with microservices, interacting synchronously with each other over HTTP, and asynchronously using events. In this post, we review our progress toward this goal, and talk about the conditions and strategy required to decommission our monolith.
Let's review briefly. We are presented with a problem; namely, we are unable to confidently make changes to the monolithic system powering SoundCloud, and that the growth of its database schema exceeds our ability to support it. So we need to decommission the monolith, and decided to migrate to a microservices architecture. Our plan for migrating away from a monolithic architecture to one based on microservices has been:
- Invest in internal tools and libraries to make the creation of microservices easier.
- Expect that new features can be built outside of the monolith, using only the internal API it provides to access its data, or the events it emits on state changes.
- When making significant changes to a feature or entity which exists inside the monolith, consider implementing these in a new service which is outside the monolith. We call these extraction projects.
This strategy has given us some great results. We have drastically reduced the time and number of decisions required to create a new service. These services benefit from world-class telemetry capabilities provided by Prometheus; confidence in deployments thanks to Docker, Kubernetes and the work we do to improve testing of service interactions; and an efficient and powerful network stack in Finagle. Work done in any of these areas has a powerful effect on the productivity of all engineers using these tools and libraries. If an engineer working in the core group makes a change which increases the productivity of many other engineers, the net effect can be equal to hiring an additional engineer. As we grow, these small effects result in dramatic improvements visible across the entire engineering organization.
As a strategy for decommissioning our monolith, though, this strategy has not served us well. We have noticed that:
- An extraction project, if its goal is to remove all code related to an entity from the monolithic codebase, is a multiple-month undertaking with dependencies on all teams who use that entity.
- Deploying a new service which serves all needs around an entity requires understanding — and, in cases where database joins are performed, changing — the code related to that entity in the monolithic codebase.
- Investing heavily in the skills and tools needed for our new microservices architecture makes it harder to find the engineers who are willing and able to work on the monolithic codebase. Many engineers feel intimidated by it and do not feel a sense of ownership for it.
That being said, we have been able to extract some services. Most often, this has happened in teams where engineers have experience with the monolith. In these cases, the engineers were able to make the case for an extraction project: the feature they were beginning to work on required integration with the monolith, and could only be expected to work for as long as the monolith continued to work. With database growth rates as they are, and diminishing knowledge of the codebase, this was a short enough period that the decision to do an extraction project was clear.
In many other cases, though, developers were not able to make this case, or were not even aware that they should, because they felt so removed from the monolith. In these cases, we have even seen services bypass the monolith and access its database directly, because the developers working on a new feature lacked the knowledge of how to modify its code, and were far quicker on our newer, better-supported, Scala stack. When we come to work on an extraction later, services accessing the database directly present a problem for the project. They require a special negotiation about how to integrate, rather than the standard "from this date on, please use the Foo service for this endpoint, rather than the monolith." Our investment in making services easier to build has created the conditions for behaviors which make it harder to decommission our monolith.
How, then, should we revise our strategy, and decommission the monolith faster? We are exploring a new approach:
- Recognize that, for an extraction project whose definition of done is "remove all code related to an entity from the monolith," the part that will take the longest and be the most complicated is to change all call sites related to this entity. This is especially difficult in cases where the calling code relies on an endpoint which joins two entities, and the future architecture will see these entities belonging to different services. To accelerate this effort, we must distribute the work.
- Change the definition of done for an extraction project to "there is a service available which supports all access patterns for this entity which we can support in future, but there is still use of the monolith by some downstreams integrating with this entity." Since this means an extraction will not be complete until downstream services make changes, the below points ensure progress toward the eventual goal of complete extraction from the monolith.
- Use monitoring to measure progress in migrating calling code.
- Produce guidelines which explain how and why to compose responses from multiple backing services. This guidance should include what to do about the resulting consistency issues ("the tracks service has returned me a track belonging to user 123, but the users service tells me that user doesn’t exist").
- Produce clear guidance around the ways in which services of different kinds can integrate with an entity. For example, a service should never access another service's database directly for online use cases. Hold engineers to account for ensuring their systems follow this guidance.
- Prioritize extraction projects based on their size and rate of growth in the monolith's database, since this is the most pressing risk to the health of that codebase.
- Make use of the monolith experts we have to guide and accelerate the creation of services which will replace parts of it.
We hope that this approach will get us to the goal of decommissioning our monolith faster, and continue to make use of the investment we have made in our microservices ecosystem, while dealing with the forces which led it to work against this goal.
Are you working to decommission a monolith and move to microservices? We'd love to hear from you about ideas or approaches which have worked for you.