In a previous series of blog posts, we covered our decision to move away from a monolithic architecture, replacing it with microservices, interacting synchronously with each other over HTTP, and asynchronously using events. In this post, we review our progress toward this goal, and talk about the conditions and strategy required to decommission our monolith.
Let’s review briefly. We are presented with a problem; namely, we are unable to confidently make changes to the monolithic system powering SoundCloud, and that the growth of its database schema exceeds our ability to support it. So we need to decommission the monolith, and decided to migrate to a microservices architecture. Our plan for migrating away from a monolithic architecture to one based on microservices has been:
This strategy has given us some great results. We have drastically reduced the time and number of decisions required to create a new service. These services benefit from world-class telemetry capabilities provided by Prometheus; confidence in deployments thanks to Docker, Kubernetes and the work we do to improve testing of service interactions; and an efficient and powerful network stack in Finagle. Work done in any of these areas has a powerful effect on the productivity of all engineers using these tools and libraries. If an engineer working in the core group makes a change which increases the productivity of many other engineers, the net effect can be equal to hiring an additional engineer. As we grow, these small effects result in dramatic improvements visible across the entire engineering organization.
As a strategy for decommissioning our monolith, though, this strategy has not served us well. We have noticed that:
That being said, we have been able to extract some services. Most often, this has happened in teams where engineers have experience with the monolith. In these cases, the engineers were able to make the case for an extraction project: the feature they were beginning to work on required integration with the monolith, and could only be expected to work for as long as the monolith continued to work. With database growth rates as they are, and diminishing knowledge of the codebase, this was a short enough period that the decision to do an extraction project was clear.
In many other cases, though, developers were not able to make this case, or were not even aware that they should, because they felt so removed from the monolith. In these cases, we have even seen services bypass the monolith and access its database directly, because the developers working on a new feature lacked the knowledge of how to modify its code, and were far quicker on our newer, better-supported, Scala stack. When we come to work on an extraction later, services accessing the database directly present a problem for the project. They require a special negotiation about how to integrate, rather than the standard “from this date on, please use the Foo service for this endpoint, rather than the monolith.” Our investment in making services easier to build has created the conditions for behaviors which make it harder to decommission our monolith.
How, then, should we revise our strategy, and decommission the monolith faster? We are exploring a new approach:
We hope that this approach will get us to the goal of decommissioning our monolith faster, and continue to make use of the investment we have made in our microservices ecosystem, while dealing with the forces which led it to work against this goal.
Are you working to decommission a monolith and move to microservices? We’d love to hear from you about ideas or approaches which have worked for you.