This is the story of how we used the Strangler pattern to migrate our public API from a monolithic codebase to a fully fledged BFF over the course of eight years. It also discusses some of the trials and tribulations we encountered along the way.
SoundCloud started as a single Ruby on Rails application more than 14 years ago. Back then, this single application served the website and the public API. While going through multiple growth phases — both in terms of user traffic and the size of the engineering team — we made the choice to more formally adopt a type of service architecture commonly referred to as microservices. This move promised to unblock engineering teams and widen technology choices, in turn allowing us to pick appropriate tools for handling our scaling challenges.
After the Cambrian explosion of languages, frameworks, and approaches, the engineering organization started to consolidate, and Scala became the default language for delivering our microservice architecture. That’s because it’s underpinned by Twitter’s Finagle as the RPC framework for interservice communication, and we knew we wanted to use Finagle.
In the new reality of a microservices architecture, where some new features now existed outside of the Rails application, and some services supplemented the existing features of the Rails application, we needed to decide how to maintain the public API going forward. It was crucial to ensure that important features — like serving content — continued to work for existing integrations, even though their implementations had changed and now spanned multiple services.
We tried various approaches and learned that our Rails application didn’t perform well when interacting with multiple microservices to serve user traffic. As a result, in 2014, we made the decision to not integrate it with other services to serve the public API, but to instead build a Scala service using Finagle that would internally proxy requests to the existing public API. This new service would intercept and augment the public API responses by calling additional services when necessary, (somewhat loosely) following the Strangler pattern.
Typically, the goal of the Strangler pattern is to incrementally replace the functionality of one system with the functionality of a new, more desirable system or systems, one piece at a time. It’s commonly used when migrating from a monolithic codebase to a microservice architecture. Although this end goal informed the original decision to adopt the Strangler pattern, our choice to use it was more motivated by an immediate need rather than planning for a future free of the public API monolith.
As a result, the Strangler was left, along with the monolith, largely unmaintained while feature development on our internal APIs continued at pace. To facilitate the continued development of our internal APIs, it became necessary to duplicate code paths for accessing core entities, e.g. tracks, playlists, users, etc. This meant one code path for internal clients and one for the public API. In addition to the obvious downside of this duplication, inconsistencies between the two APIs also emerged. A lack of maintenance also meant knowledge loss, security issues from exposing the monolith with deprecated Rails versions via transparent proxying from the Strangler, and scope creep due to feature teams often needing to touch the Strangler and/or the monolith without much prior knowledge.
As the business matured further and new investments in the public API were planned, the case to address the current situation became compelling, and the work was scoped to complete the migration of the Strangler to a fully-fledged BFF. In January 2020, after a six-year period in which modest progress was made, the work began in earnest.
Importantly, two preliminary steps helped us reduce the scope of the work.
Without these preliminary measures, completing this project may have been unrealistic.
Endpoints could be ported with varying degrees of difficulty. In some cases, there were existing reference implementations in other BFFs, e.g. for web or mobile, that could be used as a guide. In others, a complete rewrite was needed, and often it was necessary to add missing functionality to downstream microservices.
Due to the lack of experience within the company with the public API codebase, it was actually necessary to first investigate and document the full list of endpoints that the public API exposed. This involved:
Also, some things that come for free (or “magic”) in Ruby were things we needed to implement ourselves — for example, multipart request parameter parsing. Furthermore, Rails doesn’t need to be explicit about all route and Content-Type
combinations it supports. This sometimes led to unpleasant surprises during porting as it became clear that entire chunks of functionality remained to be implemented.
To build up confidence in ported endpoints, the process typically goes like this:
Of course, this is only really possible for non-mutating methods, i.e. GET
or HEAD
requests. Otherwise, the code might end up creating two entries in the database for a single request. To deal with mutating methods, i.e. PUT
or POST
requests, it was sometimes necessary to perform extensive manual regression testing. Not everything went as smoothly as we would’ve liked, and the use of rollout flags also proved useful for quickly disabling the ported code where problems did occur.
Decisions need to be made based on data, and telemetry was key in informing such decisions. For example, undocumented endpoints that receive minimal usage can sometimes be deprecated and later removed. The less code to port and maintain, the better.
Adopting the Strangler pattern comes with significant risks. As mentioned, there was initially a long fallow period where not much work was done to port the public API endpoints to the Strangler. During this period, some knowledge about the project was lost and the Strangler in fact added some complexity for teams developing new features. If you decide to adopt the Strangler pattern, make sure to have a plan to complete the migration before the knowledge is lost and it becomes a daunting task with increased risk.
As with any kind of rewrite, even an incremental one, bugs can occur. The nature of the public API — being accessible to anyone who wishes to use it — meant that sometimes things broke, and sometimes for seemingly innocuous changes. In some instances, the tenet of Hyrum’s Law came into play, where breakages occurred due to third-party integrations relying on undocumented aspects of the APIs.
It’s worth considering whether such disruptions to your business are worth the ultimate benefits of the work.
The work wasn’t without its challenges, and anyone embarking on such a project should be aware of the risks and benefits involved. However, now that all endpoints have finally been ported, there are some notable benefits: the Strangler is now a fully fledged BFF; the entire codebase of the public API has been deleted; and we have a codebase that most engineers can contribute to (Scala service), that doesn’t negatively impact project scope, that fits with our microservice architecture, and that helps ensure data consistency and security.