The End of the Public API Strangler

March 14th, 2022 by Dónal O'Brien and Jorge Creixell

This is the story of how we used the Strangler pattern to migrate our public API from a monolithic codebase to a fully fledged BFF over the course of eight years. It also discusses some of the trials and tribulations we encountered along the way.

History

SoundCloud started as a single Ruby on Rails application more than 14 years ago. Back then, this single application served the website and the public API. While going through multiple growth phases — both in terms of user traffic and the size of the engineering team — we made the choice to more formally adopt a type of service architecture commonly referred to as microservices. This move promised to unblock engineering teams and widen technology choices, in turn allowing us to pick appropriate tools for handling our scaling challenges.

After the Cambrian explosion of languages, frameworks, and approaches, the engineering organization started to consolidate, and Scala became the default language for delivering our microservice architecture. That’s because it’s underpinned by Twitter’s Finagle as the RPC framework for interservice communication, and we knew we wanted to use Finagle.

Motivation

In the new reality of a microservices architecture, where some new features now existed outside of the Rails application, and some services supplemented the existing features of the Rails application, we needed to decide how to maintain the public API going forward. It was crucial to ensure that important features — like serving content — continued to work for existing integrations, even though their implementations had changed and now spanned multiple services.

We tried various approaches and learned that our Rails application didn’t perform well when interacting with multiple microservices to serve user traffic. As a result, in 2014, we made the decision to not integrate it with other services to serve the public API, but to instead build a Scala service using Finagle that would internally proxy requests to the existing public API. This new service would intercept and augment the public API responses by calling additional services when necessary, (somewhat loosely) following the Strangler pattern.

Typically, the goal of the Strangler pattern is to incrementally replace the functionality of one system with the functionality of a new, more desirable system or systems, one piece at a time. It’s commonly used when migrating from a monolithic codebase to a microservice architecture. Although this end goal informed the original decision to adopt the Strangler pattern, our choice to use it was more motivated by an immediate need rather than planning for a future free of the public API monolith.

As a result, the Strangler was left, along with the monolith, largely unmaintained while feature development on our internal APIs continued at pace. To facilitate the continued development of our internal APIs, it became necessary to duplicate code paths for accessing core entities, e.g. tracks, playlists, users, etc. This meant one code path for internal clients and one for the public API. In addition to the obvious downside of this duplication, inconsistencies between the two APIs also emerged. A lack of maintenance also meant knowledge loss, security issues from exposing the monolith with deprecated Rails versions via transparent proxying from the Strangler, and scope creep due to feature teams often needing to touch the Strangler and/or the monolith without much prior knowledge.

As the business matured further and new investments in the public API were planned, the case to address the current situation became compelling, and the work was scoped to complete the migration of the Strangler to a fully-fledged BFF. In January 2020, after a six-year period in which modest progress was made, the work began in earnest.

Porting from the Public API to the BFF

Importantly, two preliminary steps helped us reduce the scope of the work.

Some of the official apps were still making some direct calls to the public API. These were migrated to use the official BFFs, which enabled us to shrink the API surface, and hence the scope, significantly.
Much of the common functionality in the BFFs was consolidated in Value-Added Services, further reducing the scope of the work.

Without these preliminary measures, completing this project may have been unrealistic.

Endpoints could be ported with varying degrees of difficulty. In some cases, there were existing reference implementations in other BFFs, e.g. for web or mobile, that could be used as a guide. In others, a complete rewrite was needed, and often it was necessary to add missing functionality to downstream microservices.

Challenges

Due to the lack of experience within the company with the public API codebase, it was actually necessary to first investigate and document the full list of endpoints that the public API exposed. This involved:

Adding telemetry to understand which endpoints were still in use.
Explicitly declaring all known public API routes in the Strangler codebase.
Adding a fallback to call the public API for any undeclared routes.
Removing the fallback once we were confident we had identified all routes.
Removing routes that weren’t in use and weren’t documented on the developer portal.
Creating a JIRA ticket for each endpoint to be ported, i.e. reimplemented in the Strangler, by calling existing microservices instead of the public API.

Also, some things that come for free (or “magic”) in Ruby were things we needed to implement ourselves — for example, multipart request parameter parsing. Furthermore, Rails doesn’t need to be explicit about all route and Content-Type combinations it supports. This sometimes led to unpleasant surprises during porting as it became clear that entire chunks of functionality remained to be implemented.

Response Comparisons

To build up confidence in ported endpoints, the process typically goes like this:

The ported implementation gets deployed alongside the old code that proxies to the public API.
Incoming requests execute both codepaths — the old (using the proxy) and the new code.
The response of the proxy’s call to the public API gets returned to the caller.
At the same time, the responses of the proxy and the new code are compared for consistency.
If the responses of the old and new code don’t match, a telemetry event is triggered and the difference is logged for inspection by the developer.
The developer may then need to make some changes to the ported implementation until they’re confident that the new code matches the original in terms of functionality.
At this point, the proxy can be removed and the ported response gets returned.

Of course, this is only really possible for non-mutating methods, i.e. GET or HEAD requests. Otherwise, the code might end up creating two entries in the database for a single request. To deal with mutating methods, i.e. PUT or POST requests, it was sometimes necessary to perform extensive manual regression testing. Not everything went as smoothly as we would’ve liked, and the use of rollout flags also proved useful for quickly disabling the ported code where problems did occur.

Learnings

Decisions need to be made based on data, and telemetry was key in informing such decisions. For example, undocumented endpoints that receive minimal usage can sometimes be deprecated and later removed. The less code to port and maintain, the better.

Adopting the Strangler pattern comes with significant risks. As mentioned, there was initially a long fallow period where not much work was done to port the public API endpoints to the Strangler. During this period, some knowledge about the project was lost and the Strangler in fact added some complexity for teams developing new features. If you decide to adopt the Strangler pattern, make sure to have a plan to complete the migration before the knowledge is lost and it becomes a daunting task with increased risk.

As with any kind of rewrite, even an incremental one, bugs can occur. The nature of the public API — being accessible to anyone who wishes to use it — meant that sometimes things broke, and sometimes for seemingly innocuous changes. In some instances, the tenet of Hyrum’s Law came into play, where breakages occurred due to third-party integrations relying on undocumented aspects of the APIs.

It’s worth considering whether such disruptions to your business are worth the ultimate benefits of the work.

Conclusion

The work wasn’t without its challenges, and anyone embarking on such a project should be aware of the risks and benefits involved. However, now that all endpoints have finally been ported, there are some notable benefits: the Strangler is now a fully fledged BFF; the entire codebase of the public API has been deleted; and we have a codebase that most engineers can contribute to (Scala service), that doesn’t negatively impact project scope, that fits with our microservice architecture, and that helps ensure data consistency and security.