SoundCloud Echo: Next-Level Humane Registry with Backstage

September 19th, 2022 by Julio Zynger

SoundCloud adopted Backstage

We’ve seen considerable adoption of our internal humane registry, Services Directory, since we sanitized ownership definition and introduced an easier-to-use user interface (UI) to our development infrastructure. Namely, managing backend feature toggles (or as we call them, rollouts) and visualizing interservice dependencies has been helpful in bringing engineering managers, product managers, and engineers closer — not only in their day-to-day work, but also when communicating with other teams in the organization.

Dependency diagram

With higher engagement, we also got feedback and tons of ideas for extending Services Directory and making it even more useful.

When talking to engineers at other tech companies, and also internally to our recent hires, we recurringly heard “organizing the software inventory” to be a repeated problem companies solved in different shapes and forms, just like we did. At approximately the same time we were investing in Services Directory, the industry saw the emergence of open source alternatives for building developer portals. Our team analyzed options, weighed the cost of maintaining our own Services Directory, and finally decided to migrate (and contribute) to externally developed software.

In this post, we’ll describe how we took Services Directory to the next level and migrated our internal system to the open source Backstage developer portal technology, and how we’re now providing a lot more capabilities and empowering our engineering team, in turn speeding up software delivery and engineering effectiveness.

Adoption Path

Like most pieces of work that influence other teams’ workflows and critical-path tooling, we wrote an RFC document to collect thoughts and establish a migration path out of Services Directory and into Backstage.

One Pager

The key idea and deliverable was to provide a narrow but high-value use case, and only later expand to new features and start evangelizing the platform. As mentioned in our last post on the topic, we wanted Backstage to replace Services Directory in its core value proposition:

The registry must enable finding out which systems a team owns and who the owners of given systems are.

Naming Is Hard

As we set out with a plan to build a prototype for what Backstage could look like at SoundCloud, we also had to make a naming decision. Although Backstage is a compelling name, SoundCloud — being an audio platform — already had at least two other things with that name (including this very blog). We didn’t want to make things even more confusing, so after long deliberation, we settled on SoundCloud Echo for the codename of our Backstage instance. Not only does “echo” have a relation to sound in general, but it’s also one of the very first commands engineers write on their path to software development:

$ echo 'Hello World!'

Bringing In Some Glue

We went through a few steps to get to what we considered our adoption milestone.

Getting started with Backstage meant making our platform team acquainted with how the platform looks and works, along with becoming familiar with its community. With access to the maintainers and other project contributors, we also identified opportunities to learn the long-term project roadmap and unblock ourselves if we ran into questions. We were quickly able to identify which bits would smoothly connect to our existing infrastructure and which bits would require some “glue.”

Database

The first challenge was a fundamental one: the database.

As described in a previous post, the golden paths at SoundCloud evolved around MySQL as our choice for relational database management systems (RDBMS). We have experts within the company that have plenty of knowledge on how to provision, query, and maintain MySQL clusters. On top of that, we’ve built plenty of supporting infrastructure for this technology.

What we found, though, is that Backstage can work out of the box with PostgreSQL. Even though there’s interest from the open source community in providing Backstage with support for running on MySQL, it remains unavailable. We ran an internal proof of concept to estimate the work it would take for us to implement that support, but we deemed going forward too risky, especially given that we would’ve had to play catch with the upstream project.

The alternative was to rely on Cloud resources. Instead of provisioning a PostgreSQL cluster ourselves, we set up a Postgres database node in Google’s Cloud SQL, through a proxy deployed as a sidecar container in its Kubernetes pod.

Cloud SQL proxy

The Cloud SQL Auth proxy works by having a local client running in the local environment. Our instance communicates with the proxy through the standard database protocol. The Cloud SQL Auth proxy uses a secure tunnel to communicate with its companion process running on the server.

This way, our app can use a PostgreSQL database, and we get the benefits of managed resources in the Cloud — for example, automatic backups, data replication, telemetry, etc. Most importantly, engineers operating the project have a less steep learning curve to the PostgreSQL setup. When, in the future, Backstage supports MySQL as a database engine, we can then decide to move over.

Catalog Descriptors

So we have a storage for the catalog, but how do we get the data in?

As we saw in our previous post, Services Directory already relied on a background worker job indexing “manifest files,” which were essentially JSON files with a predefined schema. We were positioned especially well for the migration to Backstage because we had done the work to standardize and sanitize all these files across SoundCloud repositories, which meant we could make some assumptions about their shape when ingesting them into our new PostgreSQL database.

Backstage’s catalog feature is built around the concept of metadata YAML files, so our job here was to write a parser that could translate our JSON manifests into Backstage’s entity metadata objects by hooking into the entity processing loop.

It was quite pleasing to see how quickly we got to a working prototype of Backstage that indexes most of SoundCloud’s software components.

This was also an enabler for our platform team: With real-life data coming in, we could speed up our exploration and experimentation and identify special-case scenarios even before we rolled Backstage out to the rest of our engineering organization. Just like with human relationships, with developer tooling, sometimes the first impression is the only impression, and we aimed to push that bar very high to drive adoption among teams.

Months later, when we had already established Backstage as a replacement for Services Directory, we ended up substituting our manifest files with Backstage catalog descriptor files. In addition to making the custom-made parser obsolete, using the format from the open source project meant we got access to plenty of new features while still reducing our internal maintenance burden: No custom internal format means no learning curve and no work to keep all files following the schemas.

That change was transparent to our engineers — our platform team ran a bulk change automating the translation from JSON to YAML and pushing these files to all repositories in the organization. That’s yet another learning for successful adoption of a tool on the engineering toolbelt: Users love new features, especially when they have to do no work to get them. By hooking in the entity processing pipeline, we were capable of stitching in several quality-of-life improvements for catalog users. Soon enough, we had most of our APIs cataloged, we had public repositories, and we could also support indexing of monorepos.

Backend Integrations

With the catalog up and running, we moved our attention to external integrations. Services Directory catalog’s main supporting feature was to provide a friendlier UI for engineers to interact with our backend feature toggles.

We took that as an opportunity to get acquainted with the different system interactions within: how we could instruct Echo’s backend to communicate with an already-existing API, and how easy it would be to design and code a frontend plugin for that integration.

Example backend integration

The fact that Backstage uses a well-thought-out and documented design system sped up our efforts considerably. On top of that, having a React codebase meant we could lean on both the existing plethora of open source UI libraries and components, and the extensive knowledge of our internal team of web engineers.

For the first time, “word was out,” and we started seeing excitement within the organization for what potential Echo/Backstage would unlock.

Maintenance, Ownership, Updates

Naturally, at SoundCloud, each team is able to operate on its own and make the best decisions for themselves. Our platform teams provide guardrails and support common use cases that we consider our golden paths, but we don’t want to mandate technology choices for everyone else. Backstage fits perfectly into that philosophy, and its plugin mechanism serves as a way for our engineers to build value together.

On top of that, we recognize how some teams are way better suited than our platform team to drive the vision for their own software ownership. In that sense, the plugin for interacting with our build system is provided by the team that owns the build systems, the plugin that shows active or past A/B tests is owned by the “experimentations” team, etc.

Echo Plugin model

Our platform team is constrained in people, time, and resources, so the co-ownership model works very well for us. With more team autonomy, we’ve seen fast adoption, growth, and also interest in contributing to Echo across the engineering organization. Meanwhile, our platform team continues to be responsible for the infrastructure of, running upgrades on, and keeping a vision of the longer-term plans for Backstage and Echo at SoundCloud — along with its limitations and opportunities.

Powering Up Engineers

Echo gets more powerful every time we deploy an upgrade to the underlying Backstage infrastructure, but over time, we’ve collected and implemented several plugins of our own that had a positive impact on our engineering team’s day to day — whether that’s by bridging communication or helping service operations. Here are a few examples of “power-ups” we’ve built to aid our engineering organization.

Entity Augmentations

Different to most entities in the catalog, Groups and Users are sourced from external “entity providers.” In our current implementation, that means GitHub.

The entity provider modeling can differ from the Backstage modeling, so we allow entity augmentation by merging the source-of-truth data (GitHub) with manually curated fields from the YAML entity envelope. In most cases, these will be annotations:

apiVersion: soundcloud.com/v1alpha1
kind: GroupAugmentation
metadata:
  name: beep-team
  annotations:
    soundcloud.com/team-readme: confluence:139924197
    soundcloud.com/team-alerts-email: team-email-for-alerts@sc.com

During the next catalog-refresh cycle, Echo will fetch available augmentations and merge them with existing entries in its catalog. The final shapes of the merged entities become available through Echo’s UI and its API.

This added feature allows further integration of Echo’s entities to reference into external systems.

In the example above, we point a page in our Confluence instance to render a “team README” with basic information about the team, contacts, ways of working, etc. The interesting aspect is that the content doesn’t need to be moved or replicated within Echo; the data source continues to be Confluence, where our non-technical audience is comfortable, but Echo, home of engineering, can read, display, and include information in its search index.

Echo Team Readme plugin

Another existing integration is with Prometheus’ Alertmanager. Since Echo exposes its catalog through a queryable API, annotations in the entity augmentation can inform our Alertmanager deployment about notification receivers — these can be email (as exemplified above), Slack, PagerDuty, etc.

The benefit of using entity augmentations is that of data centralization: When, unavoidably, a company reorganization happens and a team gets created or renamed, the team manager updates a single file and our automation kicks in for the dependent systems. In the Alertmanager use case, this means no manual interaction to configure receivers. Similar workflows can be established for automating the creation of pull requests in selected repositories, or even infrastructure provisioning.

Daily Ritual Support

As the engineering home page, Echo also supports recurring tasks individual contributors run through. Within the teams, that can be checking open pull requests across all of their ownership domain, or even selecting a team member at random to take notes in a retrospective meeting. These capabilities can be especially useful when we consider the interactions that happen across teams as well.

Randomly picked team member

As an example project workflow, an engineer can visit action item lists from previous incident review meetings and check whether they’re assigned to anyone, write and publish an RFC outlining a possible architectural change as a solution, and finally announce the system migration to the rest of the engineering organization so dependent teams can prepare accordingly.

Announcements inbox

All of these are supported by Echo in one way or another, and the long-term vision is to continue to identify intra- and inter-team interactions to standardize solutions to speed up the engineering practice end to end.

Domain-Specific Technical Health

Especially important as we consolidate our golden paths and internal tech radar, Echo also helps increase visibility of our team’s ongoing operations. The data set is an aggregation of sources like Prometheus, Zoekt, GitHub, CodeScene, and others. Through the modeling of the catalog, we then further slice our software components across our business domains and establish a maturity grade as to what we consider best practices or technical debt. We’ve also been exploring building a “migration tracker” in a similar fashion as to what Spotify describes in this post.

Put in a service perspective, these insights can be useful for individual contributors. Now, take that to the organizational level: Through historical data and Echo’s visualization-aggregating services, teams, and domains, its dataset is also informational to engineering managers preparing for their next planning meeting, and to leaders looking to build an ROI argument to select a domain to work on accumulated technical debt.

Down the Line

We’ve seen more engagement within teams and toward our platform team as more engineers got acquainted with Echo and brought it into their daily workflows. Often, conversations bubble around feature requests and plugin ideas.

These proposals feed back into our own internal roadmap and help us define KPIs to measure the success of the platform. From a product perspective, we’ve been tracking page views and DAU/MAU to establish where to invest or what to defund.

For the future, we envision being able to bring usage information back to the teams themselves: Imagine, for example, being able to inform teams of opportunities to provide more extensive documentation, or even when knowledge siloing exists.

Other opportunities lie on getting Echo closer to where our users are: For example, a lot of daily work happens on Slack. Individuals asking questions on how services work, checking in on dependent teams, incident management and response — all of that, and more, happens in Slack. Instead of engineers coming to Echo, why not make Echo conversational?

Slack integration

Similar potential exists in making information contextual by other means: GitHub comments, Jira tickets, Confluence pages — you name it. The fact information is available is the important part, and where it lives shouldn’t matter. Part of our platform’s team mission is to make that information accessible, discoverable, and easy to digest, whether that’s for a human reader or to build automation upon.

Learning as We Go

It’s been two years, and we’re still gaining confidence and working on bringing traction to Backstage at SoundCloud. We’ve gone through some operational hoops to move from Services Directory into SoundCloud Echo, and we’ve objectively provided a better development experience to our engineers, while also opening doors to plenty of potential the platform provides.

Advocacy and Education

As envisioned, relying on an open source solution brought us a lot of value “for free,” but we still had to invest a considerable amount of work to make it match our team’s needs and wishes. Developer productivity is a neverending road, and although we see lots of aptitude for the future, SoundCloud (and every Backstage adopter) makes an operational funding decision when picking Backstage to power its developer portal. Nowadays, Echo is a central piece of our infrastructure, and its maintenance means responsibility for the platform team.

Onboarding the engineering group to Backstage is more than telling engineers about it. The investment from the business is quite demanding, and to make it “the homepage of engineering” also means cultivating it as part of the culture for engineering stakeholders (EMs, PMs, Directors). Training and advocating is a work-heavy task and demands organizational efforts, especially when it’s cross-team and cross-discipline and involves non-technical folks. A big part of the work relies on “vision selling,” and that might include reshaping some of the existing culture, introducing practices like docs-as-code, or asking teams to change their ways of working, for instance, to keep their ownership listing clear and sanitized, or to consult some Backstage-provided dashboard during standup meetings.

Where to find documentation — flowchart

As exemplified earlier, in many cases we choose not to change workflows, but rather to build system integrations and provide new/alternative UIs, so individuals continue writing and consuming documents where they’re comfortable. This can be counterproductive: Tenured folks stick to “the old ways,” while new team members adopt Backstage’s way, and due to information fragmentation, the platform team starts seeing a wave of “what option should I use?” questions. It’s the responsibility of the platform team to work with the rest of engineering to decide the best moment to decommission features — even if sometimes that means deciding to roll back on a Backstage feature.

Similarly, pushing the Backstage catalog model onto teams, albeit simple, still demands educational efforts. Pair that with our previously existing catalog model, and some bridging work had to happen. Components, systems, domains; split and combined monorepos; first-party and third-party dependency listing — all of these are catalog-modeling questions whose answer is “it depends,” and for the most part, whose intricacies users shouldn’t care about.

After two years, we’re still in the early days of the full Backstage model adoption, and even then, we might decide not to pursue it completely or to introduce our own constructs. Likewise, we haven’t yet adopted the full spectrum of Backstage core features: Software Templates, although super powerful, are still something we haven’t invested in. Given the maintenance cost, we’re not yet ready to support them, nor are we prepared for the cognitive load it might introduce to the rest of our engineering team.

Contributors

As we evaluated, experimented, learned, and adopted Backstage internally, we also found opportunities to play a part in the wider community. That means engaging with the maintainers and adopters at the Discord chatroom, presenting at the monthly Backstage meetups, and contributing code.

SoundCloud contributed open-source plugins

The plugins SoundCloud open sourced revolve around areas like continuous delivery, observability, and technical health assessment, and there are more to come. Summed up, we see more than 1,000 downloads per week.