Release Quality and Mobile Trains

Once every two weeks, we prepare new versions of our mobile apps to be published to the app stores. Being confident about releasing software at that scale — with as many features and code contributions as we have and while targeting a wide range of devices like we do at SoundCloud — is no easy task. So, over the last few years, we have introduced many tools and practices in our release process to aid us.

In this blog post, I’ll cover some of the techniques we use to guarantee we’re always releasing quality Android applications at SoundCloud.

All Aboard!

Maintaining high quality is about more than just hitting deadlines; it’s a continuous job. As such, we’re constantly keeping an eye on our code and asset changes to make sure we get the best out of our engineering efforts. In addition to human code reviews, our repositories’ pull requests go through an automated pipeline of tests for every single changeset. This pipeline makes use of industry standard tools for both static analysis and unit, integration, and UI testing.

An automatically generated GitHub comment with statistics about the code changes

An automatically generated GitHub comment with statistics about the code changes

We’ve also built tools that highlight other potential issues early on during the code review cycle. By integrating them with other pieces of our day-to-day routine, such as GitHub (above) and Slack (below), we ensure our developers have a more transparent understanding of their applied diffs.

A Slack message sent to the author of the pull request after tests are run

A Slack message sent to the author of the pull request after tests are run

Even so, it’s still a challenge to foresee how an application with our scale will behave in the wild, with the variety of devices, locales, and use cases we support. For that reason, we are also big believers in having a remote feature-toggling system. Not only does it give us a way of preventing issues from spreading once they are detected, but we also use these toggles to enable early internal testing of our new features via our alpha and beta builds and to decouple app releases from feature releases.

feature flag

By using continuous integration tools, we ensure that our master branch is always deployable. This is accomplished by scheduling time-based jobs to be run — either nightly, weekly, or bi-weekly — or via triggered jobs (for example, when a pull request is merged into our master or release branches). With these CI jobs, we can guarantee our obfuscated and shrinked release builds are all set after the passes of tools like ProGuard or R8; update to the latest localized translations for all the languages we support; and automate other repetitive tasks.

GitHub comment pointing to an outdated pull request

Release Train Model

We have a two-week interval between each of our releases because we follow the Release Train model, in which, similarly to a train that departs a station on a specified schedule, we have specified a day for what we refer to as Code Freeze, which is when two members of the engineering collective (previously selected from a rotation of all members) “cut the master branch” into the release branch. We refer to them as release captains, but people familiar with release trains call them Release Train Engineers (RTE).

Pragmatically, this means that once master gets cut, no new code gets into the changeset that the release captains are responsible for publishing (with the exception of emergency hotfixes, which are then merged back to the master branch).

Following a release train not only helps in limiting the ownership of each pair of release captains, but it also gives everyone a sense of predictability over when changes will be rolled out to the public.

The dates for each code freeze are displayed for the entire company in a shared calendar

The dates for each code freeze are displayed for the entire company in a shared calendar

As a result, our design and data analysis teams can estimate when their help will be needed for defining and delivering specifications for feature development; product managers know when their experiments and product launches will go live; and the backend and frontend engineering teams are able to plan their work to aim for certain code freeze dates.

Code freeze is just one of the steps described in a detailed checklist for the captains (generated via our CI jobs). Additional steps include naming the new build for internal reference (we follow alphabetical order and always pick from a pop culture character set — we’re currently using Marvel superheroes as inspiration!) and updating documents, links, and Slack channels for the visibility of the outgoing train.

The release captains are then responsible for performing a set of manual regression tests that cover the main use cases of the application and some of the scenarios we could not automate (yet). Once all is verified and the captains feel confident with their release, they are free to start the staged rollout process by deploying the build to the beta channel and subsequently rolling out to percentages of the production user base.

Continuous integration pipeline with staged rollout steps

Continuous integration pipeline with staged rollout steps

The Train Has Departed the Station

During the two-week period of staged rollout, it’s the release captains’ job to monitor data, metrics, and dashboards for user reports, reviews, and crashes that might have ended up in production. After evaluating the impact of each of these issues, it is their call whether to request new changes to be applied in their train, which is done by pinging the involved engineers and working toward fixes.

We aim to always be attentive to feedback, but the sooner issues get reported, the sooner we can act to fix them. On that note, I’d like to encourage you to join the beta to get early access to new exciting features, in addition to supporting us by being part of this direct channel of communication.

Just before the end of their two-week “shift,” the captains might decide to hold a “post-mortem meeting,” in which incidents are discussed and their root causes understood and documented so that the entire platform collective can come up with preventive actions to reduce the likelihood of reoccurrence. The idea is that these are blameless, with the sole goal being that of improving systems and processes and spreading the knowledge across the entire team.

End of the Line

Overall, ensuring a good standard of what we publish to our users is a product of many factors. Not only do we need, use, and build tooling and automation to facilitate our daily work, but more importantly, at SoundCloud, we are always attentive to supporting the processes the team puts around building a quality-driven culture.