SoundCloud for Developers

Discover, connect and build

Backstage Blog RSS

  • June 12th, 2014 Scala Finagle Ruby Architecture Building Products at SoundCloud—Part II: Breaking the Monolith By Phil Calçado

    In the previous post, we talked about how we enabled our teams to build microservices in Scala, Clojure, and JRuby without coupling them with our legacy monolithic Rails system. After the architecture changes were made, our teams were free to build their new features and enhancements in a much more flexible environment. An important question remained, though: how do we extract the features from the monolithic Rails application called Mothership?

    Splitting a legacy application is never easy, but luckily there are plenty of industry and academic publications to help you out.

    The first step in any activity like this is to identify and apply the criteria used to define the units to be extracted. At SoundCloud, we have decided to use the work of Eric Evans and Martin Fowler in what is called a Bounded Context. An obvious example of Bounded Context in our domain was user-to-user messages. This was a well-contained feature set, highly cohesive, and not too coupled with the rest of the domain, as it just needs to hold references to users.

    After we identified the Bounded Context, the next task was to find a way to extract it. Unfortunately, Rails’ ActiveRecord framework often leads to a very coupled design. The code dealing with such messages was as follows:

     def index
      if (InboxItem === item)
        respond mailbox_items_in_collection.index.paginate(:page => params[:page])
        respond mailbox_items_in_collection.paginate(
          :joins => "INNER JOIN messages ON #{safe_collection}_items.message_id =",
          :page  => params[:page],
          :order => 'messages.created_at DESC')

    Because we wanted to extract the messages’ Bounded Context into a separate microservice, we needed the code above to be more flexible. The first step we took was to refactor this code into what Michael Feathers describes as a seam:

    A seam is a place where you can alter behavior in your program without editing in that place.

    So we changed our code a little bit:

    def index
      conversations = cursor_for do |cursor|
      respond collection_for(conversations, :conversations)

    The first version of the conversations_service#conversations_for method was not that different from the previous code; it performed the exact same ActiveRecord calls.

    We were ready to extract this logic into a microservice without having to refactor lots of controllers and other Presentation Layer code. We first replaced the implementation of conversations_service#conversations_for with a call to the service:

    def conversations_for(user, offset = 0, limit = 50)
      response = @http_client.do_get(service_path(user), pagination(offset, limit))
      parse_response(user, response)

    We avoided big-bang refactorings as much as we could, and this required us to have the microservices working together with the old Mothership code for as long as it took to completely extract the logic into the new microservice.

    As described before, we did not want to use the Mothership’s database as the integration point for microservices. That database is an Application Database, and making it an Integration Database would cause problems because we would have to synchronize any change in the database across many different microservices that would now be coupled to it.

    Although using the database as the integration point between systems was not planned, we had the new microservices accessing the Mothership’s database during the transition period.

    This brought up two important issues. During the whole transition period, the new microservices could not change the relational model in MySQL—or, even worse, use a different storage engine. For extreme cases, like user-to-user messages where a threaded-based model was replaced by a chat-like one, we had cronjobs keep different databases synchronized.

    The other issue was related to the Semantic Events system described in Part I. The way our architecture and infrastructure was designed requires events to be emitted where the state change happened, and this ought to be a single system. Because we could not have both the Mothership and the new microservice emitting events, we had to implement only the read-path endpoints until we were ready to make the full switch from the Mothership to the new microservice. This was less problematic than what we first thought, but nevertheless it did impact product prioritization because features to be delivered were constrained by this strategy.

    By applying these principles we were able to extract most services from the Mothership. Currently we have only the most coupled part of our domain there, and products like the new user-to-user messaging system were built completely decoupled from the monolith.

    In the next part, we will look at how we use Scala and Finagle to build our microservices.

  • June 11th, 2014 Scala Finagle Ruby Architecture Building Products at SoundCloud —Part I: Dealing with the Monolith By Phil Calçado

    Most of SoundCloud's products are written in Scala, Clojure, or JRuby. This wasn't always the case. Like other start-ups, SoundCloud was created as a single, monolithic Ruby on Rails application running on the MRI, Ruby's official interpreter, and backed by memcached and MySQL.

    We affectionately call this system Mothership. Its architecture was a good solution for a new product used by several hundreds of thousands of artists to share their work, collaborate on tracks, and be discovered by the industry.

    The Rails codebase contained both our Public API, used by thousands of third-party applications, and the user-facing web application. With the launch of the Next SoundCloud in 2012, our interface to the world became mostly the Public API —we built all of our client applications on top of the same API partners and developers used.

    Diagram 1

    These days, we have about 12 hours of music and sound uploaded every minute, and hundreds of millions of people use the platform every day. SoundCloud combines the challenges of scaling both a very large social network with a media distribution powerhouse.

    To scale our Rails application to this level, we developed, contributed to, and published several components and tools to help run database migrations at scale, be smarter about how Rails accesses databases, process a huge number of messages, and more. In the end we have decided to fundamentally change the way we build products, as we felt we were always patching the system and not resolving the fundamental scalability problem.

    The first change was in our architecture. We decided to move towards what is now known as a microservices architecture. In this style, engineers separate domain logic into very small components. These components expose a well-defined API, and implement a Bounded Context —including its persistence layer and any other infrastructure needs.

    Big-bang refactoring has bitten us in the past, so the team decided that the best approach to deal with the architecture changes would not be to split the Mothership immediately, but rather to not add anything new to it. All of our new features were built as microservices, and whenever a larger refactoring of a feature in the Mothership was required, we extract the code as part of this effort.

    This started out very well, but soon enough we detected a problem. Because so much of our logic was still in the Rails monolith, pretty much all of our microservices had to talk to it somehow.

    One option around this problem was to have the microservices accessing directly the Mothership database. This is a very common approach in some corporate settings, but because this database is a Public, but not Published Interface, it usually leads to many problems when we need to change the structure of shared tables.

    Instead, we went for the only Published Interface we had, which was the Public API. Our internal microservices would behave exactly like the applications developed by third-party organizations integrate with the SoundCloud platform.

    Diagram 2

    Soon enough, we realized that there was a big problem with this model; as our microservices needed to react to user activity. The push-notifications system, for example, needed to know whenever a track had received a new comment so that it could inform the artist about it. At our scale, polling was not an option. We needed to create a better model.

    We were already using AMQP in general and RabbitMQ in specific — In a Rails application you often need a way to dispatch slow jobs to a worker process to avoid hogging the concurrency-weak Ruby interpreter. Sebastian Ohm and Tomás Senart presented the details of how we use AMQP, but over several iterations we developed a model called Semantic Events, where changes in the domain objects result in a message being dispatched to a broker and consumed by whichever microservice finds the message interesting.

    Diagram 3

    This architecture enabled Event Sourcing, which is how many of our microservices deal with shared data, but it did not remove the need to query the Public API —for example, you might need all fans of an artist and their email addresses to notify them about a new track.

    While most of the data was available through the Public API, we were constrained by the same rules we enforced on third-party applications. It was not possible, for example, for a microservice to notify users about activity on private tracks as users could only access public information.

    We explored several possible solutions to the problem. One of the most popular alternatives was to extract all of the ActiveRecord models from the Mothership into a Ruby gem, effectively making the Rails model classes a Published Interface and a shared component. There were several important issues with this approach, including the overhead of versioning the component across so many microservices, and that it became clear that microservices would be implemented in languages other than Ruby. Therefore, we had to think about a different solution.

    In the end, the team decided to use Rails' features of engines (or plugins, depending on the framework's version) to create an Internal API that is available only within our private network. To control what could be accessed internally, we used Oauth 2.0 when an application is acting on behalf of a user, with different authorisation scopes depending on which microservice needs the data.

    Diagram 4

    Although we are constantly removing features from the Mothership, having both a push and pull interface to the old system makes sure that we do not couple our new microservices to the old architecture. The microservice architecture has proven itself crucial to developing production-ready features with much shorter feedback cycles. External-facing examples are the visual sounds, and the new stats system.

  • May 9th, 2014 Announcements Go Open Source Roshi: a CRDT system for timestamped events By Peter Bourgon

    Let's talk about the stream.

    The SoundCloud stream represents stuff that's relevant to you primarily via your social graph, arranged in time order, newest-first. The atom of that data model, an event, is a simple enough thing.

    • Timestamp
    • User who did the thing
    • Identifier of the thing that was done

    For example,

    If you followed A-Trak, you'd want to see that repost event in your stream. Easy. The difficult thing about time-ordered events is scale, and there are basically two strategies for building a large-scale time-ordered event system.

    Data models

    Fan out on write means everybody gets an inbox.

    Fan out on write

    That's how it works today: we use Cassandra, and give each user a row in a column family. When A-Trak reposts Skrillex, we fan-out that event to all of A-Trak's followers, and make a bunch of inserts. Reads are fast, which is great. But writes carry perverse incentives: the more followers you have, the longer it takes to persist all of your updates. Storage requirements are also quadratic against user growth and follower count (i.e. affiliation density). And mutations, e.g. changes in the social graph, become costly or unfeasible to implement at the data layer. It works, but it's unwieldy in a lot of dimensions.

    At some point, those caveats and restrictions started affecting our ability to iterate on the stream. To keep up with product ideas, we needed to address the infrastructure. And rather than tackling each problem in isolation, we thought about changing the model.

    The alternative is fan in on read.

    Fan in on read

    When A-Trak reposts Skrillex, it's a single append to A-Trak's outbox. When users view their streams, the system will read the most recent events from the outboxes of everyone they follow, and perform a merge. Writes are fast, storage is minimal, and since streams are generated at read time, they naturally represent the present reality. (It also opens up a lot of possibilities for elegant implementations of product features and experiments.)

    Of course, reads are difficult. If you follow thousands of users, making thousands of simultaneous reads, time-sorting, merging, and cutting within a typical request-response deadline isn't trivial. As far as we know, nobody operating at our scale builds timelines via fan-in-on-read. And we presume that's due at least in part to the challenges of reads.

    Yet we saw potential here. Storage reduction was actually huge: we projected a complete fan-in-on-read data size for all users on the order of a hundred gigabytes. At that size, it's feasible to keep the data set in memory, distributed among commodity servers. The problem then becomes coördination: how do you reliably and correctly populate that data system (writes), and materialize views from up to thousands of sources by hard deadlines (reads)?

    Enter the CRDT

    If you're into so-called AP data systems, you've probably run into the term CRDT recently. CRDTs are conflict-free replicated data types: data structures for distributed systems. The tl;dr on CRDTs is that by constraining your operations to only those which are associative, commutative, and idempotent, you sidestep a lot of the complexity in distributed programming. (See: ACID 2.0 and/or CALM theorem.) That, in turn, makes it straightforward to guarantee eventual consistency in the face of failure.

    With a bit of thinking, we were able to map a fan-in-on-read stream product to a data model that could be implemented with a specific type of CRDT. We were then able to focus on performance, optimizing our reads without becoming overwhelmed by incidental complexity imposed by the consistency model.


    The result of our work is Roshi, a distributed storage system for time-series events. It implements what we believe is a novel CRDT set type, closely resembling a LWW-element-set with inline garbage collection. At its core, it uses the Redis ZSET sorted set to store state, and orchestrates self-repairing reads and writes on top, in a stateless operational layer. We spent a long while optimizing the read path to support our latency and QPS requirements, and we're confident that Roshi will accommodate our exponential growth for years. It took about six developer months to build, and we're in the process of rolling it out now.

    Roshi is fully open-source, and all the gory technical details are in the repository, so please do check it out. I hope it's easy to grok: at the time of writing, it's 5000 lines of Go, of which 2300 are tests. And we intend to keep the codebase lean, explicitly not adding features that are outside of the tightly defined problem domain.

    Open-sourcing our work naturally serves the immediate goal of providing usable software to the community. We hope that Roshi may be a good fit for problems in your organizations, and we look forward to collaborating with anyone who's interested in contributing. Open-sourcing also serves another, perhaps more interesting goal, which is advancing a broader discussion about software development. The obvious reaction to Roshi is to ask why we didn't implement it with an existing, proven data system like Cassandra. But we too often underestimate the costs of doing that: costs like mapping your domain to the generic language of the system, learning the subtleties of the implementation, operating it at scale, and dealing with bugs that your likely novel use cases may reveal. There are even second-degree costs: when software engineering is reduced to plumbing together generic systems, software engineers lose their sense of ownership, which is the foundation of craftsmanship and software quality.

    Given a well-defined problem, a specific solution may be far less costly than a generic version: there's a smaller domain translation, a much smaller surface area, and less operational friction. We hope that Roshi stands in evidence for the case that the practice of software engineering can be a more thoughtful and crafted process. Software that is "invented here" can, in the right circumstances, deliver outstanding business value.

    Roshi was a team effort. I'm deeply indebted to the amazing work of Tomás Senart, Björn Rabenstein, and Johan Uhle, without whom Roshi would have never been possible.

  • May 1st, 2014 Announcements JavaScript SDKs Introducing JavaScript SDK version 2 By Erik Michaels-Ober

    SoundCloud is pleased to introduce a new major version of the SoundCloud JavaScript SDK. In version 2, we've rewritten much of the internal code, resulting in better performance for your JavaScript applications and support for more streaming standards, such as HTTP Live Streaming.

    You can test the new version by pointing your JavaScript applications to

    We've also created a guide to help you upgrade from version 1 to version 2.

    JavaScript SDK version 1 is now deprecated and will be permanently replaced by version 2 on July 1, 2014.

    On June 17, 2014, we will temporarily replace version 1 with version 2 between 10:00 and 11:00 UTC. We will do this again on June 24, 2014, between 18:00 and 19:00 UTC. These two upgrade tests will give you an opportunity to understand the impact of this change on your applications. To ensure a seamless transition for your users, we strongly encourage you to upgrade and perform internal tests in advance of these dates.

    To receive notices before, during, and after these tests, follow @SoundCloudDev on Twitter.

    If you have any questions about this upgrade, please feel free to email

  • April 27th, 2014 Contests Search and Discovery Irrational Fun: Find Yourself at Berlin Buzzwords By Erik Michaels-Ober

    We were counting down the days until Berlin Buzzwords on May 25, when we realised that it would be great if you came too! With that in mind, we've created a contest. One lucky winner will receive a free ticket to Berlin Buzzwords, including travel expenses and accommodation. Here are the details about how to apply.

    The ratio of a circle's circumference to its diameter, represented by the Greek letter π, is an irrational number—it never terminates or repeats. Your goal is to find the SoundCloud logo in π.

    We have provided a 14 pixel by 6 pixel, greyscale reference image:

    Here is the same image at 60X magnification:

    Each of the 10 shades of grey in this image can be mapped to a number:

    Color RGB Hex Number
      ffffff 0
      f0f0f0 1
      ebebeb 2
      d0d0d0 3
      c1c1c1 4
      a8a8a8 5
      878787 6
      535353 7
      333333 8
      000000 9

    Applying this mapping to the reference image produces the following 84-digit bitmap:

    0 0 0 0 0 3 3 9 9 9 6 0 0 0
    0 0 0 3 4 9 5 9 9 9 9 4 0 0
    0 1 2 7 5 9 5 9 9 9 9 7 1 0
    3 9 5 9 5 9 5 9 9 9 9 9 9 3
    6 9 5 9 5 9 5 9 9 9 9 9 9 6
    1 8 5 9 5 9 5 9 9 9 9 9 8 1

    Your challenge is to write an algorithm that finds the 10 sequences that most closely approximate the reference image. Each result should include the sequence and its position (after the decimal point) in π.

    Here is an example result set:

    Rank Image Sequence Offset
    1 082201638940102393659252475011295776958920 282336494898427768768699465405437965994582 297,640,119
    2 310015049916890341096198545241549627773525 291969856827158758552799587406476458977970 792,987,187
    3 212021479960265330123798820231693768599316 147634474729776147987653958935291919768971 972,165,010
    4 000053743536020032496985553181128909983810 344939134894807349584729687746183109884672 981,165,566
    5 142204297650171312983445842322141909755200 787408739757838589593329762648444919386594 789,652,974
    6 300011495970664010077917573663456957498854 662995598898697947677549686339433357728071 197,342,990
    7 313560984264300011495970664010077917573663 456957498854662995598898697947677549686339 197,342,978
    8 870402479996214234001557832923050859979903 649788689695954439755933903629798966788984 75,975,342
    9 208503759370245135490877462200175839969750 453766432680245845143285985661373828688970 343,577,393
    10 300544295771128716836756973814286978997516 282647269986574856578306678421894769876141 950,462,734

    Entries will be judged against the follow criteria:

    • Code quality
    • Runtime performance
    • Visual closeness to the reference image (subjective)

    You should run your algorithm against the following data set (approximately 1 GB), which contains the first 1,000,000,000 (billion) digits of π.

    Please send your submission, including a link to the source-code repository, to on or before May 5, 2014 23:59 UTC. Your repository should contain a README that includes instructions about how to set up and run your code. Entries are subject to the terms and conditions.


    Congratulations to Tomasz Pewiński, who submitted the winning entry.

    We’d also like to acknowledge entries by Dan Oved and Martin Kühl, which were also excellent. Thanks to everyone else who participated in the contest. We hope it was fun!