SoundCloud for Developers

Discover, connect and build

We use cookies for various purposes including analytics and personalized marketing. By continuing to use the service, you agree to our use of cookies as described in the Cookie Policy.

Backstage Blog RSS

You're browsing posts of the category Data

  • July 10th, 2018 Architecture Data Keeping Counts In Sync By Lorand Kasler

    Track play counts are essential for providing a good creator experience on the SoundCloud platform. They not only help creators keep track of their most popular songs, but they also give creators a better understanding of their fanbase and global impact. This post is a continuation of an earlier post that discussed what we do at SoundCloud to ensure creators get their play stats (along with their other stats), both reliably and in real time.

    Read more...

  • January 24th, 2018 Search Data Science Machine Learning Analytics Data PageRank in Spark By Josh Devins

    SoundCloud consists of hundreds of millions of tracks, people, albums, and playlists, and navigating this vast collection of music and personalities poses a large challenge, particularly with so many covers, remixes, and original works all in one place.

    Read more...

  • October 4th, 2017 Data Science Machine Learning Analytics Data SoundCloud's Data Science Process By Josh Devins

    Here at SoundCloud, we’ve been working on helping our Data Scientists be more effective, happy, and productive. We revamped our organizational structure, clearly defined the role of a Data Scientist and a Data Engineer, introduced working groups to solve common problems (like this), and positioned ourselves to do incredible work! Most recently, we started thinking about the work that a Data Scientist does, and how best to describe and share the process that we use to work on a business problem. Based on the experiences of our Data Scientists, we distilled a set of steps, tips and general guidance representing the best practices that we collectively know of and agree to as a community of practitioners.

    Read more...

  • June 20th, 2017 Architecture Data A Better Model of Data Ownership By Joe Kearney

    Once upon a time, we had a single monolith of software, one mothership running everything. At SoundCloud, the proliferation of microservices came from moving functionality out of the mothership. There are plenty of benefits to splitting up features in this way. We want the same benefits for our data as well, by defining ownership of datasets and ensuring that the right teams own the right datasets.

    Read more...

  • July 3rd, 2014 Data Real-Time Counts with Stitch By Emily Green

    Here at SoundCloud, in order to provide counts and a time series of counts in real time, we created something called Stitch.

    Stitch was initially developed to provide timelines and counts for our stats pages, which are where users can see which of their tracks are played and when.

    SoundCloud Stats Screenshot

    Stitch is a wrapper around a Cassandra database. It has a web application that provides read access to the counts through an HTTP API. The counts are written to Cassandra in two distinct ways, and it’s possible to use either one or both of them:

    Real Time
    For real-time updates, Stitch has a processor application that handles a stream of events coming from a broker and increments the appropriate counts in Cassandra.
    Batch
    The batch part is a MapReduce job running on Hadoop that reads event logs and calculates the overall totals, and then bulk loads this into Cassandra.

    The Problem

    The difficulty with real-time counts is that incrementing is a non-idempotent operation, which means that if you apply the same increment twice, you get a different value than if you would only apply it once. That said, if an incident affects our data pipeline and the counts are wrong, we can’t fix it by simply re-feeding the day’s events through the processors; if we did, we would risk double counting.

    Our First Solution

    Initially, Stitch only supported real-time updates and addressed this problem with a MapReduce job, The Restorator, which performed the following actions:

    1. Calculated the expected totals.
    2. Queried Cassandra to get the values it had for each counter.
    3. Calculated the increments needed to apply to fix the counters.
    4. Applied the increments.

    Meanwhile, to stop the sand shifting under its feet, The Restorator needed to coordinate a locking system between itself and the real-time processors. This was so that the processors didn’t try to simultaneously apply increments to the same counter, which would result in a race condition. To deal with this, The Restorator used ZooKeeper.

    As you can probably tell, this setup was quite complex, and it often took a long time to run. But despite this, it worked.

    Our Second Solution

    Luckily, a new use case emerged: a team wanted to run Stitch purely in batch. This is when we added the batch layer, and we used this as an opportunity to revisit the way Stitch was dealing with the non-idempotent increments problem. We evolved to a Lambda Architecture-style approach, where we combined a fast real-time layer for a possibly inaccurate but immediate count with a batch slow layer for an accurate but delayed count. The two sets of counts are kept separately and updated independently, possibly even living on different database clusters, and it is up to the reading web application to return the correct version when queried. At its most naive, it returns the batch counts instead of the real-time counts, whenever they exist.

    Conclusion

    To find out how Stitch has evolved over the years, you can read this updated post, Keeping Counts In Sync.

    Stitch Diagram

    Thanks go to Kim Altintop and Omid Aladini, who created Stitch, and John Glover, who continues to work on it with me.

    If this sounds like the sort of thing you’d like to work on too, check out our jobs page.