Track play counts are essential for providing a good creator experience on the SoundCloud platform. They not only help creators keep track of their most popular songs, but they also give creators a better understanding of their fanbase and global impact. This post is a continuation of an earlier post that discussed what we do at SoundCloud to ensure creators get their play stats (along with their other stats), both reliably and in real time.
Backstage Blog RSS
You're browsing posts of the category Data
July 10th, 2018 Architecture Data Keeping Counts In Sync By Lorand Kasler
January 24th, 2018 Search Data Science Machine Learning Analytics Data PageRank in Spark By Josh Devins
SoundCloud consists of hundreds of millions of tracks, people, albums, and playlists, and navigating this vast collection of music and personalities poses a large challenge, particularly with so many covers, remixes, and original works all in one place.
For search on SoundCloud, one of the ways we approach this problem is by using our own version of the PageRank algorithm, which we affectionately refer to as DiscoRank (Get it? Disco as in discovery and Saturday Night Fever?!).
The job of PageRank is to help rank search results from a query like finding all Go+ tracks called “royals.” At first glance, this task might seem trivial. The first result is, and should indeed be, Lorde’s original song, “Royals.” However, there are plenty of covers and remixes of this track, which leaves us with questions like: Which ones should we show at the top and in which order? What about other tracks in our catalog that have the word “royals” in them? Where should they be in our search results list?
October 4th, 2017 Data Science Machine Learning Analytics Data SoundCloud's Data Science Process By Josh Devins
Here at SoundCloud, we’ve been working on helping our Data Scientists be more effective, happy, and productive. We revamped our organizational structure, clearly defined the role of a Data Scientist and a Data Engineer, introduced working groups to solve common problems (like this), and positioned ourselves to do incredible work! Most recently, we started thinking about the work that a Data Scientist does, and how best to describe and share the process that we use to work on a business problem. Based on the experiences of our Data Scientists, we distilled a set of steps, tips and general guidance representing the best practices that we collectively know of and agree to as a community of practitioners.
June 20th, 2017 Architecture Data A Better Model of Data Ownership By Joe Kearney
Once upon a time, we had a single monolith of software, one mothership running everything. At SoundCloud, the proliferation of microservices came from moving functionality out of the mothership. There are plenty of benefits to splitting up features in this way. We want the same benefits for our data as well, by defining ownership of datasets and ensuring that the right teams own the right datasets.
July 3rd, 2014 Data Real-Time Counts with Stitch By Emily Green
Here at SoundCloud, in order to provide counts and a time series of counts in real time, we created something called Stitch.
Stitch was initially developed to provide timelines and counts for our stats pages, which are where users can see which of their tracks are played and when.
Stitch is a wrapper around a Cassandra database. It has a web application that provides read access to the counts through an HTTP API. The counts are written to Cassandra in two distinct ways, and it’s possible to use either one or both of them:
- Real Time
- For real-time updates, Stitch has a processor application that handles a stream of events coming from a broker and increments the appropriate counts in Cassandra.
- The batch part is a MapReduce job running on Hadoop that reads event logs and calculates the overall totals, and then bulk loads this into Cassandra.
- Calculated the expected totals.
- Queried Cassandra to get the values it had for each counter.
- Calculated the increments needed to apply to fix the counters.
- Applied the increments.
The difficulty with real-time counts is that incrementing is a non-idempotent operation, which means that if you apply the same increment twice, you get a different value than if you would only apply it once. That said, if an incident affects our data pipeline and the counts are wrong, we can’t fix it by simply re-feeding the day’s events through the processors; if we did, we would risk double counting.
Our First Solution
Initially, Stitch only supported real-time updates and addressed this problem with a MapReduce job,
The Restorator, which performed the following actions:
Meanwhile, to stop the sand shifting under its feet,
The Restoratorneeded to coordinate a locking system between itself and the real-time processors. This was so that the processors didn’t try to simultaneously apply increments to the same counter, which would result in a race condition. To deal with this,
The Restoratorused ZooKeeper.
As you can probably tell, this setup was quite complex, and it often took a long time to run. But despite this, it worked.
Our Second Solution
Luckily, a new use case emerged: a team wanted to run Stitch purely in batch. This is when we added the batch layer, and we used this as an opportunity to revisit the way Stitch was dealing with the non-idempotent increments problem. We evolved to a Lambda Architecture-style approach, where we combined a fast real-time layer for a possibly inaccurate but immediate count with a batch slow layer for an accurate but delayed count. The two sets of counts are kept separately and updated independently, possibly even living on different database clusters, and it is up to the reading web application to return the correct version when queried. At its most naive, it returns the batch counts instead of the real-time counts, whenever they exist.
To find out how Stitch has evolved over the years, you can read this updated post, Keeping Counts In Sync.
If this sounds like the sort of thing you’d like to work on too, check out our jobs page.