Backstage Blog

RSS logo

You're browsing posts of the category Data

Keeping Counts In Sync

May 11th, 2018 by Lorand Kasler

Track play counts are essential for providing a good creator experience on the SoundCloud platform. They not only help creators keep track of their most popular songs, but they also give creators a better understanding of their fanbase and global impact. This post is a continuation of an earlier post that discussed what we do at SoundCloud to ensure creators get their play stats (along with their other stats), both reliably and in real time.

Read more…

PageRank in Spark

January 24th, 2018 by Josh Devins

SoundCloud consists of hundreds of millions of tracks, people, albums, and playlists, and navigating this vast collection of music and personalities poses a large challenge, particularly with so many covers, remixes, and original works all in one place.

Read more…

SoundCloud's Data Science Process

October 4th, 2017 by Josh Devins

Here at SoundCloud, we’ve been working on helping our Data Scientists be more effective, happy, and productive. We revamped our organizational structure, clearly defined the role of a Data Scientist and a Data Engineer, introduced working groups to solve common problems (like this), and positioned ourselves to do incredible work! Most recently, we started thinking about the work that a Data Scientist does, and how best to describe and share the process that we use to work on a business problem. Based on the experiences of our Data Scientists, we distilled a set of steps, tips and general guidance representing the best practices that we collectively know of and agree to as a community of practitioners.

Read more…

A Better Model of Data Ownership

June 20th, 2017 by Joe Kearney

Once upon a time, we had a single monolith of software, one mothership running everything. At SoundCloud, the proliferation of microservices came from moving functionality out of the mothership. There are plenty of benefits to splitting up features in this way. We want the same benefits for our data as well, by defining ownership of datasets and ensuring that the right teams own the right datasets.

Read more…

Real-Time Counts with Stitch

July 3rd, 2014 by Emily Green

Here at SoundCloud, in order to provide counts and a time series of counts in real time, we created something called Stitch.

Stitch was initially developed to provide timelines and counts for our stats pages, which are where users can see which of their tracks are played and when.

SoundCloud Stats Screenshot

Stitch is a wrapper around a Cassandra database. It has a web application that provides read access to the counts…

Read more…

MySQL for Statistics – Old Faithful

July 5th, 2011 by Sean Treadway

MySQL turns out to be a good Swiss Army Knife for persistence, if used wisely. Understanding disk access patterns driven by your storage engine is key. Choosing a read or write optimized disk layout will get you very far. We chose a read-optimized disk layout using InnoDB and MySQL for statistics.

While our wheels were spinning trying to find out why our statistics storage patterns were causing MongoDB to thrash our disks, we started looking for an emergency alternative with the technology that we…

Read more…