SoundCloud for Developers

Discover, connect and build

Backstage Blog July 3rd, 2014 Data Real-time counts with Stitch By Emily Green

We made Stitch to provide counts and time-series of counts in real-time.

Stitch was initially developed to do the timelines and counts for our stats pages. This is where users can see which of their tracks were played and when.

SoundCloud Stats Screenshot

Stitch is a wrapper around a Cassandra database. It has a web application that provides read-access to the counts through an HTTP API. The counts are written to Cassandra in two distinct ways, and it's possible to use either or both of them:

For real-time updates, Stitch has a processor application that handles a stream of events coming from a broker and increments the appropriate counts in Cassandra.
The batch part is a MapReduce job running on Hadoop that reads event logs, calculates the overall totals, and bulk loads this into Cassandra.

The problem

The difficulty with real-time counts is that incrementing is a non-idempotent operation, which means that if you apply the same increment twice you get a different value to when you apply it once. If an incident affects our data pipeline, and the counts are wrong, we cannot fix by simply re-feeding the day's events through the processors; we would risk double counting.

Our first solution

Initially, Stitch only supported real-time updates and addressed this problem with a MapReduce job named The Restorator that performed the following actions:

  1. Calculated the expected totals
  2. Queried Cassandra to get the values it had for each counter
  3. Calculated the increments needed to apply to fix the counters
  4. Applied the increments

Meanwhile, to stop the sand shifting under its feet, The Restorator needed to coordinate a locking system between itself and the real-time processors, so that the processors did not try to simultaneously apply increments to the same counter, resulting in a race-condition. It used ZooKeeper for this.

As you can probably tell, this was quite complex, and it could take a long time to run. But despite this, it did indeed work.

Our second solution

We got a new use-case; a team wanted to run Stitch purely in batch. This is when we added the batch layer and took the opportunity to revisit the way Stitch was dealing with the non-idempotent increments problem. We evolved to a Lambda Architecture style approach, where we combine a fast real-time layer for a possibly inaccurate but immediate count, with a batch slow layer for an accurate but delayed count. The two sets of counts are kept separately and updated independently, possibly even living on different database clusters. It is up to the reading web application to return the right version when queried. At its naïvest, it returns the batch counts instead of the real-time counts whenever they exist.

Stitch Diagram

Thanks go to Kim Altintop and Omid Aladini who created Stitch, and John Glover who continues to work on it with me.

If this sounds like the sort of thing you'd like to work on too, check out our jobs page.