Here at SoundCloud, in order to provide counts and a time series of counts in real time, we created something called Stitch.
Stitch was initially developed to provide timelines and counts for our stats pages, which are where users can see which of their tracks are played and when.
Stitch is a wrapper around a Cassandra database. It has a web application that provides read access to the counts through an HTTP API. The counts are written to Cassandra in two distinct ways, and it’s possible to use either one or both of them:
- Real Time
- For real-time updates, Stitch has a processor application that handles a stream of events coming from a broker and increments the appropriate counts in Cassandra.
- The batch part is a MapReduce job running on Hadoop that reads event logs and calculates the overall totals, and then bulk loads this into Cassandra.
- Calculated the expected totals.
- Queried Cassandra to get the values it had for each counter.
- Calculated the increments needed to apply to fix the counters.
- Applied the increments.
The difficulty with real-time counts is that incrementing is a non-idempotent operation, which means that if you apply the same increment twice, you get a different value than if you would only apply it once. That said, if an incident affects our data pipeline and the counts are wrong, we can’t fix it by simply re-feeding the day’s events through the processors; if we did, we would risk double counting.
Our First Solution
Initially, Stitch only supported real-time updates and addressed this problem with a MapReduce job,
The Restorator, which performed the following actions:
Meanwhile, to stop the sand shifting under its feet,
The Restorator needed to coordinate a locking system between itself and the real-time processors. This was so that the processors didn’t try to simultaneously apply increments to the same counter, which would result in a race condition. To deal with this,
The Restorator used ZooKeeper.
As you can probably tell, this setup was quite complex, and it often took a long time to run. But despite this, it worked.
Our Second Solution
Luckily, a new use case emerged: a team wanted to run Stitch purely in batch. This is when we added the batch layer, and we used this as an opportunity to revisit the way Stitch was dealing with the non-idempotent increments problem. We evolved to a Lambda Architecture-style approach, where we combined a fast real-time layer for a possibly inaccurate but immediate count with a batch slow layer for an accurate but delayed count. The two sets of counts are kept separately and updated independently, possibly even living on different database clusters, and it is up to the reading web application to return the correct version when queried. At its most naive, it returns the batch counts instead of the real-time counts, whenever they exist.
To find out how Stitch has evolved over the years, you can read this updated post, Keeping Counts In Sync.
If this sounds like the sort of thing you’d like to work on too, check out our jobs page.