We made Stitch to provide counts and time-series of counts in real-time.
Stitch was initially developed to do the timelines and counts for our stats pages. This is where users can see which of their tracks were played and when.
Stitch is a wrapper around a Cassandra database. It has a web application that provides read-access to the counts through an HTTP API. The counts are written to Cassandra in two distinct ways, and it's possible to use either or both of them:
- For real-time updates, Stitch has a processor application that handles a stream of events coming from a broker and increments the appropriate counts in Cassandra.
- The batch part is a MapReduce job running on Hadoop that reads event logs, calculates the overall totals, and bulk loads this into Cassandra.
- Calculated the expected totals
- Queried Cassandra to get the values it had for each counter
- Calculated the increments needed to apply to fix the counters
- Applied the increments
The difficulty with real-time counts is that incrementing is a non-idempotent operation, which means that if you apply the same increment twice you get a different value to when you apply it once. If an incident affects our data pipeline, and the counts are wrong, we cannot fix by simply re-feeding the day's events through the processors; we would risk double counting.
Our first solution
Initially, Stitch only supported real-time updates and addressed this problem
with a MapReduce job named
The Restorator that performed the following actions:
Meanwhile, to stop the sand shifting under its feet,
The Restorator needed to
coordinate a locking system between itself and the real-time processors, so
that the processors did not try to simultaneously apply increments to the same
counter, resulting in a race-condition. It used ZooKeeper for this.
As you can probably tell, this was quite complex, and it could take a long time to run. But despite this, it did indeed work.
Our second solution
We got a new use-case; a team wanted to run Stitch purely in batch. This is when we added the batch layer and took the opportunity to revisit the way Stitch was dealing with the non-idempotent increments problem. We evolved to a Lambda Architecture style approach, where we combine a fast real-time layer for a possibly inaccurate but immediate count, with a batch slow layer for an accurate but delayed count. The two sets of counts are kept separately and updated independently, possibly even living on different database clusters. It is up to the reading web application to return the right version when queried. At its naïvest, it returns the batch counts instead of the real-time counts whenever they exist.
If this sounds like the sort of thing you'd like to work on too, check out our jobs page.