Web Scale Statistics – Failing with MongoDB

As SoundCloud rapidly grows our initial systems need an overhaul. Our scaling strategy has been very realistic, design for 10x our current usage. Our initial statistics system found under https://soundcloud.com/you/stats was made when we were 100k users, living long past its expiration date.

Background

About a year ago we started off with the goal of redesigning the statistics pages to support 500 playbacks a second. We knew that this would be a write-heavy workload and that to sustain bursts of writes, we’d need decent partitioning. Coming from a successful experience moving the Dashboard feature to Cassandra 0.6 we started out prototyping a design that would be easily partitioned.

The write side of this story went very well, Cassandra could keep up with everything we threw at it, however to naively pull out all the aggregates we were collecting took hundreds of queries to the cluster. Cassandra didn’t have atomic counters at the time, so we had a lot of individual counts that needed to be summed on the client. (This is changing with the much anticipated upcoming 0.8 release!)

In a one-night experiment, we re-implemented the Cassandra based prototype to be backed by MongoDB. Not only could this quick prototype consume events as fast as Cassandra, there were some server side features in MongoDB that we could use to simplify a few of the queries that we had for the stats like atomic inplace insert/updates (upserts) to use fewer documents and secondary indexes to build the time series. Plus it was web scale.

To answer the partitioning problem for MongoDB, we decided to lean on the automatic sharding router that was under development. This was built to automate the rebalancing of data between replication pairs and keep a cluster nice and healthy without much data administration.

Away we went, implementing a correct, distributed key-value oriented MongoDB based statistics backend. We even deployed it into a cluster of Amazon EC2 instances and hooked it up to the website to start tracking our statistics alongside our existing solution.

What we observed was disconcerting. On a 36GB RAM machine with a working set under 100 GB, the system performed better than needed. A single node could process thousands of plays per second. Once the working set approached 300 GB, the throughput dropped down to between 100 and 200 plays per second. The disk utilization of one of the shards in the cluster stayed at 100% and we were seeing a IO service latency of around 5ms and MongoDB latency spiking upwards to 15 seconds. This all pointed to becoming bound on disk seeks.

Seeking the answer

Being bound on IO was something that we anticipated as we would not need to keep the entire working set resident, but not this bad. Whatever that MongoDB’s sharding was doing was causing a single node in the cluster to bottleneck. Could it be those poor disk heads bouncing back and forth to support our write load?

For any workload, a disk seek is the worst thing one could be spending time on. In “Numbers Everyone Should Know” it’s obvious we can do better:

execute typical instruction 1/1,000,000,000 sec = 1 nanosec
fetch from L1 cache memory 0.5 nanosec
branch misprediction 5 nanosec
fetch from L2 cache memory 7 nanosec
Mutex lock/unlock 25 nanosec
fetch from main memory 100 nanosec
send 2K bytes over 1Gbps network 20,000 nanosec
read 1MB sequentially from memory 250,000 nanosec
fetch from new disk location (seek) 8,000,000 nanosec
read 1MB sequentially from disk 20,000,000 nanosec
send packet US to Europe and back 150 milliseconds = 150,000,000 nanosec

The stress to get the stats released was mounting. Our existing solution was unusable for almost all of our active users and was actually causing a ripple effect in the databases that caused site performance degradation when anyone went to visit their stats page. And we had overshot our release goal already by 2 months.

We struggled hard to identify the root cause for why we had a hotspot in the MongoDB cluster. If we could just distribute the seek-bound workload, we could at least release this iteration of the stats while we dug into the root cause of being seek bound in the first place.

To identify the root cause, we tried confirming various theories about the workload we were introducing and the behavior of MongoDB. Each experiment was time consuming but we built up an understanding of how the system was (mis)behaving.

Death by design

The design of this system was fairly simple – create a document for each time aggregate that we needed to display in a time series. Use varying time spans to reduce the number of aggregates we would need to fetch and sum in the client. We used year, month, day and hour aggregation documents for each dimension of data we wished to track. This includes the total play counts, by user, country and source URL. Send these aggregates to a [mongos][mongo-mongos] node that would then shard the documents and distribute them to one of 4 backing replication pairs.

The last theory is that when a track is first recorded, the 4 aggregate documents (year, month, day, hour) are written to the end of each collection. All 4 documents would end up on the right side of the tree and the right side on disk, because our understanding of the MongoDB disk layout goes as far as being a memory mapped BTree of the actual BSON documents. As time passes, the longer duration documents, year and month, would start to become relatively shifted to the left of the disk storage, where the newer hours and days would end up on the right. Since MongoDB’s persistence was essentially a bunch of memory mapped files, this could have caused all the page faults loading up cold pages from the middle/left of the tree when attempting to update all aggregation buckets.

We are still uncertain about how the documents were layed out in memory and on disk, how it grew, how it was mapped to disk and how the indexes resolved to documents. If the on-disk layout was by the random ID then we were screwed on writes and reads because they would both be random. If it was by time, we were screwed on reads as we’d get sparse reads over the time series, if it was on our bucket organization we were screwed in the future as our buckets locality would get sparser over time.

Coming to terms

IO doing it wrong

Not reasoning hard about this system’s IO patterns was a very late, and very obvious oversight. Even if we could partition the writes, we had designed a system that would require close to 100% in-memory residence to be performant over time. This is a very expensive proposal for mostly stale and very large data.

Lessons learned, we faced and made the difficult decision to cancel this project as it could not meet our long-term goals under real workload.

Yet the stress is still there, the SoundCloud statistics must work for all users, especially the people with millions of plays.

Up soon – what we made instead (for now).

Not reasoning hard about this system’s IO patterns was a very late, and very obvious oversight. Even if we could partition the writes, we had designed a system that would require close to 100% in-memory residence to be performant over time. This is a very expensive proposal for mostly stale and very large data.

Lessons learned, we faced and made the difficult decision to cancel this project as it could not meet our long-term goals under real workload.

Yet the stress is still there, the SoundCloud statistics must work for all users, especially the people with millions of plays.

Up soon – what we made instead (for now).