The Journey of Corpus

April 29th, 2021 by Ana Pereira

It seems like a simple enough concept: You take data from how your users interact with your product, and you use it to make business and product decisions. Informed decisions will certainly be better than guesses, so in the end, this should pay off.

But that’s for anyone who hasn’t seen the dirty underbelly of data — from how broken tracking can lead to incorrect conclusions, to having lots and lots and lots of data but no documentation around what any of it means. If you’ve worked a day in your life as an analyst/data scientist/data engineer or in any other data-heavy role, you know how bad things can get.

In mid 2019, we at SoundCloud were in a position many companies have probably found themselves in: Our stakeholders wanted to make use of the wealth of data available to us, as we’re the world’s largest open audio platform, but access to data was hard, since our data warehouse was the accumulation of six years of decisions that made sense at the time but naturally grew outdated over the years. As a result, we just weren’t able to support the business and product teams in the data-driven decisions they wanted to make.

More concretely, a single (small) team owned more than 100 ETLs (extract, transform, load processes) on Amazon Redshift — most with little to no documentation and some inconsistencies between them (they were, after all, built at different points in time, by different people, for different purposes). However, we had signed a contract with Google to switch over to BigQuery (BQ) as our new data warehousing platform. This was our opportunity to start fresh and build something that would hopefully fulfill SoundCloud’s needs at the time and scale with time.

Corpus Is Born

So we set out on our journey to create what would become the Corpus BQ project: a centralized single source of truth for SoundCloud’s most relevant data. To do this, we created the Data Corpus team, whose mission statement is to “Implement data governance on the datasets that are most critical for product and business decision making.”

And just what exactly is data governance? According to Wikipedia, data governance is “a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives.” This was what we needed to become a more data-driven company.

Corpus as a team sits somewhere in the middle of what most people think of as Data Science and Data Engineering. At SoundCloud, there’s a dedicated team of data engineers known as Data Platform, which is responsible for maintaining all the infrastructure around data. We like to think of it as “Data Platform engineers the infrastructure, we engineer the data.” Our job titles are still Data Scientist or Data Engineer, depending on each person’s background and strengths, but we’re closer to what some companies now call analytics engineers.

Corpus Principles

After defining a team mission, we needed to decide how we were going to implement data governance. We came up with this list of six principles, which will be covered in detail in the rest of this post:

Quality
Compliance
Timeliness
Usability
Efficiency
Maintainability

This list is based on an ISO standard that we adjusted to our needs by removing characteristics that were outside our responsibilities and adding “maintainability,” for reasons we’ll explain below.

Quality

Quality is defined as “providing a consistent and accurate source of truth.” The bottom line is: Our stakeholders need to be able to trust our data, or they won’t use it. Exactly how we achieve this could be the topic of its own blog post, since it involves several different strategies. Here are a couple examples:

Whenever we perform a change on an existing ETL or table, we create a new version of it and compare it against the currently deployed version. We then use an automated tool we built to validate that the two versions are exactly the same, with the exception of the changes we introduced.
All our ETLs have data quality checks that run whenever new data is added to a table. These range from cardinality checks (e.g. “Does that table that’s supposed to be one row per user really have one row per user?”) to checking the input versus the output data (e.g. “Did all the clicks in the input make it into the output?”).
We have an outlier detection algorithm (based on Tukey’s interquartile range method) that runs daily and warns us of any unexpected data observations. We then investigate these outliers and notify stakeholders if there’s anything of note.

Compliance

As required by law, we make sure we comply with the appropriate regulations for data, such as GDPR. This means, for instance, that we don’t provide any personally identifiable information (PII) in any of our resources. If there’s a need for this sort of data, there are established internal procedures to go through that don’t involve our team.

Timeliness

Internally, we define timeliness as having our daily updates finished by 10AM CET. In actuality, most days, they’re finished before 8AM CET. This is important because it ensures the data is available by the time our stakeholders start working.

We ensure we meet this deadline by working on both the efficiency of our ETLs and the timeliness of their dependencies. When analyzing our overall timeliness, we map out the landing times of both our ETLs and their source data. We then work on improving our code to make it run faster, and we work with the teams that provide the source data to ensure there are no bottlenecks.

One example of an easy win: One table we needed had its completion unnecessarily coupled with the completion of another table we didn’t actually need, and which took much longer to complete than the first table. We worked with the owners of these tables to decouple them, and we removed four hours from our pipeline delivery time in one fell swoop.

When we fail to meet our 10AM deadline, which naturally happens from time to time, we get notified automatically, and whoever is on call investigates and works on solving whatever the issue is. If the issue isn’t quickly solvable, the matter is escalated to an incident and becomes an engineering-wide priority to resolve.

Usability

We define this principle as “providing easily accessible and consumable data.” One of the main problems with our old setup wasn’t that we didn’t have enough data; it was that it was very hard to know when to use what and how. As such, with Corpus, we set out to build (as much as possible) a self-service platform to provide resources that can be used by all our stakeholders, as long as they have basic SQL skills. There will always be the necessity for certain analysis to be done by experts (data scientists/data analysts), but this shouldn’t be the case for basic data needs.

We ensure usability by focusing on the following:

Descriptions — This might seem too simple to have a real impact, but it’s one of the features our stakeholders most appreciate. There’s no need for them to ping around between Slack channels asking what the meaning of a certain field is — it’s all explained in the same place they get our data from, in the BQ UI. We even add at least one example query.
Design documents — These focus on documenting more advanced use cases (more complex example queries), data anomalies (there are always some), and context around certain design decisions. This is what our stakeholders dig into if their questions cannot be answered directly by the descriptions.
Ease of use of resources — We don’t want to build something that’s technically perfect but hard to use. Sometimes we sacrifice on storage or efficiency to make sure the table is still queryable without expert SQL skills. Other times, when the price to pay for usability is too high, we build a table with a schema that’s non-trivial (because it saves a lot of storage, for instance). But then we build a view on top that turns it into a simpler schema and only show that to stakeholders, allowing us to get the best of both worlds.
Providing training — We provide (and record) company-wide training both on SQL and on how to use the resources we provide.
Accessibility — The fact that we use a cloud solution (BigQuery) that’s ready to use out of the box means no one actually needs to install anything or ask for access to use any of our resources; they just need to access the BQ console by authenticating through their @soundcloud.com email address. Note that our data is available across the company because it doesn’t have any PII, as explained in the Compliance section above.

Efficiency

As hinted at by some of the other characteristics, we want to make sure we optimize execution time and storage, as these relate to both performance and costs. For this, we try to make the most of BQ’s features like sketches, nesting, and UDFs.

Getting proficient with BQ was something that took us some time, since it works quite differently from what we were used to before (Redshift). A few things helped with this learning curve:

The fact that we were able to use it for some time with the old tables before we needed to start building the new versions allowed us to get our feet wet before we began making the big design decisions. We also saw what worked and didn’t work about our old designs in this new platform.
We hired a consultant who was already an expert on BQ to help us with the migration. Their suggestions opened our eyes to features we didn’t know existed, such as sketches, which made a big difference once we started using them.
We acknowledge that BigQuery is in constant development and improvement, so we keep a close eye on new releases and even work with Google from time to time to try to identify the features that would be the most helpful to us.

Maintainability

Knowing the pain of maintaining something we didn’t build and had no documentation on (who doesn’t?!), we wanted to make sure we were building code that was easy to read, extend, and maintain, so we decided to make this into its own principle.

We accomplish this by focusing on three concepts:

The “80 percent principle” — We aim to create tables that respond to 80 percent of stakeholder requests, not 100 percent. For 100 percent, one will often get into too many edge cases and end up with an over-complicated setup for something that’s rarely used. As such, we don’t try to answer every single data question out there — only 80 percent of them. This principle, maybe more than anything else, has become a team staple that we turn to often in discussions.
Shared set of best practices — We maintain a set of data best practices covering the different tools we use and use cases we have. The idea is that there’s a lot of commonality between different team members’ work, which helps us have broad ownership of everything we do. We also do knowledge sharing about these practices with other teams.
Scalability — We’re well aware that products change, and we’ve seen SoundCloud evolve a lot over the years: Features are launched or deprecated, and stakeholders change, and so do business priorities. We know our data needs to evolve over time as well. As such, we try to build our resources in a way that’s easy to extend (e.g. adding a new metric doesn’t require altering a table’s schema). We also have well-established versioning and release processes, both of which allow us to evolve the data we provide without ever breaking service to our stakeholders (this could be a blog post of its own too).

One aspect of our ETLs that might surprise people is the fact that they’re all in BQ SQL. In the past, we owned ETLs in PostgreSQL and Spark/Scala, some with Python mixed in, and even one in Haskell. However, in BQ, every data pipeline we’ve encountered (so far) is actually simpler and more efficient to build using plain old SQL. We’re open to exploring other options — BQ can support many different setups. We just haven’t found a need for this, and we also greatly value the consistency and readability benefits we get from using a language that most of the company is familiar with.

The Finished Product (So Far)

Putting all of these principles into practice, we were able to build a BQ project that takes up 90 percent less storage space than the raw data it reads. In turn, that empowers anyone in the company to answer most (we are, after all, aiming for 80 percent) of their data questions.

If you’re wondering what the rest of our stack looks like, it doesn’t stop at BQ:

For scheduling our ETLs, we use Apache Airflow, a versatile scheduler that makes sure our ETLs (along with their checks) run every day — and we’re notified if something goes wrong.
For dependency management, we use an internal tool (built by our Data Platform team) called Datapub.
For the BQ project infrastructure, we use Terraform to manage access rights, dataset and table/view/UDF creation/deletion, and schemas.
For powering the automated data validation tool mentioned in the Quality section above, we use pytest.
For version control, the whole company relies on GitHub.

These are of course just our main tools; others will come up from time to time.

Final Notes on the Journey

It’s only fair to point out that we didn’t have such a clearly defined mission and principles when we were in the midst of putting the first version of the Corpus together and trying to get the team up and running. We started smaller and simpler, but we always had certain principles in mind (like the “80 percent principle”), which made a big difference at times when we faced tough decisions on which direction to take.

This is to say that it might seem like a daunting task to come up with such clearly defined scope and principles when you start a project like this, but you still shouldn’t skip the step of thinking about these things. The temptation to just get the work started and figure it all out as you go along might be high. For us, though, making sure everyone had clear foundations to fall back on paid off in the long run. When we didn’t have a pre-existing principle to rely on, we discussed which way to go and extended our principles to make sure it was covered for the next time it came up.

These principles also served many times as a first line of defense when outside teams or stakeholders tried to change our scope. Often, once we explained our thoughtful reasoning behind why we thought we should or shouldn’t do certain things, they’d agree with us.

However, we are of course open to changing these principles if there’s a strong enough reason to do so. One big lesson we learned from this migration (a reminder, really) was that just because a decision made sense at a certain point in time doesn’t mean it still makes sense five years later. We try to never lose sight of the fact that this current setup can easily become as outdated as the one we had pre-migration and, hence, we keep an open mind about possible changes.

Another aspect of this journey that might be easy to forget is that, when we started the migration, we planned for a first implementation followed by a refactoring. Instead of waiting for the perfect conditions for us to be able to do exactly what we wanted, we got started with what we had. This allowed us to gain crucial experience in tools that were new to us (BQ and Airflow) and immensely improve from the first iteration of Corpus to the second. As a matter of fact, the road to a Data Corpus that we could proudly say abided by all of these principles was actually quite long:

Early 2018 — There was a first attempt at migrating using a lift and shift approach, which was later aborted for a variety of reasons, but provided many precious lessons.
Mid 2019 — We started planning a new migration from scratch.
December 2019 — The first intermediate version of Corpus was released, still using the old data warehouse as a source in some places.
April 2020 — The release of Corpus Alpha, the first iteration that was truly independent from our old data warehouse.
June 2020 — The release of Corpus Beta, where we did a major refactoring of the code, unified table schemas, and saw big gains on storage and execution time.
July 2020 — The release of Corpus 1.0, the first version of Corpus that abided by all the principles described in this post.

We recently released Corpus 2.2 in April 2021, with Corpus 2.0 and 2.1 coming before that, and we’re now starting the work on Corpus 3.0. At this point, our release work is mostly focused on extending Corpus’ coverage by adding tables that cover new entities or metrics, which are driven by the natural evolution of the product itself. We also fix the few bugs that get discovered from time to time, which is an inevitable reality of building anything. Corpus really is a continuously evolving project, which is what makes working on this team interesting and fun!

If all of this sounds like an exciting challenge to you, then apply for the Data Corpus team!

Acknowledgements

None of this would’ve been possible without the work of many SoundClouders outside the Data Corpus team, namely:

The old Data Science Analytics team, which later got separated into the Data Science and Business Research teams. Many members of these teams contributed to the first version of Corpus when we weren’t yet a dedicated team. Nowadays, they are our stakeholders and help us prioritize what to work on next.
Data Platform team — In addition to providing much helpful guidance and support during the migration, this team is continuously ensuring we have the necessary infrastructure to do our work.
Content Authorization and Payments teams — Producers of data sources we consume that worked closely with us to meet our requirements.