It seems like a simple enough concept: You take data from how your users interact with your product, and you use it to make business and product decisions. Informed decisions will certainly be better than guesses, so in the end, this should pay off.
But that’s for anyone who hasn’t seen the dirty underbelly of data — from how broken tracking can lead to incorrect conclusions, to having lots and lots and lots of data but no documentation around what any of it means. If you’ve worked a day in your life as an analyst/data scientist/data engineer or in any other data-heavy role, you know how bad things can get.
In mid 2019, we at SoundCloud were in a position many companies have probably found themselves in: Our stakeholders wanted to make use of the wealth of data available to us, as we’re the world’s largest open audio platform, but access to data was hard, since our data warehouse was the accumulation of six years of decisions that made sense at the time but naturally grew outdated over the years. As a result, we just weren’t able to support the business and product teams in the data-driven decisions they wanted to make.
More concretely, a single (small) team owned more than 100 ETLs (extract, transform, load processes) on Amazon Redshift — most with little to no documentation and some inconsistencies between them (they were, after all, built at different points in time, by different people, for different purposes). However, we had signed a contract with Google to switch over to BigQuery (BQ) as our new data warehousing platform. This was our opportunity to start fresh and build something that would hopefully fulfill SoundCloud’s needs at the time and scale with time.
So we set out on our journey to create what would become the Corpus BQ project: a centralized single source of truth for SoundCloud’s most relevant data. To do this, we created the Data Corpus team, whose mission statement is to “Implement data governance on the datasets that are most critical for product and business decision making.”
And just what exactly is data governance? According to Wikipedia, data governance is “a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives.” This was what we needed to become a more data-driven company.
Corpus as a team sits somewhere in the middle of what most people think of as Data Science and Data Engineering. At SoundCloud, there’s a dedicated team of data engineers known as Data Platform, which is responsible for maintaining all the infrastructure around data. We like to think of it as “Data Platform engineers the infrastructure, we engineer the data.” Our job titles are still Data Scientist or Data Engineer, depending on each person’s background and strengths, but we’re closer to what some companies now call analytics engineers.
After defining a team mission, we needed to decide how we were going to implement data governance. We came up with this list of six principles, which will be covered in detail in the rest of this post:
This list is based on an ISO standard that we adjusted to our needs by removing characteristics that were outside our responsibilities and adding “maintainability,” for reasons we’ll explain below.
Quality is defined as “providing a consistent and accurate source of truth.” The bottom line is: Our stakeholders need to be able to trust our data, or they won’t use it. Exactly how we achieve this could be the topic of its own blog post, since it involves several different strategies. Here are a couple examples:
As required by law, we make sure we comply with the appropriate regulations for data, such as GDPR. This means, for instance, that we don’t provide any personally identifiable information (PII) in any of our resources. If there’s a need for this sort of data, there are established internal procedures to go through that don’t involve our team.
Internally, we define timeliness as having our daily updates finished by 10AM CET. In actuality, most days, they’re finished before 8AM CET. This is important because it ensures the data is available by the time our stakeholders start working.
We ensure we meet this deadline by working on both the efficiency of our ETLs and the timeliness of their dependencies. When analyzing our overall timeliness, we map out the landing times of both our ETLs and their source data. We then work on improving our code to make it run faster, and we work with the teams that provide the source data to ensure there are no bottlenecks.
One example of an easy win: One table we needed had its completion unnecessarily coupled with the completion of another table we didn’t actually need, and which took much longer to complete than the first table. We worked with the owners of these tables to decouple them, and we removed four hours from our pipeline delivery time in one fell swoop.
When we fail to meet our 10AM deadline, which naturally happens from time to time, we get notified automatically, and whoever is on call investigates and works on solving whatever the issue is. If the issue isn’t quickly solvable, the matter is escalated to an incident and becomes an engineering-wide priority to resolve.
We define this principle as “providing easily accessible and consumable data.” One of the main problems with our old setup wasn’t that we didn’t have enough data; it was that it was very hard to know when to use what and how. As such, with Corpus, we set out to build (as much as possible) a self-service platform to provide resources that can be used by all our stakeholders, as long as they have basic SQL skills. There will always be the necessity for certain analysis to be done by experts (data scientists/data analysts), but this shouldn’t be the case for basic data needs.
We ensure usability by focusing on the following:
As hinted at by some of the other characteristics, we want to make sure we optimize execution time and storage, as these relate to both performance and costs. For this, we try to make the most of BQ’s features like sketches, nesting, and UDFs.
Getting proficient with BQ was something that took us some time, since it works quite differently from what we were used to before (Redshift). A few things helped with this learning curve:
Knowing the pain of maintaining something we didn’t build and had no documentation on (who doesn’t?!), we wanted to make sure we were building code that was easy to read, extend, and maintain, so we decided to make this into its own principle.
We accomplish this by focusing on three concepts:
One aspect of our ETLs that might surprise people is the fact that they’re all in BQ SQL. In the past, we owned ETLs in PostgreSQL and Spark/Scala, some with Python mixed in, and even one in Haskell. However, in BQ, every data pipeline we’ve encountered (so far) is actually simpler and more efficient to build using plain old SQL. We’re open to exploring other options — BQ can support many different setups. We just haven’t found a need for this, and we also greatly value the consistency and readability benefits we get from using a language that most of the company is familiar with.
Putting all of these principles into practice, we were able to build a BQ project that takes up 90 percent less storage space than the raw data it reads. In turn, that empowers anyone in the company to answer most (we are, after all, aiming for 80 percent) of their data questions.
If you’re wondering what the rest of our stack looks like, it doesn’t stop at BQ:
These are of course just our main tools; others will come up from time to time.
It’s only fair to point out that we didn’t have such a clearly defined mission and principles when we were in the midst of putting the first version of the Corpus together and trying to get the team up and running. We started smaller and simpler, but we always had certain principles in mind (like the “80 percent principle”), which made a big difference at times when we faced tough decisions on which direction to take.
This is to say that it might seem like a daunting task to come up with such clearly defined scope and principles when you start a project like this, but you still shouldn’t skip the step of thinking about these things. The temptation to just get the work started and figure it all out as you go along might be high. For us, though, making sure everyone had clear foundations to fall back on paid off in the long run. When we didn’t have a pre-existing principle to rely on, we discussed which way to go and extended our principles to make sure it was covered for the next time it came up.
These principles also served many times as a first line of defense when outside teams or stakeholders tried to change our scope. Often, once we explained our thoughtful reasoning behind why we thought we should or shouldn’t do certain things, they’d agree with us.
However, we are of course open to changing these principles if there’s a strong enough reason to do so. One big lesson we learned from this migration (a reminder, really) was that just because a decision made sense at a certain point in time doesn’t mean it still makes sense five years later. We try to never lose sight of the fact that this current setup can easily become as outdated as the one we had pre-migration and, hence, we keep an open mind about possible changes.
Another aspect of this journey that might be easy to forget is that, when we started the migration, we planned for a first implementation followed by a refactoring. Instead of waiting for the perfect conditions for us to be able to do exactly what we wanted, we got started with what we had. This allowed us to gain crucial experience in tools that were new to us (BQ and Airflow) and immensely improve from the first iteration of Corpus to the second. As a matter of fact, the road to a Data Corpus that we could proudly say abided by all of these principles was actually quite long:
We recently released Corpus 2.2 in April 2021, with Corpus 2.0 and 2.1 coming before that, and we’re now starting the work on Corpus 3.0. At this point, our release work is mostly focused on extending Corpus’ coverage by adding tables that cover new entities or metrics, which are driven by the natural evolution of the product itself. We also fix the few bugs that get discovered from time to time, which is an inevitable reality of building anything. Corpus really is a continuously evolving project, which is what makes working on this team interesting and fun!
If all of this sounds like an exciting challenge to you, then apply for the Data Corpus team!
None of this would’ve been possible without the work of many SoundClouders outside the Data Corpus team, namely: