The topic of datasets is not new, and they have been successfully used at SoundCloud for analytics purposes for a long time. In this blog post, however, we’ll delve into how the culture of building and providing datasets is being adopted in more areas of engineering beyond the domain of analytics.
When we talk about datasets, we’re talking about published datasets. In this case, a team owning the data in question would also own the published dataset and be responsible for delivering it in a particular shape, as agreed upon with the internal consumers, i.e. the end users of the dataset. As a result, the dataset encapsulates any domain intricacies that should not leak into the consumer’s domain.
This post shares our experience of adopting “A Better Model of Data Ownership,” which we blogged about more than two years ago. It discusses our learnings and digs deeper into the specifics and technical details.
In particular, we’ll focus on exploring three questions:
The process of data management in a data privacy-compliant organization is beyond the scope of this blog post. Whenever the access to or use of personally identifiable information is not justified by the business case, further treatment of data is required, but this is not discussed here.
Before going any further, here are a couple of examples of datasets we will refer to in the following sections. We deal with artists, tracks, and plays here at SoundCloud, so we’ll draw examples from that domain
Track Metadata Dataset (holds information about a given track)
track_id |
title |
artist |
valid_from |
valid_to |
is_current |
---|---|---|---|---|---|
1 | Longer, Better, Stronger Passwords | Adept Surfer | 2019-02-05 | 2060-01-01 | true |
2 | Think Before You Click | Cautious Mouse | 2019-02-05 | 2019-05-01 | false |
2 | Think Before You Click (On My Heart) | Cautious Mouse | 2019-05-01 | 2060-01-01 | true |
Track Rightsholders Dataset (records the label or individual who owns rights to a track in a given territory)
track_id |
territory |
rightsholder_id |
valid_from |
valid_to |
is_current |
---|---|---|---|---|---|
1 | DE | 182 | 2019-02-05 | 2060-01-01 | true |
1 | FR | 182 | 2019-02-05 | 2060-01-01 | true |
2 | DE | 42 | 2019-02-05 | 2060-01-01 | true |
The particular representation shown above is called a slowly changing dimension table, since it captures the historical changes of an entity, e.g. a track, over time. For many datasets, SoundCloud has settled on the following shape:
track_id
, territory
). However, there may be multiple entries for the same entity as it changes over time.valid_from
: datetime
— the start of the validity period for the given row (technically also part of the primary key)valid_to
: datetime
— the end of the validity period for the given row (or a time in the far future)is_current
: boolean
— true
if the given row is the valid one at the time of dataset creation, and false
otherwise.The dataset may or may not contain gaps in the time dimension for a single entity. This depends on whether or not it makes sense in the domain.
One word on representing entities that have been deleted: Deleted records appear as rows, with the latest valid_to
in the past and is_current = false
. In that sense, these rows look similar to old versions of an entity with no is_current
row. This makes it harder to retrieve all entities, regardless of whether or not they have been deleted. If retrieval of deleted entries is required, it should also be possible to introduce an is_deleted
column in the dataset.
Naturally, not all datasets need to have or can have a time dimension. A plays dataset is an example of a simple fact
table:
Plays Dataset
timestamp |
listener_id |
track_id |
territory |
---|---|---|---|
2019-02-15 11:20:00 UTC | 1349 | 1 | DE |
2019-02-15 11:30:00 UTC | 1349 | 1 | DE |
2019-02-15 12:05:00 UTC | 1066 | 2 | FR |
Now that we’ve seen what the datasets actually look like, we will now consider all the various ways in which the company exchanged and consumed data before adopting the idea of published datasets.
The shift toward using datasets at SoundCloud arrived together with the shift toward batch-oriented processing where it makes sense. This means replacing online services with batch jobs for tasks such as: royalties calculation, report generation for labels and artists, processing content delivered by labels, some aspects of search and recommender systems, etc. All of these examples rely on data owned by other teams in the company, which brings up questions of contracts, SLAs, domain encapsulation, and team interdependency.
This post will now highlight the experience of the Royalties & Reporting team here at SoundCloud, a team that has always been a heavy consumer of data and has undergone a shift toward the use of published datasets. The responsibilities of the team include calculating royalties and building reports for artists and labels.
An example of a report is a CSV file that shows the breakdown of tracks, track metadata, play counts, and associated royalty earnings. In relational database terms, you could think of it as a join between a plays table, a track metadata table, a track rightsholder table, and a royalties table. Usually, these reports are generated on a schedule (as opposed to on demand), so they fit the batch-processing model quite well.
With reference to the following figure, consider the various ways in which these reporting jobs used to consume data in the past. We will discuss the pros and cons of consuming events, snapshotting databases, and consuming live service APIs, all of which we have tried but eventually moved away from.
Events are great. They are a solid approach for storing the changes of state of a system without imposing a way of how such events should later be consumed or interpreted. Many SoundCloud systems already publish events.
Events can be consumed in batch jobs directly, and were, in fact, the principle way of consuming data in our reporting jobs until very recently. But this approach carries with it a handful of tradeoffs, which are outlined below.
Events seemed very promising at first. However, as time went on, we realized they were difficult to work with and the jobs did not scale well after a year or two. The limited ad hoc querying abilities and increasingly slow jobs made development and operations frustrating and costly.
We did this. We performed full dumps of the database table backing the live service we don’t own.
It’s quite clear how snapshotting databases breaks many principles of encapsulation. Additionally, we were spending a lot of time building and maintaining all the datasets we needed ourselves. They were not the responsibility of the data owner, and they were not available to other consumers.
To be clear, snapshotting is a viable option for cheaply and easily building datasets. However, it’s a completely different story when the data owner, and not the consumer, is doing it with full knowledge and understanding of their databases.
Yes, we did this too. We would make API calls to a live service from inside a batch job.
Published datasets combine some of the positive aspects of the above approaches, but they practically eliminate all the negative ones. To reiterate, published datasets are produced by the data owner following an agreement with the consumer or consumers. We’ll discuss the pros and cons of these below.
There are two primary ways in which we build our datasets: from events, and from database snapshots. As discussed in the previous section, we have experience consuming both events and snapshots in our batch jobs. What we wanted to do differently was to (a) establish a coherent way of building these datasets, (b) create standard tooling to cover the most common cases, and (c) make it the responsibility of the data owner to provide a dataset that was clean, consistent, and timely.
The two methods of building datasets (from events and from snapshots) are not as dissimilar as they may seem. In fact, they share many of the same steps, which becomes apparent when looking at the figure below. Most datasets we produce at SoundCloud take the shape of dimension tables, and they capture the historical changes of an entity over time, as shown in the introduction section. We will assume we’re building such a dataset in the discussion below.
In case of events, the workflow is a single step, outlined below.
Note: If you have a very small number of events, you may choose to roll them up from the beginning of time every time you produce the dataset. This gives you the most flexibility and ensures the most recent code is used for rollup.
The workflow to produce a release candidate is covered below.
This multistep procedure allows you to decouple your published dataset from the source table. In the past, we’ve had problems caused by coupling schemas in the database with schemas in the data warehouse, because changing the source becomes impossible without coordinating changes far downstream.
Whether the dataset is produced from snapshots or from events, the shape of the release candidate is the same, so the validation and publishing/promoting steps are the same.
At SoundCloud, we have standard tooling for building datasets from snapshots of MySQL databases. This greatly simplifies the creation of new datasets, and as a side effect, it promotes a consistent and common shape of datasets.
Now that we’ve looked at the two ways of building a dataset, let’s look at the considerations that the dataset producer and consumers need to agree on that would inform the approach taken.
For a dataset with a time dimension, what is the time resolution required? Hourly, every six hours, daily? Highest precision possible?
Finally, for some datasets, it may be important to distinguish between two types of changes to an entity:
As more datasets were built at SoundCloud, common patterns led to standard semantics and standard tooling for the most common cases. Even in those cases where standard tooling did not apply, there was already an established pattern for what the dataset could look like and how it could be published.
For heavy consumers of datasets, that also meant that the burden of building and maintaining the datasets was now shared with those who knew most about the data, albeit not always those who had the most experience with building the datasets. But even that was changing quickly. The additional effort and the initial learning curve on the part of data owners led to the organization-wide benefits of shared, well-specified, and well-maintained datasets.