How (Not) to Build Datasets and Consume Data at Your Company

March 3rd, 2020 by George Roldugin

The topic of datasets is not new, and they have been successfully used at SoundCloud for analytics purposes for a long time. In this blog post, however, we’ll delve into how the culture of building and providing datasets is being adopted in more areas of engineering beyond the domain of analytics.

When we talk about datasets, we’re talking about published datasets. In this case, a team owning the data in question would also own the published dataset and be responsible for delivering it in a particular shape, as agreed upon with the internal consumers, i.e. the end users of the dataset. As a result, the dataset encapsulates any domain intricacies that should not leak into the consumer’s domain.

This post shares our experience of adopting “A Better Model of Data Ownership,” which we blogged about more than two years ago. It discusses our learnings and digs deeper into the specifics and technical details.

In particular, we’ll focus on exploring three questions:

How not to consume data
How to consume data and who should provide it
How to build datasets and what to consider before you start

The process of data management in a data privacy-compliant organization is beyond the scope of this blog post. Whenever the access to or use of personally identifiable information is not justified by the business case, further treatment of data is required, but this is not discussed here.

Examples of Datasets at SoundCloud

Before going any further, here are a couple of examples of datasets we will refer to in the following sections. We deal with artists, tracks, and plays here at SoundCloud, so we’ll draw examples from that domain

Track Metadata Dataset (holds information about a given track)

`track_id`	`title`	`artist`	`valid_from`	`valid_to`	`is_current`
1	Longer, Better, Stronger Passwords	Adept Surfer	2019-02-05	2060-01-01	`true`
2	Think Before You Click	Cautious Mouse	2019-02-05	2019-05-01	`false`
2	Think Before You Click (On My Heart)	Cautious Mouse	2019-05-01	2060-01-01	`true`

Track Rightsholders Dataset (records the label or individual who owns rights to a track in a given territory)

`track_id`	`territory`	`rightsholder_id`	`valid_from`	`valid_to`	`is_current`
1	DE	182	2019-02-05	2060-01-01	`true`
1	FR	182	2019-02-05	2060-01-01	`true`
2	DE	42	2019-02-05	2060-01-01	`true`

The particular representation shown above is called a slowly changing dimension table, since it captures the historical changes of an entity, e.g. a track, over time. For many datasets, SoundCloud has settled on the following shape:

Primary keys — one or more columns that uniquely identify an entity (e.g, track_id, territory). However, there may be multiple entries for the same entity as it changes over time.
valid_from: datetime — the start of the validity period for the given row (technically also part of the primary key)
valid_to: datetime — the end of the validity period for the given row (or a time in the far future)
is_current: boolean — true if the given row is the valid one at the time of dataset creation, and false otherwise.

The dataset may or may not contain gaps in the time dimension for a single entity. This depends on whether or not it makes sense in the domain.

One word on representing entities that have been deleted: Deleted records appear as rows, with the latest valid_to in the past and is_current = false. In that sense, these rows look similar to old versions of an entity with no is_current row. This makes it harder to retrieve all entities, regardless of whether or not they have been deleted. If retrieval of deleted entries is required, it should also be possible to introduce an is_deleted column in the dataset.

Naturally, not all datasets need to have or can have a time dimension. A plays dataset is an example of a simple fact table:

Plays Dataset

`timestamp`	`listener_id`	`track_id`	`territory`
2019-02-15 11:20:00 UTC	1349	1	DE
2019-02-15 11:30:00 UTC	1349	1	DE
2019-02-15 12:05:00 UTC	1066	2	FR

Now that we’ve seen what the datasets actually look like, we will now consider all the various ways in which the company exchanged and consumed data before adopting the idea of published datasets.

How Not to Consume Data

The shift toward using datasets at SoundCloud arrived together with the shift toward batch-oriented processing where it makes sense. This means replacing online services with batch jobs for tasks such as: royalties calculation, report generation for labels and artists, processing content delivered by labels, some aspects of search and recommender systems, etc. All of these examples rely on data owned by other teams in the company, which brings up questions of contracts, SLAs, domain encapsulation, and team interdependency.

Data Consumption: A Story of One Team

This post will now highlight the experience of the Royalties & Reporting team here at SoundCloud, a team that has always been a heavy consumer of data and has undergone a shift toward the use of published datasets. The responsibilities of the team include calculating royalties and building reports for artists and labels.

An example of a report is a CSV file that shows the breakdown of tracks, track metadata, play counts, and associated royalty earnings. In relational database terms, you could think of it as a join between a plays table, a track metadata table, a track rightsholder table, and a royalties table. Usually, these reports are generated on a schedule (as opposed to on demand), so they fit the batch-processing model quite well.

With reference to the following figure, consider the various ways in which these reporting jobs used to consume data in the past. We will discuss the pros and cons of consuming events, snapshotting databases, and consuming live service APIs, all of which we have tried but eventually moved away from.

Undesirable Scenarios

Consuming Events

Events are great. They are a solid approach for storing the changes of state of a system without imposing a way of how such events should later be consumed or interpreted. Many SoundCloud systems already publish events.

Events can be consumed in batch jobs directly, and were, in fact, the principle way of consuming data in our reporting jobs until very recently. But this approach carries with it a handful of tradeoffs, which are outlined below.

Pros

Flexibility: Events allow for ultimate flexibility, as they capture the most amount of detail and the entire history. This can also be a bad thing, as the details may expose domain complexity.
Excellent time resolution: Each event is timestamped.
Auditing: Investigating and debugging historical changes is free, and nothing is ever deleted. As such, it is possible to recreate system state at any point in time.

Cons

Expensive infrastructure: Each service must support event infrastructure, and this cost is not justified for all uses.
Difficult to consume: Consumers must read all events from the beginning of time. It is impossible to get the current state without reading everything. This does not scale well.
Heavyweight: At SoundCloud, events are encoded as Google Protobuf messages and stored in Hadoop Sequence Files. This makes them opaque, with no ability to selectively consume individual fields without decoding complete events. This limitation is implementation-specific.
Exposes domain complexity: Events tend to include all of the information available for interpretation later on. This interpretation is better handled by the data owner than by every consumer.
Potentially lower quality data: This occurs in cases where the data owner does not consume their own events and instead relies on their databases and APIs. The same may be true of datasets.

Events seemed very promising at first. However, as time went on, we realized they were difficult to work with and the jobs did not scale well after a year or two. The limited ad hoc querying abilities and increasingly slow jobs made development and operations frustrating and costly.

Snapshotting Databases

We did this. We performed full dumps of the database table backing the live service we don’t own.

Pros

Accurate: The data owner and data consumer consume the same data.
Easy to consume: Structured data that can be consumed cheaply and easily.
Easy to develop with: Can be queried in an ad hoc manner for development and debugging purposes.

Cons

Bad coupling: The database model becomes an implicit contract and the consumer needs to be consulted regarding any changes.
No SLA: No explicit contract with the data owner around the quality and availability of data.
Bad domain encapsulation: This exposes internal domain complexity that may be misinterpreted by the consumer. It bypasses any domain processing that happens in the service.
No time dimension: There is no visibility of changes of data over time or deleted entities.

It’s quite clear how snapshotting databases breaks many principles of encapsulation. Additionally, we were spending a lot of time building and maintaining all the datasets we needed ourselves. They were not the responsibility of the data owner, and they were not available to other consumers.

To be clear, snapshotting is a viable option for cheaply and easily building datasets. However, it’s a completely different story when the data owner, and not the consumer, is doing it with full knowledge and understanding of their databases.

Consuming Live Services through APIs

Yes, we did this too. We would make API calls to a live service from inside a batch job.

Pros

Accuracy: This was the most accurate and up-to-date data.

Cons

Bad fit for batch-processing paradigm: Depending on the API design, it would not normally be possible to work with the data in bulk with all the flexibility you’d expect.
Poor performance: It does not scale well and has the possibility of introducing unnecessary strain on the consumed API and its database.
Hard to develop against: Ad hoc querying is easy, but it may not be possible to do in bulk.
No time dimension: There is no visibility of changes of data over time or deleted entities unless the service keeps all the history.

Enter Published Datasets

Good Scenarios

Published datasets combine some of the positive aspects of the above approaches, but they practically eliminate all the negative ones. To reiterate, published datasets are produced by the data owner following an agreement with the consumer or consumers. We’ll discuss the pros and cons of these below.

Pros

Domain encapsulation: The dataset producer can preprocess the data to hide domain complexities.
Contract and SLA: Follows a well-defined contract between the producer and the consumer of what the shape, correctness, and availability should be.
May be easy to build: Internally built tooling can help build datasets for many use cases with low effort.
Easy to consume: Structured data that can be cheaply and easily consumed.
Easy to develop with: Can be queried in an ad hoc manner for development and debugging purposes.
Clear ownership: The producer assumes the responsibility for maintaining the dataset and fixing issues.
Decoupling: The producer is free to evolve the schema of their databases and events, so long as they maintain the dataset interface the same, when it makes sense.

Cons

Potentially lower quality data: The producer doesn’t usually consume their own datasets — data problems are still possible.
Takes effort to build: Datasets go beyond what can be built using standard tooling (more on that in the next section), and they may still take effort to build. In the cases where a need for a dataset is blocking the consumer but the producer has other competing priorities, the consumer may take on the responsibility of building the dataset for the producer to own and maintain.

How We Build Datasets

There are two primary ways in which we build our datasets: from events, and from database snapshots. As discussed in the previous section, we have experience consuming both events and snapshots in our batch jobs. What we wanted to do differently was to (a) establish a coherent way of building these datasets, (b) create standard tooling to cover the most common cases, and (c) make it the responsibility of the data owner to provide a dataset that was clean, consistent, and timely.

The two methods of building datasets (from events and from snapshots) are not as dissimilar as they may seem. In fact, they share many of the same steps, which becomes apparent when looking at the figure below. Most datasets we produce at SoundCloud take the shape of dimension tables, and they capture the historical changes of an entity over time, as shown in the introduction section. We will assume we’re building such a dataset in the discussion below.

Building Datasets

Building a Dataset from Events

In case of events, the workflow is a single step, outlined below.

Roll up events on top of the previous dataset: This step reads the previous version of the published dataset, as well as the events emitted since the last dataset was published. It applies the new events on top of the old dataset, performing any necessary transformations, and writes a new dataset. This is a release candidate for the published dataset.

Note: If you have a very small number of events, you may choose to roll them up from the beginning of time every time you produce the dataset. This gives you the most flexibility and ensures the most recent code is used for rollup.

Building a Dataset from Database Snapshots

The workflow to produce a release candidate is covered below.

Dump a table to a snapshot: The tool reads from the database and writes this snapshot to a temporary table. This is a direct copy of the table, with no further processing. This is not the published dataset.
Transform the snapshot (optional): This is where you’d implement steps such as filtering/removing or renaming columns, or transforming values in the dataset. This is also not the published dataset.
Roll up snapshot on top of the previous dataset (optional): This step reads the transformed snapshot and the previous version of the published dataset and writes a new dataset. This step is required when building slowly changing dimension tables or when the entities deleted from the database must be preserved in the dataset. This is a release candidate for the published dataset.

This multistep procedure allows you to decouple your published dataset from the source table. In the past, we’ve had problems caused by coupling schemas in the database with schemas in the data warehouse, because changing the source becomes impossible without coordinating changes far downstream.

Whether the dataset is produced from snapshots or from events, the shape of the release candidate is the same, so the validation and publishing/promoting steps are the same.

Validate dataset: This is a very important step that’s often overlooked. Testing the code, e.g. through unit testing, lets us gain confidence in the logic of the various steps in dataset production. However, it is usually done using a contrived set of toy data. But when dealing with large amounts of data and moving parts, it is important to also test the data that is actually produced:
Does it contain the expected values in each field?
Does it satisfy the dimension table rules (see the Examples of Datasets at SoundCloud section above)?
Is the dataset a truthful representation of the upstream data? Even a simple comparison of the number of rows in the input and the output can go a long way.
Publish dataset: Usually the dataset is published to the “latest location” (thereby overwriting the previous version), as well as to a location corresponding to the time of publishing (for safekeeping).

At SoundCloud, we have standard tooling for building datasets from snapshots of MySQL databases. This greatly simplifies the creation of new datasets, and as a side effect, it promotes a consistent and common shape of datasets.

Considerations for Building a Dataset

Now that we’ve looked at the two ways of building a dataset, let’s look at the considerations that the dataset producer and consumers need to agree on that would inform the approach taken.

Do you need to record the previous versions of an entity?
Do you need information about deleted entities?
Can entities be deleted and reinstated? How would you like to handle gaps?
How often do you need the dataset updated with the newest data?
Is there a specific time by which an updated version of the dataset needs to be available?
For a dataset with a time dimension, what is the time resolution required? Hourly, every six hours, daily? Highest precision possible?
- Generally, the datasets based on snapshots suffer from the loss of precision.
- In some cases, it may be possible to keep the complete history of changes with exact timestamps in the database, though it makes the service more complex.
- Events can provide high time accuracy, though there may still be a time difference between the event timestamp and the time that the change became effective in the other systems.
For a dataset with a time dimension, how far back is the historical information required? Where do you get this historical data from (e.g. from events, from a different dataset, or inferred from the current value)?

Finally, for some datasets, it may be important to distinguish between two types of changes to an entity:

It may be a retrospective fix or a correction, in which case the change should be considered effective before it is made in the service/database; or
It may be a real brand-new change, effective from the time the change is made.

Conclusion

As more datasets were built at SoundCloud, common patterns led to standard semantics and standard tooling for the most common cases. Even in those cases where standard tooling did not apply, there was already an established pattern for what the dataset could look like and how it could be published.

For heavy consumers of datasets, that also meant that the burden of building and maintaining the datasets was now shared with those who knew most about the data, albeit not always those who had the most experience with building the datasets. But even that was changing quickly. The additional effort and the initial learning curve on the part of data owners led to the organization-wide benefits of shared, well-specified, and well-maintained datasets.