“To me, legacy code is simply code without tests.” — Michael Feathers
If untested code is legacy code, why aren’t we testing data pipelines or ETLs (extract, transform, load)? In particular, data pipelines built in SQL are rarely tested. However, as software engineers, we know all our code should be tested. So in this post, I’ll describe how we started testing SQL data pipelines at SoundCloud.
Now when I talked to our data scientists or data engineers, I heard some of them say “Oh, we do have tests! After creating a dataset and ideally before using the data, we run anomaly detection on it/check that the dataset size has not changed by more than 10 percent compared to yesterday etc.” That’s not what I would call a test, though; I would call that a validation.
Validations are important and useful, but they’re not what I want to talk about here. Validations are what increase confidence in data, and tests are what increase confidence in code used to produce the data. And SQL is code. Ideally, validations are run regularly at the end of an ETL to produce the data, while tests are run as part of a continuous integration pipeline to publish the code that will be used to run the ETL.
Some of the advantages of having tests and not only validations are:
My team, the Content Rights Team, used to be an almost pure backend team. We handle translating the music industry’s concepts into authorization logic for tracks on our apps, which can be complicated enough. However, since the shift toward data-producing teams owning datasets — which took place about three years ago — we’ve been responsible for providing published datasets with a clearly defined interface to consuming teams like the Insights and Reporting Team, content operations teams, and data scientists.
We’ve been using technology and best practices close to what we’re used to for live backend services in our dataset, including:
However, Spark has its drawbacks. Even though the framework advertises its speed as “lightning-fast,” it’s still slow for the size of some of our datasets. This affects not only performance in production — which we could often but not always live with — but also the feedback cycle in development and the speed of backfills if business logic has to be changed retrospectively for months or even years of data.
Of course, we educated ourselves, optimized our code and configuration, and threw resources at the problem, but this cost time and money. For some of the datasets, we instead filter and only process the data most critical to the business (e.g. only export data for selected territories), or we use more complicated logic so that we need to process less data (e.g. rolling up incrementally or not writing the rows with the most frequent value).
Other teams were fighting the same problems, too, and the Insights and Reporting Team tried moving to Google BigQuery first. Inspired by their initial successes, they gradually left Spark behind and moved all of their batch jobs to SQL queries in BigQuery. But with Spark, they also left tests and monitoring behind. In their case, they had good automated validations, business people verifying their results, and an advanced development environment to increase the confidence in their datasets. Through BigQuery, they also had the possibility to backfill much more quickly when there was a bug.
In the meantime, the Data Platform Team had also introduced some monitoring for the timeliness and size of datasets. But still, SoundCloud didn’t have a single (fully) tested batch job written in SQL against BigQuery, and it also lacked best practices on how to test SQL queries. I searched some corners of the internet I knew of for examples of what other people and companies were doing, but I didn’t find a lot (I am sure there must be some out there; if you’ve encountered or written good examples, I’m interested in learning about them).
It struck me as a cultural problem: Testing didn’t seem to be a standard for production-ready data pipelines, and SQL didn’t seem to be considered code.
After I demoed our latest dataset we had built in Spark and mentioned my frustration about both Spark and the lack of SQL testing (best) practices in passing, Björn Pollex from Insights and Reporting — the team that was already using BigQuery for its datasets — approached me, and we started a collaboration to spike a fully tested dataset. His motivation was to add tests to his team’s untested ETLs, while mine was to possibly move our datasets without losing the tests.
We used our “self-allocated time” (SAT, 20 percent of engineers’ work time, usually Fridays), which is one of my favorite perks of working at SoundCloud, to collaborate on this project. As the dataset, we chose one: the last transformation job of our track authorization dataset (called the “projector”), and its validation step, which was also written in Spark. We already had test cases for example-based testing for this job in Spark; its location of consumption was BigQuery anyway; the track authorization dataset is one of the datasets for which we don’t expose all data for performance reasons, so we have a reason to move it; and by migrating an existing dataset, we made sure we’d be able to compare the results. Also, it was small enough to tackle in our SAT, but complex enough to need tests.
Not all of the challenges were technical.
Organizationally, we had to add our tests to a continuous integration pipeline owned by another team and used throughout the company. Fortunately, the owners appreciated the initiative and helped us.
I’ve already touched on the cultural point that testing SQL is not common and not many examples exist. We shared our proof of concept project at an internal Tech Open House and hope to contribute a tiny bit to a cultural shift through this blog post.
The technical challenges weren’t necessarily hard; there were just several, and we had to do something about them. I don’t claim whatsoever that the solutions we came up with in this first iteration are perfect or even good — but they’re a starting point.
WITH launch_dates AS (
SELECT territory_code, TIMESTAMP(launched) AS launched
FROM (SELECT
'US' AS territory_code,
DATE('2000-01-01') AS launched
UNION ALL
SELECT
'DE' AS territory_code,
DATE('2050-01-01') AS launched) [...]
dataclasses
for the data schema with to_sql
and from_row
methods for better readability, e.g.:@dataclass(eq=True, frozen=True)
class TrackPolicyInfo:
track_urn: str
content_policy: str
@classmethod
def from_row(cls, row):
return cls(track_urn=row.track_urn, content_policy=row.content_policy)
@property
def as_sql(self):
return f"""
(
SELECT
'{self.track_urn}' AS track_urn,
'{self.content_policy}' AS content_policy
)
"""
{{ execution_date }}
or {{ ds_nodash }}
) to be runnable in Airflow. As we run the tests independently from Airflow, this part of the queries is unfortunately not yet tested. As a substitute, the team that maintains our Airflow now supports running Airflow locally, so we can at least manually check the queries after rendering (i.e. expansion of the template).WITH raw_rollup AS (SELECT * FROM {{ params.rollup_data }}{{ ds_nodash }}),
-- replace monetizing policies in non-monetizing territories and split intervals
[...]
policies AS (...),
-- now deduplicate / merge consecutive intervals with same values
windowed_policies AS(
SELECT [...] FROM {{ params.policies_to_deduplicate|default('policies') }}),
[...]
deduplicated_policies AS (...),
SELECT * FROM {{ params.selected_cte|default('deduplicated_policies') }}
Now we can test the CTE
policies
by setting params={'rollup_data': example.input_query, selected_cte': 'policies'}
andwindowed_policies
by setting params={'rollup_data': empty_rollup_data, 'policies_to_deduplicate': example.input_query}
with the right selection of example
and empty_rollup_data
.Although this approach requires some fiddling — e.g. for testing single CTEs while mocking the input for a single CTE — and can certainly be improved upon, it was great to develop an SQL query using TDD, to have regression tests, and to gain confidence through evidence. When I finally deleted the old Spark code, it was a net delete of almost 1,700 lines of code; the resulting two SQL queries have, respectively, 155 and 81 lines of SQL code; and the new tests have about 1,231 lines of Python code. A substantial part of this is boilerplate that could be extracted to a library.
I’m looking forward to getting rid of the limitations in size and development speed that Spark imposed on us, and I’m excited to see how people inside and outside of our company are going to evolve testing of SQL, especially in BigQuery.