Testing SQL for BigQuery

“To me, legacy code is simply code without tests.” — Michael Feathers

If untested code is legacy code, why aren’t we testing data pipelines or ETLs (extract, transform, load)? In particular, data pipelines built in SQL are rarely tested. However, as software engineers, we know all our code should be tested. So in this post, I’ll describe how we started testing SQL data pipelines at SoundCloud.

Now when I talked to our data scientists or data engineers, I heard some of them say “Oh, we do have tests! After creating a dataset and ideally before using the data, we run anomaly detection on it/check that the dataset size has not changed by more than 10 percent compared to yesterday etc.” That’s not what I would call a test, though; I would call that a validation.

Validations are important and useful, but they’re not what I want to talk about here. Validations are what increase confidence in data, and tests are what increase confidence in code used to produce the data. And SQL is code. Ideally, validations are run regularly at the end of an ETL to produce the data, while tests are run as part of a continuous integration pipeline to publish the code that will be used to run the ETL.

Some of the advantages of having tests and not only validations are:

  • Some bugs can’t be detected using validations alone. For example, if your query transforms some input data and then aggregates it, you may not be able to detect bugs in the transformation purely by looking at the aggregated query result. Hence you need to test the transformation code directly.
  • Depending on how long processing all the data takes, tests provide a quicker feedback loop in development than validations do.
  • Validations are code too, which means they also need tests.
  • Improved development experience through quick test-driven development (TDD) feedback loops.

The Backstory

My team, the Content Rights Team, used to be an almost pure backend team. We handle translating the music industry’s concepts into authorization logic for tracks on our apps, which can be complicated enough. However, since the shift toward data-producing teams owning datasets — which took place about three years ago — we’ve been responsible for providing published datasets with a clearly defined interface to consuming teams like the Insights and Reporting Team, content operations teams, and data scientists.

We’ve been using technology and best practices close to what we’re used to for live backend services in our dataset, including:

  • Spark to conveniently code in the same language as our backend service, Scala
  • Prometheus to monitor, using custom metrics pushed through the Pushgateway (these metrics give us insight through alerting and visualization)
  • Tests built like the tests for our live backend services

However, Spark has its drawbacks. Even though the framework advertises its speed as “lightning-fast,” it’s still slow for the size of some of our datasets. This affects not only performance in production — which we could often but not always live with — but also the feedback cycle in development and the speed of backfills if business logic has to be changed retrospectively for months or even years of data.

Of course, we educated ourselves, optimized our code and configuration, and threw resources at the problem, but this cost time and money. For some of the datasets, we instead filter and only process the data most critical to the business (e.g. only export data for selected territories), or we use more complicated logic so that we need to process less data (e.g. rolling up incrementally or not writing the rows with the most frequent value).

Other teams were fighting the same problems, too, and the Insights and Reporting Team tried moving to Google BigQuery first. Inspired by their initial successes, they gradually left Spark behind and moved all of their batch jobs to SQL queries in BigQuery. But with Spark, they also left tests and monitoring behind. In their case, they had good automated validations, business people verifying their results, and an advanced development environment to increase the confidence in their datasets. Through BigQuery, they also had the possibility to backfill much more quickly when there was a bug.

In the meantime, the Data Platform Team had also introduced some monitoring for the timeliness and size of datasets. But still, SoundCloud didn’t have a single (fully) tested batch job written in SQL against BigQuery, and it also lacked best practices on how to test SQL queries. I searched some corners of the internet I knew of for examples of what other people and companies were doing, but I didn’t find a lot (I am sure there must be some out there; if you’ve encountered or written good examples, I’m interested in learning about them).

It struck me as a cultural problem: Testing didn’t seem to be a standard for production-ready data pipelines, and SQL didn’t seem to be considered code.

The Project

After I demoed our latest dataset we had built in Spark and mentioned my frustration about both Spark and the lack of SQL testing (best) practices in passing, Björn Pollex from Insights and Reporting — the team that was already using BigQuery for its datasets — approached me, and we started a collaboration to spike a fully tested dataset. His motivation was to add tests to his team’s untested ETLs, while mine was to possibly move our datasets without losing the tests.

We used our “self-allocated time” (SAT, 20 percent of engineers’ work time, usually Fridays), which is one of my favorite perks of working at SoundCloud, to collaborate on this project. As the dataset, we chose one: the last transformation job of our track authorization dataset (called the “projector”), and its validation step, which was also written in Spark. We already had test cases for example-based testing for this job in Spark; its location of consumption was BigQuery anyway; the track authorization dataset is one of the datasets for which we don’t expose all data for performance reasons, so we have a reason to move it; and by migrating an existing dataset, we made sure we’d be able to compare the results. Also, it was small enough to tackle in our SAT, but complex enough to need tests.

Challenges and Solutions

Not all of the challenges were technical.

Organizationally, we had to add our tests to a continuous integration pipeline owned by another team and used throughout the company. Fortunately, the owners appreciated the initiative and helped us.

I’ve already touched on the cultural point that testing SQL is not common and not many examples exist. We shared our proof of concept project at an internal Tech Open House and hope to contribute a tiny bit to a cultural shift through this blog post.

The technical challenges weren’t necessarily hard; there were just several, and we had to do something about them. I don’t claim whatsoever that the solutions we came up with in this first iteration are perfect or even good — but they’re a starting point.

  • The tests had to be run in BigQuery, for which there is no containerized environment available (unlike e.g. MySQL, which can be tested against Docker images). We at least mitigated security concerns by not giving the test account access to any tables. The next point will show how we could do this.
  • Test execution performance had been the biggest blocker of previous attempts to add tests against BigQuery. BigQuery is performant for, well, big queries, but it has high constant time overhead, especially for loading data from and into tables. Single queries also take two to three seconds, which is still a lot, but not as much as loading a table first. Therefore, our tests don’t access any tables at all, and we inject all test data as literals in our queries. An example would be:
WITH launch_dates AS (
  SELECT territory_code, TIMESTAMP(launched) AS launched
  FROM (SELECT
          'US'               AS territory_code,
          DATE('2000-01-01') AS launched
        UNION ALL
        SELECT
          'DE'               AS territory_code,
          DATE('2050-01-01') AS launched) [...]
  • SQL itself does not have language support or frameworks for testing. OK, one could argue that SQL’s ability to throw exceptions is all one needs, but that’s pretty basic. So we decided to use Python and pytest. Extra benefits we get from this are that it’s the same language as our scheduler (Airflow), it has good failure reporting, and it has easy execution of e.g. a single test file and dependency injection through fixtures.
  • Still, SQL has no language support for queries to be parametrized in the table. This is necessary for switching between the literals as tables in tests and the actual table name in production. We use the templating engine Jinja, which is conveniently also used in Airflow.
  • Already for Spark, it’s a challenge to express test data and assertions in a _simple-to-understand way_ — tests are for reading. We tried our best, using Python for abstraction, speaking names for the tests, and extracting common concerns (e.g. connecting to BigQuery and rendering templates) into pytest fixtures. We created dataclasses for the data schema with to_sql and from_row methods for better readability, e.g.:
@dataclass(eq=True, frozen=True)
class TrackPolicyInfo:
    track_urn: str
    content_policy: str

    @classmethod
    def from_row(cls, row):
        return cls(track_urn=row.track_urn, content_policy=row.content_policy)

    @property
    def as_sql(self):
        return f"""
          (
            SELECT
            '{self.track_urn}'      AS track_urn,
            '{self.content_policy}' AS content_policy
          )
            """
  • The production queries use Airflow templating (e.g. {{ execution_date }} or {{ ds_nodash }} ) to be runnable in Airflow. As we run the tests independently from Airflow, this part of the queries is unfortunately not yet tested. As a substitute, the team that maintains our Airflow now supports running Airflow locally, so we can at least manually check the queries after rendering (i.e. expansion of the template).
  • SQL is not as decomposable as we’re used to for other high-level programming languages. Thus writing unit tests isn’t trivial. We’ve approached this by using common table expressions (CTEs aka “WITH clauses”) extensively and adding templating with defaults to be able to only test single CTEs. Example:
WITH raw_rollup AS (SELECT * FROM {{ params.rollup_data }}{{ ds_nodash }}),

  -- replace monetizing policies in non-monetizing territories and split intervals
  [...]
  policies AS (...),

  -- now deduplicate / merge consecutive intervals with same values
  windowed_policies AS(
    SELECT [...] FROM {{ params.policies_to_deduplicate|default('policies') }}),
  [...]
  deduplicated_policies AS (...),
  SELECT * FROM {{ params.selected_cte|default('deduplicated_policies') }}

Now we can test the CTE

  • policies by setting params={'rollup_data': example.input_query, selected_cte': 'policies'} and
  • windowed_policies by setting params={'rollup_data': empty_rollup_data, 'policies_to_deduplicate': example.input_query} with the right selection of example and empty_rollup_data.

Conclusion

Although this approach requires some fiddling — e.g. for testing single CTEs while mocking the input for a single CTE — and can certainly be improved upon, it was great to develop an SQL query using TDD, to have regression tests, and to gain confidence through evidence. When I finally deleted the old Spark code, it was a net delete of almost 1,700 lines of code; the resulting two SQL queries have, respectively, 155 and 81 lines of SQL code; and the new tests have about 1,231 lines of Python code. A substantial part of this is boilerplate that could be extracted to a library.

I’m looking forward to getting rid of the limitations in size and development speed that Spark imposed on us, and I’m excited to see how people inside and outside of our company are going to evolve testing of SQL, especially in BigQuery.