PageRank in Spark

Jan 24th, 2018by Josh Devins3 minutes read

Jan 24th, 20183 minutes read

SoundCloud consists of hundreds of millions of tracks, people, albums, and playlists, and navigating this vast collection of music and personalities poses a large challenge, particularly with so many covers, remixes, and original works all in one place.

For search on SoundCloud, one of the ways we approach this problem is by using our own version of the PageRank algorithm, which we affectionately refer to as DiscoRank (Get it? Disco as in discovery and Saturday Night Fever?!).

The job of PageRank is to help rank search results from a query like finding all Go+ tracks called “royals.” At first glance, this task might seem trivial. The first result is, and should indeed be, Lorde’s original song, “Royals.” However, there are plenty of covers and remixes of this track, which leaves us with questions like: Which ones should we show at the top and in which order? What about other tracks in our catalog that have the word “royals” in them? Where should they be in our search results list?

Our use of PageRank started in the summer of 2012 when we introduced SoundCloud Next (now our current website). In the years since, we have grown exponentially, and our implementation of PageRank has had to scale with that enormous growth in several ways — from an in-memory Java implementation, to being distributed in Hadoop using the vanilla MapReduce API, and more recently, by using Scalding. In the summer of 2017, we made the switch to our latest version of PageRank, which runs in Spark, and now we’re open sourcing our implementation as a reusable library!

But before implementing yet another version ourselves, we first tried out the built-in PageRank from Spark. Unfortunately, we quickly realized it didn’t support the graph we were using, as we have nodes with no out-edges, some nodes with no in-edges, and edge weights that are not uniform.

After this first try, we decided to implement PageRank ourselves and tried a variety of the Spark APIs, including GraphX. Although our GraphX implementation was working in small-scale test cases, when we ran at full-scale, we encountered several bugs that are currently unresolved. Based on this experience, we decided that a simple approach using the RDD API would suit us best. We spent about an equal amount of time on implementation and on performance evaluation and tuning, but in the end, Spark was a perfect fit for this job, and we were able to utilize many of the features Spark is known for, such as in-memory data storage, easy parallel computation, and the implementation of iterative learning algorithms.

As a result of this work, we have made improvements to the performance, scalability, and general flexibility of our PageRank implementation. And because it’s able to perform one iteration of PageRank in approximately three to five minutes on a graph of more than 700 million nodes and 15 billion edges, this new implementation has enabled us to update ranks in search more frequently throughout the day, in addition to preparing us for continued growth over the coming years.

With open sourcing our implementation of PageRank, we hope you can benefit from the work we’ve put into this for your own use cases, from search to recommendations. Go check out the project, try it out, open issues, and get involved!