SoundCloud for Developers

Discover, connect and build

Backstage Blog RSS

You're browsing posts of the category Search and Discovery

  • April 27th, 2014 Contests Search and Discovery Irrational Fun: Find Yourself at Berlin Buzzwords By Erik Michaels-Ober

    We were counting down the days until Berlin Buzzwords on May 25, when we realised that it would be great if you came too! With that in mind, we've created a contest. One lucky winner will receive a free ticket to Berlin Buzzwords, including travel expenses and accommodation. Here are the details about how to apply.


    The ratio of a circle's circumference to its diameter, represented by the Greek letter π, is an irrational number—it never terminates or repeats. Your goal is to find the SoundCloud logo in π.

    We have provided a 14 pixel by 6 pixel, greyscale reference image:

    Here is the same image at 60X magnification:

    Each of the 10 shades of grey in this image can be mapped to a number:

    Color RGB Hex Number
      ffffff 0
      f0f0f0 1
      ebebeb 2
      d0d0d0 3
      c1c1c1 4
      a8a8a8 5
      878787 6
      535353 7
      333333 8
      000000 9

    Applying this mapping to the reference image produces the following 84-digit bitmap:

    0 0 0 0 0 3 3 9 9 9 6 0 0 0
    0 0 0 3 4 9 5 9 9 9 9 4 0 0
    0 1 2 7 5 9 5 9 9 9 9 7 1 0
    3 9 5 9 5 9 5 9 9 9 9 9 9 3
    6 9 5 9 5 9 5 9 9 9 9 9 9 6
    1 8 5 9 5 9 5 9 9 9 9 9 8 1
    

    Your challenge is to write an algorithm that finds the 10 sequences that most closely approximate the reference image. Each result should include the sequence and its position (after the decimal point) in π.

    Here is an example result set:

    Rank Image Sequence Offset
    1 082201638940102393659252475011295776958920 282336494898427768768699465405437965994582 297,640,119
    2 310015049916890341096198545241549627773525 291969856827158758552799587406476458977970 792,987,187
    3 212021479960265330123798820231693768599316 147634474729776147987653958935291919768971 972,165,010
    4 000053743536020032496985553181128909983810 344939134894807349584729687746183109884672 981,165,566
    5 142204297650171312983445842322141909755200 787408739757838589593329762648444919386594 789,652,974
    6 300011495970664010077917573663456957498854 662995598898697947677549686339433357728071 197,342,990
    7 313560984264300011495970664010077917573663 456957498854662995598898697947677549686339 197,342,978
    8 870402479996214234001557832923050859979903 649788689695954439755933903629798966788984 75,975,342
    9 208503759370245135490877462200175839969750 453766432680245845143285985661373828688970 343,577,393
    10 300544295771128716836756973814286978997516 282647269986574856578306678421894769876141 950,462,734

    Entries will be judged against the follow criteria:

    • Code quality
    • Runtime performance
    • Visual closeness to the reference image (subjective)


    You should run your algorithm against the following data set (approximately 1 GB), which contains the first 1,000,000,000 (billion) digits of π.

    Please send your submission, including a link to the source-code repository, to e@soundcloud.com on or before May 5, 2014 23:59 UTC. Your repository should contain a README that includes instructions about how to set up and run your code. Entries are subject to the terms and conditions.


    UPDATE

    Congratulations to Tomasz Pewiński, who submitted the winning entry.

    We’d also like to acknowledge entries by Dan Oved and Martin Kühl, which were also excellent. Thanks to everyone else who participated in the contest. We hope it was fun!

  • December 4th, 2012 Search and Discovery Architecture Architecture behind our new Search and Explore experience By Petar Djekic

    Search is front-and-center in the new SoundCloud, key to the consumer experience. We’ve made the search box one of the first things you see, and beefed it up with suggestions that allow you to jump directly to people, sounds, groups, and sets of interest. We’ve also added a brand-new Explore section that guides you through the huge and dynamic landscape of sounds on SoundCloud. We’ve also completely overhauled our search infrastructure, which helps us provide more relevant results, scale with ease, and experiment quickly with new features and models.

    In the beginning

    In SoundCloud’s startup days, when we were growing our product at super speed, we spent just two days implementing a straightforward search feature, integrating it directly into our main app. We used Apache Solr, as was the fashion at the time, and built a master-slave cluster with the common semantics: writes to the master, reads from the slaves. Besides a separate component for processing deletes, our indexing logic was pull-based: the master Solr instance used a Solr DataImportHandler to poll our database for changes, and the slaves polled the master.

    At first, this worked well. But as time went on, our data grew, our entities became more complex, and our business rules multiplied. We began to see problems.

    The problems

    • The index was only getting updated on the read slaves about every fifteen minutes. Being real-time is crucial in the world of sound, as it’s important that our users’ sounds are searchable immediately after upload.
    • A complete re-index of the master node from the database took up to 24 hours, which made making a simple schema change for a new feature or bugfix a Sisyphean task.
    • Worse, during those complete-reindex periods, we couldn’t do incremental indexing because of limitations in the DataImportHandler. That meant the search index was essentially frozen in time: not a great user experience.
    • Because the search infrastructure was tightly coupled in the main application, low-level Solr or Lucene knowledge leaked everywhere. If we wanted to make a slight change, such as to a filter, we had to touch 25 parts of the codebase in the main application.
    • Because the application directly interacted with the master, if that single-point-of-failure failed, we lost information about updates, and our only reliable recovery was a complete reindex, which took 24 hours.
    • Since Solr uses segment or bulk replication, our network was flooded when the slaves pulled updates from the master, which caused additional operational issues.
    • Solr felt like it was built in a different age, for a different use-case. When we had performance problems, it was opaque and difficult to diagnose.

    In early 2012, we knew we couldn’t go forward with many new features and enhancements on the existing infrastructure. We formed a new team and decided we should rebuild the infrastructure from scratch, learning from past lessons. In that sense, the Solr setup was an asset: it could continue to serve site traffic, while we were free to build and experiment with a parallel universe of search. Green-field development, unburdened by legacy constraints: every engineer’s dream.

    The goals

    We had a few explicit goals with the new infrastructure.

    * Velocity: we want high velocity when working on schema-affecting bugs and features, so a complete reindex should take on the order of an hour. * Reliability: we want to grow to the next order of magnitude and beyond, so scaling the infrastructure—horizontally and vertically—should require minimum effort. * Maintainability: we don’t want to leak implementation details beyond our borders, so we should strictly control access through APIs that we design and control.

    After a survey of the state of the art in search, we decided to abandon Solr in favor of ElasticSearch, which would theoretically address our reliability concerns in their entirety. We then just had to address our velocity and maintenance concerns with our integration with the rest of the SoundCloud universe. To the whiteboard!

    The plan

    On the read side, everything was pretty straightforward. We’d already built a simple API service between Solr and the main application, but had hit some roadblocks when integrating it. We pivoted and rewrote the backend of that API to make ElasticSearch queries, instead: an advantage of a lightweight service-oriented architecture is that refactoring services in this way is fast and satisfying.

    On the write side—conceptually, at least—we had a straightforward task. Search is relatively decoupled from other pieces of business and application logic. To maintain a consistent index, we only need an authoritative source of events, ie. entity creates, updates, and deletes, and some method of enriching those events with their searchable metadata. We had that in the form of our message broker. Every event that search cared about was being broadcast on a well-defined set of exchanges: all we had to do was listen in.

    But depending on broker events to materialize a search index is dangerous. We would be susceptible to schema changes: the producers of those events could change the message content, and thereby break search without knowing, potentially in very subtle ways. The simplest, most robust solution was to ignore the content of the messages, and perform our own hydration. That carries a runtime cost, but it would let us better exercise control over our domain of responsibility. We believed the benefits would outweigh the costs.

    Using the broker as our event source, and the database as our ground truth, we sketched an ingestion pipeline for the common case. Soon, we realized we had to deal with the problem of synchronization: what should we do if an event hadn’t yet been propagated to the database slave we were querying? It turned out that if we used the timestamp of the event as the “external” version of the document in ElasticSearch, we could lean on ElasticSearch’s consistency guarantees to detect, and resolve, problems of sync.

    Once it hit the whiteboard, it became clear we could re-use the common-case pipeline for the bulk-index workflow; we just had to synthesize creation events for every entity in the database. But could we make it fast enough? It seemed at least possible. The final box-diagram looked like this:

    Implementation and optimization began. Over a couple of weeks, we iterated over several designs for the indexing service. We eked out a bit more performance in each round, and in the end, we got where we wanted: indexing our entire corpus against a live cluster took around 2 hours, we had no coupling in the application, and ElasticSearch itself gave us a path for scaling.

    The benefits

    We rolled out the new infrastructure in a dark launch on our beta site. Positive feedback on the time-to-searchability was immediate: newly-posted sounds were discoverable in about 3 seconds (post some sounds and try it out for yourself!). But the true tests came when we started implementing features that had been sitting in our backlog. We would start support for a new feature in the morning, make schema changes into a new index over lunch, and run A/B tests in the afternoon. If all lights were green, we could make an immediate live swap. And if we had any problems, we could rollback to the previous index just as fast.

    There’s no onerous, manual work involved in bootstrapping or maintaining nodes: everything is set up with ElasticSearch’s REST API. Our index definitions are specified in flat JSON files. And we have a single dashboard with all the metrics we need to know how search is performing, both in terms of load and search quality.

    In short, the new infrastructure has been an unqualified success. Our velocity is through the roof, we have a clear path for orders-of-magnitude growth, and we have a stable platform for a long time to come.

    Enter DiscoRank

    It’s not the number of degrees of separation between you and John Travolta: DiscoRank is our modification of the PageRank algorithm to provide more relevant search results: short for Discovery Rank. We match your search against our corpus and use DiscoRank to rank the results.

    The DiscoRank algorithm involves creating a graph of searchable entities, in which each activity on SoundCloud is an edge connecting two nodes. It then iteratively computes the ranking of each node using the weighted graph of entities and activities. Our graph has millions and millions of nodes and a lot more edges. So, we did a lot of optimizations to adapt PageRank to our use case, to be able to keep the graph in memory, and to recalculate the DiscoRank quickly, using results from previous runs of the algorithm as priors. We keep versioned copies of the DiscoRank so that we can swap between them when testing things out.

    How do we know we’re doing better?

    Evaluating the relevance of search results is a challenge. You need to know what people are looking for (which is not always apparent from their query) and you need to get a sample that’s representative enough to judge improvement Before we started work on the new infrastructure, we did research and user testing to better understand how and why people were using search on SoundCloud.

    Our evaluation baseline is a set of manually-tagged queries. To gather the first iteration of that data, we built an internal evaluation framework for SoundCloud employees: thumbs up, thumbs down on results. In addition, we have metrics like click positions, click engagement, precision, and recall, all constantly updating in our live dashboard. This allows us to compare ranking algorithms and other changes to the way we construct and execute queries. And we have mountains of click log data that we need to analyse.

    We have some good improvements so far, but there’s still a lot of tuning that can be done.

    Search as navigation

    The new Suggest feature lets you jump straight to the sound, profile, group, or set you’re looking for. We’ll trigger your memory if you don’t remember how something is spelled, or exactly what it’s called. Since we know the types of the results returned, we can send you straight to the source. This ability to make assumptions and customizations is a consequence of knowing the structure and semantics of our data, and gives a huge advantage in applications like this.

    Architecture-wise, we decided to separate the suggest infrastructure from the main search cluster, and build a custom engine based on Apache Lucene’s Finite State Transducers. As ever, delivering results fast is crucial. Suggestions need to show up right after the keystroke. It turned out we are competing with the speed of light for this particular use-case.

    The fact that ElasticSearch doesn’t come with a suggest engine turned out to be a non-issue and rather forced us to build this feature in isolation. This separation proved to be a wise decision, since update-rate, request patterns and customizations are totally different, and would have made a built-in solution hard to maintain.

    Search as a crystal ball

    Similar to suggest, Explore is based on our new search infrastructure, but with a twist: our main goal for Explore is to showcase the sounds in our community at this very moment. This led to a time-sensitive ranking, putting more emphasis on the newness of a sound.

    So far, so good

    We hope you enjoy using the new search features of SoundCloud as much as we enjoyed building them. Staying focused on a re-architecture is painful when you see your existing infrastructure failing more and more each day, but we haven’t looked back—especially since our new infrastructure allows us to deliver a better user experience much faster. We now can roll out new functionality with ease, and the new Suggest and Explore along with an improved Search itself are just the first results.

    Keep an eye out for more improvements to come!

    Apache Lucene and Solr are trademarks of the Apache Software Foundation