SoundCloud for Developers

Discover, connect and build

Backstage Blog RSS

  • April 25th, 2013 Events Berlin Geekettes All-Women Hackathon Roundup By Amelie Anglade, Tiffany Conroy, Emily Green, Duana Stanley

    Late last year, six women crowded in to a Mitte cafe booth and listened to Berlin Geekettes founder Jess Erickson share her idea: Berlin’s first all-women hackathon. With SoundCloud’s Amelie Anglade the then newly-appointed the Berlin Geekettes Tech Ambassador, we agreed that it was a great idea to produce the hackathon as a partnership between the Geekettes and the women developers of SoundCloud.

    Duana discussing APIs

    Fast forward to the first weekend of March, when the vision became reality: after 24 hours of hacking, 80 women demoed 29 projects across a range of different platforms, from a belt transformed into a game controller to an app aimed to help toddlers learn to do everyday tasks.

    Read more...

  • April 16th, 2013 Announcements Removing 'hotness' parameter By Paul Osman

    The /tracks endpoint has traditionally accepted an order parameter for ordering results by either creation date or 'hotness'. The method for calculating a tracks 'hotness' has never been clearly explained, but generally speaking is based on the number of likes and listens a track receives.

    Recently we started to experience problems with the query that returns tracks ordered by hotness. In the past weeks, these problems started to effect and even cause outages for API users.

    We have decided that the best way forward is to remove this parameter. Starting soon, GET requests to the /tracks endpoint will ignore the order parameter and default to ordering by creation date.

    In the future, we look forward to releasing support for more stable and idiomatic search and order parameters. In the meantime, it is still possible to approximate the result sets previously returned by specifying order=hotness by manually sorting the returned tracks by a combination of favoritings\_count and playback\_count.

  • January 23rd, 2013 Events Hacks SoundCloud at PennApps By Paul Osman

    Last weekend, we sponsored and attended our first PennApps, the world's largest student run hackathon held at the University of Pennsylvania in Philadelphia. Nearly 500 students participated from a variety of universities across the US and elsewhere. Students were given 36 hours to get into teams, hack on projects, then show them off to the judges. The results were astounding. In total, over 100 hacks were submitted. Here are some of my personal favourites.

    Giving my API Demo

    Read more...

  • December 4th, 2012 Search and Discovery Architecture Architecture behind our new Search and Explore experience By Petar Djekic

    Search is front-and-center in the new SoundCloud, key to the consumer experience. We’ve made the search box one of the first things you see, and beefed it up with suggestions that allow you to jump directly to people, sounds, groups, and sets of interest. We’ve also added a brand-new Explore section that guides you through the huge and dynamic landscape of sounds on SoundCloud. We’ve also completely overhauled our search infrastructure, which helps us provide more relevant results, scale with ease, and experiment quickly with new features and models.

    In the beginning

    In SoundCloud’s startup days, when we were growing our product at super speed, we spent just two days implementing a straightforward search feature, integrating it directly into our main app. We used Apache Solr, as was the fashion at the time, and built a master-slave cluster with the common semantics: writes to the master, reads from the slaves. Besides a separate component for processing deletes, our indexing logic was pull-based: the master Solr instance used a Solr DataImportHandler to poll our database for changes, and the slaves polled the master.

    At first, this worked well. But as time went on, our data grew, our entities became more complex, and our business rules multiplied. We began to see problems.

    The problems

    • The index was only getting updated on the read slaves about every fifteen minutes. Being real-time is crucial in the world of sound, as it’s important that our users’ sounds are searchable immediately after upload.
    • A complete re-index of the master node from the database took up to 24 hours, which made making a simple schema change for a new feature or bugfix a Sisyphean task.
    • Worse, during those complete-reindex periods, we couldn’t do incremental indexing because of limitations in the DataImportHandler. That meant the search index was essentially frozen in time: not a great user experience.
    • Because the search infrastructure was tightly coupled in the main application, low-level Solr or Lucene knowledge leaked everywhere. If we wanted to make a slight change, such as to a filter, we had to touch 25 parts of the codebase in the main application.
    • Because the application directly interacted with the master, if that single-point-of-failure failed, we lost information about updates, and our only reliable recovery was a complete reindex, which took 24 hours.
    • Since Solr uses segment or bulk replication, our network was flooded when the slaves pulled updates from the master, which caused additional operational issues.
    • Solr felt like it was built in a different age, for a different use-case. When we had performance problems, it was opaque and difficult to diagnose.

    In early 2012, we knew we couldn’t go forward with many new features and enhancements on the existing infrastructure. We formed a new team and decided we should rebuild the infrastructure from scratch, learning from past lessons. In that sense, the Solr setup was an asset: it could continue to serve site traffic, while we were free to build and experiment with a parallel universe of search. Green-field development, unburdened by legacy constraints: every engineer’s dream.

    The goals

    We had a few explicit goals with the new infrastructure.

    * Velocity: we want high velocity when working on schema-affecting bugs and features, so a complete reindex should take on the order of an hour. * Reliability: we want to grow to the next order of magnitude and beyond, so scaling the infrastructure—horizontally and vertically—should require minimum effort. * Maintainability: we don’t want to leak implementation details beyond our borders, so we should strictly control access through APIs that we design and control.

    After a survey of the state of the art in search, we decided to abandon Solr in favor of ElasticSearch, which would theoretically address our reliability concerns in their entirety. We then just had to address our velocity and maintenance concerns with our integration with the rest of the SoundCloud universe. To the whiteboard!

    The plan

    On the read side, everything was pretty straightforward. We’d already built a simple API service between Solr and the main application, but had hit some roadblocks when integrating it. We pivoted and rewrote the backend of that API to make ElasticSearch queries, instead: an advantage of a lightweight service-oriented architecture is that refactoring services in this way is fast and satisfying.

    On the write side—conceptually, at least—we had a straightforward task. Search is relatively decoupled from other pieces of business and application logic. To maintain a consistent index, we only need an authoritative source of events, ie. entity creates, updates, and deletes, and some method of enriching those events with their searchable metadata. We had that in the form of our message broker. Every event that search cared about was being broadcast on a well-defined set of exchanges: all we had to do was listen in.

    But depending on broker events to materialize a search index is dangerous. We would be susceptible to schema changes: the producers of those events could change the message content, and thereby break search without knowing, potentially in very subtle ways. The simplest, most robust solution was to ignore the content of the messages, and perform our own hydration. That carries a runtime cost, but it would let us better exercise control over our domain of responsibility. We believed the benefits would outweigh the costs.

    Using the broker as our event source, and the database as our ground truth, we sketched an ingestion pipeline for the common case. Soon, we realized we had to deal with the problem of synchronization: what should we do if an event hadn’t yet been propagated to the database slave we were querying? It turned out that if we used the timestamp of the event as the “external” version of the document in ElasticSearch, we could lean on ElasticSearch’s consistency guarantees to detect, and resolve, problems of sync.

    Once it hit the whiteboard, it became clear we could re-use the common-case pipeline for the bulk-index workflow; we just had to synthesize creation events for every entity in the database. But could we make it fast enough? It seemed at least possible. The final box-diagram looked like this:

    Implementation and optimization began. Over a couple of weeks, we iterated over several designs for the indexing service. We eked out a bit more performance in each round, and in the end, we got where we wanted: indexing our entire corpus against a live cluster took around 2 hours, we had no coupling in the application, and ElasticSearch itself gave us a path for scaling.

    The benefits

    We rolled out the new infrastructure in a dark launch on our beta site. Positive feedback on the time-to-searchability was immediate: newly-posted sounds were discoverable in about 3 seconds (post some sounds and try it out for yourself!). But the true tests came when we started implementing features that had been sitting in our backlog. We would start support for a new feature in the morning, make schema changes into a new index over lunch, and run A/B tests in the afternoon. If all lights were green, we could make an immediate live swap. And if we had any problems, we could rollback to the previous index just as fast.

    There’s no onerous, manual work involved in bootstrapping or maintaining nodes: everything is set up with ElasticSearch’s REST API. Our index definitions are specified in flat JSON files. And we have a single dashboard with all the metrics we need to know how search is performing, both in terms of load and search quality.

    In short, the new infrastructure has been an unqualified success. Our velocity is through the roof, we have a clear path for orders-of-magnitude growth, and we have a stable platform for a long time to come.

    Enter DiscoRank

    It’s not the number of degrees of separation between you and John Travolta: DiscoRank is our modification of the PageRank algorithm to provide more relevant search results: short for Discovery Rank. We match your search against our corpus and use DiscoRank to rank the results.

    The DiscoRank algorithm involves creating a graph of searchable entities, in which each activity on SoundCloud is an edge connecting two nodes. It then iteratively computes the ranking of each node using the weighted graph of entities and activities. Our graph has millions and millions of nodes and a lot more edges. So, we did a lot of optimizations to adapt PageRank to our use case, to be able to keep the graph in memory, and to recalculate the DiscoRank quickly, using results from previous runs of the algorithm as priors. We keep versioned copies of the DiscoRank so that we can swap between them when testing things out.

    How do we know we’re doing better?

    Evaluating the relevance of search results is a challenge. You need to know what people are looking for (which is not always apparent from their query) and you need to get a sample that’s representative enough to judge improvement Before we started work on the new infrastructure, we did research and user testing to better understand how and why people were using search on SoundCloud.

    Our evaluation baseline is a set of manually-tagged queries. To gather the first iteration of that data, we built an internal evaluation framework for SoundCloud employees: thumbs up, thumbs down on results. In addition, we have metrics like click positions, click engagement, precision, and recall, all constantly updating in our live dashboard. This allows us to compare ranking algorithms and other changes to the way we construct and execute queries. And we have mountains of click log data that we need to analyse.

    We have some good improvements so far, but there’s still a lot of tuning that can be done.

    Search as navigation

    The new Suggest feature lets you jump straight to the sound, profile, group, or set you’re looking for. We’ll trigger your memory if you don’t remember how something is spelled, or exactly what it’s called. Since we know the types of the results returned, we can send you straight to the source. This ability to make assumptions and customizations is a consequence of knowing the structure and semantics of our data, and gives a huge advantage in applications like this.

    Architecture-wise, we decided to separate the suggest infrastructure from the main search cluster, and build a custom engine based on Apache Lucene’s Finite State Transducers. As ever, delivering results fast is crucial. Suggestions need to show up right after the keystroke. It turned out we are competing with the speed of light for this particular use-case.

    The fact that ElasticSearch doesn’t come with a suggest engine turned out to be a non-issue and rather forced us to build this feature in isolation. This separation proved to be a wise decision, since update-rate, request patterns and customizations are totally different, and would have made a built-in solution hard to maintain.

    Search as a crystal ball

    Similar to suggest, Explore is based on our new search infrastructure, but with a twist: our main goal for Explore is to showcase the sounds in our community at this very moment. This led to a time-sensitive ranking, putting more emphasis on the newness of a sound.

    So far, so good

    We hope you enjoy using the new search features of SoundCloud as much as we enjoyed building them. Staying focused on a re-architecture is painful when you see your existing infrastructure failing more and more each day, but we haven’t looked back—especially since our new infrastructure allows us to deliver a better user experience much faster. We now can roll out new functionality with ease, and the new Suggest and Explore along with an improved Search itself are just the first results.

    Keep an eye out for more improvements to come!

    Apache Lucene and Solr are trademarks of the Apache Software Foundation

  • November 2nd, 2012 Guides Coding for GOOD - SoundCloud API By Paul Osman

    The web has come a long way. We have APIs that allow developers to create amazing applications quickly, and browser technologies have advanced to the point where JavaScript / HTML and CSS are necessary tools for creating apps. Despite this, there's still a skills gap and companies are having an increasingly hard time hiring for technical positions.

    This is the main reason the folks at GOOD teamed up with the Apollo Group to launch Coding for GOOD. A series of coding lessons and final project. Participants with the top three submissions will be flown to Los Angeles to participate in a hackathon. The best of the three hacks may be offered a job at GOOD.

    We were excited when GOOD approached us about contributing a lesson on the SoundCloud API, so I sat down with them in their New York office and did a quick walkthrough on using the JavaScript SDK to create a simple SoundCloud app. Check it out, and if you know of anyone new to using web APIs, have them take a look too.