SoundCloud for Developers

Discover, connect and build

Backstage Blog September 27th, 2017 Project Management Engineering Management Deliver software faster by managing work in progress, not by adding overtime By Matt Weiden

Product development flow (flow) is the rate at which our products are developed, from idea to deployment. Good flow means that products should pass through the development cycle quickly and continuously.

Problem Statement

SoundCloud faces challenges with its product development flow. SoundCloud’s CTO, Artem Fishman, summed up these challenges succinctly during the May 2017 Engineering Town Hall.

“We have a ton of [good] product pressure… We work really hard, but are fundamentally slow to deliver… This resonated everywhere. It takes a very long time to ship something here.”

His conclusion was reached by interviewing over a hundred employees from the Technology Organization. It is clear that employees understand these challenges on a qualitative and intuitive level.

These are classic symptoms of an overloaded product development queue.

The problem was then for project managers to develop quantitative metrics for understanding our product development flow and guidelines for improving the flow issues that the company already knew on qualitative and intuitive levels.

Post Structure

This post describes recent efforts to measure and improve software product development flow at SoundCloud. It demonstrates that we can deliver software faster by managing Work in Progress (WIP) and product quality rather than pushing employees to work overtime.

The second section describes the discovery phase in which we define and measure some variables for improving flow. Project managers used tools from the Project Development KPIs Project (PDKP) to understand the problems in our workflow. Understanding that data, they derived the strategy below for improving SoundCloud’s product development flow.

  1. The company should commit to fewer products at a time
  2. Teams should work on fewer product features at a time
  3. Engineers should work on fewer tasks at a time
  4. Engineering should reduce its code inventory
  5. Engineering should monitor and reduce the number of bugs over time

The third section describes our continuous learning process and our first set of guidelines for improving flow.

The fourth and last section describes preliminary results of our efforts.

Defining success and variables

This section identifies lead and cycle time as metrics of success in product development and motivates their use. It identifies variables that have an impact on lead and cycle time.

Lead and cycle time

To improve business outcomes, businesses define the metrics for the desired outcome then iterate towards achieving that outcome by observing if actions taken improve or worsen the metrics. Members of the Process Optimization Group at SoundCloud use lead and cycle time as two important measures of our success in product development.

Lead time is the time between the initiation and completion of a product’s development. At SoundCloud, this is the time between when a product owner receives a work request to the time when the work is completed.

Cycle time is the time between when the production of a product is started and when it is completed. At SoundCloud this is the time between when an engineering team receives a feature request to the time that the software is deployed to users.

Keeping lead and cycle times low has many benefits, four of which are noted here. The first benefit is that we ship products to users faster, before our competitors. Being first increases the chance that we capture market share.

The second benefit is the reduction of backlog size. Product managers are incentivized not to commit to projects too far in advance. Since requirements for new products change rapidly, a minimal backlog means more responsive plans and product ideas are used instead of outdated ones.

The third benefit concerns the size of products shipped. Focusing on minimizing lead time incentivizes teams to deliver new products as MVPs and smaller changes to existing products. This more granular approach decreases the probability of large failures and over-investment in projects that will not succeed.

The fourth and most important benefit is that SoundCloud can gain information on its product development faster. By shipping faster, we get market information before our competitors. We get feedback on a product earlier in its development cycle. We can make small adjustments in strategies towards our desired outcome rather than large corrections after over-investing in a product that had limited success.

How can SoundCloud improve lead and cycle time? There are several important variables we can leverage.

Variables affecting lead and cycle time

This section presents the definitions of variables affecting lead and cycle time as well as an explanation for the effect of each.

The ratio of WIP to contributors

The ratio of Work in Progress (WIP) to contributors is a measure of how much contributors in a team are multitasking on average. PDKP estimates this by dividing the number of WIP issues in a Jira project by the number of engineers active on that project.

Keeping the ratio of WIP to contributors slightly below 1.0 can decrease cycle time in two ways. First, if engineers develop one product at a time, products will be ready sooner than if they multitask. Say an engineer has two tasks, each of which will take two days. They can either work on both in parallel or one after the other. The schedules of these two options are shown below.

Scheduling two tasks for one engineer

In Schedule 1, the engineer works on both in parallel, task 1 is complete at the end of day 3 and task 4 is complete at the end of day 4. The more the engineer interleaves the work, the longer it takes to complete both tasks. In the extreme case—multitasking within each day—both projects might not be completed until the end of day 4. In Schedule 2, the engineer works on the first and then the second, the first is completed at the end of day two and the second at the end of day four. By working on one thing at a time, the engineers have completed a task earlier without adding work!

The second reason to keep the ratio of WIP to contributors ratio slightly below 1.0 concerns the ability of individual members to change development roles quickly. To understand why, see the idealized model of a team’s workflow below.

Software engineering is an invisible “U-shaped flow cell.”

Engineers need to quickly transition between planning, developing, reviewing, and fixing bugs. They fulfill multiple roles in the software development process. This collaboration pattern is similar to U-shaped flow cells in Lean manufacturing.

Processing tasks in U-shaped cells can be very efficient, but not if the queue of tasks is overloaded. If more tasks are in progress than there are engineers to complete it, items will form queues waiting for attention. For example, say an engineer developing code is overloaded with new feature requests. A queue of feature requests will form. Further, it is likely that they will not be able to review code completed by their teammates, contributing to the formation of a second queue in the review stage. Either cycle time will rise or quality will suffer. Engineers need to switch roles to where queues are forming to push tasks through the queue.

Analogously, overloaded teams with a high WIP to contributors ratio will be less capable of dealing with incoming bug fixes and ad-hoc requests.

The number of products in WIP

“Products in WIP” are products in progress. PDKP estimates the number of products in WIP by counting the number of issues in a Jira project that represent a product. For most teams, these are ‘Epic’ or ‘Story’ issues. Products can be either for internal or external users.

Decreasing the number of products in WIP can decrease lead and cycle time. To illustrate, imagine we have two tasks each of which takes four days and can be shared between two engineers with no overhead. There are two schedules for this work shown below: either each engineer takes one task or they pair on the first, then the second.

Scheduling two tasks for two engineers

In Schedule 1, each engineer takes a single task, both products ship in four days. In Schedule 2, the first task is ready in two days and the second at four days. We’ve shipped the first product two days earlier without adding any work. Pairing adds some overhead, but getting information from shipping a product sooner is often worth it. These same principles apply to the planning phases of product development.

Inventory

In business, inventory is the components of a product that have not been assembled or manifested into a product ready for sale. In engineering, our inventory is anything we work on that results in a feature that is yet undelivered including undeployed code, RFCs, and unfinished Jira issues. PDKP estimates inventory with the number of commits or number of additions and deletions in pull requests in repositories that have been updated in the last three months.

Increasing amounts of code inventory is a symptom of bad flow and has a negative effect on cycle time. The more inventory there is, the more development time has accumulated in work not being seen by our users. Generally, inventory depreciates in value as it ages.

Ideally, we want to achieve what is called a “one piece flow,” a process where all code inventory is actively being worked on. This means that there should be at most one product in progress per team and one task in progress per engineer.

The challenge in accomplishing one piece flow in software development is that code inventory is invisible. In a factory that produces physical goods, inventory is obvious, it stacks up on the factory floor. In a software company, it’s not so obvious.

Software inventory isn’t visible in the same way manufacturing inventory is.

For this reason, we need to take measures to monitor code inventory and make sure it gets shipped in a timely manner.

The ratio of features to engineers

The ratio of features to engineers measures how many features each team member must maintain on average. We estimate it by dividing the number of configuration files indicating an individual feature in a team’s active repositories by the number of engineers active on a team’s Jira board.

Higher ratios correlate with higher rates of bug reports and maintenance overhead. If a team has an abnormally high ratio, the company should consider providing them with more headcount or redistributing their responsibilities to get a more even distribution.

Bugs

Our exposure to bugs is estimated by the number of issues of issue type 'Bug' in a Jira project.

Persistent bugs have a negative impact on lead and cycle time. Users and Community Support re-report the same bugs which then have to be reinvestigated before confirming they are duplicates. If engineers build on top of buggy code, the resultant implementation can be flawed, at worst resulting in wasted code that cannot stay in production. Most importantly, bugs make it harder for engineers to learn what is going wrong and to prevent future problems.

Lower bound on time to backlog completion

Multiplying a team’s cycle time by the number of products remaining in their backlog estimates a lower bound on how long it would take a team to complete their backlog.

Having a large backlog has three negative effects. First, it increases our lead time. Planning too far in advance creates a queue of work that teams cannot work through quickly and takes development time away from line managers and engineers. Second, it decreases our flexibility in changing plans in the future. Once something is on the backlog, teams feel committed to deliver them. Third, it decreases motivation.

What did we do about it?

This section describes SoundCloud’s continuous learning and improvement cycle where its project managers collect flow metrics and apply analysis thereof in guidelines to improve product development flow.

For the first set of guidelines, project managers used principles from manufacturing and queueing theory to encourage teams to work on fewer products at a time and engineers work on fewer tasks at a time.

Importantly, the guidelines don’t suggest that teams work harder. They are intended to increase focus, not the total number of hours worked.

Metrics collection

Starting in March 2017, PDKP has collected lead time, cycle time, WIP, bugs, contributor, and code inventory metrics from SoundCloud’s Jira and GitHub organizations.

To date, 18 engineering teams have configured the system to collect lead and cycle time metrics. Inventory metrics are collected for all development in SoundCloud’s GitHub organization.

All metrics are aggregated and presented in the PDKP dashboard and summary dashboard.

Guidelines and motivating analysis

1. The company should commit to fewer products at a time

SoundCloud can reduce cycle time and increase its flexibility in planning new products by reducing the number of products it commits to. Teams had high lower bounds on time to backlog completion, suggesting that overcommitment was a problem.

Of the 18 teams in the PDKP metrics collection, the median lower bound on time to backlog completion is 766 days, or about 2.1 years. While this number could represent misuse of Jira, it suggested an opportunity to shift more time from planning future products to focusing on more immediate products.

2. Teams should work on fewer product features at a time

Provided the work can be parallelized, teams should work on one product at a time in order to maximize flow. The number of products teams had in progress was too high.

Across the 18 teams sampled, PDKP reported the median number of products in progress at once is four. An in-person survey of teams in the Creators Organization reported a similar number.

It is unlikely that there is enough downtime in the development of products that teams needed to have four in progress at any given time to be productive.

3. Engineers should work on fewer tasks at a time

In theory, engineers should work on one task at a time until it is completed to maximize flow. The ratio of tasks in WIP to engineers was too high.

Across the 18 teams sampled, PDKP reported the average ratio of WIP to engineers was consistently over 2.0.

Of course, it is not always practical to work on one thing at a time while developing software. For example, teams with long running batch jobs will have enough downtime while waiting for the job to finish to do meaningful work on other tasks. However, like teams with too many products, it is unlikely that the majority of teams with high ratios in the sample have this specific problem.

Importantly, high engineer utilization is not a goal. One important result of queuing theory is that not all nodes in the queue have to be fully utilized to maximize the throughput of the queue overall. In fact, it’s good if our engineers have a little idle time.

4. Engineering should reduce its code inventory

SoundCloud has a high amount of code inventory. Reducing it is a big opportunity for us to improve these metrics.

Over SoundCloud’s GitHub repositories updated in the last three months at the time of the first metrics collection, there have been between 1,200 to 1,500 commits in open pull requests representing 200,000 to 300,000 lines of code. The average age of this inventory was 65 days old. On average, teams have 22 commits in open pull requests. That’s a lot.

Teams at SoundCloud have had success in increasing flow by decreasing inventory. The Content ID Team reduced its inventory over a period of two months, bringing the daily average number of commits in open PRs from 46 to less than 5 and the average age of those commits from 14 to less than 1 day. During this time, the team’s cycle time dropped from 12 days to 3 days.

Given this longitudinal experimental design, it’s only possible to show correlation—not causation—between reducing inventory and cycle time. However, the correlation is positive and suggests further effort on this front could help.

Importantly, code review should not be rushed. Rather, it is better to carefully and thoroughly review code as it reaches the review stage promptly or use a continuous integration review process where merging code into the master branch is not blocked by code review.

5. Engineering should monitor and reduce the number of bugs over time

Monitoring and reducing the number of bugs in a team’s system has a positive impact on lead and cycle time. SoundCloud is currently implementing a standardized bug reporting process. This is another opportunity to speed up both development and the learning feedback cycle.

Guidelines for management

The number of products product management commits to at once caused teams to overload their product development queues. Engineering managers then overloaded engineers by accepting multiple products into WIP at once.

Overloaded product queues propagate from product managers to project managers to engineers.

Note that the propagation of queue overload can cause both the single engineer and multiple engineers scheduling problems.

Product and engineering management must lead and support changes to the way we work in order to improve flow, especially since some of these practices have become habitual in the company. Accordingly, the following guidelines were given to product managers.

1. The company should commit to fewer products at a time - Ideally, teams and product leadership should agree on a limit for the number of products that can be in progress for each team.

2. Teams should work on fewer product features at a time - Product managers should prioritize one product at time, one feature at a time and encourage the teams to deliver likewise if possible.

The following guidelines were given to engineering managers.

2. Teams should work on fewer product features at a time - Engineering managers should be conscious of pulling disparate product features onto the board at the same time. They should encourage the team to focus on a limited number of features at once, ideally one. They should encourage dividing work for a product feature into subtasks that can be shared in the team to enable that focus.

3. Engineers should work on fewer tasks at a time - Engineering managers should educate their teams about flow management and encourage them to focus on a task until it’s done, if possible.

4. Engineering should reduce its code inventory - Engineering managers should monitor their team’s code inventory and encourage them to keep it low.

5. Engineering should monitor and reduce the number of bugs over time - Engineering managers should monitor their team’s backlog of bugs and schedule them to be fixed.

Guidelines for teams

Teams were given the following guidelines.

3. Engineers should work on fewer tasks at a time - Engineers should learn the basics of flow management and try to focus on one feature at a time if possible.

4. Engineering should reduce its code inventory - Engineers should review code promptly once it hits the review stage and alert others when work is blocked.

5. Engineering should monitor and reduce the number of bugs over time - Engineers should fix bugs!

A tool for improving flow

Simple, visual controls are provided as part of PDKP to help project managers improve their product development flow. These are presented on the PDKP Summary Dashboard, shown below.

The PDKP Summary Dashboard.

The metrics below are displayed on the board.

  • Lead time
  • Cycle time
  • Ratio of features to engineers

A second set of metrics, shown below, is paired with colors to suggest specific actions. Green means the team’s performance against the metric is good, yellow means performance against the metric is likely OK, red means the metric should be brought down to promote flow.

  • The estimated lower bound of days until backlog completion
  • The number of unresolved bugs in the team’s backlog
  • The number of products in WIP
  • The ratio of WIP to engineers
  • The number of commits in open PRs
  • The average age of code in open PRs

The motivations for optimizing these metrics was discussed in a previous section.

The actions suggested by the action colors on the Summary Dashboard will not always be appropriate in your day-to-day work. Sometimes they will be wrong. However, simply asking questions about your workflow when a metric lights up red will likely improve your product development flow.

Importantly, actions never suggest that a team isn't working hard enough. They are intended to increase focus, not the total number of hours worked.

Preliminary results

Since the addition of the last teams to the PDKP project on June 23rd, 2017, SoundCloud has seen a reported 39% decrease in the average lead time from a mean of 106.4 days to 64.7 days and a 52% decrease in average cycle time from a mean of 77.8 days to 37.2 days.

Aggregate lead time metrics since June 23rd, 2017.

Aggregate cycle time metrics since June 23rd, 2017.

This decrease is correlated with product and engineering leadership successfully decreasing the average number of products SoundCloud’s teams have in WIP from an average of 7.4 products per team to 3.1 products per team.

Aggregate metrics of the number of products in WIP decreasing since June 23rd, 2017.

More recently, engineering managers and engineering teams successfully reduced our code inventory from a rolling average of about 1,200 commits to about 500 commits. It is likely that we have not yet seen the full benefits of this reduction reflected in our lead and cycle times.

A recent decrease in code inventory.

It is possible that much of the improvement in our measured lead and cycle times is due to cleanup and better use of our Jira. Tickets in Jira now better represent what is actually happening in reality. If this is true, at least we are getting a better understanding of what the true numbers are.

Future improvements

SoundCloud still has a great deal of room left to improve its product development flow. Specifically, the number of products teams work on at a time, the number of issues engineers work on at a time, and its amount of code inventory needs to be further reduced.

The number of products teams are working on at a time has successfully been reduced, but should be reduced further. Teams still work on an average of 4.5 products at a time. While it does not always make sense for a whole team to work on one product simultaneously, having the average be just slightly below the average number of engineers per team (about five), suggests there is room for improvement.

The number of tasks that engineers have in WIP has not improved and should be reduced. Since June 23rd, 2017, it has remained at an average of about 2.0 per engineer. Bringing this number closer to 1.0 would likely decrease our lead and cycle times.

SoundCloud’s amount of code inventory also remains a problem. Further reducing this number represents another big opportunity to reduce our lead and cycle times.

This said, we’ve improved our product development flow a great deal. We're excited to learn more about our development cycle and to make future improvements.