One challenge engineering teams often face is dealing with work that doesn’t revolve around developing new features but that still requires the team’s attention and time. The Content Engineering Team here at SoundCloud is no exception, so we iterated on a process to deal with unplanned and support tasks to end up with fewer interruptions and more time to spend on implementing planned features.
We define unplanned and support tasks in the following way:
- Unplanned tasks — work that unexpectedly appears and needs immediate action. An example of this is a severe bug that impacts our users.
- Support tasks — work that doesn’t need immediate action and can be planned but still needs engineering support. Reasons for these tasks can be lack of automation, lack of visibility into the systems we build, work caused by infrastructure changes, or non-critical bugs we introduced ourselves.
There are several reasons why it is important that we come up with an approach to dealing with unplanned and support tasks. For one, these tasks take up time we would normally spend on delivering features, so naturally we want to minimize the number of them. Secondly, we want to make this kind of work visible; it is often invisible outside of the team, which can give the impression the team is slow and dysfunctional. Finally, these unexpected and often urgent tasks cause extra stress and context switching for the team — in addition to the fact that support tasks are often not very interesting, due to their manual and repetitive nature.
The Request Process
Initially, my team had a very open way of dealing with requests from stakeholders that resulted in unplanned and support tasks: People who had requests could leave them in our Slack channel, and a team member in a single point of contact (SPOC) role was supposed to address them.
Some team members liked taking on this role because they liked helping out our stakeholders. Others didn’t like it as much because they didn’t want to be interrupted, and as a result, were slower in responding to requests.
And even though this idea resulted in a designated person responsible for addressing incoming requests at any given time, all the other team members still saw them. As a result, they were often distracted, and some people started discussing the requests even when they were not urgent. It got to the point where the distractions and unclear priority of the requests began popping up as a recurring topic in our team retrospectives.
This led us to try another approach. We got rid of the SPOC role and instead tried asking stakeholders to reach out using our team Slack channel or via direct messages, but only for urgent requests. We defined an urgent request as something that impacts our users, disrupts a key feature, or blocks employees from doing their work. For all other requests, we specified an email address that reached the team Engineering Manager (EM) and Product Manager (PM).
We had no difficulties convincing stakeholders to try this approach, and as a result, the number of requests in Slack dropped almost immediately.
This is the support request process we still use today.
Planning the Work
We have two-week iterations in our team, and if a non-urgent support request comes up, we add it to the next iteration. The next iteration rule is important because we want to avoid building up a backlog of issues that never get worked on.
Initially, we went over the support task tickets during our planning session so that everyone was aware of them. Then we let the team self-organize with the goal of finishing the tasks by the end of the iteration. This resulted in a situation where it was often the same team members tackling the support tickets while other team members focused on feature work the entire length of the iteration. However, this wasn’t necessarily fair, so after the issue came up in one-on-one meetings and retrospectives, we addressed it and found a new solution.
Our new way of handling these tasks was to assign support tickets to team members in a round-robin fashion, thereby ensuring everyone got a fair share of support tickets. This not only improved team morale, but it made it so that the work was divided more equally and fairly, which also resulted in increased knowledge sharing.
Another variant we started trying out recently is having one day in the iteration when the team works together to finish the support tickets. This has the additional advantage that more pairing takes place, which is better for knowledge sharing. It is also nice that everyone shares the work together at the same time as a team effort.
Tracking the Work
As part of the overall process, we also think it’s important to track the work done on these tasks. We do this by creating a JIRA issue for every request that comes in, even when the request can be answered by the Engineering Manager or Product Manager. Once an issue is created, we also share it with the stakeholder who initiated the work so they can follow up on the progress.
Every request JIRA issue is labeled. Urgent requests get the “unplanned” label, and non-urgent requests get the “support” label. Then, once we are done working on an issue, we log the estimated amount of time we spent working on the issue using JIRA’s Log Work functionality.
Measure, Learn, and Act
Keeping track of the work in this way allows us to measure it, learn from it, and act on it.
More specifically, about once every two months, we query all the finished unplanned and support issues of the past iterations in JIRA. We then group them in categories and add up the time spent on them. An example of this from earlier this year is shown below.
Based on the categories and the time spent, we come up with actions. Some examples of actions we’ve come up with so far include:
- We found out that we spent quite a lot of time supporting our operations team because we were missing some automation to perform a specific task. This led us to prioritize the work required to automate the missing step, which not only got rid of the repetitive and boring tasks for our engineers, but also removed the dependency and waiting time for our operations team.
- We spent a considerable amount of time working on fixing our deployment pipelines because of infrastructural changes that were outside of our control (e.g. updated base Docker images). This led to the decision to inform the infrastructure team and ask them to consider if the benefits of their changes outweighed the additional workload for the engineering teams. We also improved our deployment pipelines so that they get triggered automatically when a new dependent Docker image has been updated. By doing this, we get notified of breaking changes immediately instead of being caught by surprise when we make a code change.
- When we introduced a bug in one of our systems, it required a lot of time to recover from the bug. We found out it took so much time because the system generated derived data / intermediate state. And because we couldn’t reprocess all the input data, it meant that all the derived data needed to get fixed/patched manually. This motivated us to come up with a plan to improve the system design.
- We found out that some of the bugs we introduced were caused by not having good enough validation criteria for our stories/epics. This resulted in us becoming better at defining validation criteria together with our PM.
Measuring, learning, and coming up with action points is important, because having tangible numbers about the time spent on unplanned and support work makes it easier to prioritize work for fixing structural issues. This is good for both the team and for SoundCloud, as engineers are interrupted less and can spend more time on impactful work.
If you are interested in learning more about our processes, you should also read the blog post Deliver Software Faster by Managing Work in Progress, Not by Adding Overtime.