Almost every company accumulates tech debt as time goes on. Tight deadlines, changing requirements, scaling issues, poor or short-sighted system designs, knowledge silos, inconsistent coding practices, turnover of key staff — these things all happen and can contribute to tech debt. So what can be done about it once it’s there?
My SoundCloud team, the Content Team, looks after a large number of legacy systems, along with a number of large systems that are difficult to understand and work with. Topics such as “Tech debt constantly increases development time” and “We have too much tech debt” come up frequently at our team retrospectives.
We have just completed a project that was almost exclusively about tech debt reduction. This post is a series of lessons learned about how to engage in this process (and when to decide not to).
A system with too much tech debt eventually becomes unmaintainable. Changes either take too long, are too difficult to test, or carry an unacceptably high risk of unintended consequences. The only way to extend the system is to add more tech debt. New feature development stops because the value gained is not proportional to the effort required.
Bad experiences with the system breed reluctance to work with the system at all, institutional knowledge of it gradually erodes, upgrades are postponed or simply don’t happen, and eventually someone puts a comment in the system’s README file that says something like “Never change this system again.”
Even for systems that haven’t reached the maintainability tipping point, tech debt can cause constant issues and increase development time, developer stress, and potential bugs.
A classic example of this problem is a test suite with one or more intermittently failing tests. Each individual case of the suite failing but then succeeding on rerun is minor, but the cumulative effect over weeks or months is far from minor. The time lost to reruns is an obvious consequence, but the erosion of confidence in the test suite can be even more damaging. True test failures will be hard to distinguish from the intermittent, “expected” failures, and the temptation to deploy with failing tests, or disable the test suite entirely, will grow.
We’ve seen above why tech debt is undesirable, but deciding what to do about it is not always easy. Here are a few possibilities.
Much has been said about “the allure of the full rewrite.” It’s tempting because it solves every problem the current system has — but at the expense of introducing every problem the new system will have. Note that the two lists of problems do not have to be mutually exclusive!
Starting over may well be the only option for systems that have passed the tech debt tipping point or for systems that are no longer fit for purpose because of fundamental design or scaling problems.
Starting over is also a large, all-or-nothing commitment to launching a new system. It will bring no value at all if the project fails, stalls, or is canceled.
This is a common option of last resort for unreliable tests. Tests that don’t reliably report on correctness and run in a reasonable amount of time are not useful and may in fact be a hindrance. If rewriting the tests is not possible, not having the tests at all may well be preferable.
Needless to say, not having tests is generally not a good state of affairs to be in either.
A depressingly common approach is to do nothing and ignore the problem. This can stem from inexperience, a business strategy that demands new feature development at all costs, or simple wishful thinking. Regardless of the reason, persistent inaction tends to lead to the tech debt tipping point.
This is an important thing to accept. No system is perfect, and attempts at perfection eventually reach a point where massive effort is necessary for very little gain. The goal should be a system that’s “good enough.”
The definition of “good enough” will vary from system to system (see below), but tech debt that prevents the system from fulfilling this definition is important tech debt that needs paying down.
It’s helpful to ask yourself the following questions:
One of my team’s services had a long history of tech debt-related problems. When we got a request for several new features, we believed adding these features would push the system perilously close to the tech debt tipping point. The system was designed in a way that put heavy, ever-increasing load on its database. We knew this load would eventually become unmanageable and the proposed new features would only hasten this process. The system needed either a substantial design change or a wholesale replacement. Our ideas for design changes were risky and complex, and we felt more confident of success with a wholesale replacement system. In other words, we felt radical action was justified in this case.
The replacement system is a batch pipeline that we built as a new component in a related system I’ll call Gulper. While the new component was essentially a greenfield project, it still had to integrate with other parts of Gulper. Gulper has not hit the tech debt tipping point, but as one of the team’s oldest and most complex systems, it is tech debt heavy and often hard to work with. In addition to adding the new component, we wanted to pay down some of Gulper’s tech debt in other areas, but we needed to balance that against our project deadlines.
Gulper’s already existing acceptance tests proved to be a particular pain point. They often failed spuriously, they were written in a language different than that in the main code base, they used external resources in a way that meant only one instance of the tests could run at a time, and they ran incredibly slowly. These tests run before every Gulper deploy. At times during our project, there were three people changing and deploying Gulper multiple times a day. The acceptance tests were costing us as much as ten or fifteen staff hours a week. The idea of full deletion of these tests did cross our minds at one point, but instead we committed to heavily refactoring the tests, using a contract-driven approach to uncouple test execution from shared external resources and mitigate spurious failures. Given the huge amounts of staff time the tests’ tech debt was causing, it was clear these tests were nowhere near “good enough.” So spending the staff time to make them “good enough” made perfect sense.
In contrast, we encountered a number of issues with Gulper’s data model that we decided to live with. Refactoring these would have been a major undertaking, and we didn’t believe the time spent would produce sufficient benefit, especially when weighed against our other priorities.
In a nutshell, this project involved some radical action in the form of a full rewrite of one system, some less radical action in the form of a heavy test suite refactor in another system, and some calculated inaction in the form of choosing not to refactor the data model issues.
Identifying important tech debt is not always easy — it takes practice and can be heavily subjective and situation-dependent as well — but paying down important tech debt now may very well keep your systems from hitting the tech debt tipping point later. Conversely, spending too much time paying down unimportant tech debt impedes your ability to deliver features in a timely manner and can be immensely frustrating. As with so many things in life, the key is finding balance.