Dealing with technical debt
Software is complex with many layers of business logic, libraries, utilities, state management. With multiple engineers working in any given system over the years, the complexity keeps growing, and the knowledge and understanding of it are decreasing. It's a matter of time when someone will skip a unit test, install an open-source dependency to save time, or write a sloppy code to quickly test a feature. All this impacts the software's quality, increasing the complexity, and introducing more risk that most of us call technical debt.
Tech debt is a risk or complexity that degrades the quality and stability of the system. It's a cost we're paying for borrowing future productivity by taking shortcuts to ship features faster now. Ward Cunningham introduced the analogy to financial debt in the 1990s. Also, Martin Fowler has a great take on the topic.
The code is never perfect, no matter how good is your test coverage. As soon as you ship something to production, you accumulate technical debt. According to the software entropy principle, the complexity of a system is increasing with every update. This complexity is an overhead for software engineers to support. The system's quality goes down over time unless rigorously maintained (Lehman's law of software evolution #7).
It's never just the code you produced; it is one of the layers in the architecture. Most systems rely on open source software, external libraries, frameworks, and internal packages other teams have written. In other words, you borrow the code from elsewhere. When you borrow something, technically, you have debt.
Technical debt is a price we are willing to pay to keep the overhead of extra complexity. Another way to look at it is a risk of something unexpected happening. It's a risk of more bugs or a crash, causing a significant business impact, risk of slowing down on new features, losing customers' trust, or security breach.
The smaller the debt, the less complicated the system, the more flexible and lightweight it is. The fewer applications and dependencies you have, the more efficient and less debt you'll have in the long run. The more hacks you introduce as a shortcut for immediate productivity, the more debt you're accumulating. For example, skipping unit tests feels faster at first, but there is a price you will be paying by more manual testing, fixing production bugs, less readable code. Some are willing to pay the interest, and it's OK if it's a deliberate decision.
Taking shortcuts is borrowing your future productivity. Sometimes it makes sense to ship a feature faster, so you consciously introducing risk by taking a shortcut. But like any debt, you have to pay back the principal, or the interest keeps growing. The risk keeps growing the longer you wait, and the longer you wait, the more new tech debt you're likely to introduce. The final stage is maintenance cost getting so high you have to do a full re-write, which is always a challenging conversation with business stakeholders because it means all new features are on hold.
Full re-writes or migrations are part of software evolution. I participated in a few significant migrations in my career, and these projects are always bigger than they seem at first. The complexity and hidden features keep resurfacing the deeper you get. Even though the greenfield project may sound like fun because you get to use the latest and greatest tech, keep in mind the pressure is to get it done is stressful. Still, in some cases, full re-writes make the most sense to address the technical debt. I would argue that complete re-writes are inevitable. At some point, it becomes more efficient to decouple one or a few components and migrate the system in pieces. That's one of the benefits of microservices.
How do you deal with technical debt, if it's unavoidable?
Minimize technical debt
Avoid unnecessary dependencies where possible. Like everything else, the decision to use open source is a trade-off. You're saving time using someone else's software, saving engineering time, and accelerating time to market. But the cost is bloated code because it's probably covering more use cases than you need, and the maintenance is outside of your control. That becomes problematic when there are known security vulnerabilities with no timelines to fix them because maintainers disappear.
Of course, we cannot completely avoid external dependencies. Otherwise, you'd write in Assembly. Most open-source software brings more benefits than risk. It addresses edge cases, bugs, test coverage, and more. But it also can make us lazy. It's easier to find a library on GitHub than writing a small utility ourselves. For example, consider writing five lines of code or use JS native methods instead of importing lodash.
New technology is cool, but don't be quick to jump on the bandwagon because it's trending on Twitter. The longer a technology has been around the more likely it will stay around. Old technology pays our bills. While we all like to talk about Rust, Go, and Kubernetes, most of the world runs on Java 8. Use a few old, proven, boring technologies to solve your problems. Apply existing technology to solve as many issues as possible before introducing something new. New stuff has a cost of maintenance, support, and learning curve.
Address technical debt
We already know it's impossible to avoid tech debt altogether. It's a good idea to have a plan to manage it, so it doesn't spiral out of control.
Make it part of the process.
Every sprint should include technical stories. If your engineering team has a seat at the table during the planning, as it should, they will advocate for tech debt work to be prioritized. I've seen 10-20% is the most common ratio of tech debt to feature work. Some teams dedicate 1 or 2 sprints at the end of each quarter for just technical work. Another option is to use on-call engineering time to fix the issues proactively. It's less important how you structure it as long as the tech debt work is happening regularly.
For any work to happen, including tech debt, it has to be clearly defined. Treat this work as a product backlog, or make it part of the backlog. It should be equal to feature work. After creating an effort vs. impact analysis, it should be easy to prioritize particular tech debt work compared to other tech work. Although, it's still not easy to choose between tech debt over feature work. The latter has instant gratification and feels more meaningful to engineers.
Use business requirements as an opportunity.
We discussed this topic on our podcast with Frank Lacalamita (episode 2). Using business and product initiative is the most efficient way to drive any transformation, get buy-in, and motivate the team. Instead of fighting the uphill battle of convincing everyone around you that it's time to address tech debt, use a product or business roadmap as a driver for refactoring and creating more technical opportunities for business growth and innovation.
Treat bigger tech debt as projects. Have a prioritized list of projects according to your effort vs. impact analysis. Each project should have an RFC explaining the problem, solution, business impact, and risks, among other things. It's important to prepare before a quarterly or annual planning starts. There usually always room for negotiation to include some technical work. When this opportunity happens, you better have everything to back up your asks.
Set your priorities.
With any debt, you want to start paying the principle first to minimize the interest. Find a high impact area where investment will improve the frame, which you use to add more enhancements later. For example, before adding a random unit test, take the time to understand what's the current test coverage, what components are breaking, or changing the most.
You're only paying interest, experiencing the pain of having technical debt, when you have to touch the code.
Educate the stakeholders.
Don't assume everyone understands what technical debt is and why it's a risk. As an engineering manager, you have to educate your product managers, business managers on the current system's issues and limitations, and its impact on velocity and stability.
The fact that PHP is not cool, and you need to migrate everything to Rust is not a good reason for executives to put everything on hold for eight months. Instead, educate them on the importance of paying tech debt, and all the opportunities well-maintained code creates from attracting better talent to better velocity and productivity to faster applications. It's like changing the oil in a car regularly or risk getting the engine overheating while driving on a highway.
Avoid using tech jargon and acronyms. You have to use business language and reasoning. Remember, the software is a means to an end. It's enabling the business to exist, but it's not the business.
It's not about new shiny toys, but what contributes to the bottom line directly or indirectly. Some metrics are hard to quantify, like hiring or team's velocity, although they should still be a good reason for investment. By definition, refactoring is changing the code invisible to the end-user, making some non-technical stakeholders more reluctant to commit.