Why Software Bugs Are Like Mini Outages

If this past year has shown us anything, it’s the importance of resilience. Businesses of all sorts have had to find creative ways to get through a very tough time. And one of those ways is through technology. Companies that never planned to be technology-driven are now having meetings on Zoom, managing a remote workforce, and adopting new software.

As the CEO of a software company, I understand the importance of delivering value to these businesses. One problem every company wants to avoid is major outages. We see the headlines every day: Robinhood enacted trading restrictions because $GME stock was soaring and its platform wasn’t prepared; Slack crashed on the Monday when everyone seemed to be getting back to work in the new year.

And these are expensive problems. According to Gartner, the average cost of IT downtime is $5,600 per minute. Of course, every business is different, and a business such as Amazon may lose millions in an hour, while a small startup’s outage may go totally unnoticed. 

But I want to push back on the idea that these outages affect larger companies more than smaller ones. Yes, the big outages are the ones that get bad press. And yes, large companies may lose more money in terms of raw dollars. But what about if we look in terms of percentage of revenue lost due to an outage? For many startups, their lifeline is a couple of big, happy customers. If even one of them churns, that could drastically impact the startup’s ability to survive to the next quarter.

I also want to push back on the notion that resilience is just about whether a system is up or down. In the modern world of DevOps, it’s the job of software developers to ship reliable code just as much as it is the responsibility of SREs to care about system health. Because the fact of the matter is, even if a system is technically up, a bug in that system will undoubtedly cause friction for the customer it is affecting. That bug is preventing them from purchasing an item, clicking on a button, or whatever other task they are trying to accomplish.

Businesses should think of these software bugs as mini outages, and the costs add up. A recent study published by the Cambridge Judge Business School found that developers waste 620 million hours a year debugging software failures, which ends up costing companies approximately $61 billion annually. The report also revealed that software engineers spend on average 13 hours to repair a single software failure.

So it’s not just about the revenue you are losing when your website isn’t functioning; it’s also the revenue you lose when frustrated customers visit competitor websites and the amount of engineering resources spent finding and fixing bugs. POCs are failing, NPS is declining and these problems are only getting more expensive and complex with the rise of cloud and microservices architectures.

According to a recent whitepaper from analyst firm IDC that our company sponsored, “Distributed architecture provides scalability and simplified development with the use of microservices, containers and the like. However, the very benefit for which this architecture was designed is the one which denotes an inherent struggle to understand and troubleshoot it — as code continuously spreads and shifts over multiple repositories.”

To ensure that businesses don’t find themselves in such a situation — and to prevent themselves from ever getting there — here are a few key ideas to implement:

• Place modern methodologies and processes in your R&D team’s workflow. By relying on new resources and taking a logical approach to confronting coding problems, developers will be able to improve their understanding of debugging — and their software as a whole.

• Maintain awareness when using third-party code. Applications are often comprised of a significant amount of third-party code, which introduces an inherent risk factor. Even packages that have been downloaded millions of times can be buggy, and even if they’re not buggy, their API’s documentation can be outdated and your developers will end up using it wrong.

• Implement modern production debuggers. By adopting next-generation debugging tools that can get live data from code, developers will be able to resolve issues faster and more efficiently. Outages and downtime will be reduced significantly, leading to much happier customers.

The rise of DevOps means that reliability is no longer solely the responsibility of ITOps and SREs. The ones writing the code today — the software developers — are just as responsible for providing robust services and great customer experiences. To address this, organizations need to shift resilience left and adopt modern tooling to ensure rapid debugging and optimal performance.

This article was originally published on Forbes