What’s the ROI of reliability?

Reliability doesn’t happen by itself. Making a system reliable and resilient enough that your customers can count on it takes a combination of time, effort, and resources that could be used elsewhere, such as shipping new features. It’s also not optional.

In an era where downtime costs an average of $14,056/min (or $843,360/hr), outages have a material impact on businesses. Unfortunately, most systems are sprawling and complex enough that even small amounts of downtime can add up quickly. In fact, even with many systems at multiple-nines availability, a typical Global 2000 business still experiences an average of 456 hours of application or infrastructure-related downtime every year.

Every company needs to invest in the reliability of their systems. On the surface, this investment ROI seems like a straightforward calculation. After all, going from 98% to 99% availability reduces downtime from ~173 hours/year to ~86 hours/year, or a 50% decrease that saves, according to the average above, $72 million for the year.

But as everyone with any experience in budgeting will tell you, it’s a little more complicated than that in practice.

Let’s take a look at how to calculate the ROI of reliability efforts, including a deeper dive into computing the amount your company gains from reliability.

How to calculate ROI

At its heart, an ROI calculation is pretty simple: divide the expected net benefits of a program (Amount Gained minus Amount Spent) by the Amount Spent and multiply by 100. Better ROI is achieved by increasing the Amount Gained, lowering the costs, or both.

With something like retail sales, the Amount Gained is usually tied to gross revenue for a given product or effort. But ROI is different when it comes to efforts that reduce existing losses, such as security or reliability programs. In these cases, the Amount Gained is based on the losses your efforts reduce.

Amount Spent is a combination of the salaries, tools, and resources required to make that change. This number can be pretty straightforward to compute, but make sure to limit the scope to your specific program or initiative. For example, if you need to procure additional observability instrumentation and a Fault Injection tool for a Chaos Engineering initiative, then only include the costs for the additional observability agents and tool, not the entire observability or tooling budget for your company.

Once you have an Amount Gained and Amount Spent, combine them in the formula above to get your ROI percentage. To expand on the above example, it might cost an enterprise company $25M to make that 98% to 99% improvement that yielded $72M through increased availability. So we take [($72M -$25M)/$25M] x 100 and get an ROI of 188%. Simple, right?

The snag for most teams comes in computing an accurate Amount Gained. The difficulty lies in having to prove a negative: how much did you gain from preventing an outage that could’ve happened? But by looking at historical or industry data, you can show comparisons of before and after to help build your case.

Calculating a rough Amount Gained estimate

General averages work well as a rough estimate, but for a more detailed ROI, you’re going to want to dig more specifically into how downtime costs vary by company size and industry.

Source: IT outages: 2024 costs and containment

‍

‍

These broader numbers can also give you a more narrowed estimate. Taking similar numbers to above, if you’re at a smaller company with 98% availability, you could estimate $37.5M in downtime costs from 173 hours of downtime. Going from 98% to 99% could save your company $18.7M as Amount Gained. But for greater accuracy in your estimate, you could further adjust it by your industry. For example, for a small retail company, it might be better to use $20M as an estimate, while a communications or media company could estimate down to $15M.

A rough estimate like this is useful for activities like evaluating the cost of tools or broad-scale budgeting. To use the above example, if you have a yearly budget of $5M for your reliability program, then as long as you can get close to that amount of improved availability, you can expect a roughly 215% ROI on your efforts.

Calculating a more precise Amount Gained

Of course, that’s really only good for a rough estimate and initial exploration. At some point, you’re probably going to be asked for a more precise ROI computation. For that, you’re going to want to look at historical data specific to your company. Fortunately, most companies have this data available, so it’s just a matter of compiling it.

Downtime costs

Start by looking at the wide variety of costs associated with downtime. As seen above, these include costs like:

Lost revenue
Contractual or legal fees
Damage control or brand impact
Staffing costs and lost productivity
Resources costs

Start by looking at previous high-impact outages at your company. At most companies, someone has already computed the business impact of that outage in terms of direct costs like lost revenue or resources for activities like restoring from backups, etc. Taking that number as a baseline, start layering in the amount of hours engineers spent addressing the outage. Be sure to include all activities directly tied to the outage, including incident response or war room hours and time spent in post-mortems.

Brand impact is usually a little harder to quantify, but you can look at any costs for communications or marketing efforts to repair the damage. If you’re a public company, then you can directly look at the impact on the stock price.

These can also have lasting impacts, taking, on average, 60 days for brand recovery, 75 days for revenue recovery, and 79 days for stock recovery after a major incident. Be sure to include those numbers in your calculation, if possible.

At the end of this, you’ll have the cost of a single P0 or P1 high-impact outage. You’ll also have an idea of the amount of downtime it caused, which can help with a cost/hour calculation.

Use this to create your Amount Gained variable.

This is how you help close the gaps from what you don't know. And if we fix two or three P1 issues, then we have more than paid for this tool [Gremlin]."
—Director of Quality Assurance, Fortune 500 travel company

Go-to-market and efficiency gains

There are additional gains from a reliability program that are hard to quantify, such as accelerating your time to market. New launches, features, migrations, or transformations introduce new failure points that are unknown and untested. Resilience programs help ensure these initiatives are launched on time, on budget, and with fewer defects, which can save millions and lead to millions in additional revenue.

Through using the right tools, like Gremlin, reliability programs improve your reliability posture before, during, and after launch. With resilience testing, you can ensure the new environments are production-ready before you launch them. You can also keep moving quickly with frequent deployments while having confidence in your reliability posture by running tests before code goes live or shortly afterward when it’s easier to roll back.

At the same time, proactive reliability programs allow you to detect and resolve issues proactively before they cause outages. While this directly decreases the amount of downtime and outages, it also means your engineers can mitigate the risks on their schedule, creating less stress, improving efficiency, and ultimately freeing up cycles to work on higher-priority projects.

When your engineers are more productive and efficient, this creates a direct ROI from the increased deployment of revenue-generating services. At the same time, less burnout will increase developer retention, and more productivity will decrease the need for future hires, which creates an ROI from lowered IT costs.

While both of these benefits can be harder to quantify, remember that they are part of the entire Amount Gained picture. If the numbers are close or leadership is on the fence, then these additional benefits can help your organization make the decision to invest in reliability.

Our cloud transformation required a new approach to reliability. With Gremlin, we incorporated reliability testing into our SDLC process, helping us validate code for reliability before going live."
—Head of Site Reliability & Quality Engineering, Top 5 Canadian Bank

Prove your ROI with Gremlin and Reliability Management

A big part of proving your ROI is being able to track your reliability posture with standardized metrics and testing. Gremlin’s Reliability Management solution was designed to help make this easy. With Reliability Management, you can create a standardized set of reliability tests based on your system and reliability standards. These can be automated to run regularly, then the results graphed and tracked using reliability scores.

By using Reliability Management, you’ll have a clear record of how your efforts have made your systems more reliable, along with an exact list of failures your efforts uncovered and resolved before they caused costly outages.

Want help building your business case? Gremlin’s worked with companies building world-class reliability programs since 2016, and we’re always willing to share best practices and learnings that have helped other teams implement and grow their programs. Reach out and connect with a reliability specialist to start building yours.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL