
Infographic: Resilience and reliability in the cloud
We all know reliability is essential, but it can still be tough to get the budgetary sign-off for a dedicated reliability effort. This is especially hard when your organization has already invested in the cloud and observability in an effort to improve resilience, performance, and uptime.
Unfortunately, when it comes to modern software systems, it’s not a matter of if something is going to fail, but a matter of when it’s going to fail.
The question you need to ask yourself is whether your system is resilient enough to bounce back from a failure and if that failure is going to happen on your schedule. This concept is why AWS considers reliability a central pillar of their Well-Architected Framework and why resilience testing using tools like Gremlin is helping companies uncover failures before they cause customer-impacting outages.
And with the cost of downtime getting more and more expensive, it’s more important than ever to put time and effort into proactive reliability efforts.
To help, we partnered with AWS to gather data about outages and downtime from companies like Splunk, New Relic, and Cockroach Labs. This data shows the impact of outages, the most common causes of outages, and the results companies get from investing in resilience.


Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALSee Gremlin in action with our fully interactive, self-guided product tours.
Take the tourWhat’s the ROI of reliability?
Learn how to compute the ROI of a reliability or Chaos Engineering program, including how to quantify the positive impact your efforts created for the company.


Learn how to compute the ROI of a reliability or Chaos Engineering program, including how to quantify the positive impact your efforts created for the company.
Read more