Observability and incident response need resilience testing

There’s a reason why observability and incident response practices have become standard across modern software development. Anyone wanting to minimize downtime and deliver reliable, available applications needs to have fully instrumented systems and playbooks so they can respond quickly and effectively to outages or incidents.

But there’s another piece to the reliability puzzle: resilience testing.

Modern architectures are complex and ephemeral, and all those moving, disparate parts create more potential points of failure that could lead to outages. Resilience testing proactively checks your systems to reveal reliability risks so you can address them before they lead to incidents.

Working together with your observability and incident response practices, resilience testing is a key practice that reduces the amount and severity of incidents to lower your MTTR and make your system more reliable and available.

Resilience testing is the missing key to reliability

Resilience testing allows you to add proactive detection and prediction to your reliability practices. By using resilience testing, you’re able to test how your system will respond when issues pop up, such as spikes in traffic or if an Availability Zone goes down. Instead of hoping that your system will respond correctly, you’re able to see exactly what will happen and address any issues before they cause outages.

The three pieces fit together symbiotically:

Observability provides metrics that drive the results of resiliency tests and alert when there’s an incident.
Incident response addresses outages found by observability and informs future resiliency tests to make sure incidents don’t reoccur.
Resiliency locates blind spots and fine-tunes alerts for observability, and verifies the repairs made during incident response.

Working together, these practices allow your teams to get visibility into the performance of their deployments, proactively detect risks, and quickly resolve any failures that slip through the cracks.

How resilience testing and observability fit together

Resilience testing builds on your existing observability work, using its metrics (or metrics pulled from your Elastic Load Balancer) to determine whether a test succeeds or fails.

Resilience tests use Fault Injection to artificially (and safely!) simulate a problem in your system. As the test runs, observability metrics are monitored to track the results of the test. If there’s an unexpected or problem response, then the test will be terminated and rolled back to the previous state.

This integration can also be used to strengthen your observability practice. Using a sequence of resilience tests, you can fine-tune your alerts, cutting down on the noise so your engineers only get paged when there’s an incident that needs their attention.

You can also verify that everything is instrumented correctly and metrics are appearing where and when you need them to. After all, it’s much better to find out metrics aren’t being reported during a controlled test instead of in the middle of a major outage, right?

How resilience testing and incident response fit together

After an incident is over and the outage dust has settled, the post mortem will undoubtedly include the question, “How can we be sure this outage won’t happen again?” This is where resilience testing comes in. By creating and regularly running a test for a specific outage condition, you can verify your continued resilience to that problem—and prove it to the rest of the company.

Usually, this is done by adding a custom test to a suite of resilience tests. In Gremlin, these test suites are used to collect standardized resilience tests that should be automatically and regularly run against systems to gauge their reliability against known failure modes. A core group test suite would include common outage causes such as unavailable dependencies, expired TLS certificates, traffic spikes, resource surges, and more. When you have an incident, you can create an additional test that replicates the outage-causing fault, then add it to that core group.

Resilience testing can also be used to give you a chance to test your incident response playbook in a controlled environment. Instead of having to wait for an outage to occur, you can test your system, then have your teams respond to uncovered failure modes per your playbook processes—but without the stress of an actual outage.

Close the reliability gap with resilience testing

Observability, incident response, and resilience testing all need to work together to meet modern reliability requirements. When you have these practices integrated correctly, you’ll be able to accurately predict how your system will respond to known failure modes, get alerted when it runs into unknown failure modes, respond quickly to restore service, then include those now known failure modes in your resilience tests.

As a result, your systems will become more and more reliable, leading to more availability and uptime to meet your company’s goal and your customer demands.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Start your free trial