Your First Chaos Experiment

Kolton Andrus
CTO
Last Updated:
May 31, 2017

OK, so you've decided that Chaos Engineering sounds like a good idea. How do you get started? We get that question a lot, and we wanted to outline some tips for implementing these practices in your environment.

Chaos

A quick aside. Chaos is a cool name, but it is a misnomer in the best way to approach failure testing. Sometimes a design decision like enabling Chaos Monkey in a new environment can be a great way to enforce realistic constraints on teams operating there. It can be a bit daunting however to apply a random strategy when dealing with an existing environment. We think the best way to get started is a thoughtful, planned experiment to validate expected behavior.

Planning your First Experiment

One of the most powerful questions in Chaos Engineering is "What could go wrong?". By asking this question about our services and environments, we can review potential weaknesses and discuss expected outcomes. Similar to a risk assessment, this informs priorities about which scenarios are more likely (or more frightening) and should be tested first. By sitting down as a team and white-boarding your service(s), dependencies (both internal and external), and data stores, you can formulate a picture of "What could go wrong?". When in doubt, injecting a failure or a delay into each of your dependencies is a great place to start.

Creating a Hypothesis

You've got an idea what can go wrong. You've chosen a scenario -- the exact failure to simulate -- to inject. What's happens next? This is a excellent thought exercise to work through as a team. By discussing the scenario, you can hypothesize on the expected outcome when running live. What will be the impact to customers, to our service or to our dependencies?

Measuring the Impact

In order to understand how your system behaves under duress, you need to measure it. It's good to have a key performance metric that correlates to customer success (such as orders per minute, or stream starts per second). As a rule of thumb, if you ever see an impact to these metrics, you want to halt the experiment immediately. Next, is measuring the failure itself, you want to verify (or disprove) your hypothesis. This could be the impact on latency, requests per second, or system resources. Lastly, you want to survey your dashboards and alarms for unintended side effects.

Have a Rollback Plan

Always have a plan in case things go wrong. Know going in that sometimes even the backup plan can fail. Talk through the ways in which you're going to revert the impact. If you're running commands by hand, be thoughtful not to break ssh or control plane access to your instances. One of the core aspects of Gremlin is safety. All of our attacks can be reverted, allowing you to safely abort if things go wrong.

Go fix it!

After running your first experiment, hopefully, there is one of two outcomes. You've verified either that your system is resilient to the failure you introduced, or you've found a problem you need to fix. Both of these are good outcomes. On one hand, you've increased your confidence in the system and its behavior, on the other you've found a problem before it caused an outage.

Have Fun

Chaos Engineering is a tool to make your job easier. By proactively testing and validating your system's failure modes you will reduce your operational burden, increase your availability, and sleep better at night.

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape