How the Gremlin agent fails safely

Testing shouldn’t feel risky.

While it might sound counterintuitive, certain types of testing can actually increase risks to your systems. Load testing, for example, is a great way to see how your systems behave under pressure, but it can also cause those same systems to fail if they aren’t equipped to handle the load.

For some types of testing, this is necessary, as is the case with reliability testing and Chaos Engineering. To fix reliability issues, you first need to expose them, which sometimes means pushing systems to their breaking point. Nonetheless, there are safe ways to do reliability testing, and we’ve designed Gremlin to be as safe as possible when running these types of tests.

In this blog, we’ll examine how the Gremlin agent implements fail-safe testing and what it means for you and your systems.

‍

Why are fail-safe testing tools important?

Testing tools are like any other kind of software: they can fail, and those failures can have unexpected or unintended consequences. For small-scale tests, like unit tests, a test suite failure usually means a test won’t run. But for larger-scale tests, like integration testing, a test suite failure could leave your system in an undesirable state.

For example, imagine you want to simulate packet loss on a remote server. On Linux, you can use the tc (Traffic Control) command to reshape traffic rules in real-time. You want to see how your service handles a poor network connection, so you run a test to drop 90% of all outbound packets using the command tc qdisc add dev eth0 root netem loss 90%. But because you’re logged into the server remotely, this impacts traffic between you and the server. Now, you can no longer connect to the server, and you have no way of reverting the rule you just applied without taking a trip to the data center.

Events like this are why fail-safes are important. Fail-safes ensure that system-changing tests can be reverted, even if you lose connection to the systems you’re testing.

‍

How the Gremlin agent fails safely

From the very beginning, Gremlin was designed to fail safely. It’s built into our architecture: all Gremlin agents (the software responsible for running experiments on your systems) must periodically check in with the Gremlin Control Plane at api.gremlin.com. If they fail to do so after a certain time, the agent will stop any actively running experiments and return the system to normal. This is known as a dead man’s switch, and while Gremlin’s implementation isn’t as grim as the name implies, this switch is key to building trust and safety during testing.

Here’s how it works:

When you deploy the Gremlin agent to a host, container, or application, it registers itself with Gremlin’s Control Plane. The agent periodically sends a heartbeat to the Control Plane to indicate it’s active and ready to run experiments. Because the agent initiates the heartbeat and not the Control Plane, it only requires outbound Internet access over port 443. This eliminates the need for you to open any inbound ports in your firewall.
When you start an experiment, the agent pulls the experiment details from the Control Plane and prepares to run it. While an experiment runs, the agent heartbeat interval decreases to every 5 seconds and reports as it goes through each stage of the experiment lifecycle.
If the agent loses connection to the Control Plane or gets an unexpected response (e.g., an HTTP 500 error), it changes the experiment state to LostCommunication and triggers its dead man’s switch, stopping the running experiment. The experiment state, log data, and other details are cached to send to the Control Plane when the agent reconnects for future troubleshooting.

Note

The agent uses very little bandwidth or CPU time when running the heartbeat. When idle, the agent transmits about 36KB of data over a 5-minute period, or 0.12KB/s. During an experiment, this increases slightly to about 0.75KB/s. For details, see Managing the Gremlin Agent.

The end result is that your systems return to normal before you even realize there’s a problem. Even if you drop all network traffic to the host, add several seconds of latency, or increase RAM consumption to 100%, the agent ensures that any impacts to your systems are quickly reverted. Once your systems are back to normal, you can troubleshoot why the agent lost connection by reviewing the agent's logs or the results screen in the Gremlin web app.

The only exceptions are the shutdown and process killer experiments, which change the system's state. In these cases, you may need to manually restart the affected systems or processes. To mitigate this risk, configure your cloud environment to auto-restart shutdown instances, add redundant hosts, or deploy a watchdog process to detect and restart terminated processes.

‍

Other ways the Gremlin agent maintains safety

The dead man’s switch isn’t the only safety mechanism we included in the agent. Because the agent must communicate with the Control Plane, it’s vital that there’s an open network connection between them. At the same time, Gremlin lets you run experiments targeting all network traffic, even traffic to api.gremlin.com. So why doesn’t the dead man’s switch trigger every time you run a network experiment?

By default, Gremlin whitelists communications to api.gremlin.com and all DNS traffic. What this means in practice is that all network experiments are configured by default to let the system resolve and connect to the Gremlin Control Plane. This is also why Gremlin has a dedicated DNS experiment: this experiment has a built-in exception for the Control Plane to ensure you can always resolve it, even if you block all DNS providers.

Additionally, the Gremlin agent has a built-in command-line interface (CLI). One function the CLI provides is stopping and rolling back running experiments. To roll back an experiment, simply open a terminal window on the system where the experiment is running and run the following command:

SHELL


gremlin rollback

If the target is a container, use gremlin rollback-container instead.

‍

Other ways to maintain safety during testing

The agent isn’t the only way Gremlin keeps your systems safe during testing. You also have two other tools: Health Checks and the halt button.

‍

Health Checks

A Health Check automatically monitors the state of your systems before, during, and after a Scenario or reliability test to ensure they still function as expected. Health Checks are typically connected to monitors or alerts in your observability tool, although they can send simple REST API requests to an endpoint such as a website.

When you run a Scenario or test with a Health Check, Gremlin automatically runs the Health Check in the background every 10 seconds. This sends a request to the observability tool or REST API endpoint defined in the Health Checks, checks the response against a set of criteria you can configure, and stops the test if the response fails to meet the criteria. Gremlin signals the agent(s) to abort and return your systems to normal, creating a fully automated layer of safety.

‍

Halt button

Gremlin also lets you stop any actively running tests using a “Halt” button. The halt button is, for lack of a better term, a kill switch for experiments. You’ll see multiple halt buttons throughout the Gremlin web app for stopping individual experiments, as well as a button for stopping all experiments across your team.

In the Gremlin web app, the Now Running page lists every running experiment in Gremlin. Next to each experiment is a bright red Halt button. When clicked, this button stops the relevant activity and sends a signal to the Gremlin agent(s) orchestrating the experiment on your system(s).

Sometimes, halting one experiment isn’t enough. For example, maybe you scheduled a dozen Scenarios to run all at the same time, but due to an unexpected problem, you need to stop them from running. Halting each individual Scenario would be time-consuming, but Gremlin has another way of handling this. If you’ve looked in the top-right corner of the Gremlin web app, you’ve probably seen the bright red Halt All button. This is an emergency kill switch that stops all actively running tests across your Gremlin team. This includes experiments, Scenarios, reliability tests, and Failure Flags. We recommend using this as a last resort since it will affect your entire team.

Screenshot of a running reliability test in Gremlin. Individual tests have their own Halt button, and there is a team-wide Halt All button at the top of the page.

‍

Gremlin makes reliability testing safer

Our focus on safe testing is one of the many reasons we’re the preferred reliability testing solution for enterprises worldwide. If you’d like to learn more about Gremlin’s safety and security practices, you can always contact us or request a demo.

‍

No items found.