Reliability Management > Reliability Tests

Reliability Tests

Supported platforms:

N/A

Reliability tests test a specific behavior of your service, such as autoscaling CPU and memory, zone and host redundancy, and dependency failures. While a test is running, Gremlin continuously monitors your service's state using its Health Checks. If any of your Health Checks become unhealthy during a test, then the test is immediately halted and marked as a failure. Otherwise, it's marked as passed.

A Test Suite is a collection of reliability tests. Each Gremlin team has one Test Suite assigned to it, and that Test Suite defines the reliability tests available to services owned by that team. To learn more, see Test Suites.

‍

Built-in Reliability Tests

These reliability tests were created by Gremlin and are automatically available to use in a Test Suite. These tests are organized into categories: Scalability, Redundancy, and Dependencies.

‍

Scalability

CPU: Tests that your service scales as expected when CPU capacity is limited. Gremlin will consume CPU in 3 stages (50%, 75%, 90%). Estimated test length: 20 minutes.
Memory: Tests that your service scales as expected when memory is limited. Gremlin will increase the memory utilization of your system in 3 stages (50%, 75%, 90%). Estimated test length: 20 minutes.
Disk I/O: Tests that your service scales as expected when disk I/O is limited. Gremlin will carry out many read and write operations in the target’s /var/tmp directory. Estimated test length: 20 minutes.

Note

The Disk I/O experiment tests disk throughput, not capacity. Only ~4kb of data will be written to disk during the experiment.

‍

Redundancy

Host: Tests resilience to host failures by immediately shutting down a randomly selected host or container. Estimated test length: 5 minutes.
Zone: Tests your service's availability when a randomly selected zone is unreachable from the other zones. The Gremlin zone tag is required for this test (this is automatically detected by the Gremlin agent. Click here to learn how tags work in Gremlin). Estimated test length: 10 minutes.‍
DNS: Tests your service’s availability when a randomly selected DNS service becomes unreachable. Estimated test length: 10 minutes.

Note

The collect_dns option is required for DNS tests (enabled by default).

‍

Dependencies

Failure: Drops all network traffic to a specific dependency. Estimated test length: 10 minutes.
Latency: Delays all network traffic to this dependency by 100ms. Estimated test length: 10 minutes.
Certificate Expiry: Opens a secure connection to your dependency, retrieves the certificate chain, and validates that no certificates expire in the next 30 days. If there is no secure connection available, and therefore no certificates, this test will pass. Estimated test length: 6 minutes.

‍

Running reliability tests

To run a reliability test, first click on the service you wish to test to open the Service Details page. From there, find the test that you wish to run and click Run. A modal window will appear asking you to confirm. To run the test, click Run again. The test will start and Gremlin will display details about the test along with its current status. On this screen, you can monitor the progress of the test and drill down into its execution details. If the test fails, you can see the cause of the failure. If the failure was caused by a Health Check, you can see which of the Health Checks triggered the failure.

Screenshot of a completed CPU reliability test

To run the full suite of tests, click the Run All button at the top of the service overview page, then click Run All Tests to confirm. Gremlin will run each test sequentially. The page will automatically refresh to show the current running test and the results of completed tests. Gremlin also sends an email to the Service Owner with the completed test results.

Run All excludes running tests on dependencies that have been marked as Single Points of Failure. See the dependency documentation to learn more.

Note

After the final stage in each reliability test, there is a 5-minute cooldown period. This is so Gremlin can monitor the state of your service after the test and ensure no failures occur as it returns to its normal operation.

‍

Running dependency tests

When you define your service in Gremlin and select its process name, Gremlin uses network traffic data to identify network resources that your service communicates with. It then lists these resources in the Dependencies section. For each dependency, Gremlin automatically creates three tests:

The Failure Test drops all network traffic to the dependency.
The Latency Test delays all network traffic to this dependency by 100ms.
The Certificate Expiry Test opens a secure connection to your dependency, retrieves the certificate chain, and validates that no certificates expire in the next 30 days. If there is no secure connection available, and therefore no certificates, this test will pass.

You can run these tests for each dependency and they will contribute to the service's reliability score.

A list of dependencies for a service created in the Gremlin web app.

Running a reliability test on a dependency works the same way as running a reliability test on a service. Simply click the Run button and click Run again to confirm.

Note on dependency testing

When running a dependency test, Gremlin doesn't actually impact the dependency. Instead, it impacts the service's network connection to the dependency. For example, if you have a web server connected to a dependency over port 3306 and run a latency experiment on the dependency, Gremlin will introduce that latency on port 3306 on the service. Then, it monitors the service's Health Checks to ensure the service still functions.

‍

Reliability tests with inverted Health Checks

By default, Gremlin marks a reliability test as “failed” when a Health Check fires. However, you can invert this so that when a Health Check fires, the test passes. This is useful for validating that your observability alerts are configured correctly.

To avoid confusion, Gremlin notes these tests with a different test result icon.

‍

Configuring zone tests

The zone redundancy test works by dropping network traffic to and from IP addresses corresponding to the target zone. By default, Gremlin automatically detects zones from the host(s) on which the Gremlin agent is running. You can associate additional IP addresses with each zone—or create new zones—by creating a zone definition. This lets you increase the scope of individual zones.

When running a zone test that targets a zone with a zone definition, Gremlin appends the IP address(es) in the zone definition to the IP address(es) it automatically detected for that zone.

Note

Zone definitions are additive, meaning they will be used in addition to the auto-detected IP addresses. These settings apply Company-wide.

‍

To create a zone definition:

Log in to the Gremlin web app and access your Company Options.
Scroll down to Zone Definitions and click + Add.
Select a zone from the drop-down (or enter a name for a new zone). Note that zone names can only contain alphanumeric characters, dashes (-), underscores (_), and periods.
Enter the IP address(es) corresponding to this zone, formatted as CIDR blocks.
Enable the Include all Gremlin agents that match this zone across the company option to include the IP addresses of Gremlin agents running in this zone.
Click Save.

Click Edit to edit an existing zone definition, or click Delete to remove the definition. This will not remove any IP addresses that Gremlin automatically detected, only the IP addresses added by the zone definition.

‍

Note

For AWS users, we recommend configuring zones using the zone ID (e.g. usw1-az1) instead of the zone name (e.g. us-west-1a). This is because the zone name may vary between AWS accounts, which could result in unexpected behavior when testing.

‍

Autoscheduling Reliability Tests

A consistent testing schedule is key to improving a Service's Reliability Score. You can schedule Reliability Tests to run automatically during a weekly testing window. Gremlin will run as many eligible Reliability Tests as possible during the specified window and track the scores over time so you can see how it improves with regular testing.

You can set up a schedule to run:

All Reliability Tests
Any Reliability Tests that have passed at least once
Any Reliability Tests that have been run at least once

Autoscheduling is optional. You can run the Reliability Tests manually if you do not wish to use autoscheduling. Autoscheduled tests will also not run on dependencies that are marked Single Points of Failure. See the dependency documentation to learn more.

‍

To schedule Reliability Tests for a Service:

On the Service page, click Autoschedule (or Settings and then Scheduling).
Select the test that you want to schedule:
1. All Reliability Tests
2. Any Reliability Tests that have passed at least once
3. Any Reliability Tests that have been run at least one
Under Test Window, specify the parameters for the test window:
1. Day
2. Start hour
3. Length of window (must be at least 2 hours)
Click Save.

‍

Privileges required

Privilege	Description
RELIABILITY_MANAGEMENT_RUN	Allows running of an RM test for a Team

Test Suites

Reliability Score