Interpreting your reliability test results

Gremlin’s default suite of reliability tests analyzes critical functions of modern services: scalability, redundancy, and resilience to dependency failures. Services that pass this suite of tests can be trusted to remain available during unexpected incidents. But what happens when a service fails a test? How do you take failed test results and turn them into actionable insights?

This blog aims to answer that question. We’ll walk through all seven tests in the Gremlin Recommended Test Suite and explain what they test, what happens if your service fails, and what actions you can take to turn that failure into success.

‍

The Gremlin Recommended Test Suite

If you aren’t familiar with Gremlin Reliability Management (RM), here’s how it works: you define your services (these can be applications deployed to hosts, containers, or Kubernetes), add a Health Check to monitor critical metrics about the service, then run reliability tests on your service. While a test is running, Gremlin uses the service’s Health Check(s) to determine whether it’s in a healthy state. If not, Gremlin stops the test and marks it as a failure. Otherwise, you pass.

We built the Gremlin Recommended Test Suite to test modern cloud systems against reliability best practices. This test suite contains seven tests in total. These are:

[Insert table from https://www.gremlin.com/docs/reliability-management-test-suites#gremlin-recommended-tests]

‍

Scalability tests

Scalability tests measure your service’s ability to add or remove capacity in response to changing demands. These tests stress your service’s hardware to simulate a sudden increase in user traffic. Gremlin monitors your service to ensure it can handle the extra load without becoming slow or unavailable. These tests also let you test autoscaling systems, especially on cloud computing services like Amazon EC2 or Azure Virtual Machines.

‍

CPU scalability

CPU usage is often the first metric engineers use to determine a system's load. High CPU usage can result in poor performance as processes compete for CPU time. This test determines whether your service remains responsive under heavy load by generating load in three increments: 50%, 75%, and 90%. If your Health Check(s) remain healthy throughout the test, you pass.

A failure might indicate that your service isn’t scaling correctly or quickly enough. Start by checking the service’s autoscaling rules. If you don’t have a rule defined for scaling based on CPU usage, consider adding one. Our blog on scaling your systems based on CPU utilization covers this in detail. When you’ve tweaked your rules, re-run this test to verify that you can scale quickly and add enough capacity. In addition, when the test finishes, check your service to ensure it scales back down. This is especially important when running in the cloud, where resources cost money, even if unused.

‍

Memory scalability

Running out of memory is a critical reliability risk. Unlike CPU, low memory doesn’t just slow down a system. New processes will fail to start, processes trying to allocate memory will crash, and the operating system may even start evicting memory-hungry processes. Some deployment platforms, like Kubernetes, will refuse to start a process if it requests more memory than what’s available on the host, leaving it in a Pending state.

There are a few ways to increase memory scalability. In a cloud environment, you can provision hosts with larger memory capacities or create an autoscaling rule to scale on high memory usage, similar to scaling on CPU usage. You can also use swap (or paging) files using the host’s persistent storage as makeshift RAM. Swap files are significantly slower than RAM but can provide a buffer on systems with limited memory. This buffer can offer enough time for additional instances to spin up.

See our blog post for more tips on effectively managing memory-intensive workloads in the cloud.

‍

Redundancy tests

Redundancy tests determine your service’s ability to remain operational during a host or zone failure.

‍

Host redundancy

Host redundancy involves deploying one or more replicas of the host that your service is running on. Generally speaking, you can replicate a service onto two hosts and use a load balancer to direct traffic between both hosts. If both hosts run the same version of the service and can access the same data, your service will remain available even if one fails. Customers won’t notice any impact since the load balancer can redirect traffic on the fly.

Cloud platforms make it relatively easy to configure host redundancy. For example, Amazon EC2 lets you set up auto-scaling groups (ASGs). While the name implies scaling, ASGs also provide redundancy by maintaining a minimum number of instances. These instances can share the same templates and file storage, making them identical. Just deploy an elastic load balancer before your ASG, and you’ll have automatic redundancy.

‍

Zone redundancy

Zone redundancy is slightly trickier than host redundancy due to the isolation gap between availability zones (AZs). Creating zone-redundant services depends on your cloud provider and deployment method. For Amazon EC2, an instance’s zone is determined by its subnet: one subnet may be located in use1-az1, while another is in use1-az2. Like host redundancy, you can create an ASG and an elastic load balancer, but your load balancer must be AZ redundant. We explain how to do this in our blog: How to build zone-redundant cloud instances and Kubernetes clusters.

‍

Dependency tests

The more complex a service is, the more likely it is to rely on dependencies: external services that provide additional functionality. When a dependency is critical to your service’s operations, it becomes a single point of failure (SPoF). Gremlin’s dependency tests are designed to root out these SPoFs and ensure your service remains operational when your dependencies aren’t.

‍

Dependency failure

The first question to answer is: what happens when my service can’t access a dependency? The dependency failure test answers this by simulating a network outage between your service and the dependency. All inbound and outbound network traffic is dropped, creating a “black hole” in your dependency graph. If your service becomes unresponsive or fails during this test, the dependency is a single point of failure and a significant reliability risk.

Unfortunately, you can’t just spin up a replica of a dependency since they’re often managed by other teams and organizations. Instead, focus on your service’s connection to the dependency. For example, find the places in your service’s code where it makes calls to the dependency and wrap them in try-catch blocks to handle failed calls safely. You could also add a retry mechanism to work around temporary outages. In both cases, provide feedback to the user indicating that this part of the service is unavailable.

‍

Dependency latency

Detecting failed dependencies is easy, but detecting slow dependencies can be much more difficult. When a dependency is slow to respond, it degrades the entire user experience, but it may not be slow enough to register as unavailable. For instance, imagine you set a 3-second timeout for a connection to a dependency—in other words, if it takes longer than three seconds to connect to the dependency, your service will consider it unavailable. If a request takes 2.5 seconds to process, your service will run normally, but your users will get impatient. This is especially true for tightly coupled services like databases, where the slightest bit of latency can have a compounding effect.

To mitigate this, find ways to keep your service responsive even when it can’t reach its dependencies. This includes using caches, adding circuit breakers to proxy requests to the dependency, and making requests asynchronous so they don’t block the main application from running. We offer more solutions in our blog: How to make your services resilient to slow dependencies.

‍

Certificate expiry

TLS security underpins nearly all modern encryption, especially communications between services. The certificate expiry test confirms this by checking the dependency’s entire TLS certificate chain. The test will fail if any certificate in the chain expires within the next 30 days. While a failure doesn’t mean your users will notice an immediate impact, your service may fail to connect to the dependency once its certificate expires.

To prevent this, regularly renew and replace your certificates, preferably using an automated solution like Certbot.

‍

Keeping your services free of reliability risks

Running your first reliability testing suite is an excellent start toward improving reliability, but a one-time test isn’t enough. Services change over time as developers push new code, engineers spin up new infrastructure, and platform teams tweak the environment. Schedule your reliability tests at least weekly to ensure your services stay reliable, and to uncover any new reliability risks or regressions. Gremlin also makes it easy to track changes in reliability over time so you can demonstrate the benefits of your reliability work.

If you’re new to Gremlin, you can run your first reliability test suite in just a few minutes by signing up for a free 30-day trial.

No items found.