How to fix the root cause of a failed reliability test

You’re well on your way to becoming more reliable. You’ve added your services, found and fixed some Detected Risks, and run your first set of reliability tests. However, some of your tests returned as “Failed.”

Not to worry: this isn’t a reflection of you or your engineering skills but rather an opportunity to learn more about how your systems work and, more importantly, how to make them more resilient.

In this blog, we’ll review Gremlin’s built-in reliability tests and what it means if they fail. We’ll explain what each test does, what a failure means, and how to address those failures to make your systems more resilient.

Note

While this blog focuses on AWS, these principles apply to other cloud platforms, including Azure and GCP.

‍

The Gremlin Well-Architected Cloud Test Suite

A Test Suite is a collection of reliability tests that Gremlin runs on each service in your team. Test Suites determine how your reliability score is calculated based on the number of tests. By default, teams are assigned the “Gremlin Recommended Test Suite,” but this blog will focus on the Well-Architected Cloud Test Suite (you can learn more in our documentation).

First, let’s look at each test in this suite. There are nine tests across three categories, with three tests each:

‍

Scalability

CPU tests that your service scales as expected when CPU capacity is limited.
Memory tests that your service scales as expected when memory is limited.
Disk I/O tests that your system scales as expected when disk I/O (i.e., disk throughput capacity) is limited.

Redundancy

Host tests resilience to host failures by immediately shutting down a randomly selected host or container.
Zone tests your service’s availability when a randomly selected zone becomes unreachable.
DNS tests your service’s availability when a randomly selected DNS server becomes unreachable.

Dependencies

Failure drops all network traffic to a dependency.
Latency delays all network traffic to a dependency by 100ms.
Certificate Expiry retrieves a dependency’s TLS certificate chain and validates that none of the certificates expire within the next 30 days. If you don’t have TLS enabled, the test will pass.

‍

While running a test, Gremlin uses Health Checks to determine whether your service is healthy. Health Checks use your existing observability tool’s metrics and alerts to check service health (or you can simply point it at a URL). The thresholds that constitute “pass” or “fail” depend entirely on how you configure the Health Check.

‍

CPU and memory scalability

The CPU and memory scalability tests stress two different resources similarly: CPU and RAM. They consume the relevant resources in three stages over a period of 15 minutes: 50%, 75%, and 90%. If these tests fail, it means your service can’t scale up capacity fast enough to accommodate increasing demand.

What does this mean in production? Poor CPU scalability means processes will run slower as they compete for CPU time. Poor memory scalability means the operating system’s out-of-memory (OOM) killer process might terminate processes to make room. This can lead to crashes, data loss, and general instability.

The easiest way to address these is by configuring autoscaling rules. If your service runs on a managed cloud like AWS, autoscaling is essentially built-in. We cover the how in detail in our blogs: How to scale your systems based on CPU utilization and How to validate memory-intensive workloads scale in the cloud. In short, AWS lets you set up auto-scaling groups (ASGs), which add or remove EC2 instances to a cluster depending on the cluster’s overall resource usage. You can set a minimum number of instances, all using the same deployment template, and set thresholds for scaling based on total CPU or RAM consumption. If your cluster reaches or exceeds the threshold, EC2 will automatically provision an additional node, add it to the cluster, and load balance traffic once it’s ready.

‍

Disk I/O scalability

Disk I/O tests your service’s responsiveness when disk I/O is limited. It does this by carrying out many read and write operations in the service’s /var/tmp directory for 20 minutes.

This test is designed for cloud environments where the platform limits disk throughput. For example, Amazon Elastic Block Storage (EBS) volumes track throughput using IOPS (input/output operations per second). IOPS varies depending on the number and size of disk writing operations from your application: for general-purpose volumes such as gp2 and gp3, the maximum is 16,000 operations 64 KiB in size. For io1 and io2, this is 64,000 and 256,000 16 KiB operations, respectively (details are available here). If a process uses up most or all of this bandwidth, it can make other disk-heavy processes run more slowly.

One way to address this is by switching volume types. SSDs support significantly higher IOPS than HDDs but are generally more expensive. You could also assign dedicated EBS volumes to disk-heavy processes or move them onto different hosts.

‍

Host and zone redundancy

The host redundancy test will sound familiar to anyone who’s heard of Chaos Monkey: it shuts down a random host (or container) your service is running on. The zone redundancy test works similarly, but instead of shutting down hosts, it blocks network traffic to a randomly selected availability zone. Both tests are meant to ensure that your service is replicated across multiple failure points.

In general, host redundancy is easier to accomplish since it’s built into many of the tools cloud platforms provide. In fact, you might be using it already. If you’re using EC2 Auto Scaling Groups for scalability, you can also use it for redundancy. Just specify the minimum number of instance replicas you want, and if one instance fails (e.g. as a result of a shutdown), EC2 will automatically provision and deploy a replacement.

ASGs can also provide zone redundancy. An EC2 instance's subnet determines the zone it’s deployed to. As long as you configure your ASG to use multiple AZs and subnets, EC2 will distribute your instances equally across all zones to minimize single points of failure.

‍

DNS redundancy

As the old saying goes: “It’s always DNS.” While it’s important that your customers can reach your services, it’s equally important that your services can reach each other. DNS allows services to reference each other by name rather than IP address, and any interruptions in DNS availability can prevent this from happening, effectively creating a network black hole.

On cloud platforms, most DNS resolution is handled by the platform provider. AWS runs a Route 53 Resolver service in each availability zone, but if you’re concerned about Route53 becoming unavailable (which is unlikely, considering their SLA is 100% with a 100% service credit under 99.95%), you can use a secondary DNS in your VPCs.

‍

Dependency failures

Dependencies are services that your service relies on but aren’t part of the service itself. They’re usually managed by other teams or organizations and give you little control over their operations.

Gremlin highly encourages testing dependencies because they’re often single points of failure. You can learn a few techniques for building resilience against dependency failures in our blog: How to build reliable services with unreliable dependencies. In short, you can:

Send requests asynchronously so your service doesn’t have to wait for responses before continuing.
Configure redundant services for a dependency. For example, if your service uses an LLM, you could use Amazon Q as your primary provider and OpenAI or another vendor as a fallback.
Cache and queue requests using services like Amazon SQS, Apache Kafka, or RabbitMQ.

‍

Dependency latency

Slow dependencies can be even more detrimental to performance than failed dependencies. You can detect a failed dependency immediately, but you could wait several seconds for a slow dependency to respond.

‍We also have a blog on dealing with slow dependencies. Our suggestions include:

Using multi-threading and constructs like Promises in Node.js to communicate with dependencies asynchronously.
Adding an intermediate caching layer between your service and the dependency.
Using the circuit breaker software development pattern, which tests for slow or unavailable dependencies before sending a request to them.
Using exponential backoffs to gradually retry failed requests.

‍

Certificate Expiry

Most modern encryption is built on Transport Layer Security (TLS) certificates, especially data in transit between services. All TLS certificates have an expiration date, after which they’re no longer recognized as valid. This is for safety reasons: if a malicious party somehow accessed your service’s TLS certificate, they could impersonate your service indefinitely.

Gremlin’s certificate expiry test checks every TLS certificate in a dependency’s certificate chain to see if it expires within the next 30 days. Ideally, you’d use an automated certificate management tool like AWS Certificate Manager (ACM) or an ACME (Automatic Certificate Management Environment) client like Certbot to rotate your certificates automatically. In cases where you can’t use automatic tooling, ACM can send notifications of expiring certificates as early as 45 days in advance.

For certificates further up the chain—especially root certificates—renewals are most likely out of your control. Fortunately, expired intermediate and root certificates are rare, though they do happen, especially if your users are using out-of-date devices or software to access your service.

‍

Next steps and staying ahead of regressions

Great work making your service more resilient, but where do you go from here?

You already know how to run a test suite and take action on the results. The next step is to automate testing to catch regressions. After logging into the Gremlin web app and navigating to your service, click Settings, then Scheduling. You can auto-schedule reliability tests to run weekly during a customizable testing window. After each test run, Gremlin will track and report on the results, giving you immediate feedback on your reliability.

‍

Enabling reliability test autoscheduling on a service in the Gremlin web app.

‍

If you want to run additional tests beyond what the Well-Architected Cloud Test Suite offers, you can create your own! Test Suites are built on Scenarios, which you can fully customize. You can create a new Test Suite from scratch, but we recommend cloning an existing suite and tailoring it for your team from there.

Keep in mind that when you switch from one Test Suite to another, your reliability scores will be set back to zero. This doesn’t mean your scores were deleted, though! If you switch back to your previous Test Suite, those old scores will be restored, although the test results might show as “expired” if they were last run over a week ago. You can learn more about how the reliability score is calculated in our docs.

While you’re on your reliability journey, always remember that failing a reliability test isn’t a bad thing. It’s an opportunity to learn and improve. Schedule tests, check the results, make improvements, and repeat. Try to get your reliability score as high as possible across your services, and before you know it, you’ll have highly available and fault-tolerant services.

No items found.