What is the Well-Architected Cloud Test Suite?

When it comes to reliability, cloud providers use a Shared Responsibility Model. In essence, they’ll keep the infrastructure reliable, while you’re responsible for architecting reliability into your systems. To help make this easier, they’ve published a variety of best practice guides, such as the AWS Well-Architected Framework. These lengthy documents are filled with recommendations to help you architect a more secure, more reliable system.

But how do you know that your systems are performing the way you planned? That’s where reliability tests come in. In fact, the Well-Architected Framework includes a recommendation to test your reliability regularly.

Gremlin built the Well-Architected Cloud Test Suite to make this easier. Designed around cloud reliability principles and best practices, it gives you a testing foundation that covers the most common reliability failures out of the box. And with Gremlin’s Intelligent Health Checks, or integrations with most observability tools, you can understand how resilient your systems are in minutes.

Let’s dig into what the test suite includes.

What are test suites?

Test suites are the backbone of standardized resilience testing.

When using a test suite, you’re engaging in validation testing, which is used to verify the resilience of your system against known failure modes.

As an example, let’s look at autoscaling. When there’s a sudden surge in traffic, a cloud-based system should spin up additional resources to handle the increased demand. If it doesn’t, then it will lead to an outage where the system can’t keep up with demand. Thus, autoscaling is a known potential point of failure in a cloud system.

Using a test suite, you can automate running tests that increase demand on your system to make sure that autoscaling works correctly. If it doesn’t, the test will disengage and rollback to its previous state. But now you’ll know that there’s a problem with autoscaling—one that you can fix before it leads to an actual outage instead of just a simulated one.

Pre-built test suites, like the Gremlin Recommended Test Suite or the Well-Architected Cloud Test Suite, are a great place to start your testing. As you become more familiar with your system and the responses, you can customize the test suites by adjusting test parameters, adding/removing tests to better fit your architecture, or even creating your own test suites from scratch.

Tests in the Well-Architected Cloud Test Suite

There are several core reliability tests that you should look at running for every service. These are contained in the Gremlin Recommended Test Suite, covering your service’s scalability, redundancy, and dependency response. The Well-Architected Cloud Test Suite starts with these core tests, and layers on additional Disk I/O and DNS tests.

Once your service is connected to the Gremlin platform, you’ll be able to immediately run these tests on your services. The Well-Architected Cloud Test Suite includes nine tests that break down into these groups.

Scalability tests

Your cloud deployments need to be resilient to sudden increases in demand for resources. These three tests will verify that your services are resilient to sudden resource spikes.

CPU Scalability

Ensure that your service scales as expected when CPU is limited.

Consume CPU in 3 stages (50%, 75%, 90%)
Est. test length: 15 minutes

Memory Scalability

Ensure that your service scales as expected when memory is limited.

Increase memory utilization in 3 stages (50%, 75%, 90%)
Est. test length: 15 minutes

Disk I/O Scalability

Ensure your service withstand high read/write disk activity.

Increase memory utilization in 3 stages (50%, 75%, 90%)
Est. test length: 15 minutes

Relevant AWS Well-Architected Framework sections

WAF REL05-BP07: Implement emergency levers
WAF REL07-BP01: Use automation when obtaining or scaling resources
WAF REL07-BP02: Obtain resources upon detection of impairment to a workload
WAF REL07-BP03: Obtain resources upon detection that more resources are needed for a workload

Redundancy tests

While cloud providers work hard to keep systems up, there will still be times when hosts, zones, or regions are unreachable. These tests shut down network access to verify that your deployment has the redundancy needed to stay up when a host, zone, or DNS service isn’t reachable.

Host redundancy

Verify that your service can withstand the loss of a host.

Immediately shutting down a randomly selected host or container
Est. test length: 5 minutes

Zone redundancy

Ensure your service can withstand the loss of an availability zone. The Gremlin zone tag is required for this test.

Immediately make a randomly selected zone unreachable from other zones.
Est. test length: 5 minutes

DNS redundancy

Ensure your service can failover to a secondary DNS, then back to the primary DNS.

Simulate a DNS outage by making your primary DNS service unavailable.
Est. test length: 2 minutes

Relevant AWS Well-Architected Framework sections

WAF REL02-BP01: Use highly available network connectivity for your workload public endpoints
WAF REL08-BP03: Integrate resiliency testing as part of your deployment
WAF REL10-BP01: Deploy the workload to multiple locations

Dependency tests

Service-based architectures usually come with a complex arrangement of interlocked dependencies. These tests verify that your system responds correctly when dependencies are slow or unavailable.

Dependency Failure

Verify that your service handles the loss of a dependency.

Drop all network traffic to a dependency.
Est. test length: 5 minutes

Dependency Latency

Ensure that your service handles latency with a dependency correctly.

Delay all network traffic to this dependency by 100ms.
Est. test length: 5 minutes

Certificate Expiry

Expired certificates on dependencies are a common outage cause. Use this test to ensure no certificates expire in the next 30 days.

Open a secure connection to your dependency, retrieve the certificate chain, and validate that no certificates expire in the next 30 days.some text
- Note: If there is no secure connection available, and therefore no certificates, this test will pass.
Est. test length: 2 minutes

Relevant AWS Well-Architected Framework sections

WAF REL05-BP04: Fail fast and limit queues
WAF REL05-BP05: Set client timeouts
WAF REL13-BP04: Manage configuration drift at the DR site or Region

Detected load balancer risks

Some reliability risks can be detected by reading the configuration and status of your elastic load balancer (ELB). Instead of running a test, these are automatically detected and surfaced in Gremlin. The Well-Architected Cloud Test Suite includes three additional risks for cloud architectures based on load balancer configurations.

No load balancer availability zone redundancy
Verifies that your load balancer is set to run in multiple availability zones to avoid single points of failure.
Cross-zone load balancing disabled
Verifies whether cross-zone load balancing is enabled to better handle the loss of one or more instances.
Deletion protection disabled
Checks to see whether load balancer delete protection is enabled to prevent accidental deletions.

Next steps: Intelligent Health Checks get you testing quickly

Health Checks are an essential part of running tests to monitor your services performance and determine whether a test passed or failed. Gremlin’s Intelligent Health Checks make it easy to get your test suite up and running quickly. On AWS, Intelligent Health Checks can be automatically pulled from your Elastic Load Balancer with a single click. This means you can run the Well-Architected Cloud Test Suite in both instrumented and non-instrumented environments, such as pre-production or staging environments.

For other infrastructures, you can use integrations to pull metrics directly from your observability tool.

Sign up for a free 30-day Gremlin trial to start running tests—or request a demo to see it in action and get your questions answered.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Close Your AWS Reliability Gap

To learn more about how to proactively scan and test for AWS reliability risks and automate reliability management, download a copy of our comprehensive guide.

Get the AWS Primer

What is the Well-Architected Cloud Test Suite?

What are test suites?

Tests in the Well-Architected Cloud Test Suite

Scalability tests

CPU Scalability

Memory Scalability

Disk I/O Scalability

Relevant AWS Well-Architected Framework sections

Redundancy tests

Host redundancy

Zone redundancy

DNS redundancy

Relevant AWS Well-Architected Framework sections

Dependency tests

Dependency Failure

Dependency Latency

Certificate Expiry

Relevant AWS Well-Architected Framework sections

Detected load balancer risks

Next steps: Intelligent Health Checks get you testing quickly

Introducing Gremlin for AWS