What is the Well-Architected Cloud Test Suite?
When it comes to reliability, cloud providers use a Shared Responsibility Model. In essence, they’ll keep the infrastructure reliable, while you’re responsible for architecting reliability into your systems. To help make this easier, they’ve published a variety of best practice guides, such as the AWS Well-Architected Framework. These lengthy documents are filled with recommendations to help you architect a more secure, more reliable system.
But how do you know that your systems are performing the way you planned? That’s where reliability tests come in. In fact, the Well-Architected Framework includes a recommendation to test your reliability regularly.
Gremlin built the Well-Architected Cloud Test Suite to make this easier. Designed around cloud reliability principles and best practices, it gives you a testing foundation that covers the most common reliability failures out of the box. And with Gremlin’s Intelligent Health Checks, or integrations with most observability tools, you can understand how resilient your systems are in minutes.
Let’s dig into what the test suite includes.
What are test suites?
Test suites are the backbone of standardized resilience testing.
When using a test suite, you’re engaging in validation testing, which is used to verify the resilience of your system against known failure modes.
As an example, let’s look at autoscaling. When there’s a sudden surge in traffic, a cloud-based system should spin up additional resources to handle the increased demand. If it doesn’t, then it will lead to an outage where the system can’t keep up with demand. Thus, autoscaling is a known potential point of failure in a cloud system.
Using a test suite, you can automate running tests that increase demand on your system to make sure that autoscaling works correctly. If it doesn’t, the test will disengage and rollback to its previous state. But now you’ll know that there’s a problem with autoscaling—one that you can fix before it leads to an actual outage instead of just a simulated one.
Pre-built test suites, like the Gremlin Recommended Test Suite or the Well-Architected Cloud Test Suite, are a great place to start your testing. As you become more familiar with your system and the responses, you can customize the test suites by adjusting test parameters, adding/removing tests to better fit your architecture, or even creating your own test suites from scratch.
Tests in the Well-Architected Cloud Test Suite
There are several core reliability tests that you should look at running for every service. These are contained in the Gremlin Recommended Test Suite, covering your service’s scalability, redundancy, and dependency response. The Well-Architected Cloud Test Suite starts with these core tests, and layers on additional Disk I/O and DNS tests.
Once your service is connected to the Gremlin platform, you’ll be able to immediately run these tests on your services. The Well-Architected Cloud Test Suite includes nine tests that break down into these groups.
Scalability tests
Your cloud deployments need to be resilient to sudden increases in demand for resources. These three tests will verify that your services are resilient to sudden resource spikes.
CPU Scalability
Ensure that your service scales as expected when CPU is limited.
- Consume CPU in 3 stages (50%, 75%, 90%)
- Est. test length: 15 minutes
Memory Scalability
Ensure that your service scales as expected when memory is limited.
- Increase memory utilization in 3 stages (50%, 75%, 90%)
- Est. test length: 15 minutes
Disk I/O Scalability
Ensure your service withstand high read/write disk activity.
- Increase memory utilization in 3 stages (50%, 75%, 90%)
- Est. test length: 15 minutes
Relevant AWS Well-Architected Framework sections
- WAF REL05-BP07: Implement emergency levers
- WAF REL07-BP01: Use automation when obtaining or scaling resources
- WAF REL07-BP02: Obtain resources upon detection of impairment to a workload
- WAF REL07-BP03: Obtain resources upon detection that more resources are needed for a workload
Redundancy tests
While cloud providers work hard to keep systems up, there will still be times when hosts, zones, or regions are unreachable. These tests shut down network access to verify that your deployment has the redundancy needed to stay up when a host, zone, or DNS service isn’t reachable.
Host redundancy
Verify that your service can withstand the loss of a host.
- Immediately shutting down a randomly selected host or container
- Est. test length: 5 minutes
Zone redundancy
Ensure your service can withstand the loss of an availability zone. The Gremlin zone tag is required for this test.
- Immediately make a randomly selected zone unreachable from other zones.
- Est. test length: 5 minutes
DNS redundancy
Ensure your service can failover to a secondary DNS, then back to the primary DNS.
- Simulate a DNS outage by making your primary DNS service unavailable.
- Est. test length: 2 minutes
Relevant AWS Well-Architected Framework sections
- WAF REL02-BP01: Use highly available network connectivity for your workload public endpoints
- WAF REL08-BP03: Integrate resiliency testing as part of your deployment
- WAF REL10-BP01: Deploy the workload to multiple locations
Dependency tests
Service-based architectures usually come with a complex arrangement of interlocked dependencies. These tests verify that your system responds correctly when dependencies are slow or unavailable.
Dependency Failure
Verify that your service handles the loss of a dependency.
- Drop all network traffic to a dependency.
- Est. test length: 5 minutes
Dependency Latency
Ensure that your service handles latency with a dependency correctly.
- Delay all network traffic to this dependency by 100ms.
- Est. test length: 5 minutes
Certificate Expiry
Expired certificates on dependencies are a common outage cause. Use this test to ensure no certificates expire in the next 30 days.
- Open a secure connection to your dependency, retrieve the certificate chain, and validate that no certificates expire in the next 30 days.some text
- Note: If there is no secure connection available, and therefore no certificates, this test will pass.
- Est. test length: 2 minutes
Relevant AWS Well-Architected Framework sections
- WAF REL05-BP04: Fail fast and limit queues
- WAF REL05-BP05: Set client timeouts
- WAF REL13-BP04: Manage configuration drift at the DR site or Region
Detected load balancer risks
Some reliability risks can be detected by reading the configuration and status of your elastic load balancer (ELB). Instead of running a test, these are automatically detected and surfaced in Gremlin. The Well-Architected Cloud Test Suite includes three additional risks for cloud architectures based on load balancer configurations.
- No load balancer availability zone redundancy
Verifies that your load balancer is set to run in multiple availability zones to avoid single points of failure. - Cross-zone load balancing disabled
Verifies whether cross-zone load balancing is enabled to better handle the loss of one or more instances. - Deletion protection disabled
Checks to see whether load balancer delete protection is enabled to prevent accidental deletions.
Next steps: Intelligent Health Checks get you testing quickly
Health Checks are an essential part of running tests to monitor your services performance and determine whether a test passed or failed. Gremlin’s Intelligent Health Checks make it easy to get your test suite up and running quickly. On AWS, Intelligent Health Checks can be automatically pulled from your Elastic Load Balancer with a single click. This means you can run the Well-Architected Cloud Test Suite in both instrumented and non-instrumented environments, such as pre-production or staging environments.
For other infrastructures, you can use integrations to pull metrics directly from your observability tool.
Sign up for a free 30-day Gremlin trial to start running tests—or request a demo to see it in action and get your questions answered.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALTo learn more about how to proactively scan and test for AWS reliability risks and automate reliability management, download a copy of our comprehensive guide.
Get the AWS PrimerIntroducing Gremlin for AWS
Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. Gremlin for AWS enables engineering teams on AWS to prevent incidents, monitor and test systems for known causes of failure, and gain visibility into the reliability posture of their applications.
Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. Gremlin for AWS enables engineering teams on AWS to prevent incidents, monitor and test systems for known causes of failure, and gain visibility into the reliability posture of their applications.
Read more