Cloud providers like AWS excel at creating reliable platforms for developers to build on. But while the platforms may be rock-solid, this doesn’t guarantee your applications will be too. It’s the provider’s job to offer stable infrastructure, but you’re still on the hook for making your workloads resilient, recoverable, and fault-tolerant.

There’s only one problem: cloud platforms are essentially black boxes. With no insight into how the platform is built, how it works, or how it handles failures, how can you design highly-available workloads?

We’ll answer this question by examining some of the most popular AWS services, such as EC2, EKS, and ECS. We’ll explore how workloads running on them can fail, the tools that AWS provides to help you mitigate failures, and how you can uncover potential failure modes before they impact your systems and customers.

The Shared Responsibility Model for Reliability

No single organization runs an entire cloud platform. Yes, the provider manages the infrastructure, but they don’t manage the applications running on it. That’s the responsibility of you, their customer. This is why major cloud providers adopt a shared responsibility model that divides responsibilities between themselves and their customers.

Under a shared responsibility model, the cloud provider maintains the platform's reliability by building and operating the infrastructure. This lets you focus entirely on building and operating your workloads, but this includes making those workloads resilient and fault-tolerant.

The Shared Responsibility Model of AWS. https://aws.amazon.com/blogs/industries/applying-the-aws-shared-responsibility-model-to-your-gxp-solution/

How do you manage reliability under a shared responsibility model?

In addition to providing best practices guides like the Well-Architected Framework (WAF), AWS offers tools, options, and services to help you improve reliability.

For example, look at Amazon Elastic Compute Cloud (EC2). When you deploy an application onto an EC2 instance, your application is only as reliable as that single instance. One bad software update, accidental misconfiguration, availability zone outage, or similar disruption will take your application offline. In short, the instance becomes a single point of failure. Even if you perform perfect zero-downtime maintenance, Amazon only guarantees 99.5% uptime for individual instances, which allows for around 8 minutes of downtime each day.

You can mitigate this risk using EC2’s auto-scaling groups (ASG) feature. An ASG is a self-maintaining group of EC2 instances connected by an elastic load balancer (ELB). If one instance fails, the ELB immediately re-routes traffic to a healthy instance while simultaneously replacing the unhealthy instance. Once the new instance comes online, AWS will seamlessly route traffic to it, keeping the ASG balanced at its minimum number of instances.

This is just one example of using a service’s own features to make it more resilient. Other services have similar mechanisms with slightly differing implementations. In Lambda, for example, instances of your application are replicated automatically based on demand until they meet a concurrency quota. In EC2-backed Elastic Kubernetes Service (EKS), scaling is built on EC2’s auto-scaling groups (although you still need to configure Kubernetes itself to replicate and scale your pods). For each service your team uses, learn what reliability risks you’re responsible for and what solutions are available.

What reliability risks should you be concerned with on AWS?

We’ll look at three key causes of failures in the cloud: limited redundancy, limited scalability, and accidental resource deletion.

Redundancy risks

Redundancy means you can lose a resource—such as a host, cluster, or availability zone—and your service(s) will remain operational. For globally distributed services, like DynamoDB, redundancy may be enabled by default. For other services, setting it up usually involves extra steps.

Let’s revisit EKS as an example. Kubernetes is built for redundancy:

  • You can create highly available clusters by replicating control plane nodes (the nodes that manage the Kubernetes cluster).
  • You can create as many replicas of a container as you want as long as your cluster has enough capacity.
  • You can replicate clusters across availability zones, regions, or cloud platforms.

However, none of these are enabled out-of-the-box. To implement redundancy in these ways on AWS, you’d need to:

  1. Set up the Cluster Autoscaler to automatically provision capacity as needed (or use EKS Auto Mode). Managed Kubernetes services like EKS will often manage control plane nodes for you or abstract them away entirely.
  2. Edit your Deployment’s replica field, use horizontal pod autoscaling to set a minimum number of replicated Pods for Kubernetes to maintain, and/or use node affinity rules to specify how Pods should be scheduled onto nodes.

Enable EKS Zonal Shift or similar features, or deploy a multi-region cluster.

Note
For more on Kubernetes reliability, see our collection of Kubernetes-related blogs, tutorials, and videos.

Scalability risks

Scalable systems respond to changes in demand by increasing or removing capacity. Cloud platforms make scaling relatively easy—often as simple as checking a box or increasing a number—but this doesn’t always solve the problem.

One challenge is that scaling takes time. In EC2, creating an instance means allocating hardware, copying an operating system image (an Amazon Machine Image, or AMI), running setup scripts, and connecting it to various networks. While AWS is extremely fast, it can still take several minutes for a new instance to come online. In the meantime, you still need to serve your customers and handle increasing traffic.

One way to address this problem is to add a buffer to your scaling thresholds. For example, if your Kubernetes cluster normally scales up at 80% capacity, consider setting the trigger slightly lower at 75 or 70%. When usage eventaully increases to 80%, you’ll have already provisioned and deployed the extra capacity.

The same goes for scaling down when demand drops. Cloud resources cost money, and unused resources are an unnecessary expense, which is why you should downscale once usage falls below a certain threshold. However, demand can change rapidly, and it’s possible that your systems will scale down just as usage starts to spike again. For this reason, it’s good to have a buffer for downscaling as well as upscaling. For example, instead of configuring your service to downscale as soon as CPU usage drops below 50%, consider adding a condition that requires usage to remain below 50% for a certain amount of time (e.g. one minute). This helps avoid constant upscaling and downscaling during dynamic periods while reducing operational costs. We cover this in more detail in our blog on cost-saving strategies.

Accidental resource deletion risks

Accidentally deleting critical resources is more common than you’d think, especially in a complex environment like the cloud. Given enough time and a large enough team, even well-organized deployments will end up with dozens of undocumented instances, services, and other resources. When it’s time to clean up the environment, how do you know which resources can be discarded and which are critical?

Infrastructure as Code (IaC) tools like AWS CloudFormation and Terraform can reduce this risk. Since the environment is defined in code files, anything that doesn’t match your CloudFormation template will be removed. However, this doesn’t prevent engineers from deploying ad-hoc changes or “temporary” systems that inadvertently become critical infrastructure. Additionally, an engineer could make a change to a code file without realizing the impact it will have. If they remove an elastic load balancer (ELB) from a CloudFormation template and apply the changes, AWS will happily remove the ELB regardless of its criticality. This leaves an open question: what can you do to prevent key resources from being deleted?

Fortunately, AWS also has the option to mark specific resources as critical. ELBs in particular have a deletion_protection attribute that prevents the resource from being deleted regardless of the user’s privileges or the method used. Once enabled, the only way to delete a resource with this flag is by manually disabling the flag first, adding just enough friction to prevent accidents. Other services have similar features: EC2, for example, has a DisableApiStop attribute to prevent instances from being stopped or terminated. For other services, you'll want to check the AWS documentation for the relevant attribute(s).

How do you test for and mitigate cloud reliability risks?

While each cloud service has unique failure modes, most cases can be covered by a single standardized suite of tests. The challenge is finding the right tool to test for these failure modes.

Some providers have built reliability testing directly into their platforms. AWS has Fault Injection Service (FIS), which can inject faults directly into the AWS API. For example, FIS can pause replication between DynamoDB tables. However, injecting fault is only one of the many features a reliability testing solution needs to provide. For a reliability testing solution to work well, it also needs to:

  • Apply a standard set of experiments across services and/or teams regardless of infrastructure (e.g. EC2 instances, ECS containers, or EKS clusters).
  • Continually monitor your environment for emergent reliability risks.
  • Integrate with your existing tools, including observability tools like CloudWatch, Datadog, or New Relic.

Gremlin provides a formalized and standardized approach to reliability testing. The first step is finding Detected Risks, which are high-priority reliability concerns in your configuration. Gremlin automatically looks for several AWS-specific risks, including:

  • Services running in a single availability zone (AZ).
  • Load balancers with cross-zone load balancing disabled.
  • Load balancers with deletion protection disabled.

For EKS clusters, Gremlin also detects failed Pods, un-schedulable Pods, Pods stuck in a crash loop, Deployments with missing CPU or memory limits, and more. These risks are constantly re-evaluated, helping you stay ahead of regressions.

In addition to Detected Risks, you can actively test your systems using fault injection. This tests how your cloud resources respond to failure conditions such as high latency, dependency failures, and saturated resources. Gremlin comes with the Well-Architected Cloud Test Suite, which is a collection of pre-built, ready-to-run tests that cover the most common AWS failure modes. Whether your team runs EC2 instances, EKS clusters, or ECS containers, you can use the Well-Architected Cloud Test Suite to run the same tests across your environment and get a single standardized score.

Screenshot of the Gremlin web app showing the Well-Architected Cloud test suite.

Gremlin also natively integrates with several observability tools—including CloudWatch—for monitoring your services. For AWS services, Gremlin can automatically create Intelligent Health Checks. These are automated checks that Gremlin performs during tests to ensure a service is healthy. Intelligent Health Checks track each service’s golden signals—latency, error rate, and request rate—and set reasonable thresholds for failure without you having to configure anything.

Start improving reliability on AWS

If you want to start testing the reliability of your AWS workloads, you can try Gremlin for free for 30 days. You get full access to the platform, including Detected Risks and the Well-Architected Cloud Test Suite. Detected Risks give you near-instantaneous insights into your reliability risks, and the test suite provides guidance on what risks to test for next.

If you’d like to see what Gremlin looks like before signing up for a trial, check out our AWS product tour and see how easy it is to onboard your AWS workloads into Gremlin:

No items found.
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Close Your AWS Reliability Gap

To learn more about how to proactively scan and test for AWS reliability risks and automate reliability management, download a copy of our comprehensive guide.

Get the AWS Primer