The way the story goes is that in the old days Amazon used to cut power to data centers so they could see if their services were actually redundant across different data centers; and that they only abandoned this practice when EC2 customers started to complain (no matter how many times they were warned their instances might disappear without notice). This story may be apocryphal, but you don’t need to be worried about power loss outages in order to have a given data center go down. Simple problems such as even a low level of packet loss (say 2-4%) in an environment with large fan out and retries can fill up your switch buffers, overload your network, and devour the entire data center.

Having made software redundant across data centers, engineers quickly found a variety of outages they could now survive. From pushing a bad config to a specific region, to having an issue with a zone specific dependency, to exhausting resources or quotas; pulling your software out of a data center in this way has a lot of additional benefits. If there’s a problem local to specific region, you can fail out of that region to quickly provide immediate remediation. Overtime this idea of being redundant across different data centers (or zones as many cloud offerings call them) has become a best practice across the industry. However, how do you know you’re actually redundant?

Too often operators have spent the time and effort to replicate their service across different zones or regions only to have one of those data-centers go down and suddenly find out that they are not resilient. This can be due to a critical dependency being only available in a single zone; or perhaps your fleet is not evenly balanced; or maybe your service cannot tolerate a split brain. The lesson is straightforward: being present in multiple regions is not the same thing as being resilient to zonal failure.

Like a lot of things in software engineering, just because we intend things to work in a certain way doesn’t mean they will. Fortunately, we have a useful tool to evaluate the difference between “believing” you’re resilient and “knowing” you’re resilient: we can test it

Provisioning for Redundancy

Consider a service deployed across two zones running at 75% capacity utilization. We’d like to think this is redundant, but what happens when we lose one of those zones entirely? The total traffic is routed to the remaining zone which only has half the allocated hosts, but now must handle 100% of the requests putting it at 150% capacity utilization.

You might think reactive autoscaling will help with this problem, but consider that all the other services at your company (or other companies if you've contracted with a cloud provider) which also lost that zone and will also be looking to replace their capacity. All of a sudden this service is struggling to scale up.

Conventional wisdom here is to run excess capacity (over provisioning) such that loss of a single zone does not put your service under water (e.g. If you’re in 3 zones run at 150% of the machines you need keeping them at 67% utilization for whatever your limiting resource is). The algorithm here is target utilization = (number of Zones - 1)/(number of Zones). This guarantees that we can continue to service the same rate of traffic even in the loss of a single zone. Astute observers will note that your capital expenditure here goes down the more zones you’re redundant across (a significant concern for most organizations).

Validating Zone Redundancy

How do we validate this? First we apply load (artificial or otherwise) to the service under test. We wait until any scaling is completed to hit our target utilization. After which we can block network traffic from this service to all machines in the “impacted” zone. As we’ll see later, it’s important to block all traffic from the impacted service to that zone, even to IPs or machines not under test.

What we expect to see is service utilization in the un-impacted zones to increase, but overall error rates to remain minimal (or at least not increase from before the experiment). What you may find (as operators frequently do) in your own testing is that your scaling curve with respect to utilization is non-linear.

As experienced operators will note, as more load gets pushed from the impacted to un-impacted machines request latency will increase, lowering your throughput. As new requests come in and existing requests are now taking longer, this has a super linear (or sometimes geometric) impact on your hardware utilization until autoscaling catches up. This is a good test of how your autoscaling system responds under such conditions as it may cause over-scaling, or you may find it's not responsive enough and thus you need to increase your resident fleet size.

It’s important to apply load, before starting the experiment, in a way that allows you to evaluate the availability of the service under test. Without this you may not properly exercise dependencies or scaling logic. This can quickly lead to situations where the service passes tests without load, but in real world situations have dramatically different performance and are no longer resilient.

In cases where it’s not reasonable to use customer traffic, try to make sure your artificial load exercises the full breath of your service while keeping the blend of traffic representative of what you see in production. This is especially true for services with a heterogeneous blend of traffic, where different requests have dramatically different performance.  It’s possible you may also have dramatically different blends at certain times or for certain events and you may want to test multiple times to exercise them all.

Configuring for Zonal Failure

As you might imagine remaining well balanced across zones becomes a key component in being prepared for zonal failure. Unfortunately, this is a resilience mechanism which can also work against you during an outage. Outages which impact your service in a single region but don’t take down that region in its entirety can cause such auto-balancing mechanisms to attempt to keep allocating hosts in the impacted region before adding capacity in remaining regions, limiting your ability to respond to the outage.

This is a subtle misconfiguration that is the default behavior in some cloud providers. It’s important to test the scaling of your service while it’s being impacted, typically by increasing the load on it to force scaling mechanism (or manually setting lower utilization targets and observing how it responds). Success here is the impacted service being able to bring on new instances in the un-impacted zones.

Testing Best Practices

Another issue many operators who’ve only recently added zonal redundancy run into is having their service be redundant, but having cross zone traffic to their dependencies. This typically manifests in the following way: Imagine a service is spread across 4 zones and one of them experiences an outage. If the service is scaled correctly, you should notice minimal errors. However, if the instances in remaining zones are talking cross zone to a dependency which has failed out of the impacted zone, the service will experience significant error rate on maybe 25% of requests. This is because a quarter of the traffic being serviced by instances in the remaining 3 zones is attempting to talk to the critical dependency in the impacted zone. This is part of why it’s so important to block all traffic from your service to the impacted zone and not just between hosts belonging to your service.

It is also possible to have the opposite problem. Say we run these tests, but only block traffic to existing hosts in the impacted zone. As regular scaling starts to occur, new hosts for dependencies will appear in the impacted zone that aren’t included in test's block list. This would invalidate the test results and may lead to false positives (thinking you’re resilient when you’re not) if not caught.

Know you're resilient to zonal failures

Zone redundancy is a powerful weapon in providing resilience to a service, but there is a significant gap between putting instances in multiple zones and actually being resilient to zonal failures. Reliability Engineering gives us the tools you need to not just believe we’re zone redundant, but to prove it and know for sure

No items found.
Sam Rossoff
Sam Rossoff
Principal Software Engineer
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Book a demo

Schedule a time with a reliability expert to see how reliability management and Chaos Engineering can help improve the reliability, resilience, and availability of your systems.

Schedule now