How a major retailer tested critical serverless systems with Failure Flags

Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage.

But there was one snag: the service was built using AWS Lambda.

This meant infrastructure-focused tests would have trouble replicating the failure conditions necessary to test the failover due to Lambda’s serverless model.

Fortunately, we had a solution. Using Failure Flags, they were able to get up and running, and accurately run the failover test, in less than 30 minutes.

Let’s look at how it all went down!

Customer: What happens when a Lambda region fails?

Any experiment starts with a question you need answered. In this case, the customer wanted to know what would happen if a region of their AWS Lambda-based application became unavailable. If they’d set everything up correctly, the application would failover between regions and keep running to avoid an outage.

But that’s just scratching the surface. To ensure everything was running as expected, they needed to assess not only whether the region failover worked but also how it worked. This included getting data like how quickly it failed over, how and when alerts triggered, and more.

The only way to know this for sure is to simulate the failure of a region, record how the system responded, and compare the results against expectations.

The elephant in the room: you can’t just make AWS Lambda fail

The whole point of using serverless systems like AWS Lambda is that everything is abstracted away and managed for you. On the plus side, this means you don’t have to worry about server or infrastructure resource allocation. But you also don’t have granular control of the underlying resources and instances.

When a service is built directly on servers, such as with Amazon EC2 instances, testing region failover is a fairly straightforward and standard test. Using Fault Injection, you simulate a server or group of servers being unavailable, such as using a Blackhole experiment.

But the whole point of a managed service is that you don’t get access to the underlying infrastructure. Which means the only way to test failure on an infrastructure level would be to inject faults directly into the Lambda servers.

AWS, rightly so, isn’t going to allow that. And that’s where Failure Flags comes in.

Failure Flags is specifically designed for application-level testing on serverless, container, and similarly managed environments. It can limit failures to one specific application, allowing you to test what happens if a region is unavailable without impacting AWS Lambda.

Setting up and testing in 30 minutes with Failure Flags

Like feature flags, Failure Flags lets you perform experiments on specific parts of your services and applications with minimal impact to your application code and no performance impact when disabled. On AWS Lambda, it’s deployed by using a Lambda Extension.

The extension can be added to any Lambda Function without impacting function, availability, or performance. You configure it by using environment variables or configuration files, and it will only run a test when it receives instructions from the Gremlin Control Plane.

Using this approach, the team of engineers was able to set up Failure Flags on their Lambda application, configure a region failover testing scenario, and then run the test. All in under 30 minutes. In fact, the experiment itself only took about 5 minutes.

The result? Proof that the failover worked correctly. And those results could be shared with leadership and the wider team to show compliance with reliability standards.

But it also revealed more than that. While the failover worked correctly, there was some anomalous data that pointed to other issues. So not only were they able to prove the resilience of a critical service (and bypass millions in lost sales), but analysis of the results revealed further opportunities to increase performance.

This test was only the beginning

The speed, accuracy, and results of the test helped the customer improve the performance and resilience of their systems. But they’re just getting started. After this quick experience, they decided to roll out Gremlin and reliability tests across all of their deployment environments, including Lambda, EKS, ECS, and more.

Ready to see how easy it can be to improve resilience? Check out our interactive Failure Flags demo below and see how to use Gremlin Failure Flags to run Chaos Engineering experiments on a serverless application.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

How a major retailer tested critical serverless systems with Failure Flags

Customer: What happens when a Lambda region fails?

The elephant in the room: you can’t just make AWS Lambda fail

Setting up and testing in 30 minutes with Failure Flags

This test was only the beginning

Three serverless reliability risks you can solve today using Failure Flags