Test serverless and application-level reliability with Failure Flags

It’s been a year and a half since Failure Flags was released. Since then, customers have used Failure Flags to run thousands of tests for applications running on serverless, container, and service meshes. (Check out this blog post to see how easy it was for a major retailer to set up and test a critical service on AWS Lambda in less than 30 minutes.)

We’ve also been hard at work improving Failure Flags capabilities and ease of use, which is why it’s time to officially announce it as out of Beta!

Let’s take a brief look at Failure Flags, how it works, and some of the significant improvements from the last year.

‍

Resilience tests for serverless, containers, Kubernetes, and service mesh

Failure Flags lets you run tests on the application level, which is essential for managed services where the infrastructure layer is abstracted away from your control. It does this by using three components: the Gremlin SaaS API, the Failure Flags Sidecar or Lambda Extension, and one of the SDKs integrated into your application code.

This combination allows you to run tests on applications using:

The interaction between these three pieces is essential for safety and making sure your application isn’t impacted except during a test. Once installed, the sidecar/extension and SDK will sit passively unless an experiment is running in Gremlin, so it can safely remain in any application.

‍

Create application errors, latency, and data issues with Failure Flags

Failure Flags gives you the capability to create latency, cause specific error codes, and, in Node.js, modify data such as variables. Application errors caused by these issues represent the bulk of the problems that teams deal with day-to-day, including:

Incorrect or corrupt data
Customer-specific failures
Lock-contention on hot data
Breaking API changes
Unexpected API responses
Partial service failures
Message double-delivery or ordering issues

Beyond specific errors, you can use Failure Flags to test how your application interacts with other parts of your system, allowing you to verify key system parts like observability and alert configuration or automated recovery systems.

‍

Run Failure Flags in Node.js, Python, Java, Go—and .NET

You can install Failure Flags using an SDK. These are designed to be fail-safe if the agent is misconfigured, can't communicate with your application, or can't communicate with the Gremlin API. That means you can leave the SDK in your application without worrying about it impacting anything outside of your experiment’s parameters.

Failure Flags has SDKs for the most common serverless and managed container languages, including Node.js, Python, Java, and Go. Now, we’re pleased to announce that the .NET SDK is also now available!

‍

Set up Failure Flags experiments in the Gremlin UI

When Failure Flags first launched in Beta, experiments had to be manually set up using JSON, but that changed in the second half of 2024. Now, you can select your Failure Flag, attributes, services, and effects using drop-down boxes. There’s still a JSON tab if you would prefer to create experiments that way, and any changes you make in one are reflected in the other.

‍

Failure Flags is GA…and we’re just getting started

With all the customer-led improvements and optimizations over the last year, we’re pleased to make Failure Flags Generally Available for all customers.

And we won’t stop here! We’re currently hard at work on even more Failure Flags improvements to help make it easier for you to not only run application-level resilience tests, but also to standardize those tests so you can scale your efforts across your organization.

Want to see what the fuss is about? Check out the interactive Failure Flags walkthrough below, or contact us to set up a demo!

No items found.