What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to reliably root out issues before they impact customers.
However, most current Chaos Engineering and resilience testing is focused on the underlying infrastructure. This helps identify potentially catastrophic failures, but misses the more frequent failures that still significantly impact customer experience. Further, infrastructure-focused testing is often not feasible in serverless environments, or on teams that don’t manage the underlying infrastructure.
Failure Flags is Gremlin’s approach to solving that issue. In short, it makes it possible to test your software’s failure modes without compromising that software or its security, and without outsized efforts to build specialized testing environments.
The Failure Flags SDK makes software testable
Failure Flags is a software-level library to help make applications and data testable. Just as Feature Flags enable software teams to safely roll out new features at specific points of their software, Failure Flags enable those same teams to safely inject fault conditions into their applications to understand how they perform under nonideal conditions. The result? Confident deployments that can only come from teams who are building and testing software with failure modes in mind.
Just like Feature Flags, a Failure Flag is a named point in your software and identifies spots where failures might occur. Add Failure Flags to your code using the Failure Flags SDK (JavaScript, Java, Python, and Go), configure a container sidecar (on ECS or Kubernetes) or Lambda extension and you’re ready to go.
When you’re ready to start testing, you or your CI/CD system can define and run experiments using the Gremlin web app or APIs. The details of each experiment, like the amount of latency to add, exceptions to throw or other test configuration is specified as part of the experiment, not the Failure Flag itself. Once you’ve added Failure Flags to your code you can run any number or type of experiments.
Unlike Feature Flags, you don’t need to write the code simulating adverse behavior yourself, but you can extend the failure types and even define your own if you need something special. On the topic of special features, with Failure Flags you can create more powerful and expressive experiments than with any other fault-injection system available today.
Instrumenting and using Failure Flags
Failure Flags is all about making your actual application’s failure modes testable, but it isn't particularly opinionated on how or when the tests are run.
It’s best to add Failure Flags around calls to your network (or other I/O) dependencies and to name each Failure Flag the same as the dependency it gates. Beyond those points, we also recommend adding them to the beginning of your request handlers. It can also be helpful and minimize code impact to add Failure Flags to any common libraries you use to abstract calls to your dependencies.
With your code instrumented, you or your CI/CD system can define and run experiments using the Gremlin dashboard or APIs.
Identifying failures with Failure Flags
With Failure Flags you can target specific applications by name, or deployment features like cloud or region. You can target Failure Flags specific to an application or across applications. And you can target Failure Flags based on runtime context. This means that Failure Flags can do what no other fault injection platform can: let you fine-tune the scale and impact of experiments across your entire infrastructure.
They can help you experiment with:
- Dependency loss or latency
- Database table locking issues
- Hot partitions
- Malformed messages and dead letter queue configurations
- API gateways or reverse proxies
- Request or response corruption, bad response codes
- Isolated impact to specific, well-known users
Failure Flags is safe and portable
Failure Flags is designed for safety and negligible impact when experiments are not running or if there are any configuration issues preventing communication with Gremlin. Once you add Failure Flags to your code, you can leave them in even when deploying to environments where you never want to run experiments. You can be sure that it will never impact your application by simply omitting part of the configuration, omitting the sidecar from your deployment, or even blocking the Gremlin API at the firewall level.
Further, Failure Flags are fail-safe. We know how important it is to maintain code portability. The same features and design decisions that make Failure Flags safe to leave in your code mean that you can add Failure Flags to your software without worrying about removing Failure Flags from your code if you choose to leave Gremlin. Any disruption in service will result in Failure Flags stopping any existing experiments.
Extending the most comprehensive fault injection platform
You can integrate the Failure Flags SDK into your code on any platform, but you can only run experiments on platforms that support container sidecars (like Kubernetes, Docker, Nomad, or Cloud Foundry) or AWS Lambda (with Lambda Extensions). Running your application without the sidecar will have no impact on your application, but you will not be able to run experiments.
Of course, Failure Flags is one piece of Gremlin’s broader fault injection and reliability management platform, with support for infrastructure-based agents to make your platforms and infrastructure testable. Gremlin supports all public cloud environments, including AWS, Azure, and GCP. It runs on Linux, Windows, Kubernetes and other containerized environments, AWS Lambda and other serverless platforms, and, yes, bare metal, too. It integrates with the CI/CD, observability, and performance testing tools you already use so you can incorporate it with your current tooling and workflows.
To learn more about Failure Flags, explore the Failure Flags docs or schedule a demo. You can also try it for yourself by signing up for a free trial and following our tutorial for AWS Lambda.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALHow to fix Kubernetes init container errors
One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start…
One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start…
Read moreRelease Roundup Sept 2023: Measurably improve reliability
It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.
It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.
Read more