Failure Flags
Gremlin Failure Flags lets you run Chaos Engineering experiments and reliability tests on serverless workloads, containers, and similar managed environments. Just like feature flags, Failure Flags let you perform experiments on specific parts of your services and applications with minimal impact to your application code and no performance impact when disabled. Failure Flags are safe to deploy in your application and will default to disabled when you have no actively running experiments.
Use-Cases
Failure Flags is an application level fault injection tool and its use-cases cover simulating or realizing those failures in your system that either have impact at the application level or target application data. These typically represent the bulk of the issues teams see day-to-day. Issues like:
- Incorrect or corrupt data
- Customer-specific failures
- Lock-contention on hot data
- Breaking API changes
- Unexpected API responses
- Partial service failures
- Message double-delivery or ordering issues
But more than testing issues, Failure Flags can help you:
- Test observability and alarm configuration
- Exercise automated recovery systems
- Isolate experiments in any environment to well-knows users or customers
Architecture and Performance Impact
Failure Flags involves integration with your applications and for that reason it is critical that you can be confident that adopting this technology will not adversely affect either the availability or performance of those applications outside of experiment parameters. Failure Flags - like other Gremlin products - is designed to fail safely.
Failure Flags is made up of three major components: the Gremlin SaaS API, the Failure Flags Sidecar or Lambda Extension, and one of the SDKs. No impact to your applications is possible unless all three are configured correctly at runtime. Working backwards from your application:
- The SDK must be integrated with your application and explicitly enabled via environment variable.
- The sidecar or extension must be deployed with your application and use a common localhost interface.
- The sidecar or extension must be enabled and provided with current credentials to the Gremlin API via environment variables or other configuration options.
- The sidecar or extension must have a stable network route to the Gremlin API and be provided with configuration required to traverse corporate proxies.
- Your company Gremlin account must have Failure Flags enabled.
- Your team must have created and run an experiment.
Any misconfiguration, configuration omission, or service outage can only prevent experimentation and will minimize any adverse impact to your applications. Further, the various Failure Flags SDKs are published under the Apache-2.0 license. You're encouraged to audit those libraries as you see fit. Adopting Failure Flags will in no way lock-in your applications to Gremlin.
Takeaways
- It is safe to add Failure Flags to your code and leave them there
- It is easy to prevent experimentation in any environment
- The SDKs are licensed under Apache-2.0
- Adding Failure Flags will not create lock-in
Supported Platforms
Failure Flags can run on any platform or environment that supports multiple processes with shared localhost. These include most if not all Kubernetes platforms, AWS Lambda, AWS ECS, virtual machines, container platforms with shared network namespaces, and many others (like your laptop). Gremlin currently provides support for:
- AWS Lambda
- AWS ECS
- Kubernetes
Gremlin does provide executables and a variety of packages that can be used in other platforms but we cannot provide support for those at this time.
Supported Languages and Frameworks
The Failure Flags SDKs are language-specific and released under the Apache 2.0 license. These include support for:
- JavaScript / TypeScript / NodeJS
- Python
- Go
- Java
Each of these are minimal SDKs and support similar features and semantics when possible.
Preparing and Next Steps
Before you’ll be able to use Failure Flags you’ll need to gather some information and do a little pre-work:
- Identify the Application you will Instrument: Consider the common use-cases listed above and decide which of your applications you'll get started with.
- Firewalls and Routes: Make sure that the network your chosen application is deployed into has a route to beta.gremlin.com and api.gremlin.com.
- Proxy Configuration: If that network uses an outbound HTTP or HTTPS proxy you'll need to gather its URL, any credentials, and certificate material it uses. That certificate material should be PEM encoded.
- New Library Dependencies: You will add a library dependency to your project. If your organization uses an internal package / library cache make sure that you've included the Failure Flags SDK for your application.
See the following pages to get started: