Announcing the availability of Gremlin using AWS CloudFormation Public Registry
We’re excited to announce that Gremlin is available on AWS CloudFormation Public Registry. CloudFormation Public Registry is a new searchable collection of extensions that lets customers easily discover, provision, and manage resource types (provisioning logic) and modules published by AWS Partner Network (APN) Partners and the developer community. We’ve collaborated with CloudFormation Public Registry to enable you to easily deploy Gremlin and run Chaos Engineering experiments on your AWS deployments.
Why use Gremlin with CloudFormation?
AWS CloudFormation provides an easy way to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their life cycles, by treating infrastructure as code. Engineers can now deploy and update Gremlin resources in a simple, declarative style that abstracts away the complexity of specific resource APIs.
Gremlin, the world’s first managed enterprise Chaos Engineering solution, helps you validate the reliability of your cloud infrastructure by running thoughtful experiments designed to test for failure modes. Gremlin makes it easy to run targeted experiments on your AWS workloads, including:
- Testing the configuration of auto scaling groups (ASGs) by simulating heavy traffic.
- Validating region failover and disaster recovery by simulating Availability Zone or region outages.
- Validating CloudWatch configurations and alerts.
- Ensuring that containerized workloads, Kubernetes resources, and distributed services can automatically recover from failure.
With this co-launch between Gremlin and CloudFormation Public Registry, you can now easily discover and use our published resource types instead of having to build and maintain them yourself.
You can deploy the Gremlin agent across your entire AWS Organization—or for a specific set of accounts within an OU—in a single operation by using CloudFormation’s StackSets with service-managed permissions. You can then use Gremlin’s Chaos Engineering platform to run chaos experiments, discover services, and validate the resilience of your cloud workloads.
In addition, you can use CloudFormation features such as Drift Detection, which lets you identify drift of resources in your stack from their expected template configuration and understand detailed information about drift status. This ensures that your Gremlin agents always match your defined configuration.
How to deploy Gremlin using CloudFormation
Before getting started, you will need to create an IAM role to grant CloudFormation access to the Kubernetes API. A template is available here. This will output an ARN, which you will need for the following steps.
Next, enable the AWSQS::EKS::Cluster extension. Navigate to the CloudFormation registry, select public extensions, then search for “AWSQS::EKS::Cluster”. Click activate, and when prompted for an execution role ARN, use the ARN created for your IAM role. Now you can use CloudFormation to deploy your cluster. More information is available here.
Now that your EKS cluster is up and running, the next step is to activate the Gremlin extension. Navigate to the CloudFormation registry. Under “Publisher,” switch to “Third party” and search for “Gremlin” as shown below:
Use the default options, but for the execution role ARN, enter the ARN that was generated for your IAM role. Then, press “Activate extension”.
The next step is to create a YAML template to deploy the Gremlin agent. The extension deploys Gremlin using the Gremlin Helm chart. This example uses secret-based authentication, which requires your Gremlin team ID, team secret, and a name for the EKS cluster (to identify it in the Gremlin web app). To learn more about configuring the Helm chart, including how to use certificate-based authentication, read our chart documentation.
In the AWS CloudFormation console, create a new stack using this template, and enter a name for the stack. Create the stack and monitor the Events tab. Once the stack is deployed, you will see an event with the status <span class="code-class-custom">CREATE_COMPLETE</span>:
You can verify that Gremlin was successfully deployed by logging into the Gremlin web app, clicking Clients, and finding your EKS cluster nodes in the list. You can now run chaos experiments on your EKS cluster!
Learn more
Reliability is paramount when running workloads in the cloud. Even in a fully managed cloud environment, there’s still the potential for a wide range of failure modes that can cause outages. These outages can cost customer trust, revenue, and valuable engineering time spent on troubleshooting and incident response. Reliability is so important that it’s one of the pillars of the AWS Well-Architected Framework (WAF).
Using Gremlin and CloudFormation Public Registry, you can easily validate the resilience of your AWS deployments against a variety of failure modes including node failures, sudden traffic surges, third-party dependency failures, and more. For more information, visit our GitHub repository.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALTo learn more about how to proactively scan and test for AWS reliability risks and automate reliability management, download a copy of our comprehensive guide.
Get the AWS PrimerWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more