Gremlin's 2024 year-end Release Roundup
It’s been a busy year at Gremlin! We released two new experiments, added an entirely new onboarding process and features for AWS users, added a brand new Test Suite and Detected Risks, and made many UI improvements to our web app. We beefed up our agents with more enterprise capabilities, including support for large Kubernetes clusters and systems with over 64 CPUs, improved experiment behaviors, improved dependency detection, and per-team Private Network Integrations.
Keep reading for a complete look at everything Gremlin’s released in 2024!
Brand new experiments
Simulate massively parallel workloads with the new Process Exhaustion experiment
The Process Exhaustion experiment simulates running processes on a system to consume process IDs (PIDs). This lets you test your systems’ ability to handle massively concurrent workloads, such as container orchestration tools and large-scale web and proxy servers. You can use this to determine:
How many processes your systems can handle before becoming unstable
- How your services respond when the host runs out of PIDs
- Whether your PID limits are being enforced across your services
You can find this new experiment under the State category of experiments for Windows and Linux, and it’s ready to be incorporated into your Scenarios and Reliability Management Test Suites. Check out our blog post for more details.
Build reliable GPU workloads
Organizations have been investing substantially in GPU workloads, with the industry expected to more than quadruple to $274B by 2029. These same workloads can have massive impacts if they fail, such as generative AI systems becoming unresponsive, streaming events losing signal, and expensive simulations having to recalculate.
Gremlin’s new GPU experiment lets you test your GPU-based workloads and discover failure modes before they impact your users. You can stress your GPU by consuming compute capacity on hosts, containers, and Kubernetes resources. For more thorough stress testing, you can use Scenarios to run this experiment in parallel with others, such as system-level CPU, memory, or network experiments.
“With the rise of Large Language Models (LLMs)—Megascale, LLaMa, Gemini, GPT4—ML training shifted the scale of a single training job from tens to tens of thousands…At such scale, failures are not a matter of if, but a matter of when.” - Fundamental AI Research (FAIR) Team, Meta
Learn more in our documentation.
All new AWS workflow
Effortlessly onboard your AWS-based services
Gremlin can now automatically discover services running on AWS! Gremlin can already leverage your CloudWatch metrics for Health Checks, and we’ve expanded on this integration to include services.
Gremlin uses your Elastic Load Balancers (ELBs) to detect service routes and can automatically translate these into Gremlin services for running reliability tests. You don’t need to manually define the services yourself or add annotations to your manifests when using EKS. Just deploy the Gremlin agent, grant Gremlin IAM access, and choose which services you want to test in just a few clicks.
Check out our new quick-start guide!
Accurately monitor your services without setting up an observability tool
Along with our new onboarding flow, we also introduced Intelligent Health Checks, a way for Gremlin to create and configure Health Checks for you automatically. Health Checks are how Gremlin tracks the health of your services before, during, and after reliability testing. Normally, Health Checks require you to have a pre-existing monitoring or observability tool set up. With Intelligent Health Checks, all you need to do is click on a box, and we’ll create these checks for you.
Once you create a service from an ELB, just go to its settings and click the check box to enable Intelligent Health Checks. Gremlin will find three of the service’s metrics—throughput, latency, and error rate—and monitor these metrics to understand your service’s baseline performance. When you run a reliability test, Gremlin continuously compares each metric’s current levels against its baseline to determine whether the service is healthy. If they’re significantly different, Gremlin halts the test and returns your service to normal operation.
Uncover more reliability risks
As part of our work on making AWS more seamlessly integrated into Gremlin, we’ve added three new AWS-specific Detected Risks to help ensure that your AWS workloads are redundant and accident-resistant:
- Availability zone redundancy checks if an Application, Network, or Gateway load balancer is mapped to multiple availability zones.
- Cross-zone load balancing checks whether you have cross-zone load balancing enabled on this service, improving your application’s ability to handle the loss of one or more instances.
- Deletion protection checks that your load balancer has the “deletion protection” flag enabled to avoid accidental deletion.
Ensure your services are cloud-optimized
Cloud providers often publish guidance on building applications that fully utilize the platform’s features. These “Well-Architected Frameworks” offer recommendations but sometimes lack concrete steps for engineers to follow.
To help with this, we created a brand new test suite: the Well-Architected Cloud Test Suite. This suite lets you test and govern to cloud reliability principles and best practices. This suite also adds two new reliability tests: Disk I/O, which tests whether your services can tolerate a drop in input/output operations per second (IOPS); and DNS, which tests whether your services can successfully failover to a secondary DNS provider.
Keep your communications private with AWS PrivateLink via Marketplace
Although Gremlin is a SaaS solution, we offer ways to connect to our service that don’t require transmitting data over the public Internet. AWS PrivateLink is one such solution. For AWS customers, AWS PrivateLink lets you connect directly to Gremlin’s VPC through AWS’ network without having to route over the Internet. It’s just one more way we prioritize security.
Enabling AWS PrivateLink is done on a per-account basis. Contact your Gremlin rep for more information.
Streamline deploying Gremlin to AWS with AWS Key Management Service
Gremlin now natively integrates with AWS Key Management Service (KMS), making deploying Gremlin to your AWS environment easier and more secure. When deploying the agent, you can replace your normal configuration values (team_id, team_certificate, etc.) with the Amazon Resource Name (ARN) of the KMS secret you wish to use. When the Gremlin agent starts, it will retrieve the values from KMS, letting you deploy Gremlin securely without storing or distributing plaintext passwords or certificates.
Learn how to do it in our tutorial.
Improved support for serverless and containerized workloads
Make your service mesh applications more reliable
Serverless developers rejoice—you can now run Gremlin experiments on service mesh applications!
The Gremlin Service Mesh Extension lets you run experiments on Istio services. You can simulate poor network conditions and latency, outages, dependency failures, and more. And because this feature is built on top of Failure Flags, you have fine-grained control over your testing parameters using selectors and attributes.
To learn more, see our documentation on deploying Failure Flags on Istio. If you’re new to Failure Flags, get a quick tour below:
Onboard your Kubernetes clusters faster with Argo Rollout support and auto-generated Helm commands
Kubernetes has always been a core focus for Gremlin, and now, we’re making it even easier to onboard new clusters.
Gremlin’s Getting Started page now has an auto-generated Helm command, pre-populated with your team ID and certificates. All you need to do is download the values.yaml file, copy the Helm command, and run it. We also provide a standard manifest file for non-Helm users.
For teams running Argo on Kubernetes, Gremlin will now detect and list Argo Rollouts separately from other Kubernetes resource types.
Manage testing more effectively
Create custom roles to meet your organization’s requirements
Gremlin now supports fully customizable role-based access controls (RBAC). RBAC lets you specify which actions your users can perform. You can assign users to one or more roles, with each role enabling a different set of privileges. Privileges correspond to actions in Gremlin, such as running a CPU experiment or managing user accounts. Gremlin also provides a set of standard roles out-of-the-box.
In addition to controlling which privileges users have, you can set default privileges for new team members. When new users join a Gremlin team, Gremlin automatically assigns them a default role. For example, you can create a role with minimal privileges and make it the default, then assign more permissive roles to individual users as needed.
Read our blog to learn how customizable RBAC works in Gremlin, or check out the documentation for instructions on setting it up.
Discover and track dependencies more accurately
Gremlin has long been able to find your services' critical dependencies automatically, and now we’ve improved our discovery methods. Gremlin now detects DNS calls made by your service to other services and uses this information to identify dependencies. This DNS-based method is faster and more accurate and lets Gremlin track dependencies even if their IP address changes. We discuss it in detail in our blog, How dependency discovery works.
Prevent testing during critical time blocks with restricted time windows
Sometimes there isn’t a good time to run reliability tests, such as during code merges, scheduled deployments, or peak traffic times. Gremlin now has a native way to prevent users from running experiments, Scenarios, or reliability tests with restricted time windows. Restricted time windows lets you set blocks of time at either the team or company level where tests won’t run. Running tests are halted, and scheduled tests will not run during this time. You can specify a weekday, start time, and duration. Check out the docs to learn more.
Improved auditing tools in the Gremlin API
A comprehensive audit trail is important for any software tool, especially one that tests your systems. Gremlin now provides two REST API endpoints for retrieving log data about who logged into your Gremlin organization and which experiments/Scenarios were run. Both of these logs are available under the /reports/security endpoint. You can learn more in our REST API documentation.
A cleaner, more streamlined web app
We squashed some bugs and improved the user experience in our web interface. This includes:
- Adding help text to the Test Suite creation wizard warning users that reliability scores will be reset when a new Test Suite is applied.
- Making test results clickable on the service overview page. Clicking on a test result will bring you to the most recent test run.
- Displaying more information about LostCommunication agent errors.
- Reducing the delay when creating multiple test suites.
- Improving the way columns are rendered in reports.
- And much more!
Easier Failure Flags experiment creation
We’ve made the Failure Flags experiment creation interface easier to use! Now you can select your Failure Flag, attributes, services, and effects using drop-down boxes. If you want to continue using JSON, click the JSON tab to edit it directly. Gremlin will automatically update your JSON to match the contents of the drop-downs and vice versa.
Agent improvements
We’ve greatly improved our Linux, Windows, Failure Flags, and private network integration (PNI) agents.
Better support for enterprise deployments
Gremlin scales with you no matter how large or complex your environment is. We’ve made many performance and stability improvements to our agents to support even the biggest enterprise deployments.
For Windows users, the Gremlin agent now supports systems with more than 64 processors (v1.20.1).
The Linux agent now supports kernels 4.6 and earlier (v2.52.2). We’ve also added stricter dependency checks, and the installation will fail if the necessary permissions are unavailable (v2.52.3). You can learn more about the permissions the agent requires on our security page.
For Kubernetes, we’ve improved the performance of our Chao Daemonset for large clusters (v0.10.0). We’ve also made it possible to specify which namespaces you want Gremlin to monitor when using our Helm chart (v0.18.1).
Per-team Private Network Integrations
Gremlin’s Private Network Integration (PNI) agent lets you connect Gremlin to services hosted on your private network without exposing them to the public Internet. The agent proxies Health Checks, webhooks, and other requests originating from the Gremlin control plane. You can now scope PNI agents to individual teams instead of your entire company. This lets teams use PNIs across multiple networks, even if those networks are isolated.
Improved dependency detection
Gremlin will more accurately detect your service’s dependencies. In addition to using the service’s DNS records, the Gremlin agent will detect whether the service has an active connection to the dependency. This will improve the relevance of dependencies that Gremlin shows in your service’s dependency list.
Click here to learn how to manage your service’s dependencies in Gremlin.
Improved disk experiment compatibility
When running a disk experiment, the Gremlin agent will no longer mark newly created files as “hidden.” This allows other applications, like monitoring services and observability tools, to correctly detect and measure changes in disk usage. Gremlin will still delete the files when the experiment ends.
Click here to learn more about the disk experiment.
New container drivers for better performance and support
In addition to the normal dependency and library updates, we’ve added new container drivers for Docker, containerd, and CRI-O. These new libraries remove our runC dependency, significantly reducing CPU and I/O usage.
The agent is also smarter about detecting pre-existing network ingress rules that conflict with the Blackhole experiment, which can happen with integrations like Cilium or Kata. We’ve also made several improvements to experiment rollback logic, such as improved logging and error reporting and better handling for external network devices taken offline while an experiment is running.
Improved experiment behavior
We’ve made some under-the-hood improvements to our agents as well. First, we made the I/O experiment more impactful by having it bypass the page cache and read directly from the disk. This more accurately reflects real-world disk reading behavior. We’ve also made it easier to run container experiments on host-based agents by adding the SYS_ADMIN, SYS_RESOURCE, and CAP_SYS_CHROOT capabilities by default. Lastly, we’ve enhanced logging and error messages to simplify troubleshooting issues such as container drivers failing to load, failing to parse configuration and certificate files, and Windows validation errors.
Try it for yourself
If you have a Gremlin account, all the features highlighted in this post are available. Just make sure to update your agents to the latest versions.
New to Gremlin? Sign up for a free 30-day trial to see how Gremlin helps uncover reliability risks and improve your reliability posture.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALGremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your free trialThree serverless reliability risks you can solve today using Failure Flags
Just because your app is serverless doesn’t mean you don’t need to think about reliability. Learn three of the top causes of serverless failures—and how to prevent them—in our latest blog.
Just because your app is serverless doesn’t mean you don’t need to think about reliability. Learn three of the top causes of serverless failures—and how to prevent them—in our latest blog.
Read moreHow role-based access control (RBAC) works in Gremlin
Gremlin recently released custom role-based access controls (RBAC) for greater control over your reliability testing. Learn how it works in this blog post.
Gremlin recently released custom role-based access controls (RBAC) for greater control over your reliability testing. Learn how it works in this blog post.
Read moreWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read more