Release Roundup Sept 2023: Measurably improve reliability
It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.
New features and UI updates
Find hidden reliability risks without fault injection
The headline feature this month is the introduction of Detected Risks. This new capability automatically detects high-priority reliability concerns in a Kubernetes environment—without running any reliability tests or chaos experiments. You can look forward to dozens more risks being added by the end of this year.
Run experiments on serverless workloads
We also launched the beta release of Failure Flags, Gremlin's new framework for running Chaos Engineering experiments on fully managed platforms such as AWS Lambda functions, serverless workloads, and containers. Teams can now run chaos experiments where access to the underlying infrastructure is limited, or simulate failures at the application layer that aren’t possible at the infrastructure layer. It also means Gremlin can now run across your entire stack—even if it’s managed for you.
Get a clearer view of reliability with better reporting
Also this month, we improved Company Summary reports (previously called the Dashboard). You can now see summary reports of both your Detected Risk reports and Reliability Score reports, so you can get a sense of your reliability posture in one place. As part of this change, plan usage details have been moved to Company Settings.
Additional improvements
In other news, we’ve made a number of general improvements:
- Gremlin now supports delegation of Namespaces to a Team for both manual and automatic service creation. Teams can more confidently run experiments without accidentally impacting other teams' resources.
- We’ve added service annotations, which lets you automatically register your Kubernetes services in Gremlin by adding a simple annotation. This speeds up the process of service creation significantly: any service with an annotation simply appears in the Gremlin Service Catalog, ready for you to manage and test.
- We’ve added web app support for managing multiple services simultaneously. This lets you add Health Checks to multiple services with a single click and start testing within seconds. The Service Catalog has been reworked to reflect this change.
- Scenarios can now be deleted in addition to being archived, so now you only need to see your most relevant Scenarios.
Agent Updates
Better performance for Linux agents
We’ve made two significant improvements to the Linux agent, both of which reduce network overhead and improve overall performance.
First, Gremlin now uploads discovered process data at a slower rate, reducing network overhead.
Second, <span class="code-class-custom">gremlind</span> now batches up process data over 15 minute intervals, deduplicating all network and process data detected over this interval. Previously, <span class="code-class-custom">gremlind</span> would emit snapshots of process and socket data to Gremlin's control plane over two minute intervals.
Enabling Detected Risks
Noted above, Gremlin can now detect specific reliability risks without fault injection. To support this functionality, the Chao Kubernetes agent now sends the <span class="code-class-custom">imageID</span> of each container, which enables Gremlin to identify services running multiple container versions simultaneously—a common reliability risk. You can learn more about Detected Risks here.
Security improvements
We continue to build out enterprise-grade security capabilities trusted by some of the world’s largest and most regulated companies, and this month we’ve made two updates.
First, when installed directly on the host and launched with SystemD, the Gremlin agent now runs with ambient capabilities (capabilities(7)) rather than file capabilities. Ambient capabilities allow the Gremlin agent to retain certain permissions even after it has started, making it more flexible and secure in a Linux environment.
Second, when installed directly on the host, the suid bit is no longer set for installed binaries <span class="code-class-custom">/usr/bin/gremlin</span> and <span class="code-class-custom">/usr/sbin/gremlind</span>. Additionally, these binaries are no longer owned by the Gremlin linux user, but instead by root, which allows a user to run things as if they were being run by the owner while improving security.
Certificate Expiry test improvements
Running Certificate Expiry experiments against CIDR values (e.g., 10.0.0.0/24) will make several attempts to find an active IP address in use by the target system for evaluating certificate expiration characteristics within the duration specified by the argument <span class="code-class-custom">--length</span>.
Improved labeling
With Helm, you can now add labels to the deployed Gremlin Pods using the <span class="code-class-custom">chao.podLabels</span> and <span class="code-class-custom">gremlin.podLabels</span> parameters. Labels make it easier to filter, sort, or select pods for tests and experimentation in Gremlin. See the Chart documentation for details.
Try it for yourself
If you already have a Gremlin account, everything noted here is already available to you, as long as you have the latest agent installed.
If not, sign up for a free trial to start understanding and improving your reliability posture in minutes.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read moreTreat reliability risks like security vulnerabilities by scanning and testing for them
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Read more