Release Roundup November 2024: Reliability in the serverless and AI era

2024 is coming to a close, and while many teams are slowing down in preparation for the holidays, we’ve been cooking up tons of new features. We’ve extended our platform support to the Istio service mesh, added a brand new experiment type for testing artificial intelligence (AI) and large language model (LLM) workloads, and made it easier to onboard Kubernetes clusters. We’ve also made our Linux and Windows agents more robust and performant.

See the full details in this extra-long release roundup!

‍

New Features

Make your service mesh applications more reliable

Serverless developers rejoice—you can now run Gremlin experiments on service mesh applications!

The Gremlin Service Mesh Extension lets you run experiments on Istio services. You can simulate poor network conditions and latency, outages, dependency failures, and more. And because this feature is built on top of Failure Flags, you have fine-grained control over your testing parameters using selectors and attributes.

To learn more, see our documentation on deploying Failure Flags on Istio. If you’re new to Failure Flags, get a quick tour below:

‍

Build reliable GPU workloads

Organizations have been investing substantially in GPU workloads, with the industry expected to more than quadruple to $274B by 2029. These same workloads can have massive impacts if they fail, such as generative AI systems becoming unresponsive, streaming platforms losing signal, and having to recalculate expensive simulations or model trainings.

Gremlin’s new GPU experiment lets you test your GPU-based workloads and discover failure modes before they impact your users. You can stress your GPU by consuming compute capacity on hosts, containers, and Kubernetes resources. For more thorough stress testing, you can use Scenarios to run this experiment in parallel with others, such as system-level CPU, memory, or network experiments.

‍

“With the rise of Large Language Models (LLMs)—Megascale, LLaMa, Gemini, GPT4—ML training shifted the scale of a single training job from tens to tens of thousands…At such scale, failures are not a matter of if, but a matter of when.”

Fundamental AI Research (FAIR) Team, Meta

‍

The GPU experiment is available now for all Gremlin users. Check out the announcement blog, or read our documentation to learn more.

‍

Keep your communications private with AWS PrivateLink via Marketplace

Although Gremlin is a SaaS solution, we offer ways to connect to our service that don’t require transmitting data over the public Internet. AWS PrivateLink is one such solution. For AWS customers, AWS PrivateLink lets you connect directly to Gremlin’s VPC through AWS’ network without having to route over the Internet. It’s just one more way we prioritize security.

Enabling AWS PrivateLink is done on a per-account basis. Contact your Gremlin rep for more information.

‍

Onboard your Kubernetes clusters faster with Argo Rollout support and auto-generated Helm commands

Kubernetes has always been a core focus at Gremlin, and now, we’re making it even easier to onboard new clusters.

Gremlin’s Getting Started page now has an auto-generated Helm command, pre-populated with your team ID and certificates. All you need to do is download the values.yaml file, copy the Helm command, and run it. We also provide a standard manifest file for non-Helm users.

For teams running Argo on Kubernetes, Gremlin will now detect and list Argo Rollouts separately from other Kubernetes resource types.

‍

Agent updates: better support for enterprise deployments

Gremlin scales with you no matter how large or complex your environment is. We’ve made many performance and stability improvements to our agents to support even the biggest enterprise deployments.

For Windows users, the Gremlin agent now supports systems with more than 64 processors (v1.20.1).

The Linux agent now supports kernels 4.6 and earlier (v2.52.2). We’ve also added stricter dependency checks, and the installation will fail if the necessary permissions are unavailable (v2.52.3). You can learn more about the permissions the agent requires on our security page.

For Kubernetes, we’ve improved the performance of our Chao Daemonset for large clusters (v0.10.0). We’ve also made it possible to specify which namespaces you want Gremlin to monitor when using our Helm chart (v0.18.1).

‍

Try it yourself

If you already have a Gremlin account, everything listed here is already available to you as long as you have the latest agent installed.

If you’re new to Gremlin, sign up for a free trial and see how easy it is to improve reliability.

No items found.