Reliable AI models, simulations, and more with Gremlin's GPU experiment
Artificial Intelligence (AI) has become one of the biggest tech trends in years. From generating full movies to updating its own code, AI is performing tasks that were once science fiction.
But when you peek under the hood, AI is just math running on a fleet of servers and a sea of graphical processing units (GPUs). Like all modern applications, it relies on a complex network of hardware and software, all of which can fail in different ways. The question is: how do AI applications respond to these failures? If the GPUs powering an AI model’s neural network fail, can the model recover? Or are there impacts to its performance and predictability?
Answering these questions is why we created our latest experiment: the GPU Gremlin. Now, you can stress test your GPUs by consuming their computing capacity. We’ve made this experiment available across all supported platforms, including Kubernetes. Keep reading to learn more.
How does the GPU experiment work?
The GPU experiment works by stressing your GPU’s central processing unit for the duration of the experiment. It consumes as much computing power as possible to simulate a highly-intensive workload and push the GPU to its limit. It uses OpenCL—a framework for writing programs on different platforms, including GPUs—to accomplish this.
How can you use GPU experiments to test reliability?
The GPU experiment is useful for simulating GPU-intensive workloads, testing observability, triggering auto-scaling mechanisms, and more.
AI was a key motivation behind the development of the GPU Gremlin: 72% of organizations surveyed by McKinsey & Company were using AI at the start of 2024, and that number is likely to keep growing throughout 2025. AI relies heavily on GPUs, since they’re extremely efficient at parallel processing. With expectations around AI reaching historic highs, the reliability of these systems is more important than ever. But what are some of the ways these highly complex systems can fail?
Testing AI large language model performance and availability
AI is driven by models, which are mathematical constructs that process enormous amounts of data to synthesize information in a human-like way. The most well-known models are large language models (LLMs)—the kind used in services like ChatGPT—but there are others. For example, DALL-E uses a model called CLIP, or Contrastive Language-Image Pre-training, to generate images from plain text.
No matter what type of model you use, they all have one thing in common: they need to crunch a lot of data, and GPUs are the most effective tool. But what happens when those GPUs are busy responding to users or training other models? What happens when you don’t have enough video memory to hold the full model? How does limited capacity affect your ability to schedule tasks, and can your infrastructure scale to meet changing demand?
The GPU experiment can test all of these situations. If you have an LLM deployed, run a GPU experiment alongside it to simulate heavy loads or additional workloads. While the experiment is running, monitor the performance, throughput, and availability of your LLM to determine what (if any) impact there is. Like Gremlin's other resource experiments, this can help with capacity planning and scaling.
Validating scalability
Autoscaling based on GPU usage isn’t quite as common as autoscaling on CPU, but major cloud providers still support it.
For example, systems with NVIDIA GPUs can use a utility like NVIDIA System Management Interface (nvidia-smi) to gather metrics on GPU utilization, memory consumption, power consumption, and temperature. If your systems are running on AWS, you can push this data to Amazon CloudWatch and use it as a trigger for dynamic autoscaling.
Once you have your metrics and scaling policy configured, run a simple GPU experiment. Make sure it runs long enough to meet the requirements of your scaling policy while accounting for delays in metrics collection. For instance, if your auto-scaling group is set to scale after 1 minute of 80% load, run your experiment for 2 minutes or longer.
In addition to scaling up, this is a great opportunity to validate that your systems can also scale down. After the experiment ends and GPU usage returns to normal, ensure your auto-scaling group terminates and removes unnecessary hosts. This will save you a lot on costs and overhead.
Simulating LLM deployments
LLMs are extremely resource-intensive, with even small models requiring tens of gigabytes (GB). For example, a 13-billion parameter LLaMa model with 16-bit precision needs at least 26 GB of memory. GPT-3, which has 175 billion parameters, would need at least 350 GB. And GPT-4 is even larger, with 1.8 TRILLION parameters! Even with modern enterprise GPUs exceeding 48 GB of memory, you’d need to fully dedicate at least 8 GPUs to running a single model.
While you’re probably not running models on the scale of GPT-3, you can still simulate their impact. Using Gremlin, you can easily distribute the experiment across your fleet and adjust the number of impacted nodes to simulate models of different sizes and complexities.
For larger models or deployments, you can also use Gremlin’s advanced targeting options. For example, imagine you want to simulate a GPT-3-size LLM on a fleet of 8 servers. You can configure a GPU experiment to run on 8 specific servers or have Gremlin select 8 random servers from your entire fleet (e.g., using agent tags and impact limits).
Preparing for noisy neighbors
When you run GPU workloads on shared infrastructure, you’re competing with other workloads for resources. These other workloads can have very different requirements than yours and may have spikes in demand that affect your own workloads. These are called noisy neighbors.
If you want to preemptively test the impact of a noisy neighbor, run a GPU experiment alongside your applications and observe their performance. If there’s no change or the change is minimal, you can deploy confidently to the cloud. If not, consider provisioning dedicated GPU resources or increasing the size of your fleet.
Validating fault-tolerance
Reliability becomes an even greater concern for workloads that span multiple GPUs and/or multiple GPU-hosting servers. Not only do you have to worry about a GPU failing, but you also have servers, networks, task orchestrators, and dependencies, all of which can fail. It cost OpenAI $63 million to train GPT-4: at that scale, failures can have significant costs.
Here, we can improve your testing by combining experiments. Gremlin’s blackhole experiment can simulate failures between two servers by dropping network traffic. Using Scenarios, you can run a blackhole experiment to take part of your fleet offline while simultaneously running a GPU experiment. This determines how well your infrastructure can route and rebalance workloads around failed instances and maintain resiliency in a distributed environment.
How do you run a GPU experiment?
To run a GPU experiment:
- Log into the Gremlin web app.
- Create a new experiment and select the host(s), container(s), or Kubernetes resource(s) you want to target.
- Under Choose a Gremlin, expand the Resource category, then select GPU.
- Set the Length of the experiment. By default, it will run for one minute.
- Click Run Experiment to start the experiment.
Start building reliable GPU-powered workloads today
The GPU experiment is now available for all Gremlin users! If you’d like to learn more, check out our documentation. And if you haven’t tried Gremlin yet, you can get a 30-day free trial with full access to the Gremlin platform.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALGremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your free trialRelease Roundup November 2024: Reliability in the serverless and AI era
A review of the major product and feature releases from September to November 2024, including Service Mesh via Failure Flags, AWS PrivateLink in Marketplace, RM & FI scheduling improvements, and agent updates.
A review of the major product and feature releases from September to November 2024, including Service Mesh via Failure Flags, AWS PrivateLink in Marketplace, RM & FI scheduling improvements, and agent updates.
Read more