The amount of traffic handled by AI systems can’t be overstated. Over half of all organizations in India, the UAE, Singapore, and China use AI, and traffic from generative AI sources jumped by 1,200% since July 2024. While demand for AI-powered workloads is steadily increasing overall, traffic to individual AI providers is much more unpredictable. User demand spikes and wanes unexpectedly, but like any service, users expect you to always be available and responsive.

The challenge is that AI is much more difficult to scale than traditional workloads. A single large language model (LLM) like DeepSeek R1 can range from under 1 GB to half a terabyte, depending on the number of parameters. Running an LLM cluster means transferring several gigabytes across the network and into memory for each new host or model you deploy, and that’s before you can handle your first request.

How do you ensure your systems can scale when AI is so demanding? This blog will explore different strategies.

How companies scale AI systems today

Leaders in generative AI build their infrastructure on top of scalable systems. OpenAI, one of the biggest names in the industry, runs its workloads on Kubernetes. As of 2024, the company moved its infrastructure to Azure, using Terraform to manage deployments and Kafka for message streaming. OpenAI also uses an uncommon scheduling algorithm called coscheduling, which schedules multiple pods as a single unit (similar to how Kubernetes deploys multiple containers in a single pod).

Anthropic also runs its flagship Claude model on Kubernetes via Google Kubernetes Engine (GKE) and previously Amazon Elastic Kubernetes Service (EKS). Their EKS deployment used spot instances—EKS instances made up of spare capacity that can be terminated at any time—and Karpenter in place of the standard Cluster Autoscaler. Using spot instances instead of standard on-demand instances cut Anthropic’s cloud bill by 40%, and since their workloads stored data in S3, they didn’t have to worry about transferring data off of terminating nodes.

Best practices for scaling AI workloads

Scaling any large, distributed workload can be broken down into three steps:

  1. Determine which metrics to scale on.
  2. Set thresholds for scaling on those metrics.
  3. Proactively test those thresholds by simulating load.

We’ll focus on inference servers since they tend to be the most important user-facing part of AI systems. Inference servers handle receiving user requests, handing them to the AI model, and returning the response to the user, which means performance and availability are critical.

Determining which metrics to scale AI workloads on

Platforms like Kubernetes can automatically monitor and scale up systems, but it’s up to you to choose which metrics to use and what thresholds to set. With AI deployments, it might seem like an obvious choice to scale based on GPU usage. After all, AI models heavily rely on GPU performance, so you should be able to set a threshold on GPU capacity just like CPU or RAM capacity, right?

Unfortunately, GPU usage-based scaling is an anti-pattern. There isn’t a clear relationship between GPU utilization and AI model throughput, and because AI models use GPUs heavily, usage will almost always be higher than your threshold.

Instead, focus on metrics that track the user experience. For AI applications, this often comes down to:

  1. Queue size, which tracks the number of requests waiting in the inference server queue before they’re added to the current batch. In other words, this is how many user queries are waiting to be processed.
  2. Batch size, which is the number of items that can be processed at once. Batches are pulled from the queue. Larger batches result in higher latency due to less frequent pulling.
  3. Time-per-token, which is the amount of time needed to generate output tokens. This effectively equates to latency.

Google found that an increase in queue size leads to a corresponding increase in latency, making it an excellent metric for autoscaling. You can use batch size, too, but that’s better if you already have a specific latency target. Google found that setting the queue size’s target threshold to 25 maintained a mean time-per-token under 0.4 seconds despite traffic spikes, and setting the batch size target threshold to 50 reduced the mean time-per-token to under 0.3 seconds.

Configure your systems to scale based on your metrics

Once you’ve selected your metrics, you need a way to get them from your inference server to your orchestration platform, whether that’s Kubernetes, EKS, GKE, or a similar platform. The exact process depends on your inference server, orchestrator, and cloud platform.

As an example, NVIDIA’s Triton Inference Server exposes a Prometheus metric called nv_inference_queue_duration_us, which tracks the total amount of time requests spend waiting in the queue (in milliseconds). Assuming you’ve deployed Prometheus to your Kubernetes cluster, you can configure a Horizontal Pod Autoscaler (HPA) to trigger if this metric exceeds the specified time.

The HPA only increases the number of Pods—what about deploying new nodes? When you deploy Kubernetes to a cloud platform like EKS or GKE, the platform continuously monitors the status of your Pods. If a Pod is in the Pending state because there aren’t enough resources to run it, the Cluster Autoscaler will detect this and start provisioning a new node (assuming you haven’t yet reached your max cluster size). When demand decreases, the queue duration will also decrease, and your deployment will scale back down.

Validate your scalability by simulating demand

Now that you’ve configured your autoscaling rules, you should test them to verify that they work as expected. In this case, the risks are:

  • Failing to scale, which can increase the queue size and latency for users.
  • Scaling too aggressively and overprovisioning resources.
  • Failing to scale back down when demand decreases, resulting in a large cloud bill.

One method to test this is to simply spam requests to your inference model until the queue size grows. Another method that requires less brute force is to first consume hardware resources so the inference server performs slower than usual, and then send requests until the queue backs up.

As mentioned above, AI models rely heavily on GPUs. While the metrics we listed aren’t based solely on GPU usage, GPU usage still factors into them, so we can artificially drive them up by consuming GPU capacity. Gremlin recently released an experiment specifically for stress testing GPU compute resources over a period of time (e.g. 5 minutes). You can run this experiment to reduce GPU availability across your cluster, then use a standard load testing tool like Apache JMeter or hey to send requests to the inference server. As more requests enter the queue, the queue duration will increase and eventually trigger the pod autoscaler. If the autoscaler can’t deploy new pods due to limited resources, the Cluster Autoscaler will provision a new node.

If your test fails—for example, the HPA doesn’t deploy a new pod—congratulations, you discovered a reliability risk before deploying to production! Make adjustments to your autoscaling rules, deploy the changes, and then re-run the test until it behaves as expected. Repeat this process until you’re confident that your deployment can scale and that scaling lowers the queue duration.

Other ways to make your AI workloads more resilient

Want to learn other ways to improve the reliability of your AI-powered workloads? Check out our blog on How to make your AI-as-a-Service more resilient.

No items found.
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL