How to make your AI-as-a-Service more resilient

When you think about “AI reliability,” what comes to mind?

If you’re like most people, you’re probably thinking of generative AI model accuracy, like responses from ChatGPT, Stable Diffusion, and Sora. While this is certainly important, there’s an even more fundamental type of reliability: the reliability of the infrastructure that your AI models and applications are running on.

AI infrastructure is complex, distributed, and automated, making it highly susceptible to failure. As the AI market grows and more competitors emerge, companies providing AI-powered services must demonstrate their ability to serve customer requests with as little downtime or disruption as possible.

In this blog, we’ll explore how AI systems can fail, how these failures impact the user experience, and how you can prevent common failure modes using reliability testing.

‍

The reliability risks in AI systems

Before discussing reliability best practices, let’s cover the reliability risks inherent in AI systems.

These can vary based on several factors, including the type of AI workload, the size and complexity of the models deployed, and the computing capacity of the systems hosting them. But generally, AI architectures have three key qualities:

They’re distributed, which means they run on many hosts across a data center, availability zone, or region.
They’re networked, which means transmitting models, user requests, and other data over networks.
They’re designed to scale, which means increasing and decreasing computing capacity in response to changing user demand or model sizes.

Fortunately, we have a wealth of best practices around each of these. Let’s look at each one individually.

‍

Distributed AI models mean more points of failure

In distributed environments, workloads run across multiple hosts simultaneously. AI models, in particular, are massively distributed, requiring anywhere from a single host up to tens of thousands of hosts. In an environment this large and complex, failures aren’t a matter of if, but when. Your infrastructure must be ready to detect and replace failed components, while re-routing traffic in the meantime.

You can mitigate this risk by using an orchestration tool to manage your AI workloads and infrastructure. For example, Ray is a commonly used open-source project for running machine-learning applications. While it natively supports clustering, you can run it on Kubernetes via KubeRay. This lets you leverage Kubernetes features like horizontal Pod autoscaling, taints and tolerations, and resource requests. Kubernetes can also monitor the status and health of your workloads, automatically replace failed containers, balance resource usage across the cluster, and even scale up or down in response to demand.

Adding a tool like KubeRay seems like an additional point of failure, but it opens up an entire ecosystem of reliability tooling and best practices. We’ve written plenty of guides on making Kubernetes more resilient, and by running AI models on Kubernetes using KubeRay, you can use these best practices to harden your cluster and ensure your workloads are resilient.

Tip

To reduce cloud costs, use spot instances instead of dedicated instances. Spot instances can be shut down anytime if your cloud provider needs the capacity, but they’re often cheaper, and provide advanced notice before shutting down. This gives your environment time to provision a new node.

‍

Network instability can significantly slow down response times

Distributed applications require high-speed networks, and AI is no exception. Up to 33% of the time needed to complete an AI/ML task is spent waiting for network availability, while AI network traffic is expected to double every two years. There are several ways networks can fail:

Outages caused by failed routers, switches, or network interface cards.
Latency due to high traffic, limited network capacity, or over-taxed network devices.
Connection errors resulting from invalid or expired TLS certificates.

One way to mitigate network outages is to use a service mesh like Istio, which can detect and automatically retry failed connections, add circuit breakers, and even run basic Chaos Engineering experiments. If your AI service is running on cloud platforms like AWS and GCP, providers often include guidance on making your private network (virtual private cloud, or VPC, on AWS) more resilient.

Service meshes and VPCs are good for routing internal network failures, but what about external ones? How do you prepare for problems where users can’t access your services? This is where API gateways help: they route requests to one or more backend services based on configuration rules. Most of these rules are based on the contents of the request (e.g., if a user is sending a request to a specific URL, forward the request to a specific service), but API gateways can also route based on network conditions.

For instance, if a host fails to respond to a health check, the API gateway will mark it as unavailable and reroute the request to a different host. API gateways and load balancers can also track response times for individual hosts, and if a service has one fast host and one slower host, the gateway will route the request to the faster host. This also helps balance requests across your network so no one host gets overloaded.

‍

AI models need to scale quickly and reliably

AI models are getting bigger, which means they need more powerful hardware. In just the past 15 years, the number of variables AI models use in decision-making increased from 1 million to over 1.5 trillion! For example, a 70-billion-parameter model requires 168 GB of GPU memory, but Llama 3.1’s full 405-billion-parameter model requires nearly one terabyte!

Note

Note: these requirements were calculated using https://token-calculator.net/llm-memory-calculator

‍

*Source:* *https://ourworldindata.org/scaling-up-ai*

‍

Scaling to this size isn’t effortless, even in a cloud environment. You need to determine how much capacity to provision for the AI model(s) you want to deploy, plus additional capacity for user requests and other services. The “cold start challenge” makes this even harder, where AI models take extra time to respond to requests due to deployment or scaling. Each step in the deployment process takes time, even in the cloud:

Instances and containers need to be provisioned.
Models must be transferred from storage (like HDDs or SDDs) onto the new instances and into GPU memory.
Routers and switches must redirect traffic to the new instances and set up health checks.

One option is to keep the model loaded, even during periods of low demand, which will result in higher costs. Another is to use an alternative system for loading models, like Amazon’s Fast Model Loader, though these might be limited to specific services. Ultimately, the solution you choose will have to balance responsiveness and cost, while also dynamically scaling to meet surges in user demand.

‍

Test and prove the reliability of AI-based services

Building fault-tolerance into your AI services is important, but how do you prove its effectiveness?

The most effective way to do this is by recreating failure conditions on the systems you want to test using Fault Injection. Chaos Engineering and reliability testing tools like Gremlin use fault injection to create failure modes on your systems in a safe and controlled way. Gremlin, in particular, adds a number of safety features and controls so you can apply specific failure modes to specific services, down to an individual process or application.

Here’s how these can apply to AI.

‍

Validate resilience to network issues

Gremlin provides a range of network-based tests, including blackhole (dropping network packets), latency, packet loss, DNS failure, and checking for expiring TLS certificates. These help ensure that your AI models and other services can work around poor network conditions or performance.

If you’re unsure where to start, Gremlin automatically detects dependencies that your services communicate with. After you install the Gremlin agent and add a service, Gremlin uses DNS traffic to identify dependencies. Next, it generates a set of network tests that validate the service’s response when the dependency is unavailable or slow. It also checks the dependency’s entire TLS certificate chain to look for expired or expiring certificates.

‍

Test your ability to scale

The best way to test scalability is to subject systems to load and observe how they respond. With modern AI models, the GPU arguably is the most critical component, which is why Gremlin provides the GPU experiment. This works using OpenCL to create and run workloads that consume the GPU’s computing capacity. In a cloud environment like AWS, you can use GPU metrics as the basis for autoscaling Kubernetes deployments and regular clusters.

This experiment also simulates noisy neighbors. For example, if multiple AI models are running on the same host or a host handles multiple user requests simultaneously, multiple models will try using the same GPU hardware. The GPU experiment simulates this environment, letting you test your model’s responsiveness when a key resource is limited.

Gremlin also tests resource scalability using CPU and RAM utilization, storage capacity, and storage throughput. While these aren’t as AI-specific as GPU usage, they’re still vital for other services, such as your user-facing front end. After all, it doesn’t matter how scalable your AI model is if you can’t handle an increase in user demand.

‍

See Gremlin in action with an interactive product tour

Ready to make your AI and LLM workloads resilient against failure? Give Gremlin a try with a free 30-day trial, or see how easy it is to get started with our interactive product tours.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL