Serverless computing requires a significant shift in how organizations think about deploying and managing applications. No longer do Ops teams need to think about provisioning servers, installing operating system patches, and writing shell scripts to manage deployments. While serverless takes away much of this responsibility, one aspect still needs to be handled thoughtfully: reliability.

In this blog, we’ll look at three important facts about serverless reliability that teams often overlook. We’ll explain what they are, what the risks are of not addressing them, and how you can make your serverless applications more fault-tolerant.

What makes serverless different from other architectures?

“Serverless” doesn’t really mean “no servers.” Instead, it means shifting the responsibility of provisioning, deploying, and maintaining servers from the customer to the platform provider. Providers like AWS (Lambda), Google (Cloud Functions), and Azure (Functions) manage the physical infrastructure, operating system, and runtimes, letting you focus almost entirely on deploying code.

As a result, many teams think they no longer have any control or responsibility over reliability. But the truth is:

  1. Serverless architectures don’t guarantee reliability.
  2. You do have control over serverless reliability.
  3. Serverless reliability practices can benefit all platforms, not just serverless platforms.

The key thing to remember, especially when coming from an infrastructure-centric environment, is to shift your attention from infrastructure reliability to code reliability. While you might not have direct control over the environment, you do have control over your code. Much of these recommendations will focus on that fact.

#1: Serverless doesn’t guarantee reliability

Cloud providers do their best to maintain high availability, but they’re not infallible. Power outages, data center fires, networking errors, and even bad configuration changes can bring your serverless functions offline at any point. This is a trade-off of using a cloud provider for any managed service: you give up some control in exchange for less effort deploying and maintaining infrastructure.

This isn’t to say providers are susceptible to failing. Most large providers offer at least three nines (99.9%) of availability, equating to about one workday of downtime per year. Many will also offer credits if downtime exceeds this amount: AWS Lambda offers a full refund if their uptime falls below 95%. However, in the meantime, your applications are still offline. This brings us to the next fact:

#2: You still have control over serverless reliability

Just because you gave up some control when switching to serverless doesn’t mean you’re out of options. Using Lambda as an example, you can still choose:

  1. Your application code and runtime environment (Java, Node.js, Python, etc).
  2. How much CPU and RAM to allocate to your function.
  3. How to network your function to users and other services.
  4. The maximum number of application instances that can run simultaneously (concurrency).

The first and most basic step is to ensure your function has enough CPU and RAM to run from start to finish. This is often as easy as just sending a typical request to your function and measuring how long it takes to finish.

Concurrency as a source of redundancy

The next step is to adjust concurrency. Concurrency isn’t just about running multiple instances of your function at once: it also creates redundancy automatically. If one instance of your application fails, the serverless platform can immediately spin up a new instance. In the case of Lambda, you can configure your function to run in multiple subnets, with each subnet located in a different availability zone. As Lambda creates new functions, it automatically distributes them across different zones to prevent the risk of a zone outage taking your application offline.

One challenge with concurrency is that instances take time to start, and the bigger your function is, the longer it takes. You can get around this somewhat by maintaining a minimum number of “warm” instances that are already deployed and ready to serve traffic, although this will increase your operating costs. Some serverless platforms even have this feature built-in (Lambda calls this Provisioned Concurrency).

Writing resilient serverless code

When writing code for a serverless function, code for failure. In other words, approach application development with the expectation that your dependencies (connections made to services outside of the function) will fail, requests will contain malformed data, and that your serverless platform will experience high latency.

"The most common categories of failures are bad deployments and bad configurations. While some of these failures can be difficult to infer or reproduce, common symptoms include disruption of connectivity, increased latency, increased traffic due to retry storms, increased CPU and memory usage, and slow I/O."
- AWS Lambda: Resilience under-the-hood

There’s no one-size-fits-all solution to eliminate these risks, but there are a few steps you can take:

  • Use asynchronous communication, like Promises in JavaScript. If your code sends a network request, synchronous communication blocks the application from continuing until it receives and processes the response. Other services are never guaranteed to be available, so send the request and use an asynchronous process to watch for and handle the response.
  • Use message queues like Apache Kafka and RabbitMQ. Message queues create a buffer between services where messages can pool. This reduces the need to add retry logic or asynchronicity to your applications, but it introduces a new point of failure.
  • Wrap your dependency calls in circuit breakers. Instead of making direct API calls to external services, wrap them in a function that first checks whether the external service is available. If it’s not, the circuit breaker should prevent the call from happening and return an error. Meanwhile, it can check in the background to determine whether the dependency is available again, and if so, reopen the connection.
  • Deploy caches. Caches like Redis temporarily store data locally instead of calling a dependency for every request. This reduces the load on the dependency and provides a fallback in case the dependency fails.

#3: Serverless reliability practices aren’t just for serverless

Serverless isn’t so unique that we can’t apply these same practices to other deployment models. For example, any changes you make to your code for serverless functions—like circuit breakers—will also help improve reliability in a bare metal environment. Similarly, using cloud networking strategies—like using multiple subnets for zone redundancy—also works for infrastructure like compute instances and Kubernetes clusters.

Verify the resilience of your serverless applications

You’ve put a lot of effort into making your serverless applications more resilient: how do you prove your work has made a difference?

Gremlin helps you pproactively test your applications to ensure they meet your organization’s reliability standards. Our Failure Flags feature lets you inject fault directly into your applications, just like how Feature Flags let you dynamically enable and disable features. With Failure Flags, you can add latency to dependency calls, trigger exceptions, inject malformed data, return unexpected HTTP status codes, or any other action you can think of. Prove that you can withstand common failure modes, then schedule your Failure Flags to run periodically to pre-emptively catch regressions.

To learn more, check out this quick introduction to Failure Flags. Or, if you’d like to give it a try, sign up for a free 30-day trial.

No items found.
Categories
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Close Your AWS Reliability Gap

To learn more about how to proactively scan and test for AWS reliability risks and automate reliability management, download a copy of our comprehensive guide.

Get the AWS Primer