Note
In this blog, we use “managed service providers” to refer to companies that provide hosted computing services, not managed IT service providers (MSPs).

When was the last time you thought about the reliability of your cloud dependencies?

The biggest challenge with using cloud platforms and SaaS services is also its biggest strength: the provider controls everything. You don’t have to worry about maintaining the service yourself, but if it fails, there’s little you can do. However, if you’re not prepared for this failure, it can have cascading effects on your own services and applications. What can you do to keep your own services up and running even when your dependencies are down?

In this blog, we’ll look at how you can prepare for outages in your SaaS, FaaS, and other cloud-based services.

What are managed services, and how can they fail?

In the context of this blog, a managed service is any computing service provided by a third-party. This includes Software as a Service (SaaS) offerings, cloud computing platforms, and even services owned by other teams in your organization. The distinguishing feature of managed services is that you can use them without the need to own or maintain them yourself. But because you don’t own them, you have no visibility over them, and no direct influence over their operations.

The Shared Responsibility Model of AWS. Source: https://aws.amazon.com/blogs/industries/applying-the-aws-shared-responsibility-model-to-your-gxp-solution/

Providers like AWS, Google, and Microsoft dedicate hundreds of teams, thousands of engineers, and countless hours to building resilient platforms. But despite this effort and preparation, managed services aren’t failure-proof. Like any other service, they’re prone to poor performance, over-saturation, and outages.

Network-based service outages are nothing new. There are established techniques for handling them, such as caching, circuit breakers, and asynchronous calls. The challenge is when the service provides a unique and critical function that your application requires. Just recently, one company’s mobile app lost a key feature (online ordering) due to cloud provider outages. Your services and applications need to be ready for these types of outages, or they could also fail.

Managed services and single points of failure

When a managed service is essential to your application’s operation, it becomes a single point of failure (SPoF). SPoFs are a significant reliability risk, especially if your application expects it to always be available.

For example, a recent Amazon Kinesis outage impacted several other AWS services, including CloudWatch, ECS Fargate, and API Gateway. Kinesis is a data streaming service that receives data, processes it, and sends it to a destination (like an S3 bucket or Lambda function). When Kinesis failed, this streaming pipeline stopped, which prevented other applications from sending or receiving data. This had a cascading impact on other Amazon-owned services, like Whole Foods, Ring, and Flex. It also affected non-Amazon companies such as Xero, a SaaS-based accounting service. When something as critical as a central data-handling platform fails, how can you maintain availability?

For us to answer these questions, we need a way to test what happens to our services when its dependencies (the services that it relies on) go down.

Running tests on managed services

This should come as no surprise, but service providers don’t want customers running failure tests on their services. Even tools like AWS FIS (Fault Injection Simulator), which have direct access to the AWS platform, are limited in how they change the service’s behavior. The risk of accidentally impacting other customers—or the platform itself—is simply too high.

Instead, we can take a different approach. Instead of testing the service itself, we can test our service’s network connection to its dependencies. We can’t change the behavior of the service, but we can change how our own services interact with it.

Think back to the Kinesis outage from the previous section. Instead of relying solely on Kinesis, you could use a different data streaming platform, like Apache Kafka. Kafka is an open source, self-hosted platform that requires setup and maintenance. There are managed Kafka services, including a service hosted by Amazon (Amazon MSK). The key benefit is that you can deploy backup Kafka clusters in different zones, regions, or even clouds.

Imagine we’re using Kafka, and AWS has another outage. What do we do when we can’t use our primary data streaming platform? How does this impact our application? More importantly, how does this impact our users? We can assume our applications will safely and reliably failover to the backup cluster, but we won’t know for sure until it happens. Instead of waiting for another AWS outage, we can take a proactive approach by recreating the outage ourselves using Gremlin.

Testing cloud risks using the Well-Architected Cloud Test Suite

Gremlin comes with a pre-built suite of reliability tests called the Well-Architected Cloud Test Suite. This suite is designed to test how well services adhere to cloud reliability principles and AWS best practices. Unsurprisingly, many of these tests are built for testing dependencies. For each dependency that your services talk to, the Well-Architected Cloud Test Suite tests three things:

  1. Can your service handle a slow connection (latency)?
  2. Can your service handle an unresponsive connection (failure)?
  3. Is the dependency’s TLS certificate expiring soon (certificate expiry)?

For example, imagine we deployed a service called frontend that connects to DynamoDB at dynamodb.us-east-1.amazonaws.com. Since the frontend service is one that we run ourselves, we can run reliability tests on it using Gremlin. If we run a failure test, Gremlin drops all Internet Protocol (IP) packets from the service to dynamodb.us-east-1.amazonaws.com.

Since the test is only running on frontend, it doesn’t impact Amazon’s systems. From their perspective, our service simply stopped sending traffic to DynamoDB.

A list of reliability tests for a dependency.

While the test is running, it’s important to monitor the service to understand how it behaves. For a web service, things to look for include:

  • Do any elements on the page take longer to load than usual? Are any elements not loading at all?
  • Are we showing users that there’s an internal problem? If so, how are we showing it?
  • Is the failure cascading to other services? Are there other services (besides the frontend) showing errors or crashing?
  • How does this affect the user experience? Are users likely to get frustrated or confused by the way our frontend is handling the outage?

Gremlin makes it easy to monitor service health using Health Checks. During a test, Gremlin uses the service’s Health Checks to track its current state. If the service appears unhealthy, Gremlin stops the test and marks it as a failure. We recommend creating Health Checks that connect to your existing metrics, alarms, and Service Level Indicators (SLIs). This way, Gremlin uses the same criteria that you already use to determine service health. And if you don’t already have observability set up, Gremlin can automatically create Health Checks for you

Testing for reliability risks in serverless applications

So far, we’ve only considered the reliability of network connections. However, managed services can fail in other ways. They might return bad data, hold network connections open, or trigger unexpected paths in your code. To test conditions like these, we need to go one layer deeper and use Failure Flags.

Failure Flags lets you run reliability tests and Chaos Engineering experiments on serverless workloads, like AWS Lambda and containers. Like feature flags (where it gets its namesake), Failure Flags can test specific areas of your application’s code without impacting other areas.

Failure Flags has a unique advantage over the pre-built reliability tests shown above: you can place Failure Flags anywhere in your code. We suggest adding them right before any network calls to external dependencies so that your experiments can directly impact those calls. For example, if you started a latency experiment that adds 200ms of latency, the application will pause for 200ms before making each call to DynamoDB. You can also apply other experiment effects, such as throwing exceptions or modifying variables.

You can also configure your Failure Flags to only fire in certain conditions based on the values of variables. For example, you can configure your Failure Flags tests to only affect traffic from internal test accounts, and not from customers.

Tip
Tip: want to learn more about Failure Flags? Check out our tutorials.

Whether you use Gremlin’s pre-built tests or Failure Flags, the process is the same. Start by running tests that change your service’s connection to the managed service. Then, observe how your application behaves, whether manually or by using Health Checks. Finally, address any problems that you find, and repeat the process by testing again.

Fixing issues that you discover

Now that we’ve uncovered reliability risks, what can we do about them? This varies depending on the risk, but here are a few suggestions.

Failure tests

If your service fails during a dependency failure test, this means the dependency is a single point of failure. You can improve resilience by:

  • Using a caching tool like Redis or Varnish to create a buffer between your service and its dependencies.
  • Adding logic to your application to test the connection to the dependency. This can include adding an exception handling block in case the connection can’t be created.
  • Optionally, deploying a redundant fallback service. This can be difficult if you’re using a managed service unique to one specific cloud provider. It would also require having a failover method, such as a load balancer, which could introduce new reliability risks.

Latency tests

If a latency test results in higher response times, your service might be making synchronous (also called “blocking”) calls to a dependency. Synchronous calls will block code from executing until the service receives and processes the response from the dependency.

If your programming language or framework supports it, try switching these to asynchronous calls. Asynchronous network calls work by sending the request as part of a background thread. The background thread waits for the response while the rest of the service continues processing. For user-facing applications, you can use a visual cue like a spinning loading icon to show users that a request is still being processed. This won’t reduce ‌actual ‌latency, but it can reduce perceived latency, which lowers the chance of users getting frustrated and leaving the page.

Certificate expiry tests

Unfortunately, expiring (or soon-to-expire) certificates are the responsibility of the service provider. You can raise the problem with your provider and ask them to rotate their certificates, but otherwise, approach it like you would a failure test. You can also use this as an opportunity to check your own certificates by running certificate expiry experiments.

How to prove and maintain resilience

After you’ve implemented the fixes, you still need to verify that they work. With Gremlin, this is as easy as re-running the tests that uncovered the issues in the first place. If your fixes are working correctly, your service should remain responsive, resulting in a high reliability score. If not, repeat this process of testing, fixing, and re-testing until you feel confident in your service’s resiliency.

Once you’ve completed testing, make sure you’re continuously monitoring your service’s resiliency by running regular tests. You can schedule reliability tests to run weekly directly in the Gremlin web app. You can choose whether to schedule all tests, only tests that you’ve run, or only tests that have passed. You can also set testing windows to ensure automatic testing doesn’t impact your users or your team.

Autoscheduling reliability tests in the Gremlin web app.

For more tips and insights into testing managed services with Gremlin, you can watch our Office Hours session on How to run fault injection tests on AWS managed services.

No items found.
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Close Your AWS Reliability Gap

To learn more about how to proactively scan and test for AWS reliability risks and automate reliability management, download a copy of our comprehensive guide.

Get the AWS Primer