Making Your APIs More Resilient with Gremlin
Here's the thing: when a company measures their critical services, APIs are often considered second class-citizens. But the fact of the matter is that APIs are a core part of an organization's infrastructure, and not understanding their weaknesses can lead to performance issues and downtime.
The API brokers metadata between internal services and there’s always the risk that a failure can affect the user experience or result in an outage. As the adoption of your API scales, it can even end up creating an unexpected attack on your own infrastructure due to increased read/write usage.
Here at Gremlin, we aim to help engineers build more resilient infrastructure. We believe focusing on API-related failure injection is critical to ensure your API never disrupts your user experience or causes a high severity incident. One of the ways to help accomplish this is to run an API GameDay with controlled chaos experiments.
* If you are unfamiliar with GameDays, they are like fire drills where you practice a potentially dangerous scenario in a safe environment to proactively identify weaknesses. To learn more read our Introduction to GameDays and our guide on How to run a GameDay.
Example API GameDay Infrastructure: The MyStatus App
Let’s say we were running experiments on a status update sharing application called “MyStatus”. The MyStatus infrastructure is composed of an API Gateway (e.g. Open Source Kong), Memcached for caching, and MySQL for the database. This is demonstrated in the diagram below:
The best case scenario is that when your instances are impacted by a chaos experiment, they are either able to handle the stress or you automate their removal from your fleet. After they are removed, they would be automatically replaced with fresh hosts; a fresh host is safer than rebooting infected hosts.
Install the Gremlin agent on your memcached instances.
Round 1: Small Blast Radius Chaos Engineering Experiment
- A large number of read requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
Round 2: Medium Blast Radius Chaos Engineering Experiment
- A large number of read requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
- A memory attack using Gremlin at the same time we trigger the large number of API requests.
Round 3: Large Blast Radius Chaos Engineering Experiment
- A large number of read requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
- A large number of write requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
- Kill a cache instance using Gremlin at the same time we trigger the large number of API requests.
Additional Chaos Engineering Experiments
Network gremlins also allow you to see the impact of lost or delayed traffic to your application. You can test how your service behaves when you can’t reach one of your external dependencies.
Understanding how your system behaves if memcache becomes overloaded will give you critical insight into your infrastructure. If memcache crashes how does this impact your SLA and database reliability? Does your database crash? Does it failover?
Preparing For Failures In 3rd Party APIs
After you’ve tested and confirmed the resiliency of your own, the next step is ensuring you are prepared for what happens when 3rd party APIs fail. An outage of a 3rd party API can still affect customer experience, so it’s essential to have a plan for these outages as well. There have been a number of API outages that caused some well-known websites and applications to go down:
- The Facebook and Instagram API servers went down for an hour taking. The outage also impacted a number of well-known websites including Tinder and HipChat.
- Amazon Web Services (AWS) experienced a disruption that caused an increase in faults for the EC2 Auto Scaling APIs.
- Twitter experienced a one hour major outage that impacted websites and applications using Twitter APIs.
API reliance is only going to increase, so testing the resilience of these is a key step to ensuring the resilience of your systems. If you have an application that is dependent on external APIs to perform a critical function, you need to have a plan for dealing with disruptions. API virtualization, synthetic / real-user monitoring, asynchronous scripting and caching are all common ways to mitigate failures. But how do you know the fallbacks you’ve put in place actually work in a real-world scenario if you’ve never tested them?
In Conclusion
Running chaos experiments on a consistent basis is one of many things you can do to begin measuring the resiliency of your APIs. Making sure you have good visibility (monitoring) and increasing your fallback coverage will all help strengthen your own systems. But don’t stop there: with the number of connected devices and application ecosystems growing rapidly, it is more important than ever to safeguarded applications from internal and external third-party API outages and errors.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more