Intelligent Health Checks: one-click observability for reliability tests
Reliability testing and observability are similar in one important way: engineering teams know they should be doing it, but they’re not sure how to start, or they don’t have the right resources, or they need to focus on competing priorities like feature development and incident response.
In an ideal world, reliability and observability would be automated processes that configure, monitor, and run themselves. Imagine being able to deploy a new host or service to your production environment and have it automatically decide which metrics to monitor, which reliability tests to run, and when to stop a test if it exceeds certain thresholds.
That’s what led us to create Intelligent Health Checks. With Intelligent Health Checks, simply click a checkbox, and Gremlin creates a full set of Health Checks that can be used to determine service health during reliability tests—no third-party observability tools required.
In this blog post, we’ll explain how Intelligent Health Checks work, how they automate reliability testing, and how you can get up and running with Intelligent Health Checks in just a few minutes.
What are Health Checks, and what makes them “Intelligent”?
Health Checks are automated, periodic checks of a metric or HTTP endpoint. In the context of reliability testing and Chaos Engineering, they serve two key functions:
- Determine a service’s baseline performance and behavior.
- Monitor a service during a test, and stop the test if the metric goes outside acceptable thresholds.
Prior to the introduction of Intelligent Health Checks, you’d need to create Health Checks yourself by connecting them to metrics in your observability tool (Gremlin natively integrates with several, including Datadog, AppDynamics, and PagerDuty). As you run reliability tests, Health Checks query your metrics to determine whether your service is still healthy. If the service becomes slow or unavailable because of the test, the Health Check will flag this, and Gremlin will stop the test. This raises a lot of questions, such as:
- What metrics should I use?
- What thresholds should I set?
- Should I use my existing metrics and alarms, or do I need to create new ones?
- How can I automate this process so I don’t need to set up new metrics and Health Checks for every service?
Intelligent Health Checks remove this uncertainty. When you click the check box to enable Intelligent Health Checks, Gremlin finds the relevant metrics in your cloud platform, watches them to get a baseline profile of your service, then uses this baseline to set reasonable failure thresholds. Gremlin treats Intelligent Health Checks just like any other Health Check: it automatically monitors them when running reliability tests, and it will stop an actively running test if any of the checks exceed their thresholds.
Which metrics do Intelligent Health Checks use?
Intelligent Health Checks track three metrics: error rate, latency, and request rate. If you’ve read Google’s Site Reliability Engineering handbook, you might recognize these as three of the four Golden Signals. We chose these three because they most accurately represent the end-user experience, while also being common to most cloud-based services. An increase in latency and error rates, or a sharp change in request rates, is often a clear indicator of a problem no matter what service you’re running.
How do Intelligent Health Checks work?
Intelligent Health Checks are a key feature of Gremlin for AWS. When you authenticate Gremlin to your AWS environment, Gremlin automatically detects your Elastic Load Balancers (ELBs). If you have the Gremlin agent installed on the systems where your services are running—EC2 instances and EKS clusters specifically—Gremlin maps your ELBs to the services running on your hosts. This is how we can identify which services are running in your environment.
Next, when you view an ELB-based service in Gremlin and open its Health Check settings, you’ll have the option to use Intelligent Health Checks. All you need to do is click on the check box. Gremlin will find the relevant metrics for the ELB in CloudWatch, observe its current and past state, and create a Health Check with reasonable thresholds. These metrics are already available in AWS, so Gremlin only needs read-only permission to view them. During a reliability test, if any of the three metrics strays too far from the baseline, then the Health Check will return as “failed” and stop the test. If the test completes, then it returns as successful.
Intelligent Health Checks don’t necessarily replace your existing Health Checks. In fact, you can combine them to give your services as much coverage as you want. If you decide you don’t want to use Intelligent Health Checks—or you want to replace them with your own—you can remove them simply by unchecking the box. You can always re-activate them at any time without impacting your test scores or services.
How to get started with Intelligent Health Checks
There are three prerequisites for using Intelligent Health Checks on Gremlin:
- Your service must be hosted on AWS.
- You must have an ELB directing traffic to your service.
- You must have the Gremlin agent running on the same host as your service. Gremlin supports EC2, EKS, and EC2-backed Fargate.
If you don’t yet have a Gremlin account, you can sign up for a free 30-day trial. Full instructions are available in our AWS Quick Start Guide. And if you want to learn more about Gremlin’s AWS support, you can read our blog on Gremlin for AWS.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALIntroducing Gremlin for AWS
Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. Gremlin for AWS enables engineering teams on AWS to prevent incidents, monitor and test systems for known causes of failure, and gain visibility into the reliability posture of their applications.
Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. Gremlin for AWS enables engineering teams on AWS to prevent incidents, monitor and test systems for known causes of failure, and gain visibility into the reliability posture of their applications.
Read moreResiliency is different on AWS: Here’s how to manage it
Learn about the reliability risks you can still run into when deploying to AWS, and how to avoid them.
Learn about the reliability risks you can still run into when deploying to AWS, and how to avoid them.
Read moreBest practices for a resilient AWS architecture
Get best practices based on the AWS Well-Architected Framework for autoscaling, redundancy, dependencies, and more to make your AWS architecture more resilient.
Get best practices based on the AWS Well-Architected Framework for autoscaling, redundancy, dependencies, and more to make your AWS architecture more resilient.
Read more