How to measure Kubernetes cluster reliability

Measure and report the reliability of Kubernetes with resilience testing and risk monitoring

Editor’s Note: Metrics and reporting are only one part of the Kubernetes Resiliency Framework. Find out more about the framework in The Ultimate Guide to Kubernetes High Availability.

——

Why it’s important to track Kubernetes cluster reliability

Ask executives and engineers at any organization, and they’ll all agree that the availability and resiliency of their Kubernetes clusters could be improved. But when you ask them how, the discussion usually breaks down.

In this article, we’re going to dig into how you can create metrics and reporting dashboards that show you how reliable your Kubernetes clusters are—and give you clear actions your teams can take to address the issues to improve the resilience and availability of your Kubernetes clusters.

Let’s start by looking at the benefits of having clear metrics and dashboards

1. Identify Kubernetes reliability risks that need to be addressed

Most organizations lack a consistent, agreed-upon method for identifying reliability risks that can be shared and understood across their teams. It’s not that the information isn’t out there—almost every engineer knows the common ways their service will fail—it’s that there’s no centralized way for all reliability risks and potential failure points to be cataloged, tested for, and compared between services.

Tracking resilience tests gives you that central alignment. When you track the results over time, individual teams can show exactly what risks are and aren’t present in their services, taking that knowledge out of the engineer’s heads and putting it into a place where the entire organization can benefit from it.

2. Prove the results of Kubernetes engineering efforts

Reliability tracking also provides a framework to prove the effectiveness of a team’s efforts. Without a framework, a well-intentioned engineer could spend hours addressing an issue that they know could lead to an outage, but end up with little to no recognition or acknowledgement of their efforts. This is because they’re attempting to prove a negative. Yes, they prevented an outage, but how can they show that they stopped an outage that didn’t happen or fixed an issue that’s no longer there?

By tracking reliability risks over time, engineers and operators can show the effectiveness of their efforts by pointing to the test that previously failed but now passes, proving that the risk is no longer present.

3. Create a common Kubernetes reliability metric across the organization

Finally, tracking reliability risks creates a metric that can be used to track reliability over time across the organization. This is where ‌standards and testing come together to produce actionable organizational alignment. By laying out the standardized test suites everyone should follow, you create a list of reliability risks everyone should track.

Over time, this creates a metric where the entire team can align around common reliability metrics and get an accurate picture of the reliability posture of their entire Kubernetes system.

How to use resilience testing to create reliability scores and dashboards

Tracking the results of resiliency tests makes it possible for each Kubernetes service to be given a reliability score. These scores, in turn, can create dashboards where the scores of all Kubernetes services are rolled up for review and alignment, thus creating a view of the entire deployment’s reliability posture.

How reliability scoring works

The current status of every resiliency test falls into one of three results:

Passed - The deployment performed as expected and no reliability risk exists.
Failed - The deployment did not perform as expected, and a reliability risk is known to exist.
Not run - The test hasn’t been run recently enough to be certain of the result. A known reliability risk may or may not exist—which is, in itself, a reliability risk.

When you’re looking at a service’s reliability posture‌, you’re only concerned about whether a reliability risk is present. If a risk is present, then you need to evaluate whether engineering time and effort should be spent resolving the risk. If not, then you can count on your system to be resilient in that area without further ‌engineering effort.

By looking at it this way, test results can be pooled into a binary state where a point is scored for any passed tests (no reliability risk present) and a zero is scored for a failed or not-run test (known or possible reliability risk present).

Example of how scores can be computed in Gremlin

When we compile the results of an entire suite of tests, a score is created.

How regular Kubernetes resilience testing creates a metric of scores

When you run a series of tests to build a reliability score, this creates a numeric data point that shows the reliability posture of your Kubernetes deployment at a specific time.

By regularly running resiliency test suites, you create a metric of your reliability posture over time. Like any metric, this can be plotted to show trends, then each data point can be drilled down to the individual test results.

*Example of how scores can be graphed in Gremlin*

Every organization will have different requirements, and your standards owner should set your specific testing cadence, but a good goal is to work towards weekly testing of production systems. A weekly cadence gives you an accurate view that will always be recent enough to be considered current, and by testing in production, you’ll be getting an accurate view of your actual Kubernetes deployment under real-world conditions.

Create dashboards for alignment and reporting

By combining reliability scores with regular testing, you create reliability metrics. So the next step is to create a system for reporting those metrics with dashboards.

*Example of a dashboard for multiple services in Gremlin*

These dashboards should be used in regular reliability alignment meetings (or as part of existing engineering review meetings) for the entire team—including leadership—to review the current reliability posture of your Kubernetes deployment.

The goal with these dashboards isn’t to assign blame or point out failures. Instead, they should be used to plan engineering work and applaud successes. For example, if a team shipped a new feature and their reliability score decreased, this might be expected with the large amount of new code added to the system. The decrease in score then shows the team that time should be spent ensuring reliability of the new feature before moving onto the next one. At the same time, if they come back two weeks later and the score has increased, then they should be celebrated for how much they improved the new feature’s reliability.

How to track automatically detected Kubernetes reliability risks

The nature of Kubernetes cluster and node configurations make it possible to continuously scan and monitor for known critical Kubernetes reliability risks, such as misconfigurations that would disable autoscaling. These risks and how to monitor for them are discussed in more detail below, but these also create their own reliability metric unique to Kubernetes.

*A sample team risk report from Gremlin*

As with reliability scores, these detected risks can be broken down into a binary metric: either the risk is present or it isn’t. And just like with reliability scores, tracking the detection of these risks over time creates a reliability metric. Like with reliability metrics gained from testing, these scanned reliability metrics should be reviewed in regular alignment meetings, then be used to show when teams have successfully addressed the risks to make the systems more reliable.

‍

Next steps in your Kubernetes high availability journey:

Find out how to monitor Kubernetes reliability risks
Learn about resilience testing for Kubernetes clusters
Get started with Chaos Engineering on Kubernetes
Start your free 30-day Gremlin trial

Download the comprehensive eBook

Learn how your own resiliency management practice for Kubernetes in the 55-page guide Kubernetes Reliability at Scale: How to Improve Uptime with Resiliency Management

Thanks for requesting

‍Kubernetes Reliability at Scale:

How to Improve Uptime with Resiliency Management.‍

‍

View the guide here.

(A copy has also been sent to your email.)