There’s more pressure than ever to deliver high-availability Kubernetes systems, but there’s a combination of organizational and technological hurdles that make this ‌easier said than done.

Technologically, Kubernetes is complex and ephemeral, with deployments that span infrastructure, cluster, node, and pod layers. And like with any complex and ephemeral system, the large amount of constantly-changing parts opens the possibility for sudden, unexpected failures.

At the same time, enterprise Kubernetes deployments involve a number of different teams all working together on their own services. Variations between best practices, configurations, and deployment methods can create an inconsistent resiliency posture between services, which, due to the interconnected nature of microservices architecture, will lower the resiliency of your entire Kubernetes deployment.

To help address these technological and organizational issues, we developed a framework for improving Kubernetes resiliency at scale based on the application and testing of shared standards. Combining organizational standards with resiliency testing and reliability risk monitoring, it gives you a core set of best practices for creating Kubernetes resilience standards and verifying them across your organization, resulting in a stronger resiliency posture.

Framework for Kubernetes Resiliency consisting of organizational and deployment-specific standards, metrics and reporting, risk monitoring and mitigation, and validation test suites.

The framework consists of four distinct sections. In this blog post, we’ll briefly go over each one on a high level. If you want to learn more, each section has its own chapter in the eBook Kubernetes Reliability at Scale, a comprehensive guide that shows you how to improve uptime with resiliency management.

Resiliency standards

Some reliability risks are common to all Kubernetes deployments. For example, every Kubernetes deployment should be tested against how it’s going to respond during a surge in demand for resources, a drop in network communications, or a loss of connection to dependencies.

These are recorded under Organizational Standards, which inform the standard set of reliability risks that every team should test against. While you should start with common reliability risks, this list should expand to include risks unique to your company that are common across your organization. For example, if every service connects to a specific database, then you should standardize around testing what happens if there’s latency in that connection, or if the database suddenly becomes unavailable.

Deployment-Specific Standards are deviations from the core Organizational Standards for specific services or deployments. The standards can be more strict or loose than ‌organizational standards, but either way, they’re exceptions that should be noted. For example, an internal sales tool might have a higher latency tolerance for connecting to a database because your team is more willing to wait, while a customer-facing tool might need to access a faster database replica to avoid losing sales.

When defining standards, you should also consider the causes of previous outages or failure modes your deployment has been sensitive to in the past. Depending on the nature of the failure, these could fall under Deployment-Specific Standards (if the failure modes only affect a handful of key services) or Organizational Standards (if the failure modes could affect all services). 

Metrics and reporting

Reliability is often measured by either the binary “currently up/currently down” status or the backwards-facing “uptime vs. downtime” metric. But neither of these measurements will help you see your deployment’s reliability posture before you experience incidents and outages—and whether that posture is improving over time.

This is why it’s essential to have metrics, reporting, and dashboards that show the results of your resiliency tests and risk monitoring. These dashboards give the various teams core data to align around and be accountable for results. By showing how each service performs on tests built against the defined resiliency standards, you get an accurate view of your reliability posture that can inform important prioritization conversations.

Risk monitoring and mitigation

Some Kubernetes risks, such as missing memory limits, can be quick and easy to fix, but can also cause massive outages if unaddressed. The complexity of Kubernetes can make it easy to miss these issues, along with other known reliability risks common across all Kubernetes deployments, which means you can operationalize their detections.

Many of these critical risks can be located by scanning configuration files and containers statuses. These scans should run continuously on Kubernetes deployments so these risks can be surfaced and addressed quickly, especially for frequently-updated deployments. When using Gremlin, the Detected Risks feature continuously monitors for these Kubernetes issues, and gives you reports that show which services have which risks.

Validation testing using standardized test suites

Utilizing Fault Injection, resiliency testing safely creates fault conditions in your deployment so you can verify that your systems respond the way you expect them to, such as a spike in CPU demand or a drop in network connectivity to key dependencies.

Using the standards from the first part of the framework, suites of reliability tests can be created and run automatically. This validation testing approach uncovers places where your systems aren’t meeting standards, and the pass/fail data can be used to create metrics that show your changing reliability posture over time.

When you’re starting with Gremlin, every service will have access to a core suite of validation tests based on reliability best practices. We recommend starting with these, then customizing your test suites over time as you further define your standards and cover previous incidents.

Next steps: Improve uptime with Kubernetes Resiliency Management

Improving Kubernetes resiliency doesn’t have to be a multi-year project with a massive lift. By embracing standards with automated monitoring and resilience testing, you can uncover reliability risks across your entire Kubernetes deployment with very little lift from individual teams.

And by finding these risks, you’re able to mitigate them on your own time before they cause incidents or outages—and before they cause customer-impacting downtime.

Find out more by reading the eBook Kubernetes Reliability at Scale: How to Improve Uptime with Resiliency Management, including a 30-day plan for improving your Kubernetes resiliency and proving your results.

No items found.
Gavin Cahill
Gavin Cahill
Sr. Content Manager
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
K8s Reliability at Scale

To learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook

Get the Ultimate Guide