How to standardize resiliency on Kubernetes
There’s more pressure than ever to deliver high-availability Kubernetes systems, but there’s a combination of organizational and technological hurdles that make this easier said than done.
Technologically, Kubernetes is complex and ephemeral, with deployments that span infrastructure, cluster, node, and pod layers. And like with any complex and ephemeral system, the large amount of constantly-changing parts opens the possibility for sudden, unexpected failures.
At the same time, enterprise Kubernetes deployments involve a number of different teams all working together on their own services. Variations between best practices, configurations, and deployment methods can create an inconsistent resiliency posture between services, which, due to the interconnected nature of microservices architecture, will lower the resiliency of your entire Kubernetes deployment.
To help address these technological and organizational issues, we developed a framework for improving Kubernetes resiliency at scale based on the application and testing of shared standards. Combining organizational standards with resiliency testing and reliability risk monitoring, it gives you a core set of best practices for creating Kubernetes resilience standards and verifying them across your organization, resulting in a stronger resiliency posture.
The framework consists of four distinct sections. In this blog post, we’ll briefly go over each one on a high level. If you want to learn more, each section has its own chapter in the eBook Kubernetes Reliability at Scale, a comprehensive guide that shows you how to improve uptime with resiliency management.
Resiliency standards
Some reliability risks are common to all Kubernetes deployments. For example, every Kubernetes deployment should be tested against how it’s going to respond during a surge in demand for resources, a drop in network communications, or a loss of connection to dependencies.
These are recorded under Organizational Standards, which inform the standard set of reliability risks that every team should test against. While you should start with common reliability risks, this list should expand to include risks unique to your company that are common across your organization. For example, if every service connects to a specific database, then you should standardize around testing what happens if there’s latency in that connection, or if the database suddenly becomes unavailable.
Deployment-Specific Standards are deviations from the core Organizational Standards for specific services or deployments. The standards can be more strict or loose than organizational standards, but either way, they’re exceptions that should be noted. For example, an internal sales tool might have a higher latency tolerance for connecting to a database because your team is more willing to wait, while a customer-facing tool might need to access a faster database replica to avoid losing sales.
When defining standards, you should also consider the causes of previous outages or failure modes your deployment has been sensitive to in the past. Depending on the nature of the failure, these could fall under Deployment-Specific Standards (if the failure modes only affect a handful of key services) or Organizational Standards (if the failure modes could affect all services).
Metrics and reporting
Reliability is often measured by either the binary “currently up/currently down” status or the backwards-facing “uptime vs. downtime” metric. But neither of these measurements will help you see your deployment’s reliability posture before you experience incidents and outages—and whether that posture is improving over time.
This is why it’s essential to have metrics, reporting, and dashboards that show the results of your resiliency tests and risk monitoring. These dashboards give the various teams core data to align around and be accountable for results. By showing how each service performs on tests built against the defined resiliency standards, you get an accurate view of your reliability posture that can inform important prioritization conversations.
Risk monitoring and mitigation
Some Kubernetes risks, such as missing memory limits, can be quick and easy to fix, but can also cause massive outages if unaddressed. The complexity of Kubernetes can make it easy to miss these issues, along with other known reliability risks common across all Kubernetes deployments, which means you can operationalize their detections.
Many of these critical risks can be located by scanning configuration files and containers statuses. These scans should run continuously on Kubernetes deployments so these risks can be surfaced and addressed quickly, especially for frequently-updated deployments. When using Gremlin, the Detected Risks feature continuously monitors for these Kubernetes issues, and gives you reports that show which services have which risks.
Validation testing using standardized test suites
Utilizing Fault Injection, resiliency testing safely creates fault conditions in your deployment so you can verify that your systems respond the way you expect them to, such as a spike in CPU demand or a drop in network connectivity to key dependencies.
Using the standards from the first part of the framework, suites of reliability tests can be created and run automatically. This validation testing approach uncovers places where your systems aren’t meeting standards, and the pass/fail data can be used to create metrics that show your changing reliability posture over time.
When you’re starting with Gremlin, every service will have access to a core suite of validation tests based on reliability best practices. We recommend starting with these, then customizing your test suites over time as you further define your standards and cover previous incidents.
Next steps: Improve uptime with Kubernetes Resiliency Management
Improving Kubernetes resiliency doesn’t have to be a multi-year project with a massive lift. By embracing standards with automated monitoring and resilience testing, you can uncover reliability risks across your entire Kubernetes deployment with very little lift from individual teams.
And by finding these risks, you’re able to mitigate them on your own time before they cause incidents or outages—and before they cause customer-impacting downtime.
Find out more by reading the eBook Kubernetes Reliability at Scale: How to Improve Uptime with Resiliency Management, including a 30-day plan for improving your Kubernetes resiliency and proving your results.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALTo learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook
Get the Ultimate Guide10 Most Common Kubernetes Reliability Risks
These Kubernetes reliability risks are present in almost every Kubernetes deployment. While many of these are simple configuration errors, all of them can cause failures that take down systems. Make sure that your teams are building processes for detecting these risks so you can resolve them before they cause an outage.
These Kubernetes reliability risks are present in almost every Kubernetes deployment. While many of these are simple configuration errors, all of them can cause failures that take down systems. Make sure that your teams are building processes for detecting these risks so you can resolve them before they cause an outage.
Read moreFour pillars of a best-in-class reliability program
Reliability impacts every organization, whether you plan for it or not. Leading companies take matters into their own hands and get ahead of incidents by building reliability programs. But since many of these programs are still nascent, how do you know what good looks like?
Reliability impacts every organization, whether you plan for it or not. Leading companies take matters into their own hands and get ahead of incidents by building reliability programs. But since many of these programs are still nascent, how do you know what good looks like?
Read moreWhat is Reliability Management?
Measuring and improving the reliability of technical systems has always been challenging. As an industry, we've developed several practices to try and address reliability concerns, such as incident response, observability, and Chaos Engineering. This led SREs and service owners to measure reliability in a handful of ways:
Measuring and improving the reliability of technical systems has always been challenging. As an industry, we've developed several practices to try and address reliability concerns, such as incident response, observability, and Chaos Engineering. This led SREs and service owners to measure reliability in a handful of ways:
Read more