Kubernetes just celebrated its tenth birthday. That’s 10 years of microservices, containers, service meshes, and many other paradigms that are now common to many developers’ toolkits.

While a decade is a long time for software to grow, not everyone has adopted Kubernetes. And for those who are in the early stages of adoption, the process is still fraught with reliability risks and challenges. Many of these risks have been discovered and documented, but Kubernetes is a complex beast. There are still hidden risks, as well as unknown risks that are unique to each deployment. Some of these risks will arise as you migrate your services; others won’t appear until you’ve been running in production. Regardless, for a migration to succeed, reliability must be a focus throughout the process.

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

Treat reliability as a practice, not a destination

Reliability isn’t a one-and-done exercise. Much like Quality Assurance (QA), performance testing, and other testing practices, reliability is ongoing. It involves:

  • Running reliability tests regularly (at least weekly),
  • Scanning environments for configuration issues,
  • Identifying and restarting (or replacing) failed services.

Reliability needs can vary by team. For example, in large organizations, Kubernetes deployments are often managed by a centralized team called a Center of Excellence (or simply Platform) team. Their responsibility is to keep the cluster up and running so other teams can deploy to it. When testing reliability, their focus is ensuring the cluster can scale, data is replicated, and that failover systems are in place.

In contrast, application development teams aren’t concerned about the reliability of the cluster, but the reliability of the services that they deploy onto the cluster. Their responsibility is making sure they’ve replicated their Kubernetes deployments across multiple zones, that they’ve configured liveness and readiness probes, and that they set resource requests and limits.

Encourage engineers to learn and understand Kubernetes’ design

The more you know about how a system works, the better equipped you are to improve its reliability. You’ll have more information on its reliability risks and how to avoid common beginner mistakes. While you don’t need to know everything about Kubernetes (it’s far too large and complex), understanding the basic components and interactions will give you a massive head start on your reliability journey.

For platform engineers, areas of focus might include cluster provisioning, communication between nodes, and container scheduling. For developers, these might include container restart policies, affinity rules, and taints and tolerations.

As you discover bugs, risks, and failure modes, document them. This ensures you have a record of your reliability work and steps taken to prevent them. This also gives you a playbook in case the problem returns in the future. Lastly, share this knowledge with other teams to help them avoid the same pitfalls.

Proactively find failure modes

When adopting a new platform, finding its weak points is often the last thing on engineers’ minds. Though it might seem counterintuitive, the best way to understand how a complex system behaves is by making it fail. The approach:

  1. Challenge any assumptions you have about how Kubernetes works. For example, an application developer might think: “if my liveness probe fails, Kubernetes will automatically replace the container with a new one.”
  2. Use a tool like Gremlin to test the system by reproducing the conditions of your challenge. In this example, Gremlin can automatically detect whether a Kubernetes deployment has a liveness probe configured. Alternatively, you could run a blackhole experiment on the container to make it appear offline, which will (in theory) trigger the liveness probe.
  3. While the experiment is running, observe your system(s) to determine whether it proves or disproves your assumption. For example, did your liveness probe work as expected? If so, how long did it take to replace the “failed” pod? Did requests to your application fail in the meantime? Did something happen that you didn’t expect, such as Kubernetes failing to replace the pod?
  4. If there was a problem or unexpected behavior, fix it, then repeat this process to validate the fix.

This also gives you the chance to practice responding to real-world failures. If you discover a similar failure mode in production, you can refer to your previous experiments to troubleshoot and resolve the problem faster.

Be open to learning from incidents

Experiencing a production incident doesn’t mean the migration was a failure. In fact, incidents can help make your Kubernetes deployment more reliable by revealing problems you might not have considered before.

When something fails (and the likelihood of failures when adopting a new platform is high), take it as a learning opportunity. Study the problem, identify the root cause, then design and implement systems to prevent its recurrence. It’s important to focus on the technical and organizational processes that led to the problem, not the people who caused it.

Continuing with the liveness probe example: what could you do to ensure that any containers deployed in the future have working liveness probes? A COE team might create a production deployment pipeline that automatically checks new deployments for valid probes. As developers push their updated containers to production, this pipeline could scan the deployment details, run a copy of the container, look for a valid probe, and flag (or block) deployments that don’t have probes defined.

Even if a deployment manages to make it through the pipeline with reliability risks, Gremlin can monitor and detect them for you. Liveness probe detection is one of Gremlin’s Detected Risks: high-priority reliability problems that Gremlin automatically identifies using the service’s configuration. While a deployment pipeline is a one-time event, Detected Risks are monitored continuously and tracked over time. You can see the current status of a risk (whether it’s currently at risk, mitigated, or irrelevant to the service), and the historical status of all risks for your team or organization. This helps with detecting and resolving regressions.

Team Risk report showing a decrease in Detected Risks over a period of three months.

Additionally, running Gremlin’s suite of reliability tests on a service generates a reliability scorecard for that service. In addition to a single score representing the service’s overall reliability, you can also see the results for each test in the suite. You can easily see test execution details, logs, and why the test failed. Tests are also grouped by category, so you can see at-a-glance which risks your service is more susceptible to.

Gremlin Company Summary Report showing three services with scores of 94–95.

Conclusion

Building a reliable Kubernetes deployment is difficult, but not impossible. Engineers just need to take an active approach to discovering, understanding, and addressing reliability risks during the adoption process.

If there’s one thing to take away from this blog, it’s this: reliability is an ongoing project. It doesn’t stop when the production switch is turned on. Even after you start serving production traffic, keep running experiments, keep detecting risks, and keep preparing your incident response procedures.

No items found.
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
K8s Reliability at Scale

To learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook

Get the Ultimate Guide