How to keep your Kubernetes Pods up and running with liveness probes
Getting your applications running on Kubernetes is one thing: keeping them up and running is another thing entirely. While the goal is to deploy applications that never fail, the reality is that applications often crash, terminate, or restart with little warning. Even before that point, applications can have less visible problems like memory leaks, network latency, and disconnections. To prevent applications from behaving unexpectedly, we need a way of continually monitoring them. That's where liveness probes come in.
In this blog, we'll explain what liveness probes are, how they work, and how Gremlin checks your entire Kubernetes environment for missing or incomplete liveness probe declarations.
Looking for more Kubernetes risks lurking in your system? Grab a copy of our comprehensive ebook, “Kubernetes Reliability at Scale.”
What are liveness probes and why are they important?
A liveness probe is a periodic check to see if a container has failed and, if so, whether to restart it. It's essentially a health check that periodically sends an HTTP request (or sends a command) to a container and waits for a response. If the response doesn't arrive, or the container returns a failure, the probe triggers a restart of the container.
The power of liveness probes is in their ability to detect container failures and automatically restart failed containers. This recovery mechanism is built into Kubernetes itself without the need for a third-party tool. Service owners can define liveness probes as part of their deployment manifests, and their containers will always be deployed with liveness probes. In theory, the only time a service owner should have to manually check their containers is if the liveness probe fails to restart a container (like the dreaded <span class="code-class-custom">CrashLoopBackOff</span> state).
How do I address missing liveness probes?
Defining a liveness probe for each container takes just a few lines of YAML, and you don't need to change anything about how your application or container works.
For example, let's add a liveness probe to an Nginx deployment. When the container is up and running, it exposes an HTTP endpoint on port 80. Since other applications will communicate with Nginx over port 80, it makes sense to create a liveness probe that checks this port's availability:
The section we're looking at in particular is:
If we break this down:
- <span class="code-class-custom">httpGet</span> indicates this is probe issues HTTP requests. There are also liveness probes that run commands, send TCP requests, and send gRPC requests.
- <span class="code-class-custom">path</span> and <span class="code-class-custom">port</span> are the URL and port number that we want to send the request to, respectively.
- <span class="code-class-custom">initialDelaySeconds</span> is the amount of time to wait between deploying the container and running the first probe. This is to give the container time to start up so we avoid false positives.
- <span class="code-class-custom">periodSeconds</span> is how often to run the probe after the initial delay.
Put together, this means that after 60 seconds have elapsed, Kubernetes will send an HTTP request to port 80 every 3 seconds. As long as the container returns a status code between 200 and 400, Kubernetes considers the container to be healthy. If it returns an error code or can't be contacted at all, Kubernetes restarts the container.
How do I validate that I'm resilient?
After you've deployed your liveness probe, you can use Gremlin to ensure that it works as expected. Gremlin's Detected Risks feature automatically detects high-proirity reliability issues like missing liveness probes. You can also use Gremlin's fault injection toolkit to run Chaos Engineering experiments and cause your liveness probes to report an error.
Imagine we've deployed Nginx with its liveness probe. First, we can check to make sure the liveness probe exists by querying the Pod and looking for the <span class="code-class-custom">Liveness</span> line. For example:
Now that we've confirmed the probe is part of our deployment, let's run a Chaos Engineering experiment to test what happens when the probe gets tripped.
Using fault injection to validate your fix
With Gremlin, we can add just enough latency to the container's network connection to trip the liveness probe. The container won't know that the latency is generated by Gremlin, and will treat it as real latency. If we add enough latency to trip the 1 second timeout, we should see the liveness probe fail and the container restart.
To test this:
- Log into the Gremlin web app at app.gremlin.com.
- Select Experiments in the left-hand menu and select New Experiment.
- Select Kubernetes, then select our Nginx Pod.
- Expand Choose a Gremlin, select the Network category, then select the Latency experiment.
- Increase MS to 1000. This is the amount of latency to add to each network packet in milliseconds. Since the probe is set to time out after one second, this guarantees that any response sent from Nginx to Kubernetes takes at least that long.
- Increase Length to 120 seconds or higher. Remember: the liveness probe will hold for 60 seconds while waiting for the pod to finish starting. We want to run our experiment long enough to exceed that delay.
- Click Run Experiment to start the experiment.
Now, let's keep an eye on our Nginx Pod. In just a few seconds, we'll see the pod restart automatically.
What similar risks should I be looking for?
Kubernetes has two additional types of probes: startup probes and readiness probes.
Startup probes grant containers extra startup time by letting you set both a probe period (in seconds) and a failure threshold, which is the amount of times Kubernetes will run the probe before killing the container. For example, if you set a period of 30 seconds and a failure threshold of 30, the probe will run every 10 seconds up to 30 times, leaving 5 minutes (300 seconds) for the application to start. It's important to note that liveness probes won't run until after the startup probe finishes.
Readiness probes work similarly to liveness probes and handle applications that are running, but not yet ready to receive traffic. Readiness probes prevent traffic from other containers until the application is ready to process that traffic. For example, our Nginx container might start up within 10 seconds, but what if we had to load a massive configuration file that took an additional 30 seconds? If we just set a startup probe to allow for 10 seconds, other applications might send requests to the Nginx container while it's still processing its configuration. Readiness probes prevent this from happening.
While you're free to use all three probe types for your containers, the Kubernetes docs explain when you might prefer one over the other. In short:
- Use a readiness probe when you want to avoid sending traffic to a Pod until it's ready to process traffic.
- Use a liveness probe to detect critical container errors that might not be detected by Kubernetes.
- Use a startup probe with containers that take a long time to start, and may interfere with a liveness probe.
We'll cover more Kubernetes Detected Risks in the future. In the meantime, if you're ready to scan your own Kubernetes environment for reliability risks like these, give Gremlin a try. You can sign up for a free 30-day trial, and after installing our Gremlin agent, get a complete report of your reliability risks.
For more on liveness probes and other Kubernetes risks, check out our comprehensive ebook, “Kubernetes Reliability at Scale.”
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALTo learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook
Get the Ultimate GuideIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read moreTreat reliability risks like security vulnerabilities by scanning and testing for them
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Read more