How to ensure your Kubernetes Pods have enough CPU
A common risk is deploying Pods without setting a CPU request. While it may seem like a low-impact, low-severity issue, not using CPU requests can have a big impact, including preventing your Pod from running. In this blog, we explain why missing CPU requests is a risk, how you can detect it using Gremlin, and how you can address it.
Looking for more Kubernetes risks lurking in your system? Grab a copy of our comprehensive ebook, “Kubernetes Reliability at Scale.”
What are CPU requests and why are they important?
In Kubernetes, you can control how resources are allocated to individual Deployments, Pods, and even containers. When you specify a limit, Kubernetes won't allocate more than that amount to the Pod. Conversely, when you specify a request, you're specifying the amount that the Pod requires to run.
Kubernetes measures CPU request values as CPU units. For example, 1 CPU unit is the same as 1 physical or virtual CPU core. This value can be fractional: 0.5 is half of one core, 0.1 is one tenth of a core, etc.
Requests serve two key purposes:
- They tell Kubernetes the minimum amount of the resource to allocate to a Pod. This helps Kubernetes determine which node to schedule the Pod on and how to schedule it relative to other Pods.
- They protect your nodes from resource shortages by preventing over-allocating Pods on a single node.
Without this, Kubernetes might schedule a Pod onto a node that doesn't have enough capacity for it. Even if the Pod uses a small amount of CPU at first, that amount could increase over time, leading to CPU exhaustion.
How do I mitigate missing CPU requests?
To mitigate this risk, specify an appropriate resource request for each of your containers using <span class="code-class-custom">spec.containers[].resources.requests.cpu</span>. If you're not sure what to set as a value, you can get a baseline estimate using this process:
- Run your Pod normally.
- Collect metrics using the Kubernetes Metrics API, an observability tool, or a cloud platform. An easy way to do this is by running <span class="code-class-custom">kubectl top pod</span>. Ideally, you should gather these metrics from a production system for the most accurate results.
- Find the CPU usage for your Pod, then use that value as the CPU request amount. You might want to increase this amount to leave some overhead, especially if the Pod isn't under any load.
For example, imagine we have a Pod running Nginx that we want to set CPU requests for. After some testing, we determined that the container uses <span class="code-class-custom">200m</span> of CPU time. To be safe, we'll request <span class="code-class-custom">250m</span> by adding it to our Kubernetes manifest (see lines 10—12 below):
Then, apply the change and wait for Kubernetes to re-deploy your Pod:
How do I validate that I'm resilient?
Once your Pod finishes restarting, you can use the Kubernetes Dashboard (or <span class="code-class-custom">kubectl describe node <node name></span>) to list each Pod running on the specified node, along with their resource requests and limits. If your memory request applied successfully, then the Nginx Pod should have a value listed in the "Memory Requests" column:
You can also use Gremlin to verify your mitigation. Gremlin's Detected Risks feature immediately detects any high-priority reliability concerns in your environment. These can include misconfigurations, bad default values, or reliability anti-patterns. If you've addressed this risk, then the CPU requests risk will show as "Mitigated" instead of "At Risk".
A more thorough way to validate this is by seeing how Kubernetes responds when the Pod grows beyond its request. For example, what happens when our Pod uses exactly 250m of CPU time? What about 300m? This requires an active approach to testing using a method called fault injection.
Using fault injection to validate your fix
With fault injection, you can consume specific amounts of CPU time within a Pod or container to ensure your Pod doesn't get evicted or moved to a different node. In Gremlin, an ad-hoc fault injection is called an experiment.
To test this scenario:
- Log into the Gremlin web app at app.gremlin.com.
- Select Experiments in the left-hand menu and select New Experiment.
- Select Kubernetes, then select our Nginx Pod.
- Expand Choose a Gremlin, select the Resource category, then select the CPU experiment.
- Change CPU Capacity to the percentage of CPU we want to consume. We want to use 250m of CPU time, which equates to 1/4 of a single core. In other words, we want to use 25%. In Gremlin, we'll set CPU Capacity to 25 and keep the number of cores set to 1.
- Click Run Experiment to start the experiment.
Now, we keep an eye on our Nginx Pod. We'll see usage increase above 250m, but the Pod itself will keep running just fine. If it gets evicted or rescheduled, this tells us one of several things:
- We're requesting an unnecessarily high number of CPU units.
- We don't have enough capacity to run our workloads, and we need to scale our cluster vertically.
- We're not leaving enough overhead for this Pod to let it grow, and so we should increase our minimum requested CPU.
What similar risks should I be looking for?
You can use these same methods to test for memory requests. In fact, Gremlin's Detected Risks automatically finds Kubernetes resources that don't have memory requests defined, just like how it finds resources without CPU requests. For a complete list of the most critical Kubernetes risks, download a free copy of our ebook.
Ready to find out which of your Kubernetes resources are missing CPU request definitions? Sign up for a free 30-day trial, install the Gremlin agent, and get a report of your reliability risks in minutes.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALTo learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook
Get the Ultimate GuideIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read moreTreat reliability risks like security vulnerabilities by scanning and testing for them
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Read more