Introducing Process Exhaustion: How to scale your services without overwhelming your systems
We rarely think about how many processes are running on our systems. Modern CPUs are powerful enough to run thousands of processes concurrently, but at what point do our systems become oversaturated? When you’re running large-scale distributed applications, you might reach this limit sooner than you'd expect.
How can you determine what that limit is, and how does that affect the number and complexity of the workloads you deploy? In this blog, we’ll take a deep dive into Gremlin’s brand new Process Exhaustion experiment. We’ll explain how it works, why you’d want to use it, and how testing your process limits can lead to increased reliability and efficiency.
What is a process, and how does it affect reliability?
A process is an instance of an application running on a computer. As an example, consider a simple text editor: when you open the editor, the operating system loads its executable, pulls in any other resources needed to run it, passes along any options (e.g., the path to a text file), then starts it. This process runs independently of other processes, whether they’re other applications or background services.
Some applications start as a single process and create (i.e. fork) additional processes later. For example, the Apache web server uses this technique to assign processes and threads to incoming requests, which lets it distribute work across multiple CPU cores. Rather than keep forking and assigning threads indefinitely, Apache uses a configurable “ServerLimit” option. Setting the ServerLimit too high could lead to system instability during high traffic periods, since the system must maintain all of those processes in memory and context switch between them to ensure each one gets CPU time. Each process also requires storage space and network bandwidth, leading to resource exhaustion. However, setting this value too low could result in slow or failed responses, since the application is no longer using CPU cores efficiently.
Why are process limits important?
In the era of monolithic applications, process limits were less of a concern since entire applications would often run under a single process. Today, organizations are adopting modular systems like containers, Kubernetes, and serverless computing. Individual containers (or pods in the case of Kubernetes) run as processes, which means the more containers you run on a host, the more PIDs you consume. A key goal of container orchestration tools like Kubernetes is to use computing resources as efficiently as possible by running many pods on each host, which can eventually result in process exhaustion. It’s such a concern that the official Kubernetes docs recommend that you consider increasing the process ID (PID) limit on your Kubernetes hosts.
Process IDs (PIDs) are a fundamental resource on nodes. It is trivial to hit the task limit without hitting any other resource limits, which can then cause instability to a host machine.
Process ID Limits And Reservations, the Kubernetes Documentation
Understanding process limits can help you answer questions like:
- How many workloads can I reasonably expect my systems to handle?
- How do my systems and services react when a host runs out of PIDs?
- Are my PID limits being enforced (e.g. on Kubernetes)?
- If I have a sudden influx of traffic, can my systems scale without exhausting PID limits or crashing?
What is the default process limit for Linux systems?
In theory, Linux can support over 4 billion concurrent processes on a modern 64-bit kernel. In practice, Linux has two limits: a hard limit, and a soft limit. The soft limit restricts the number of processes a user can run, which prevents users from exhausting the entire system’s processes (e.g. by running a fork bomb). The hard limit is an absolute ceiling on the number of processes that can run simultaneously, and only the root user can change it.
Even powerful modern systems will likely exhaust CPU and memory long before reaching the default hard limit. For this reason, some distributions, such as SUSE, set this limit to a more reasonable 32,768.
How to test process exhaustion using Gremlin
Gremlin’s Process Exhaustion experiment works by creating new PIDs until it reaches the amount you specify. You can tell it to consume a specific number of PIDs, or a percentage of the available PIDs on the target. Process Exhaustion supports these parameters:
As with all Gremlin experiments, you can change the length of time that the experiment runs for (60 seconds by default) and run on multiple targets simultaneously. If you’re new to Gremlin or to running experiments, we recommend starting small by targeting a single non-production host and using percentage-based allocation to avoid exceeding your PID limit.
To run a Process Exhaustion experiment:
- Log into your Gremlin account (or sign up for a free trial).
- Create a new experiment and select a host, container, or Kubernetes resource to target. Start small and select a single non-essential target.
- Under Choose a Gremlin, select the State category, then select Process Exhaustion.
- Set the Length of the experiment. By default, it will run for sixty seconds.
- Select the Allocation Strategy. “Absolute” will consume the number of PIDs you specify, while “Relative” will consume PIDs until the total count meets the number specified. For this example, choose “Relative.”
- Enter the Percent of PIDs to allocate. We recommend starting small for your first experiment, so enter 25%.
- Click Run Experiment to start the experiment.
While the experiment’s running, keep an eye on your systems using a monitoring or observability tool:
- How did the system respond? Did it crash, or did it keep running throughout the experiment?
- Did any behaviors arise that you didn’t expect, like applications or services failing?
- If a crash or other kind of critical failure occurred, how long did it take for it to happen? Was it sudden, or did it only happen after the experiment was running for a while?
- What steps could you take to avoid or mitigate these issues in production?
Make your systems more resilient
Process Exhaustion is only one of many major features added to Gremlin recently. To learn more about our other features - including support for AWS Key Management Service, Restricted Time Windows, improved dependency detection, and improved API auditing tools - check out our release roundup blog post for March 2024. And if you’re ready to give the Process Exhaustion experiment a try, sign up for a free 30-day trial.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIAL