Getting the most out of Gremlin Resource Experiments
Using Gremlin to simulate resource contention is a great way to help understand how your application responds during bounding conditions, to test and validate autoscaling, and help ensure you have the proper notifications configured from your monitoring system. To do this, Gremlin provides mechanisms for creating CPU load, consuming memory, and consuming your disk I/O & space. While Gremlin does this well, the Linux kernel also does a great job of helping to balance application load in your systems; often, it’s best to skew the kernels ability to balance Gremlin, after all it is just another application.
To do that, let’s explore the tools in the Linux OS that can help to unbalance your system, to test the extremities of what your applications and systems are capable of. Linux provides a great toolbox for this, commands such as <span class="code-class-custom">nice</span>, <span class="code-class-custom">chrt</span>, and <span class="code-class-custom">ionice</span> come to the top of mind, as well as adjustments to <span class="code-class-custom">OOM Killer</span>. In this article, we’ll dive into the use and use-case for each one as it relates to Gremlin and injecting chaos experiments into your systems.
Pushing the Compute Boundaries
By default, the Linux scheduler CFS fairly weights every process at 0 and uses the <span class="code-class-custom">SCHED_OTHER</span> scheduling policy. The Gremlin daemon, <span class="code-class-custom">gremlind</span>, is treated no differently than other process out of the box. This means when running a CPU contention experiment, it will at worst simulate the boundaries of a normal runaway process. For most experiments, this is desired and expected. Afterall, you don’t normally run your process in a heavily weighted way.
Sometimes, however, we want to smash past those boundaries, the built in safeguards, and fairness of the linux scheduler. To do this, you’ll need to engage two mechanisms to tell the Linux scheduler that <span class="code-class-custom">gremlind</span> is your priority process, and therefore your system’s priority process.
Nice will set a processes priority on a scale from +19 to -20, 0 being the default. This scale is how nice we want a process to behave, +19 being very very nice, meaning that the process will more readily defer processor time to other applications, while -20 at the other end of the spectrum being a very un-nice process indeed. We want <span class="code-class-custom">gremlind</span> to be very un-nice to our system, that is allow it more CPU time and the ability to preempt other processes in favor of itself.
This command will set <span class="code-class-custom">gremlind</span> to the highest priority on the machine:
The other mechanism we need to adjust is the scheduling policy for Gremlin. As I mentioned above, the default scheduling policy is <span class="code-class-custom">SCHED_OTHER</span>. In total, there are 5 scheduling policies in CFS: <span class="code-class-custom">SCHED_FIFO</span>, <span class="code-class-custom">SCHED_BATCH</span>, <span class="code-class-custom">SCHED_IDLE</span>, <span class="code-class-custom">SCHED_OTHER</span>, <span class="code-class-custom">SCHED_RR</span>. Without diving too deeply into the technical details behind each one, <span class="code-class-custom">SCHED_BATCH</span> is designed for CPU intensive workloads. Setting the <span class="code-class-custom">gremlind</span> process to use this scheduling policy will, in conjunction with making it very un-nice, enable it consume the majority of the CPU resources available to your host.
To view the current policy, run the following command:
This command will set <span class="code-class-custom">gremlind</span> to the <span class="code-class-custom">SCHED_BATCH</span> policy:
To return <span class="code-class-custom">gremlind</span> to normal operating conditions, run the following commands:
Pushing the I/O Boundaries
Along with the compute scheduler, Linux also has an I/O scheduler, with scheduling policies of its own. The policies of the Linux I/O scheduler are: Idle, Best Effort and Real Time.
The default I/O policy is Best Effort, which actually takes some of its direction from processes niceness. Best Effort has a priority scale of 0-7, with 0 being the highest priority. The default equation to determine where a process falls on the priority scale is: <span class="code-class-custom">io_priority = (cpu_nice + 20) / 5</span>. Therefore, if you’ve already set your niceness to -20, without changing anything you’ve got the best I/O scheduling that Best Effort can afford you. We can do better though.
The Real Time policy gets first access to disk, regardless of what else is happening in the system. Like Best Effort, it also has a priority scale of 0-7, 0 being the highest priority.
To set <span class="code-class-custom">gremlind</span> to the Real Time policy with a priority of 0, run the following command:
Post experiment, to return <span class="code-class-custom">gremlind</span> to normal conditions, run the following command:
Pushing the Memory Boundaries
When Linux starts to run out of memory, it gets a bit defensive. Enter the OOM Killer - a process the kernel uses to free up memory when it starts to hit the limits of memory exhaustion. OOM Killer works by giving each running process a <span class="code-class-custom">oom_score</span>; that is, how likely it is to terminate a process in the case of low or no available memory.
It computes that score proportional to the amount of memory used by the process. The equation is <span class="code-class-custom">oom_score = 10 * %_of_process_memory</span>. So if your host has 10Gb of memory, your application is using around 3Gb, another 1Gb is being utilized by other tasks and <span class="code-class-custom">gremlind</span> is using roughly 5Gb, then your app would receive an <span class="code-class-custom">oom_score</span> of ~300, while <span class="code-class-custom">gremlind</span> would receive an <span class="code-class-custom">oom_score</span> of 500 - <span class="code-class-custom">gremlind</span> would be killed and the system should return to normal.
You can see the <span class="code-class-custom">oom_score</span> of any given process by running the command
For instance, on a T2.micro at idle state, <span class="code-class-custom">gremlind</span> has an <span class="code-class-custom">oom_score</span> around 8
There are a couple ways to modify the score. The first one is through <span class="code-class-custom">/proc/$PID/oom_score_adj</span> and the second is through <span class="code-class-custom">/proc/$PID/oom_adj</span>; the first being a very granular scale, similar to nice where positive integers make it more likely to be killed and negative numbers less likely. The second method of adjustment is less granular, on a scale of 15 to -17, with -17 having a special value meaning of never kill.
To set <span class="code-class-custom">gremlind</span> to never be killed by OOM Killer, run the following command:
To return <span class="code-class-custom">gremlind</span> to normal conditions, run the following command:
Conclusion
Finding the edge cases of where our systems breakdown, and recording what happens in those events, is one of the many use cases for Chaos Engineering. Coupled with the right observability and devops practices, you can start to understand what happens at the extreme end of scalability for your applications. Resource contention is one of those extreme end cases.
You wouldn’t start here, but you may be able to improve performance further by adjusting the policies around how Linux treats<span class="code-class-custom"> gremlind</span>. By doing so, you’ll be able to experiment with pushing your hosts and applications into those extreme scenarios and prevent big problems should those scenarios occur naturally, while also finding new ways to tweak performance and enhance both reliability and process execution for your system.
Avoid downtime. Use Gremlin to turn failure into resilience.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.