Overview and Resources
The Simian Army is a suite of failure-inducing tools designed to add more capabilities beyond Chaos Monkey. While Chaos Monkey solely handles termination of random instances, Netflix engineers needed additional tools able to induce other types of failure. Some of the Simian Army tools have fallen out of favor in recent years and are deprecated, but each of the members serves a specific purpose aimed at bolstering a system's failure resilience.
In this chapter we'll jump into each member of the Simian Army and examine how these tools helped shape modern Chaos Engineering best practices. We'll also explore each of the Simian Chaos Strategies used to define which Chaos Experiments the system should undergo. Lastly, we'll plunge into a short tutorial walking through the basics of installing and using the Simian Army toolset.
Simian Army Members
Each Simian Army member was built to perform a small yet precise Chaos Experiment. Results from these tiny tests can be easily measured and acted upon, allowing you and your team to quickly adapt. By performing frequent, intentional failures within your own systems, you're able to create a more fault-tolerant application.
Active Simians
In addition to Chaos Monkey, the following simian trio are the only Army personnel to be publicly released, and which remain available for use today.
Janitor Monkey - Now Swabbie
Janitor Monkey also seeks out and disposes of unused resources within the cloud. It checks any given resource against a set of configurable rules to determine if its an eligible candidate for cleanup. Janitor Monkey features a number of configurable options, but the default behavior looks for resources like orphaned (non-auto-scaled) instances, volumes that are not attached to an instance, unused auto-scaling groups, and more.
Have a look at Using Simian Army Tools for a basic guide configuring and executing Janitor Monkey experiments.
Update: Swabbie is the Spinnaker service that replaces the functionality provided by Janitor Monkey. Find out more in the official documentation.
Conformity Monkey - Now Part of Spinnaker
The Conformity Monkey is similar to Janitor Monkey -- it seeks out instances that don't conform to predefined rule sets and shuts them down. Here are a few of the non-conformities that Conformity Monkey looks for.
- Auto-scaling groups and their associated elastic load balancers that have mismatched availability zones.
- Clustered instances that are not contained in required security groups.
- Instances that are older than a certain age threshold.
Conformity Monkey capabilities have also been rolled into Spinnaker. More info on using Conformity Monkey can be found under Using Simian Army Tools.
Security Monkey
Security Monkey was originally created as an extension to Conformity Monkey, and it locates potential security vulnerabilities and violations. It has since broken off into a self-contained, standalone, open-source project. The current 1.X version is capable of monitoring many common cloud provider accounts for policy changes and insecure configurations. It also ships with a single-page application web interface.
Inactive/Private Simians
This group of simians were either been deprecated or were never publicly released.
Chaos Gorilla
AWS Cloud resources are distributed around the world, with a current total of 25 geographic Regions. Each region consists of one or more Availability Zones. Each availability zone acts as a separated private network of redundancy, communicating with one another via fiber within their given region.
The Chaos Gorilla tool simulates the outage of entire AWS availability zone. It's been successfully used by Netflix to verify that their service load balancers functioned properly and kept services running, even in the event of an availability zone failure.
Chaos Kong
While rare, it is not unheard of for an AWS region to experience outages. Though Chaos Gorilla simulates availability zone outages, Netflix later created Chaos Kong to simulate region outages. As Netflix discusses in their blog, running frequent Chaos Kong experiments prior to any actual regional outages ensured that their systems were able to successfully evacuate traffic from the failing region into a nominal region, without suffering any severe degradation.
*Netflix Chaos Kong Experiment - Courtesy of Netflix*
Latency Monkey
Latency Monkey causes artificial delays in RESTful client-server communications and while it proved to be a useful tool. However, as Netflix later discovered, this particular Simian could be somewhat difficult to wrangle at times. By simulating network delays and failures, it allowed services can be tested to see how they react when their dependencies slow down or fail to respond, but these actions also occasionally caused unintended effects within other applications.
While Netflix never publicly released the Latency Monkey code, and it eventually evolved into their Failure Injection Testing (FIT) service, which we discuss in more detail over here.
Doctor Monkey
Doctor Monkey performs instance health checks and monitors vital metrics like CPU load, memory usage, and so forth. Any instance deemed unhealthy by Doctor Monkey is removed from service.
Doctor Monkey is not open-sourced, but most of its functionality is built into other tools like Spinnaker, which includes a load balancer health checker, so instances that fail certain criteria are terminated and immediately replaced by new ones. Check out the How to Deploy Spinnaker on Kubernetes tutorial to see this in action!
10-18 Monkey
The 10-18 Monkey (aka <span class="code-class-custom">l10n-i18n</span>) detects run time issues and problematic configurations within instances that are accessible across multiple geographic regions, and which are serving unique localizations.
Simian Chaos Strategies
The original Chaos Monkey was built to inject failure by terminating EC2 instances. However, this provides a limited simulation scope, so Chaos Strategies were added to the Simian Army toolset. Most of these strategies are disabled by default, but they can be toggled in the <span class="code-class-custom">SimianArmy/src/main/resources/chaos.properties</span> configuration file.
Instance Shutdown (Simius Mortus)
Shuts down an EC2 instance.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.shutdowninstance</span>
Network Traffic Blocker (Simius Quies)
Blocks network traffic by applying restricted security access to the instance. This strategy only applies to VPC instances.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.blockallnetworktraffic</span>
EBS Volume Detachment (Simius Amputa)
Detaches all EBS volumes from the instance to simulate I/O failure.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.detachvolumes</span>
Burn-CPU (Simius Cogitarius)
Heavily utilizes the instance CPU.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.burncpu</span>
Burn-IO (Simius Occupatus)
Heavily utilizes the instance disk.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.shutdowninstance</span>
Fill Disk (Simius Plenus)
Attempts to fill the instance disk.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.shutdowninstance</span>
Kill Processes (Simius Delirius)
Kills all Python and Java processes once every second.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.killprocesses</span>
Null-Route (Simius Desertus)
Severs all instance-to-instance network traffic by null-routing the <span class="code-class-custom">10.0.0.0/8</span> network.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.nullroute</span>
Fail DNS (Simius Nonomenius)
Prevents all DNS requests by blocking TCP and UDP traffic to port <span class="code-class-custom">53</span>.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.faildns</span>
Fail EC2 API (Simius Noneccius)
Halts all EC2 API communication by adding invalid entries to <span class="code-class-custom">/etc/hosts</span>.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.failec2</span>
Fail S3 API (Simius Amnesius)
Stops all S3 API traffic by placing invalid entries in <span class="code-class-custom">/etc/hosts</span>.
Configuration Key
simianarmy.chaos.fails3
Fail DynamoDB API (Simius Nodynamus)
Prevents all DynamoDB API communication by adding invalid entries to <span class="code-class-custom">/etc/hosts</span>.
Configuration Key
simianarmy.chaos.faildynamodb
Network Corruption (Simius Politicus)
Corrupts the majority of network packets using a traffic shaping API.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.networkcorruption</span>
Network Latency (Simius Tardus)
Delays all network packets by <span class="code-class-custom">1</span> second, plus or minus half a second, using a traffic shaping API.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.networklatency</span>
Network Loss (Simius Perditus)
Drops a fraction of all network packets by using a traffic shaping API.
Configuration Key
<span class="code-class-custom">simianarmy.chaos.networkloss</span>
Using Simian Army Tools
Prerequisites
Installation
Receiving Email Notifications
Configuration
Executing Experiments
Run the included Gradle Jetty server to build and execute the Simian Army configuration.
After the build completes you'll see log output from each enabled Simian Army members, including Chaos Monkey 1.X.
Using Chaos Monkey 1.X
This older version of Chaos Monkey uses probability to pseudo-randomly determine when an instance should be terminated. The output above shows that <span class="code-class-custom">0.918...</span> exceeds the required chance of <span class="code-class-custom">1/6</span>, so nothing happened. However, running <span class="code-class-custom">./gradlew jettyRun</span> a few times will eventually result in a success. If necessary, you can also modify the probability settings in the <span class="code-class-custom">chaos.properties</span> file.
By default, the <span class="code-class-custom">simianarmy.chaos.leashed = true</span>property in <span class="code-class-custom">chaos.properties</span> prevents Chaos Monkey from terminating instances, as indicated in the above log output. However, changing this property to <span class="code-class-custom">false </span>allows Chaos Monkey to terminate the selected instance.
Next Steps
Now that you've learned about the Simian Army, check out our Developer Tutorial to find out how to install and use the newer Chaos Monkey toolset. You can also learn about the many alternatives to Chaos Monkey, in which we shed light on tools and services designed to bring intelligent failure injection and powerful Chaos Engineering practices to your fingertips.