Azure
Search Chaos Monkey
Inspired by Chaos Monkey, the Azure Search team developed an alternative tool they call Search Chaos Monkey. Search Chaos Monkey is initially used to attack a test environment that contains a randomly and continuously changing search service. Test environment experiments allow the team to catch bugs before they reach production.
Once it's in production, Search Chaos Monkey's destructive power is managed through its configurable chaos level.
- Low chaos failures are recovered from gracefully with little to no interruption in service. Alerts raised in low mode are considered bugs.
- Medium chaos failures are also gracefully recovered from, but they may degrade service performance or stability. Low-priority alerts are sent along to engineers on call.
- High chaos failures are those that definitively interrupt service and trigger high-priority alerts for on-call engineers.
These levels offer a modicum of control over experiments, but not much in the way of granularity. The Azure Search team also designates an extreme chaos level to any failure that incurs data loss, causes ungraceful degradation, or fails silently. To maintain experimental control, Search Chaos Monkey is not permitted to induce extreme failures on a continuous basis.
Causing Chaos on Azure with Gremlin
Performing Chaos Experiments on your Azure applications is simple, safe, and secure using Gremlin. Azure's distributed computing architecture all but requires proper failure injection testing with tools like Gremlin, which can strain resources, disrupt network traffic, and terminate instances.
Check out the official documentation or look through our in-depth community site for more information.
WazMonkey
WazMonkey is an open-source tool that selects a random Azure role instance and reboots it. Written in C# and executed on the command-line, WazMonkey is simple and straightforward to use.
Fault Analysis Service
Azure's Fault Analysis Service is a service that injects failure and runs test scenarios against applications built on Microsoft Azure Service Fabric. The Fault Analysis Service executes actions, which are individual faults that target a system. Developers can combine multiple actions to perform complex tasks and Chaos Experiments, such as:
- Restarting nodes
- Simulating load balancing or application upgrades
- Inducing data or memory loss
- Removing or restarting a replica
Developers can induce controlled Chaos to simulate both graceful and ungraceful faults within Service Fabric clusters. An ungraceful fault is anything that terminates a process, such as restarting a node or application.
During execution, Fault Analysis Service frequently snapshots the current "run state" and adds them to named Event types. Event types like ExecutingFaultsEvent and ValidationFailedEvent can then be retrieved via the API.