Chaos Engineering on Docker Swarm with Gremlin and Datadog

Ana M Medina
Sr. Chaos Engineer
Last Updated:
November 30, 2020

Introduction

Gremlin is a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform. We will be deploying a web application using Docker Swarm and using Datadog to monitor it.

This tutorial will teach you how to replicate the demo environment I used for my DockerCon 2019 talk as part of the Black Belt Track. Feel free to watch it on YouTube.

Prerequisites

Before you begin this tutorial, you’ll need the following:

Overview

This tutorial will show you how to use Gremlin, Datadog and Docker Swarm

  • Step 1: Setup Docker
  • Step 2: Setup the demo application
  • Step 3: Setup Docker Swarm Visualizer
  • Step 4: Install Datadog
  • Step 5: Setup Datadog tags
  • Step 6: Setup Datadog Monitors
  • Step 7: Install Gremlin  
  • Step 8: Experiment #1: Test recoverability using a shutdown attack on the visualizer container
  • Step 9: Experiment #2: Validate monitoring by running a CPU attack on all hosts
  • Step 10: Experiment #3: Test disaster recovery using a host shutdown attack

Step 1: Setup Docker

First, install Docker onto all three of your hosts. Start by SSHing into the first host:

BASH

ssh username@your_server_ip

Next, add the official Docker GPG key:

BASH

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Use the following command to add the Docker repository for stable releases:

BASH

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

Update the apt package index:

BASH

sudo apt-get update

Before installing Docker, check to make sure you are installing from the Docker repository instead of the Ubuntu 18.04 repositories:

BASH

apt-cache policy docker-ce

BASH

docker-engine:
  Installed: (none)
  Candidate: 5:19.01.13-3\~xenial
  Version table:
     5:19.01.13-3\~xenial 500
        500 https://apt.dockerproject.org/repo ubuntu-xenial/main amd64 Packages

Install the latest version of Docker CE:

BASH

sudo apt-get install docker-ce

Verify that Docker is installed and running:

BASH

sudo systemctl status docker

BASH

● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-11-02 11:06:14 EST; 3 weeks 1 days ago

Repeat this process on your other two hosts.

Next, we'll create a cluster. We'll be using Docker Swarm, an orchestration tool for clustering Docker nodes and scheduling containers. Docker Swarm comes installed with Docker automatically, we just need to enable it. First, we need to choose a host to run as the cluster leader, then specify the IP address we want to use to advertise the cluster (which Docker calls a “swarm”). Replace <span class="code-class-custom"><host ip></span> with the host's publicly accessible IP address:

BASH

docker swarm init --advertise-addr 

This displays a command that you can use to join other hosts to the swarm. Run this command on your other hosts. It should look like this:

BASH

docker swarm join --token SWMTKN-1-4d1n50au6ez00ob3b2c8bricdbn0k4celh4va0x71e65msheae-cflxrtwytcyjfp0eu6nynb4nv 138.197.212.101:2377

After running this command, let's go back to our leader and verify that the other two hosts joined by running:

BASH

docker node ls

We'll see a list of all three nodes and their status:

BASH

ID                           HOSTNAME        STATUS  AVAILABILITY  MANAGER STATUS

f2d5osdf5kllr39clj1nfeafe    swarm-worker2   Ready   Active
a7tfima3738d0e979a8yr5189    swarm-worker1   Ready   Active
is0ab8snrkkkaob9phls129nf *  swarm-manager1  Ready   Active        Leader

Step 2: Deploy the demo application

We'll be deploying a sample web application that lets visitors submit votes. This application has a web page with two clickable options (in this example, cats vs. dogs), and another page for viewing the vote results. You can learn more about this application on GitHub.

SSH into the host you're using as the swarm leader, then clone the application's GitHub repository:

BASH

git clone https://github.com/dockersamples/example-voting-app.git

Change to the new directory:

BASH

cd example-voting-app/

Deploy the application. This creates a new stack in the swarm named “vote”:

BASH

docker stack deploy --compose-file docker-stack.yml vote

In a web browser, enter the address of the leader to view the demo application. The voting page broadcasts on port 5000, and the results page broadcasts on port 5001.

In this example, we'll access the voting app via <span class="code-class-custom">http://142.93.82.142:5000</span>. It should look like this:

Docker example voting app vote page.

We'll access the results page via <span class="code-class-custom">http://142.93.82.142:5001 </span>. It should look like this:

Docker example voting app results page.

Step 3: Setup Docker Swarm Visualizer

Next, we'll deploy an application called Docker Swarm Visualizer to visualize our Docker Swarm cluster.

On your swarm leader, deploy the visualizer by running this command:

BASH

docker service create \
  --name=viz \
  --publish=9020:8080/tcp \
  --constraint=node.role==manager \
  --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock \
  dockersamples/visualizer

We will access this visualizer app via port 9020. If we open <span class="code-class-custom">http://142.93.82.142:9020</span> in a browser, we'll see the following screen:

Docker Swarm Visualizer dashboard.

Step 4: Install Datadog

Before we run any chaos experiments, we'll want to monitor our cluster. To do that, we'll set up Datadog (sign up here).

After creating an account, open “Integrations” on the left navigation bar and select “Agent”.

Datadog new integration.

Select Docker from the options, then follow the instructions under “Use our easy one-step install.”

Datadog Docker installation instructions.

Install the Datadog agent on each host (make sure to replace <span class="code-class-custom">DD_API_KEY</span> with your own Datadog API key):

BASH

DOCKER_CONTENT_TRUST=1 docker run -d --name dd-agent -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=aaaabbbbccccddddeeeeffffgggghhhh datadog/agent:latest

Step 5: Setup Datadog tags

Next, we'll set up tags for each of our Docker hosts. This will make it easier to identify the hosts in Datadog. In the Datadog UI, use the left navigation bar to navigate to “Infrastructure” and select “Infrastructure List.”

Datadog infrastructure list.

We want to select the hosts we’ve installed Datadog on and give them a special tag. We are going to use the search function to find them faster. We will be adding the tag <span class="code-class-custom">project:docker_swarm_demo</span>. This way, we can identify all three hosts using a single tag.

Tagging hosts in Datadog.

Step 6: Setup Datadog Monitors

Now that we have Datadog installed on our hosts with tags, we want to create a monitor. Monitors let us track events in Datadog (like changes in host metrics or logs) and fire when certain conditions are met. We'll use these to generate alerts and send notifications.

In the left navigation bar, select “Monitors” and choose “New Monitor”.

Creating a new monitor in Datadog.

One of our attacks is testing CPU consumption, so we want to monitor CPU usage. Specifically, we want to make sure our monitor captures the combination of user and system CPU usage. Select “metric” from the given options, then add the <span class="code-class-custom">system.cpu.user</span> and <span class="code-class-custom">system.cpu.system</span> metrics together as shown in this screenshot.

Configuring a CPU Monitor in Datadog.

We are going to make the warning threshold <span class="code-class-custom">65</span> and the alert threshold <span class="code-class-custom">90</span>. This means that when the average CPU usage exceeds 65% during the last minute, the monitor will fire and we’ll get a warning notification.

Under “Say what’s happening”, we’ll customize the notification. I've made the subject of the email to be “Chaos! The CPU is really high on {{host.name}} {{host.ip}}.” Then, on the body of the notification, I've added some extra wording and entered my own email address so that the notification goes directly to me and provides some details on what happened.

Step 7: Install Gremlin

Now that we have monitoring and alerting configured, let's install Gremlin and start running chaos experiments on our swarm. First, SSH into your swarm leader:

BASH

ssh username@your_server_ip

Next, follow the instructions in our documentation to install Gremlin onto a virtual machine. Since we're running Ubuntu 18.04, follow the instructions for installing a Deb package. You can check if Gremlin authenticated successfully by clicking “Agents” in the navigation bar and making sure your host is marked as “Online”.

Step 8: Experiment #1: Test recoverability using a shutdown attack on the visualizer container

Now let's run our first chaos experiment! For this experiment, we want to validate that Docker Swarm can automatically recover container failures. Our hypothesis is: when we shutdown our visualizer application, Docker Swarm automatically restarts the container, and our monitoring tool, Datadog, shows this change.”

Let's go back to the Gremlin web app. Select “Attacks” from the navigation bar, then press the green “New Attack” button. Select “Containers,” then select the visualizer container. Gremlin auto-populates Docker tags such as the namespace and container name, so we can search for those using the search box as shown below.

Targeting a container via search in the Gremlin web app.

Our visualizer container is named "viz", so let's check the checkbox next to it:

Selecting a target container in the Gremlin web app.

We will now choose the attack we want to run, which in this case, is a shutdown attack. Select “State” and choose “Shutdown” from the options. We will set the delay to <span class="code-class-custom">0</span> so the attack runs immediately, then uncheck “Reboot”to prevent the container from automatically restarting. Now let's press the green “Unleash Gremlin” button to run the attack.

Configuring a shutdown attack in the Gremlin web app.

Experiment #1 Results

Our hypothesis was, “when we shutdown our visualizer container, Docker automatically restarts it, and our monitoring tool, Datadog, shows this change.” If we configured everything properly, Datadog will show that the container has shut down and Docker has brought it back up.

However, if we navigate to access the visualizer app via port 9020, we see that the app does not load as expected.

Broken visualizer application page.

This might just mean that the container hasn't had enough time to fully restart. Let's open the Datadog UI and use the left navigation bar to navigate to “Infrastructure” and select “Containers”.

Accessing the container list in Datadog.

We can then search our containers by using the tag we had created in step 3: <span class="code-class-custom">project:docker_swarm_demo</span>. After sorting by “Start” you can see that our viz container restarted and the status is that it’s back and running.

Viewing the container list in Datadog.

When we navigate to access the visualizer app via port 9020, we see that the app is loading like we saw it before.

Docker Swarm Visualizer dashboard.

While there was a small delay, this confirms our hypothesis that Docker Swarm automatically restarts stopped apps and Datadog shows the change.

Step 9: Experiment #2: Validate monitoring by running a CPU attack on all hosts

Our second chaos experiment will help us validate that we have configured our monitoring properly. Our hypothesis is: “when we consume CPU resources, our monitoring tool, Datadog, will show this increase and alert us if it passes our threshold.” Going back to the Gremlin web app, select “Attacks” from the menu on the left and press “New Attack”. Instead of selecting an individual container, we'll select all three hosts this time.

Targeting hosts in the Gremlin web app.

We will now choose the attack we want to run, which is a CPU attack. Select “Resource” and choose “CPU” from the options. Set the length to <span class="code-class-custom">180</span> seconds, select <span class="code-class-custom">All Cores</span> from the dropdown, then set “CPU Capacity” to <span class="code-class-custom">100</span>. Click the “Unleash Gremlin” button to run the attack.

Configuring a CPU attack in the Gremlin web app.

Experiment #2 results

Our hypothesis was: “when we consume CPU resources, our monitoring tool, Datadog, will show this increase and alert us if it passes our threshold.” If we configured everything properly, Datadog will show that the hosts have utilized a significant amount of CPU.

When we go back to the Datadog UI and use the left navigation bar to navigate to “Infrastructure” and select  “Host Map” and filter by the <span class="code-class-custom">project:docker_swarm_demo</span> tag we created, we see that the CPU resources on these hosts jumped up to 92%.

Viewing our Docker Swarm host map in Datadog.

This experiment also validated that we set up Datadog Monitoring properly, and that we're receiving email alerts as expected.

Validating monitors and alerts in Datadog.

Step 10: Experiment #3: Test disaster recovery using a shutdown attack

For our last experiment, we'll be thinking about what happens when our main application goes down. In this case, it will be the voting application. Our hypothesis is, “when we shut down all of our containers for the vote application, we will only suffer a few seconds of downtime as Docker Swarm provisions and deploys new containers.”

Going back to the Gremlin web app, select “Attacks” from the menu on the left and press “New Attack”. Select “Containers,” then using the search bar, type “vote” and select the vote namespace as shown below:

Searching for the vote app in the Gremlin web app.

We'll be running another shutdown attack. Select “State” and choose “Shutdown” from the options. Set the delay to 0 and uncheck “Reboot” to prevent the containers from automatically restarting. Press “Unleash Gremlin” to run the attack.

Experiment #3 results

Our hypothesis was, “when we shut down all of our containers for the vote application, we will only suffer a few seconds of downtime as Docker Swarm provisions and deploys new containers.” Let’s verify that first by opening the voting application and the voting results application in a browser.

We can see that our voting results application is running with no downtime, and no data loss.

The voting app results page.

If we check the container’s monitoring using Datadog and sort by start time, we see that all of our containers are up and running after the restart. This means that Docker Swarm was able to quickly restart the containers, which is exactly what we wanted.

List of running containers in Datadog.

Running exercises like these is great practice for making sure your applications can withstand failure, and that your monitoring is set up correctly.

Conclusion

Congrats! You’ve now learned how to:

  • Set up a web application on three hosts using Docker Swarm,
  • Monitor containers using Datadog, and
  • Use Gremlin to test the resilience of your application.

As a next step, setup PagerDuty Alerts for this application and verify they are working by using the same Chaos Engineering practices you just learned. If you have any questions at all or are wondering what else you can do with this demo environment, feel free to reach out on the Chaos Community Slack(join here)!

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape