How to use Gremlin with Amazon RDS
Amazon RDS is a managed relational database service that lets you easily deploy, scale, and replicate databases. You can create an instance of Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, or SQL Server, and have it be fully managed for your application, while benefiting from a fully managed infrastructure, automatic updates, automatic recovery from host failures, and more.
Amazon RDS has many resiliency features built in, but these don’t account for the other types of failures that can impact our RDS applications: for example, losing network connectivity between our applications and RDS. Application misconfigurations, unexpected restarts, or availability zone (AZ) failures can happen at any time, and when they do, we want to know that our application can recover gracefully without disrupting the user experience.
Using Chaos Engineering, we can proactively test these different conditions, see how our application responds, then use these insights to make both our application and RDS deployment more resilient. This tutorial will show you how to use Gremlin to run chaos experiments that will test the resiliency of your applications when using Amazon RDS.
Prerequisites
Before starting this tutorial, you’ll need:
- A Gremlin account (request a free trial).
- An AWS account (click here to get started).
- A host with the Gremlin agent installed, such as an Amazon EC2 instance. Read here for instructions on installing the Gremlin agent.
Overview
This tutorial will show you how to:
- Step 1: Create a MariaDB instance in Amazon RDS
- Step 2: Deploy the MariaDB TODO demo application
- Step 3: Run a blackhole attack
Step 1: Create a MariaDB instance in Amazon RDS
First, let’s create a database instance in RDS. For this tutorial, we’ll use MariaDB. Log into the AWS Management Console and select Amazon RDS from the list of services. Scroll down to the Create database section and select Create database. Choose your preferred creation method (I used Standard for this example), then choose MariaDB for the engine. Since this is just a demo instance, I recommend choosing Free tier (db.t2.micro is more than enough for this example) or Dev/Test.
Give the instance a name of your choice. We’ll name ours <span class="code-class-custom">mariadb-gremlin-demo</span>. Add credentials for the admin account (or check the box to let RDS automatically generate a password) and make sure to copy these credentials as we’ll need to use them in the next step. Finish configuring the instance however you’d like, then click Create database. The new database will be provisioned in a few minutes.
Once the instance is created, we’ll need to populate it with a database and table. We’ll use the <span class="code-class-custom">mysql</span> command line tool to connect to the database, although any MariaDB/MySQL client will work.
Open a connection to the database using the following command, making sure to replace <span class="code-class-custom">YOUR_HOST</span> with your database server endpoint, <span class="code-class-custom">YOUR_PORT</span> with the port number, and <span class="code-class-custom">YOUR_USER</span> with your MariaDB username. You’ll be prompted to enter your password:
From here, we’ll run two scripts: the first creates a database named <span class="code-class-custom">todo</span>, and the second creates a table named <span class="code-class-custom">tasks</span>:
Now we’re ready to deploy our app and connect it to our database! Enter <span class="code-class-custom">quit()</span> to close the client.
Step 2: Deploy the MariaDB TODO demo application
Next, we’ll deploy an application to our host and connect it to the database. This application provides a web page that lets users add items to a TODO list and persists these items to our MariaDB database. The application has two services: a client service that runs the website and frontend, and an API service that connects to the database and processes requests from the client. See the GitHub repository for more details.
First, we’ll open a terminal on the host where we have the Gremlin agent installed, then clone the source code from the GitHub repository provided by MariaDB:
Next, we’ll build the application. This requires NPM, which you can install by following these instructions. Once NPM is installed, we’ll run the following commands to navigate to the <span class="code-class-custom">client</span> service folder and install its dependencies:
Before we actually run the client service, we need to run the API service, which acts as a middle layer between the client and MariaDB. The project provides several examples in different languages, but we’ll deploy the Node.js version. Open a second terminal window and run the following commands:
When running the API, we’ll provide our database connection details as environment variables. In the following command, replace the following strings:
- <span class="code-class-custom">YOUR_HOST</span>: Your MariaDB server’s hostname.
- <span class="code-class-custom">YOUR_PORT</span>: Your MariaDB server’s port number.
- <span class="code-class-custom">YOUR_USER</span>: The username you want to use to log in to MariaDB.
- <span class="code-class-custom">YOUR_PASS</span>: The password used to log in to MariaDB.
Note: If the database you created in step 1 has a different name other than <span class="code-class-custom">todo</span>, replace <span class="code-class-custom">DB_NAME=todo</span> with <span class="code-class-custom">DB_NAME=<your database name></span>.
Once the API is up and running, switch back to your client terminal and start the client:
Now, open the URL for your server in the browser on port 3000 and you’ll see the following screen. Try adding tasks and refreshing the page. If the data persists, then the database connection is working!
Now that we’ve set up our application, let’s run a chaos experiment!
Step 3: Run a blackhole attack
In this experiment, we’ll simulate a full scale outage between our instance and RDS. We’ll do this by running a blackhole attack, which drops all network traffic. Since this instance communicates with multiple different services, we’ll limit the scope of this attack (the blast radius) to only affect traffic going to and from the database server.
First, we’ll log into Gremlin by signing in at app.gremlin.com. In the left-hand side bar, select Attacks, then New Attack. In the attacks screen, select the Infrastructure tab, then scroll down to select the host where the TODO application is running. Here, our host is named <span class="code-class-custom">rds-demo</span>:
Under Choose a Gremlin, select Network, then select Blackhole. In the Hostnames text box, add the hostname for the database server. You can optionally add <span class="code-class-custom">3306</span> to the Remote Ports field to only impact traffic on that port, but adding just the hostname is fine for this example.
Next, click Unleash Gremlin to start the attack. Once the attack is running, open your todo app in your browser. What happens when you try refreshing the page, or adding an item? As it turns out, the page loads just fine, but we don’t see any items.
For users of our site, this would be confusing. It appears as if all of their data just disappeared. If we take a look at our terminal output for the API server, we can see what happened. The API server sent an asynchronous (<span class="code-class-custom">async</span>) request to MariaDB and continued rendering the webpage in the meantime. The database is unavailable due to the blackhole attack, so the request eventually times out and returns an error. But at this point, the page has already been displayed to the user, so we get no visual indication of a problem.
To avoid this issue, we could add some code to our <span class="code-class-custom">.catch()</span> block that shows a user-friendly error message or popup. We could also add a loading indicator on the client side to show when a database request is being made. From an operations perspective, we should consider adding redundancy to our RDS instance and implementing load balancing to reduce the risk of a complete outages like this in the first place.
Conclusion
Running chaos experiments like this blackhole experiment can reveal unexpected behaviors in systems and the services they depend on. We should consider other conditions that might impact our application's behavior. For example, we saw what happened when we lost connection to the database, but what happens a network misconfiguration adds an extra 100ms of latency to database traffic? What if it introduces packet loss or corruption? What if we lose connection to our DNS server and can no longer resolve our database's hostname? We should try these experiments with our database and other dependencies.
Now that you have an environment with an Amazon RDS instance and the Gremlin agent installed, try running these different experiments and record your observations. If you want to run more advanced experiments, check out our library of Recommended Scenarios.
Avoid downtime. Use Gremlin to turn failure into resilience.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.