Getting started with IO attacks
Storage devices remain one of the most significant bottlenecks in modern systems. CPU and RAM speed seems to increase exponentially year over year, and although there have been large improvements in IO performance with solid state (SSD) and NVMe drives, moving data to and from persistent storage is still orders of magnitude slower than moving it to and from memory. In scalable cloud applications, this slowness can have a major impact on performance, latency, and the user experience. To replicate this effect ourselves, we can use the IO attack.
In this blog, we take an in-depth look at the IO attack. We’ll explain how it works, present technical use cases, and show you how to tie these back to business objectives. By reading this blog, you’ll learn why IO attacks are useful and how they can help you deliver value to your organization.
Why should you run IO attacks?
Persistent storage is a big challenge in modern, distributed applications. Not only do our systems need to process and retrieve ever-increasing amounts of data, they need to do so quickly. Of course, storage bandwidth isn’t infinite: the more data we move around, the more saturated our devices become, and the longer it takes to perform storage-related actions. If we don’t account for this demand beforehand, this saturation can reduce performance and potentially cause system instability. Using an IO attack, we can simulate heavy IO operations and monitor our applications, services, and systems to understand how they handle the added stress. This way, we can mitigate potential problems before they happen in production and affect customers.
With IO attacks, we can validate that:
- Moving from a high-throughput storage device to a low-throughput device (e.g. moving from an SSD to network attached storage) won’t significantly reduce application performance.
- Our applications and systems remain responsive during disk-heavy workloads.
- Caches like Redis are working as expected.
This lets us:
- Prepare for high-traffic events, where increased user activity puts additional stress on storage devices.
- Prepare to launch high-bandwidth services, such as media streaming, file storage, or content delivery, by simulating demand in advance.
- Optimize and improve the user experience by uncovering and addressing performance gaps in our storage solution.
How does an IO attack work?
An IO attack continuously reads and/or writes data to a directory on a filesystem. This means it can be used with any storage solution that can be mounted to a filesystem, including HDDs, SSDs, NVMe drives, and network attached storage (NAS). You can configure these parameters:
- <span class="code-class-custom">Directory</span>: the root directory where the attack will be executed.
- <span class="code-class-custom">Mode</span>: whether to read, write, or read and write to the disk.
- <span class="code-class-custom">Workers</span>: the number of concurrent workers reading/writing to the disk.
- <span class="code-class-custom">Block Size</span>: the number of kilobytes (KB) that are read/written at a time.
- <span class="code-class-custom">Block Count</span>: the number of blocks read/written at a time.
- <span class="code-class-custom">Volume Percentage</span>: the percentage of the target volume to fill.
These attributes are called the magnitude of the attack. As with all Gremlin attacks, you can run a disk attack on multiple systems simultaneously. This is called the blast radius. Note that the Workers, Mode, Block Size, and Block Count options are only visible by clicking on “Show Advanced Options” under Volume Percentage. Increasing these options increases the amount of data written or read by the attack, which in turn increases the amount of storage bandwidth utilized.
Unlike the disk attack, the IO attack won’t consume disk space or leave files on your storage device. At no point will Gremlin access or modify your data. During the attack, we recommend using an observability tool or command-line tool like iostat to monitor system performance.
When running your first IO attack, start small. Keep the Workers, Block Size, and Block Count low (try running it with the default values first), then gradually increase them until you reach your target utilization. It may help to know the maximum throughput of your storage device before running an experiment. You can use a tool like FIO to benchmark your device and get its maximum throughput when idle. Using this knowledge, and by monitoring the throughput used by the IO attack, you can get a clear understanding of how much bandwidth the attack will use based on your magnitude.
As you run these experiments, remember to record your observations in the Gremlin web app, discuss the outcomes with your team, and track any changes or improvements made to your systems as a result. This way, you can demonstrate the value of the experiments you’ve run to your team and to the rest of the organization.
Get started with IO attacks
Now that you know how IO attacks work, try running one for yourself:
- Log into your Gremlin account (or sign up for a free trial).
- Create a new attack and select a host to target. Start with a single host to limit your blast radius.
- Under Choose a Gremlin, select the Resource category, then select IO.
- Enter the Directory to run the attack in. This defaults to /tmp on Linux systems.
- Select the Mode depending on whether you want to test read performance, write performance, or both.
- Optionally, in the Advanced Options section, enter the number of Workers that will be simultaneously writing to disk, the amount of data to write at a time in Block Size, and the number of blocks written simultaneously in Block Count.
- ^If you want to monitor disk throughput utilization, open your observability tool or start your monitoring tool now.
- Click Unleash Gremlin to start the attack.
While the attack is running, try using your application. Do you notice any significant impact on responsiveness? What happens if you try to transfer a large amount of data? Does the system become slow or unstable? Once the attack completes, try increasing the magnitude, duration, or blast radius, and repeat the experiment. Does your application and systems behave as expected? If not, record your observations and bring them to your engineering team so that they can look into resolving the issue. If a fix is deployed, run the attack again to validate that the fix works.
If you want to put increasing pressure on your IO devices, try creating a Scenario. Scenarios let you run multiple attacks sequentially. You can create a Scenario consisting of multiple IO attacks that gradually increase in intensity (by changing the Mode, Workers, Block Size, and Block Count). Give it a try, and remember to record your observations!
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more