How reliability engineering can verify disaster recovery plans
Disaster recovery plans have always been a crucial part of businesses—especially essential services like banks. These plans help keep your business up and running during a disaster or extreme scenario so you can be there for your customers when they need you the most.
And in an age where technology and applications are essential for providing service, your disaster recovery plans need to include how you’ll be able to meet minimal thresholds of operation even in the face of outage-causing events.
In fact, IT disaster recovery plans are so crucial to modern business that they’re being required by regulations across the globe, including APRA CPS 230 in Australia, DORA in the EU, FCA PS21/3 in the UK, and OSFI Guideline E-21 in Canada.
This blog shows how reliability engineering and Gremlin can help test your disaster recovery plans to make sure you’re prepared—and compliant with regulations.
How can reliability engineering help?
No matter how good your disaster recovery plan is, you don’t know for sure that it will work until a disaster actually happens—which is the exact worst time to find out your plans aren’t effective. Reliability engineering is the practice of using a combination of technologies and processes to simulate failure scenarios and detect configuration risks to uncover how your systems, plans, and processes respond under real-world conditions.
Reliability engineering offers a number of benefits for disaster recovery plan testing:
- Faults are simulated safely in a way that minimizes risk and can be instantly reverted if a failure is uncovered. Simulating faults allows you to holistically see how your system responds to get a more complete picture of resilience.
- Multiple faults can be combined to accurately simulate real-world scenarios, such as entire availability zones or regions becoming unavailable, dependencies going dark, networks being damaged, etc.
- Multiple scenarios can be combined into comprehensive test suites that can be run on a schedule to regularly verify your disaster recovery plans.
- Scenarios can be standardized, making it possible to verify plans and prove compliance across your organization.
With reliability engineering, you move from making educated guesses about the effectiveness of your disaster recovery plans to having confidence that you’ll be able to maintain minimum service thresholds in the event of a disaster.
1. Verify correct response to outages
There are a variety of disaster-related reasons why a data center or resource might go down, ranging from network cables being severed to the entire data center being destroyed. But as far as other resources are concerned, it doesn’t matter what happens—all that matters is that the data center is now unreachable.
Thus, to simulate a data center outage, we don’t have to actually take a data center offline. Instead, we can interrupt the connection between the data center and the application. In Gremlin, we used Blackhole experiments to simulate this. Blackhole experiments make resources unavailable by cutting off traffic between the resource and your application.
Run them on targets like:
- Availability zones
- Regions
- Dependencies
- Specific IP addresses or ports
These tests form the backbone of testing any disaster recovery plan. You can start small by running experiments on individual services or resources, but to truly test disaster recovery, you should move up to shutting down network access to entire availability zones or regions.
2. Ensure your system can handle traffic pattern shifts
If your load balancing is working correctly, then when a resource like an entire region goes down, your system will reroute traffic and distribute it among the still active resources. It’s why redundancy is a key part of any disaster recovery plan.
But being able to reroute traffic is only one part of resilience to unavailable resources. Usually during normal operation, redundancy splits traffic between multiple resources using load balancers.
As an example, your load balancer might split traffic between three redundant resources so each resource only gets 33% of the total traffic. As a result, you might, understandably, only allocate enough instances for 40% of the total traffic amount. But if one of those resources goes down, suddenly each of those two remaining resources needs to handle 50% of the traffic. If they can’t scale up above the previous 40% threshold, then you’re going to face a cascading failure.
Fortunately, you can simulate traffic pattern shifts by using test targeting resources, including:
- CPU - Generate a high CPU load for one or more CPU cores
- Memory - Allocate high levels of memory to take up capacity
- I/O - Increase read/write amounts to put pressure on I/O devices like hard disks
- Disk - Write files to a disk to fill its capacity
- Processes - Consume process IDs (PIDs) to simulate heavy process loads on OSes
These tests help you make sure your services respond and scale correctly to the sudden surges of traffic that would be caused by a resource going down.
3. Simulate partial resource or network outages
It’s easy to assume that disaster recovery plans will only need to respond when entire availability zones, regions, or other groups of resources go down, but it’s not always the case. For example, what if you're using availability zone two in a region and availability zone one goes down. Your resources are still running, but suddenly a large amount of traffic from availability zone one will be shifted over to availability zone two, causing a strain on resources.
This is where you want to be thorough in your testing to cover more scenarios than just outages within your architecture. Ask yourself questions like, what happens if you’re not using a region that goes down, but one of your dependencies is? If there’s a partial region failure that affects some of your clusters, but not others? If damage to the network drastically reduces network performance without causing an outage, can our system react correctly?
This is why it’s also important to test partial resource outages, network slowdowns, or the unavailability of only a handful of services. These experiments should be part of your toolkit:
- Shutdown - Simulate losing one or more machines by causing a shutdown on host operating systems or containers
- Process killer - Simulate application or dependency crashes by killing specific processes
- Latency - Simulate slowdowns by increasing latency for all matching egress network traffic
- Packet loss - Simulate network instability by increasing packet loss on egress network traffic
- DNS access - Block access to DNS servers to simulate DNS provider outages
These experiments expand the scope and breadth of your disaster recovery plan testing to give you a complete picture. Where outage tests verify that your system can respond if you lose resources, these tests verify that your system can respond in case of wider outages that occur outside your system or impact a large number of people.
4. Standardize, automate, and record tests across your organization
Once you have a collection of tests for your disaster recovery plan, you’ll want to roll them out across your organization. After all, it’s not enough for a handful of services to be compliant—all of the vital services for your applications need to be tested to make sure the disaster recovery plans will work.
With a tool like Gremlin, you can combine these tests into centrally controlled test suites that can then be assigned to services. At first, service owners should run these tests manually, then address any failures or reliability risks the tests uncover. As teams become more familiar with the test suites, they can be automated to run on a regular schedule to verify continued effectiveness of disaster recovery plans.
These results can also be used to document compliance. The results can be centrally gathered on dashboards and into reports, making it easy to prove that disaster recovery plans exist and that they’re able to maintain minimum service levels.
Gremlin takes disaster recovery testing from manual guessing to automated proof
For many organizations, disaster recovery compliance is a long, painstaking manual process that can take weeks of time and endless checklists. Even then, the best you can do is verify that a plan is in place and make an educated guess as to its effectiveness.
Gremlin changes that. The effectiveness of Gremlin’s reliability engineering platform has been proved at enterprise organizations around the globe—including in regulated industries like financial services and banking.
In addition to providing safe, secure testing, Gremlin has organizational features like centrally managed test suites, dashboards, reporting, and reliability scores to give you automated, standardized proof of the effectiveness of your disaster recovery plans.
Schedule a demo with one of our reliability experts to find out more!
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALSchedule a time with a reliability expert to see how reliability management and Chaos Engineering can help improve the reliability, resilience, and availability of your systems.
Schedule nowHow to verify, document, and prove compliance with Gremlin
Find out how Gremlin can help companies in regulated industries comply with Operational Resilience requirements like DORA, APRA CPS230, FCA PS21/3, and more.
Find out how Gremlin can help companies in regulated industries comply with Operational Resilience requirements like DORA, APRA CPS230, FCA PS21/3, and more.
Read more