Resilient and reliable IT systems have become a minimum requirement for modern businesses—a fact driven home by any number of high-profile outages over the past few years.

Unfortunately, when those outages are in the financial sector, it can have far-reaching and incredibly damaging results. To combat this, there’s been a recent wave of regulations across the globe around IT operational resilience and risk management in the financial sector, including:

Many jurisdictions without established IT operational resilience regulations are also currently in the midst of defining those regulations, such as the OCC and Fed in the US.

The regulations aren’t limited to operational resilience alone. For example, payment processors in Australia have strict performance requirements as part of the New Payments Platform (NPP) regulations, requirements that come with heavy fines if they’re not met.

Meeting these regulatory requirements requires a team effort across the organization, but apart from complicated and time-consuming manual checklists, there’s not many ways to show compliance. Even then, the checklists don’t fully prove how resilient a system is to failure or whether it’s truly compliant in case of failure scenarios.

This is where Gremlin can help. By simulating failures in a controlled, safe manner, Gremlin shows you the reliability posture of your systems—and documents the results for simple compliance reporting.

While each regulation has its own specific requirements, there are certain common threads between them where Gremlin can automate testing and reporting to provide a single source of truth for operational resilience compliance.

1. Standardize Operational Resilience testing

Regulations require companies to perform regular operational resilience testing with clear documented processes, but the exact makeup of those tests and processes is left up to individual companies. Unfortunately, there are limits to traditional testing methods (such as QA testing) once software is deployed, and processes often boil down to time-consuming manual resilience checklists.

Gremlin includes the ability to conduct a number of operational resilience tests:

  • Resource tests - Test against sudden changes in consumption of computing resources, including CPU, memory, I/O, Disk, and Process Exhaustion
  • Network tests - Test against unreliable network conditions including dropped traffic, latency, packet loss, DNS access, and expired certificates, 
  • State tests - Test against unexpected changes in your environment, such as power outages, node failures, clock drift, or application crashes.

Multiple individual tests can be used to create scenarios, which you can then put into test suites that can be automated and used across your organization. (Gremlin comes with pre-built test suites based on common failures to help you get started.)

How to test Operational Resilience and create resilience processes for compliance

  1. Develop operational resilience standards for service behavior.
  2. Create a test suite that verifies how well individual services comply with those standards.
  3. Run an initial set of tests on each service to get a reliability posture of the service.
  4. Where tests fail, create tickets with service owners to remediate issues.
  5. After fixes are implemented, run the tests again to verify resilience.
  6. Continue to regularly run the test suites to continually verify Operational Resilience.
  7. Gremlin automatically uses test results to create reliability scores and catalogs the results, giving you a system of record for compliance.

2. Testing Business Continuity and Disaster Recovery plans

All regulations and guidance require documented and tested Business Continuity and Disaster Recovery plans. The specific language will vary between regulations, but all regulations require companies to test severe yet plausible scenarios, such as failure of third-party services providers, switchovers to redundancy capacity and facilities, and widespread power outages.

Gremlin helps test your Business Continuity and Disaster Recovery plans by safely replicating severe failures such as the loss of a region or availability zone, or a third-party provider outage. By simulating these scenarios, you’ll be able to answer questions like:

  • Do my systems failover correctly to redundant regions or availability zones?
  • Are my systems scaling correctly to handle a sudden increase in traffic?
  • Is the load being balanced correctly between the remaining systems?
  • Can our systems still meet minimum operational thresholds if a third-party service is suddenly unavailable?

These scenarios will often involve multiple simultaneous resilience tests. For example, you might combine network failure (black hole) tests with resource scaling tests, latency tests, disk I/O tests, and packet loss tests to simulate fluctuating connectivity and traffic levels.

How to test your Business Continuity and Disaster Recovery plan for compliance

  1. Determine specific scenarios to test your plans—and the expected system response under each scenario.
  2. Create Gremlin test suites that replicate those scenarios.
  3. Determine the individual services that would be affected by each scenario.
  4. Run the test suites on those services. This can be done individually or all at once using a GameDay process.
  5. Use the results to identify and address any areas that require remediation.
  6. Once fixes are implemented, run the tests again to verify.
  7. Use the results and the testing process as part of your compliance reporting.

3. Verify Operational Resilience monitoring and incident response

A core requirement of all regulations is for companies to have operational monitoring and incident response processes in place to detect outages and remediate them quickly. As part of these regulations, companies are expected to specify minimum viable operational standards for those processes and verify that they can meet those standards.

Gremlin’s resilience testing fits perfectly into monitoring/observability and incident response practices. Use resilience tests to verify that your monitoring is working correctly and to fine-tune your alerts so you minimize noise while still flagging any serious issues. And after an incident has occurred, resilience testing can be used to replicate the failure conditions so you can verify and prove that your systems won’t crash the same way again.

How to verify monitoring and incident response compliance

  1. Instrument your systems and set alert levels with a monitoring/observability tool.
  2. Use Gremlin’s native integrations to create Health Checks for your resilience tests.
  3. Create a test suite (or customize a pre-built one) to match your monitoring alerts.
  4. Run the tests and verify that your monitoring works and that alerts trigger correctly.
  5. Adjust your alerts as necessary and run tests again.
  6. Run tests regularly to uncover failures, improve resilience, and prove compliance.
  7. If an incident occurs, create a new scenario that matches the failure conditions.
  8. Run that test on all affected services to verify remediation actions.
  9. Add the test to your regular test suite to prove continued resilience and compliance.

4. Map third-party dependencies and verify resilience to outages

Third-party dependencies can create an inherent risk to operational resilience, which is why many regulations require companies to identify all third-party operational resilience risks and, where possible, test to verify that their systems will maintain a minimal operational threshold in the event a third-party dependency goes down.

Unfortunately, creating a comprehensive list of all third-party dependencies is easier said than done. Modern architectures are often a complex web of APIs and service calls so that a single app can have hundreds of dependencies. Microservice architecture can make this even trickier since every service will rely on a different set of dependencies.

Gremlin’s Dependency Discovery helps you uncover the actual dependencies impacting each of your services. By running dependency discovery as part of Reliability Management, you’ll get a list of all of your service’s dependencies. Then, you can run tests such as dependency failure, latency, or certificate expiry to verify your system's resilience if a dependency goes down.

How to test third-party dependency compliance

  1. Set up a service within Reliability Management.
  2. Gremlin will automatically detect any dependencies for that service.
  3. Run either a custom or pre-built test suite on the service that includes dependency tests.
  4. The tests will show you which dependencies will cause failures if they’re unavailable.
  5. Adjust systems accordingly to fit your dependence resilience standards.
  6. Rerun tests and run regularly to verify third-party compliance.

5. Meet performance obligations

Some regulations have strict performance obligations that lead to hefty fines if not met. For example, the Australian NPP regulation has specific Operational Performance, Availability, and Resilience requirements with fines that start at $25,000 for the first breach of the Mandatory Compliance Requirement and increase from there.

The material impact of these fines can place just as much importance on performance compliance as resilience compliance. By simulating conditions where performance would degrade, you can use resilience testing to verify compliance and uncover potential issues before they affect customers and lead to fines.

When constructing these test suites, a broad spectrum of tests can help you narrow down performance issues. For example, latency tests against a caching database can help you verify failover functionality works correctly, while scalability tests can help you verify performance even under increased traffic.

How to verify operational performance compliance

  1. Specify performance standards along with expected behavior.
  2. Gain a performance and resilience baseline using a custom or pre-built test suite.
  3. Use the results to uncover unexpected behavior or performance degrades.
  4. Make fixes, then run tests again for a new performance result.
  5. Continue running tests suites regularly to uncover potential degradations before they impact customers or cause fines.

Automate compliance and standardize reporting with Gremlin

Every regulation requires that companies verify, document, and report compliance. Unfortunately, without testing, Operational Resilience is almost impossible to document because you can only document the number of incidents that did or did not happen.

With resilience testing, however, you can verify and document the resilience of your system to known scenarios, effectively proving compliance by showing the ability of your system to still meet minimum operational standards when faced with specific failures.

Gremlin’s Reliability Engineering Platform makes the verification, documentation, and reporting of Operational Resilience compliance easy. By automating and scheduling test suites to run regularly, you automatically create an ongoing record of compliance and resilience. These reports can be centrally gathered into dashboards and reports for executive leadership and compliance verification.

With Gremlin, organizations in regulated industries can go from manual, time-consuming, and unverifiable compliance checklists to automated, standardized, and provable operational resilience compliance.

Schedule a demo to find out how to use Gremlin to comply with the specific regulations your company faces.

No items found.
Gavin Cahill
Gavin Cahill
Sr. Content Manager
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Book a demo

Schedule a time with a reliability expert to see how reliability management and Chaos Engineering can help improve the reliability, resilience, and availability of your systems.

Schedule now