One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes, like forgetting to change defaults, making small typos, or leaving conflicting timeouts in the code. It’s why 27.8% of unplanned outages are caused by someone making a change to the environment.

Fortunately, reliability testing can help you catch these errors before they cause outages. But we’ve recently seen the rise of a different source of failures: AI errors.

Much like human errors, these AI errors are made despite the best intentions of developers. After all, AI agents can increase efficiency, reduce toil, and generally help developers move faster. But they can also introduce the same kind of minor errors as humans. At the same time, AI agents lack the domain authority and system-specific knowledge of experienced engineers, meaning they could produce code that fits established best practices, but introduces errors on your specific systems.

That doesn’t mean you have to avoid AI agents, but it is another variable you need to account for in your testing. Fortunately, these four best practices for reliability testing will help you catch AI (and human) errors before they cause outages or incidents that impact your customers.

1. Test as close to production as you can

Reliability testing will always get the best results when done in production environments. This gives an accurate picture of your actual system reliability, and a tool like Gremlin makes sure that it’s done safely with little to no customer impact.

Unfortunately, that’s not always possible. Some organizations are too regulated to risk testing in production, while others have strict policies against it. But you can still get enormous benefits from reliability testing in close-to-production environments. Reliability testing is at its most effective when it can test holistically to give you a full picture of how all of your services interact with each other under production loads.

If you can’t test in production, then create an environment that’s as close to parity with production as possible. Several Gremlin customers in highly regulated financial markets have invested in creating parity testing environments, resulting in P0 and P1 issues that were detected and resolved before they made it to production.

This last-mile check is the best way to keep AI errors from impacting customers. Often these errors won’t be detected by standard checks or QA testing on isolated services, then cause issues when the services are integrated with the larger system. Testing in a production (or production-parity) environment will uncover these conflicts and prevent outages.

2. Run standardized test suites against known issues

Every engineering team has a list of known issues that could go wrong with their system. Sometimes it’s because the issue caused outages in the past, while other times it’s just a best practice that they’ve learned through experience. Regardless of the source, it’s a good idea to check new code against the failures list to make sure it’s resistant to them. Common examples include things like verifying autoscaling, checking database failover configurations, and checking for expiring security certificates.

Reliability test suites are a great way to create a standardized process around these issues. A test suite is a collection of tests designed to run as a group one after the other. When you build a test suite specifically to target that list of known failures, you give your teams an easy way to verify resilience to each failure.

These test suites become even more important with AI agents. While AI agents can distill code based on common practices across the internet, they don’t contain the domain-specific knowledge about your specific system that comes with experience. So an AI agent might turn out code that works perfectly fine in most situations, but creates issues when integrated into your specific systems.

Standardized test suites give you that extra layer of proactive prevention to make sure that any new code shipped will comply with reliability standards and be resilient to known issues.

3. Embrace automated testing and risk detection

Just because your systems work now doesn’t mean they’ll work after the next deployment. Modern systems are complex and distributed, with a lot of separate pieces all depending on each other. IN systems like these, Service A can pass its individual QA tests while introducing conflicts that shut down service B or C.

Unfortunately, issues like this can’t be caught until the code is actually shipped and in the full environment. To prevent these outages, you need to regularly test the entire system, including services already deployed.

In addition, production systems can have issues that pop up just due to the natural progression of time, such as security certificates that were fine two months ago, but now suddenly expire next week.

Automated test suites, like with Gremlin’s Reliability Management, run on a regular schedule against all of your specific services. For example, we run our set of reliability test suites against our production environment on a weekly basis. Most of the time, there are no major issues, but sometimes, we’ll uncover a reliability risk that could lead to an outage under the right conditions. The automated test suites give us the time to address the issue before there’s any customer impact.

Another automated tool is reliability risk monitoring, such as with Gremlin’s Detected Risks. Detected Risks automatically scans your configurations and systems, such as your Kubernetes orchestrator configurations, to uncover common reliability risks. This is especially important when your reliability standards specify settings different from default values. Whether it’s updates to the Kubernetes orchestrator or using an old config file when setting up a new code deployment, these defaults have a habit of creeping back in and creating outage-causing conflict.

These automated tools are helpful for any team to improve their reliability, but they’re especially important as you embrace AI agents. A big selling point for AI agents is that they help you move faster, which also means there’s a greater chance of default settings or tiny issues being integrated without anyone noticing.

Automated, scheduled reliability test suites and reliability risk monitoring catch those failures that slip between the cracks.

Gremlin helps keep your availability high

Change, like using AI agents, can be good, especially when it allows you to move faster, innovate more, and produce better experiences. But change will also introduce new potential failures, so you need a plan to account for those failures.

Gremlin can help you take deliberate, effective action to verify the resilience and reliability of code introduced by AI agents. And it can help you do it safely on an organization-wide scale.

Which means your systems will be protected from both human and AI error.

Want to find out more? Check out our interactive demos to see what running a test is like for yourself.

No items found.
Gavin Cahill
Gavin Cahill
Sr. Content Manager
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour