Grubhub and JPMC shift reliability testing left at Chaos Conf 2020
Gremlin’s Chaos Conf is always an exciting event, bringing together leaders at the forefront of Chaos Engineering practices. This year was no exception, moving beyond defining Chaos Engineering to more advanced adoption and best practices discussions.
Two talks that stood out to me were Rahul Arya’s “Let Devs be Devs” and Doug Campbell’s “Self-Serve Chaos Engineering” talks. Both discussed empowering developers to perform Chaos Engineering and lowering the friction to meet reliability goals across large corporations, but approached this from different angles.
When developers are building new features to delight customers, creating a mindset shift where reliability testing can be a benefit and not an inhibitor to development velocity is powerful. Rahul and Doug have created templates and training that are scalable and provide fast feedback while enabling developers with training to expand the use of Chaos Engineering quickly.
Standardize starting points and trust your developers
The common thread in both Rahul and Doug’s rollout strategies was immediate pervasiveness. Where JPMC took the approach of creating standard web frameworks that all new applications were built on using CloudFront and Terraform, Grubhub took the approach of adding Gremlin to all pre-production services. This wasn’t an extended process. They brought Chaos Engineering to everyone immediately. The next part is important: after fully enabling and creating a starting point for developers, they left decisions in their capable hands.
The speakers both emphasized that developers were granted the freedom to perform their own experiments with guidance. JPMC has theirs templatized via default Scenarios, but ultimately leaves it up to the developer to set reliability targets and choose which Scenarios to run from their library. Meanwhile, Grubhub installs Gremlin agents everywhere and enables developers to run experiments on any service, trusting them to set their own parameters and coordinate with other teams when testing across services. All of the investment upfront of templatizing and training creates a solid base for developers to perform chaos experiments and learn about their applications.
As infrastructure complexity grows, questions and assumptions increase. Chaos Engineering helps answer these questions and uncover unknown ones.
Start with the low hanging fruit, then educate and expand
Using templates and installing agents everywhere is the first step, but to kick start teams into action, both Rahul and Doug shared similar advice. Start with developers who have the DevOps mindset and the hunger to improve their reliability. Then identify the “biggest sticks”. Research what problem areas slow down development the most, and bring in Chaos Engineering. This has the double benefit of increasing development velocity and creating a big, quick win. From there, work it into their daily flow. Teach developers to have a hypothesis, then test that hypothesis using chaos experiments when adding reliability mechanisms into their applications.
From there, it’s important to promote the tool often. Document use cases in the areas where developers already look for references, demo the tool for large teams, and encourage developers to help each other to spread the practice even faster. Teach developers how to perform attacks on their own, don’t perform the attacks for the developers. After all, they know their applications best.
Continue to iterate on what is working and what is not working with your devs. That is key. Don't be afraid to go fast. Also have that feedback built into your process. Always talk to your customers. For me, my customers are my developers.
As developers become comfortable using the tool, and usage is spreading across your organization, begin automating. Add chaos experiments to your CI/CD pipeline. This provides fast feedback and prevents regressions. At JPMC, this was part of centralizing the CI/CD tooling at the company, which made meeting compliance easier, leaving devs more time for development work.
Reliability at scale
When creating a new mindset for developers, it’s clear that the best path is the one with the lowest friction. Rahul and Doug have shown us that spreading Chaos Engineering across two large, successful firms is best done by enabling developers to gain the benefits of increased reliability in the fastest, lowest friction way.
To reach the massive scale of these two firms, it was important to spread the usage in a way that was one team to many. They accomplished this by creating templates and lowering barriers to getting started, then enabling champions with quick wins. All this leads to improved reliability and meeting compliance without slowing developers down. In fact, resolving problem areas can even speed up development.
If you want to learn more, I highly recommend watching these talks and the others from Chaos Conf 2020.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more