Chaos Engineering and Resilience Testing Tools: Build vs Buy
Chaos Engineering and resilience testing, where engineers intentionally inject failure to test the reliability of their systems, are becoming a regular practice for companies who value uptime and availability. As cloud-based systems have grown more complex, Fault Injection testing has become a critical part of the software testing and release process to uncover surprise dependencies, fix problems before they become 3am outages, and bake reliability into every feature.
As these practices become more of a necessity to meet modern availability demands, teams are naturally debating whether they should build their own tools in-house, or buy a commercial offering. If you’re weighing these options, the pros and cons for each approach below—and the example cost analysis case study—should help you make your decision.
Why build your own Fault Injection tool?
Many internal tools start off as a fork from open source tools (like Chaos Monkey) to allow a quicker path to a minimal viable product. Usually, the initial goal is to address simple concerns, such as random shutdowns or reboots of hosts. Gradually, more failure states can be added over time, and ideally, an automation layer can be added on top.
Pros
Customization - Building an internal Chaos Engineering tool means you can customize it to your application or environment’s exact needs so that it deeply integrates with environments like your monitoring and development pipelines.
Control over the roadmap - Building also means you possibly have a shorter feedback cycle between development and production (budget and resources permitting). You also get full control over the product roadmap, and can exert more control over the features and direction of the product.
Control over traffic - When you build a tool, all connections will stay within your company’s internal network, which gives control over attack surface area. Everything can be kept on-prem or in a private cloud with no reliance on the outside world. This also potentially makes it easier to monitor and control traffic.
Cons
Costs - Building any Fault Injection tool requires the dedicated time of engineers to support and maintain the application. With buying, support is handled by the SaaS provider, leaving engineers time to develop chaos experiments and truly test their application’s reliability, rather than having to ensure the availability of the tool.
Time to use - Building a sophisticated Fault Injection platform takes roughly 14-18 months of focused development time from several engineers to build and maintain. And when you consider the potential for reprioritizations, reorgs, and other changes over a year’s time, there’s a good chance that time will be even longer.
Ease of use - Since internal tooling tends to be released as a minimal viable product first, interfaces might not be well-documented or easy to use. SaaS tools have to be approachable from a novice perspective, as well as extensible to the most advanced users, right from the start. Both are a major factor in how well your organizational culture adopts Chaos Engineering and resilience testing.
Scalability - While building a Fault Injection tool to address one particular application stack allows for much more control, this can be a double-edged sword. The application may not be extensible to other application stacks and infrastructure types, such as container architecture vs. host-based, or running in AWS public cloud vs. Google Cloud Platform or Azure. Additionally, most built tools won’t have an API as an out-of-the-box feature, limiting your ability to automate your chaos experiments or integrate them into your SDLC.
Security - Using open source tools or developing in-house means security may be an afterthought or an implied feature. For example, making the assumption that users are all authenticated through a corporate network, and thus automatically have the right access, rather than enforcing the principle of least privilege and multi-factor auth.
Why buy a Fault Injection tool?
Buying a commercial offering for Chaos Engineering resilience testing means you’re able to get up and running sooner with zero in-house development. Even starting with an existing open source project will require a non-trivial amount of time to build robust features, not to mention sufficiently hardening the tool for security.
Pros
Less engineering time - Building your own solution when there are commercial tools ends up looking like undifferentiated heavy lifting, rather than focusing engineering resources on business drivers. It’s often more cost-effective for your team to purchase a tool than to build their own at the expense of customer-facing, revenue-driving features.
Robust API and usability - A SaaS product needs to be extensible and generally available across diverse customer environments, which means an API layer is provided and maintained in addition to a graphical user interface. The API layer will also usually have feature parity with the in-app experience, and will have the same SLA. Additionally, because of the inherent dependency of the app to the API, features that the SaaS provider needs will immediately be available.
Greater compatibility - Cross-platform compatibility also comes into play. In the age of microservices and distributed systems, tooling needs to support various environments, rather than only being compatible with specific environments. Commercial tools are designed for multiple environments in order to support more customers.
Support from experts - Customer support will also be included in your contract. This can be helpful for a couple of reasons. First, your in-house engineers don’t need to provide support. Second, the support engineers as part of your contract will not only be experts in their tool, but also in Chaos Engineering and resilience testing. They can provide guidance not just on how to use the tool, but how to best use it in your organization.
Faster time to value and results - Purchasing a solution means you realize a more immediate impact on reducing downtime. An hour of downtime costs $100,000 on average, and that doesn’t include engineering costs to bring the site back up or the potential impact to your brand. Building an in-house tool can take years, and every month where an organization delays Chaos Engineering and resilience testing is a month that might contain a large outage.
More features and test coverage - A purchased product will also be refined from feedback and input from a number of customers. Lots of user feedback can mean a simple, intuitive UI, and a more robust feature set that covers use cases your internal engineers may not have considered, but find very valuable.
Proven and dependable success -Buying a Chaos Engineering or resilience testing platform means you can look at the vendor’s track record of success with similar businesses in your industry. This is especially important in regulated industries or areas that require high availability like finance or retail. If others in your industry have safely and effectively used the tool to improve reliability, you can breathe easy and have more confidence in its capabilities.
Cons
Higher upfront costs - Buying a tool comes with more upfront costs before you can actually start running tests. However, since the development costs for the tool are borne by the SaaS company and spread out over all of their customers, these will often end up being much less than building your own tool in the long term.
Security needs to be double-checked - Because a SaaS is a hosted solution, there will be traffic going out of the network. SaaS offerings have this in mind when building their software, and in particular with Chaos Engineering, security is always a huge concern and must be baked into the product. This is another place to look at the track record of the company—if it’s being used by companies with a similar level of security requirements to yours, then it’s a good sign they can meet your requirements. Vendors might also offer features to reduce this risk, such as proxies and private VPC support.
Less roadmap control - As with any vendor software, you lack control over the roadmap. You’re buying a product you can begin using immediately, but you’ll be a layer removed from their product roadmap. But many companies, including Gremlin, have a close relationship with their customers, so you’ll be able to help influence the roadmap development.
Case study: Why a major insurance company chose to buy
This example table is based on the analysis a major insurance company did before choosing Gremlin for their Chaos Engineering and resilience testing tool. Like with many enterprise companies, this company requires team members to perform a thorough analysis of the various options out there before making a recommendation and purchasing a tool.
When performing the analysis, they took a holistic approach to cost evaluation, and tried to include as many possible costs as possible. This includes licensing, the cost of engineering hours, how many tools would need to be combined for full test coverage, and more.
After performing the analysis, these key takeaways stood out to them:
- Individually, vendor-specific and open source tools were cheaper than Gremlin
- Running all of the tests required for full coverage would need a combination of vendor-specific tools and open source tools—and they’d still have to build additional tests to run all of the tests they needed.
- Chaos ToolKit provides a similar level of test coverage to Gremlin, but required a substantially higher investment of engineering time and would take much longer before it was returning value.
In the end, they chose Gremlin for its combination of best-in-class capabilities, price, and full test coverage.
Conclusion
Downtime is expensive, as is the operations burden of building and maintaining a system. While companies won’t get the same level of control with a bought tool that they would with a built one, buying a tool affords them the time and availability to start making their systems more reliable immediately— ultimately helping everyone sleep better at night.
Ready to figure out whether buying or building a tool is right for your team? Schedule a demo with one of our reliability experts to see everything Gremlin has to offer.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALThe two kinds of failure testing
Learn more about exploratory testing and validation testing, the two most common uses of Fault Injection.
Learn more about exploratory testing and validation testing, the two most common uses of Fault Injection.
Read more