Hitting reliability goals in the face of layoffs
It’s never easy when layoffs hit your organization. In addition to the personal impact of losing friends and coworkers from your team, those who remain are left trying to achieve the same business goals with less people and resources.
Unfortunately, layoffs and restructuring have become a common part of business. But you’re not alone. Your partners (including Gremlin) are here to help you navigate your new reality.
If you find yourself having to meet the same reliability goals with a smaller team, it can be hard to justify taking the time for resilience testing. But the truth is: when done right, resilience testing can make it easier for you to do more with less and reach your reliability goals.
Reduce specialization and keep it simple
When your team is larger, specialization can be a boon that increases the effectiveness and quality of your efforts. We see this regularly with Chaos Engineering specialists, where their ability to expertly craft custom experiments uncovers unknown risks. This is especially advantageous if your team is large enough to have multiple specialists to help increase throughput. But when a team shrinks, so does the number of specialists you have, and any process that goes through them starts to bottleneck. If all of your testing has to go through one or two specialists, then you’re creating blockers.
The key here is to pivot from custom tests run by a small group of specialists to core tests that can be done by a wider scope of people. Ideally, every engineer should be able to run a basic battery of scalability, redundancy, and dependency tests. While this means you’re not running custom experiments against some of the more complicated failure modes, it has the benefit of removing bottlenecks, increasing how broadly you test, and catching a wider scope of core reliability risks.
Automate what you can
Automation reduces bottlenecks and the amount of lift required to perform tests. Manually running an experiment or suite of experiments means each person running the experiment has to set aside time to set up and run the experiment.
Reliability Management is an invaluable tool for helping automate your reliability testing efforts. Once set up, it can run on a regular schedule, then report back any failed tests on your dashboard. You can review the dashboard and test results, prioritize your efforts, and focus your time on making reliability improvements instead of running tests.
Leverage existing knowledge
It takes time and effort to gain expertise in resiliency testing, then to map all of that out into tailored test suites. If you can afford that time, then it can really be worth it, but not when you’re trying to do more with less. In situations like this, it’s better to lean on existing expertise and knowledge.
Gremlin’s worked with hundreds of companies and our team of experts have millions of collective hours solving reliability problems. All of that knowledge and expertise has gone into isolating a core set of key tests in the recommended Reliability Management test suites. Instead of taking months to run experiments and build your own suites, you can use these suites and have teams running effective, valuable tests in a matter of hours.
Focus on high-value work
When people talk about doing more with less, they’re really talking about trying to be as efficient as possible—even if it means comprehensive coverage takes a back seat. Instead of taking a lot of effort to provide a thin layer of 100% testing coverage, this is the time to maximize your effort by focusing on getting valuable results with 80% test coverage.
Between Gremlin’s Detected Risks and Reliability Management test suites, you have all of your common failure modes covered, failure modes which account for a significant portion of major outages. More importantly, these failures usually lead to larger, more impactful outages. So by embracing pre-built, automated, and low-lift methods of testing, you can create more value with less work.
We’re here to help
While our roots are in the custom Fault Injection experiments of Chaos Engineering, all of us at Gremlin have worked hard to make it easier and simpler for teams to run resilience tests with Reliability Management. These validation tests can be automated, scheduled, and controlled centrally to minimize the time and effort you and your team spend running tests. The goal: find and fix your reliability risks with less time and energy than manual tests and chasing incidents.
If you find yourself facing impossible resiliency goals with a small team and few resources, we’re here to help you make your systems more reliable.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALThe two kinds of failure testing
Learn more about exploratory testing and validation testing, the two most common uses of Fault Injection.
Learn more about exploratory testing and validation testing, the two most common uses of Fault Injection.
Read moreWhere to automate resilience testing in your SDLC
When organizations begin to deploy resilience testing or Chaos Engineering, there’s a natural question: can we integrate this with our CI/CD pipeline or release automation tools? The short answer is yes. Integration is possible, but resiliency is different, so automation is a nuanced conversation.
When organizations begin to deploy resilience testing or Chaos Engineering, there’s a natural question: can we integrate this with our CI/CD pipeline or release automation tools? The short answer is yes. Integration is possible, but resiliency is different, so automation is a nuanced conversation.
Read moreHow to standardize resiliency on Kubernetes
Use this framework to improve Kubernetes resiliency at scale with a combination of organizational standards, resilience testing, and reliability risk monitoring.
Use this framework to improve Kubernetes resiliency at scale with a combination of organizational standards, resilience testing, and reliability risk monitoring.
Read more