Ronnie Chen: "Staying Alive: Patterns for Failure Management from the Depths" - Chaos Conf 2018
Hi everyone. So, this talk is about failures. What they look like, how to prepare for them, and how to survive them. My name is Ronnie Chen, and I'm an engineering manager at Twitter. I'm also a technical diver.
My first introduction to the world below the surface was when I was standing on a tiny rowboat in the middle of a lake when I was about two years old. I saw some seaweed floating in the water, so I reached over for it. I lost my balance, and fell in. Everything went dark.
I passed out in that water, and almost drowned that day. Since that, I've had a healthy fear of the water my entire life. Diving was something that was a way to give me a new relationship with the ocean. A way for me to explore safely. To help myself become more comfortable, I started pursuing more and more training. First, advanced certifications, then rescue diving. And then I started my technical training. In my first year, I did almost 200 dives.
My training certified me to go on increasingly deeper dives. I learned how to safely explore shipwrecks buried way below recreational scuba limits. Once I learned how to dive on a rebreather, I was able to do expeditions that lasted up to five hours in a single stretch. A rebreather is a portable gas blending and recycling device. It purifies the air that you're exhaling so that you waste less gas. This is the same life support technology that's used in spacesuits.
330 feet below the surface means that your body is being exposed to 10 atmospheres of pressure. The deeper you go, the higher the partial pressure of Nitrogen in your breathing gas, which acts as a narcotic that slows down your brain.
Time literally starts to slow down. As you breathe, gas accumulates in your bloodstream, and it needs to be off-gassed slowly, or else you face decompression sickness, which can cause muscle pain, vertigo, nausea, difficulty breathing, what's called the chokes, paralysis, internal bleeding, and of course, death.
In these conditions, you can no longer rely on the surface as a means to escape. You're frequently hours from the surface, and even longer from emergency medical support.
that's just a fraction of the things that can go wrong while diving on a rebreather. I'm not going to go through the whole list, but one of my favorites is number 12, which is that the chemical scrubber that's used to remove carbon dioxide from the gas that you exhale can turn into a caustic soda that will give you chemical burns on your mouth, airways and lungs. Yes, mixed with water, the thing that is purifying your breathing gas will literally kill you.
Underwater explorer Jill Heinerth gave a talk on rebreather safety when I first started technical diving. If you own a rebreather for five years, two percent of you are going to die on it. One person out of 50 in just five years. When you're a diver facing these kinds of odds, you know that your survival depends on your ability to persevere through failure, because avoiding them isn't an option.
When we think about failures, we tend to think that they occur because of a solitary catalyst, but that's a really simplistic and harmful way to think about failures. If it was that simple, mortality rates for diving wouldn't be as high as it is. All you'd have to do is slap in a redundant system, budget in a little bit of extra breathing gas, and you'd be done.
Every person who starts technical diving already knows how failures start. On your very first day of technical training is when you start preparing for compounding failures. In the pool, on that first day, in the middle of doing all of those safety drills, my dive instructor swam up behind me when I didn't notice, and reached over and turned off the gas on my tank. My next breath never came.
How I responded in those next few moments would determine whether or not I was able to continue pursuing technical training. Panic, freak out, swim to the surface, and I would be removed from the class. If I could figure out how to survive and use the tools and my training, my lessons could continue. In the kinds of environments that we would be exposed to, every person on the dive team needed to be able to deal with compounding factors and multiple catalysts.
In all of my years of diving, in the hundreds of dives that I've done, I've never done the same dive twice. The current was different, the surge, the chop at the surface, the visibility, the water temperate, the time of day, the time of year, my dive team, my gear, my level of experience, my state of mind. A thousand tiny interactions that all contribute to every outcome, good or bad, on every dive. No dive has ever been under the same conditions, and no failure has, either.
Here, we see a failure cascade. A catastrophe where a chain of factors all contribute to the progression. You have a rebreather malfunction, which maybe you would have caught if you hadn't skipped your pre-dive check. Your back-up tank has a leak, and it's running low because of a faulty O-ring. Your buddy is too far away from you, and they're distracted by something else. The dive light doesn't get their attention because the beam has lower contrast in the daytime.
In all of this activity, you accidentally kick up some silt, and the visibility drops. Your air consumption goes up because you're stressed, and you breathe through the last of your air reserves in your tank, at which point, you swim for the surface, even though you have a decompression obligation. That's a catastrophe.
But, a change in any of these circumstances could have stopped that cascade in its tracks. At least eight different things had to go wrong for this critical accident to occur.
A change of time, a difference in visibility, the choice of a different partner, and the story that you tell at the end of your day would be different. Was that O-ring in item number three faulty because it was a factory defect? Was it because it got pinched when you were screwing in your regulator because you were distracted in the parking lot earlier? Was it under greater stress because it was a cold day? Or because you store all your gear in the parking garage, and the exhaust fumes degraded the rubber?
If you focus just on that initial factor, that rebreather malfunction, you have failed to capture all of the nuances of how that accident occurred. A post-mortem that blamed this incident on a simple mechanical failure would cover only 12.5% of all of the issues that conspired to lead up to this accident. Focus your investigation there, and what's upstream of that, and you're choosing to ignore all of the ways that situational and behavioral factors impacted your outcome.
So then, what does it even mean to have safety in a complex system? Most of the time, we can't even reliably identify all of the contributing factors for this complex failure. It's a long and frustrating battle to try to play Whack-A-Mole to eliminate failures one-by-one.
Staying alive in this world requires us to prepare ourselves to manage those inevitable failures when they happen. Pressure makes people worse at carrying out even the most basic tasks, and it's so important to build up familiarity before you need it. One of my first tech training drills combined a very standard out-of-air situation with running a safety line. Two incredibly simple tasks that I was already comfortable with independently. Out-of-air, you share your regulator and put your own back up in your mouth. Running a safety line, you hold it, keep tension, and swim. Put them together, and I hit my cognitive load. My faulty human brain simply couldn't keep my buoyancy and handle both things together.
My instructor swam up to me and signaled that he was out of air. I gave them my regulator, but in doing so, released tension on my line. Now it's catching on my fins. By the time I put my own back up in my mouth and catch my breath, they're swimming for the exit, but my line is tangled around my fins. As this happens, the air just keeps dropping.
My air consumption actually tripled in this exercise. Stress had completely overloaded my brain, but after we practiced that drill a few times, it was just simple. Share, back up, keep swimming.
When the pressure is on, whether it's being on call, managing an outage, having built that confidence and that muscle memory from running previous drills, frees up your mental CPU for key decision-making. Even excellent, detailed run books are not a match for exposure to the real process. Don't wait for a failure to test that recovery plan. If you actually do run-throughs and practice operating under pressure, you'll not only understand the limits of your system, but raise your own capacity to act smoothly as circumstances get tougher.
By the time I was ready for my rebreather certification tests, a three-part failure just felt routine. Straight forward enough that another instructor friend of mine took it upon himself to do some additional fault injection during my exams. While I was going through my drills, he was floating above me in the water, selectively deflating buoyancy gear, tangling up my line. What he didn't know that day was that I was also fighting through a wicked case of the stomach flu, which was making me seasick in the unusually high chop that day.
Layers and layers of chaos. Nothing that I could have anticipated or practiced for without being in that situation. Surviving that experience gave me the tools to keep my cool when I faced real incidents that started off challenging and got tougher. Nothing could have prepared me for how to handle that stress except actually being there.
One strategy that we use a lot in dive training is having the least experienced person lead the dive. Your whole team works together, runs the drills, makes the dive plan, does contingency planning, but the person who's actually leading the mission, who swims in front, is the least experienced person in the group. This is a suggestion that tends to make a lot of people uncomfortable. If that's you, I'm going to ask you to lean into that feeling a little bit, and examine why it makes you feel that way. What are you nervous about? And what does that say about your confidence in the system?
One thing that becomes clear when you operate in this way is that the best way for the mission to succeed is to make sure that all of the members of the team are succeeding. If your most experienced person leads, you're more likely to leave someone behind without realizing it. If you want to increase the chances that everyone makes it back to the boat, and that the team isn't pushed past its true operating capacity, look to bring up the floor. Your floor of experience.
In failure incidents, the limiting factor to how bad it gets is measured by that floor. If your inexperienced people are leading, they're learning and growing, and being able to operate with a safety net. When you do this, all kinds of hidden dependencies reveal themselves, too. Every undocumented assumption, every piece of ancient team lore that you didn't even know that you were relying on, comes to light.
If you fail to lift the newcomers up, you have to do everything yourself. You become a single point of failure that's weighing down the team. The moment that you're compromised, the operating capacity of your entire team tanks. Where's the resilience in that?
What happens when a team is trained and designed to ensure that even the most inexperienced person is empowered to make mission critical decisions? Everyone gets to tackle harder missions, because your overall competency level is elevated. Every time you do this, it's an opportunity to help your team build better judgment, and ultimately that's what it's all about.
Because complex systems are constantly evolving, the best weapon for failure that you have is judgment. All the training and rules are just a proxy for that. Training gives you a solution for a known well-defined problem. But, we know that the real incidents that we face are much more chaotic. In the midst of an incident, you're making a decision call based on the available information, under time pressure, and your judgment is what you're counting on to take you home at the end of the day.
So, there's a couple of ways that we can refine our judgment so that we're better at making these decisions. Some of these may look familiar to you in the tech industry. Post-mortems is the one that everyone already knows, but remember, if you focus on a single event, you're ignoring the bulk of the problem. Blameless isn't just about feeling good and making sure that everyone doesn't feel bad about what happened. When you focus on pointing blame, you're blinding yourself to all of the other factors in play, and engaging in post-mortem theater.
You want to look at your entire safety framework and identify key pieces where the cascade could have stopped, but didn't. Your whole goal is to gain a better understanding of the entire ecosystem, so you can address the weak points, but you don't have to wait for a failure to happen organically to start doing this kind of work. A pre-mortem is an effective way to start surfacing doubts and concerns and make contingency plans.
Before a dive, we would discuss the risks in play, and talk about what would happen if we had to abort a dive early, or the current is stronger than we expected, or of the rollout that needs to take eight hours goes haywire 30 minutes in. What are your options?
Talking about mitigation strategies for these failures help to prepare you for when the pressure is really on. When you're doing this exercise, look at both the likely failures and the high regret failures. Ignoring those high regret incidents means you're playing the odds, and when that happens, you have to be willing to experience that corresponding level of regret.
Fire drills. Running a fire drill allows you to prove that your plans and your safety systems are functional. Doing this well requires that you start with the basics. Following a run book, verifying that it does what it claims to do. You need a shared baseline confidence in daily simple failure recovery before you can start getting fancy. Only then can you start adding complexity. Designate an arsonist to create a failure scenario and evaluate your response. See where your weak points are, and iterate from there. Every time you run a scenario, it's going to be different, just like every dive is different. Do you know what to do once you've identified the problem? Are your tools sufficient to resolve the issue? What training gaps have you identified?
And last, revisit the past decisions that you've made. When we look back on our actions, we tend to restrict it to cases where failure has occurred, the classic post-mortem. But, survivorship bias means that we tend to assume that if nothing bad happened, your judgment was sound and the decisions were good. But, if you revisit successful operations, you may find a dependency on blind luck that lets you know that the clock is ticking until a repeat incident happens. Revisiting your rationale and sharing it is how judgment across your team gets refined.
The last thing I want to talk about today is success. In diving, the best rescue is the one that you didn't have to do. The best incidence response is when there's not an incident at all. Our culture has a tendency to reward heroic action, but that's based on response to failure. It's really important to recognize that true success lies in a job done quietly and without incident day after day. When a fail over is boring and routine, that often goes unrecognized. Nobody gets pizza and gets sent home early. But, we do the opposite of that all of the time. So, we need to remember that success can be boring and safe.
If you're interested in learning more about complex systems and how they fail, here's some links to get started, and I'll be sharing these slides from my Twitter account after this talk. With that, I hope that I gave you some concrete suggestions to make the work that you do a little bit safer. Thank you.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more