Podcast: Break Things on Purpose | Ep. 6: Subbu Allamaraju, Senior Technologist at Expedia
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!
In this episode, we speak with Subbu Allamaraju, Senior Technologist at Expedia.
Transcript of Today's Episode
Rich Burroughs: Hi, I'm Rich Burroughs, and I'm a Community Manager at Gremlin.
Jacob Plicque: And I'm Jacob Plicque, a Solutions Architect at Gremlin. And welcome to Break Things on Purpose, a podcast about Chaos Engineering.
Rich Burroughs: Welcome to Episode 6. We release new episodes monthly, which means that we've been at it for half a year now. It's been a really great opportunity to talk with such smart people about things like Chaos Engineering and resiliency.
Jacob Plicque: Wow. It's been a half a year? That's amazing. Thanks everyone for listening.
Rich Burroughs: Yeah, it's been a lot of fun. For this episode, we spoke with Subbu Allamaraju from Expedia. Jacob, what stands out to you from our conversation with Subbu?
Jacob Plicque: So one great thing that stood out was his unadulterated passion for reliability in all of its facets, and how it drives him both personally and professionally. What about you?
Rich Burroughs: I really liked his points about business realities. Businesses don't have unlimited resources and we're always making trade-offs. And his point about understanding how your business makes money.
Jacob Plicque: Very true. So great, let's go now to the interview with Subbu.
Rich Burroughs: Today we're speaking with Subbu Allamaraju. Subbu is a Senior Technologist at the Expedia Group. Welcome.
Subbu Allamaraju: Thank you. Thanks for having me here. Really excited to be talking about all things chaos.
Jacob Plicque: Absolutely. So why don't we just kick it off talking a little bit about your background. Looking at your LinkedIn, it looks like you spent some time at Yahoo in the 2000s, and at eBay, and also written some books around programming.
Subbu Allamaraju: Yes, I did that. Let me walk you through my journey into what I call this understanding system safety thing. My experience started in this area around 2012, around July of 2012, when I had a chance to jump into the cloud infrastructure. And how to then [inaudible] infrastructure. At a small scale, which is around, I can remember the cluster of infrastructure that I inherited was 96 nodes. It was running OpenStack at the time, and I essentially played a role to understand how to run at a certain scale. And over the next three years, I and the rest of the team managed to bring it up to 14,000 nodes.
Rich Burroughs: Wow.
Subbu Allamaraju: If I remember correctly. And I think that made me go through an entire journey of understanding how things work, how to automate, and how things fail. And how to react to things when they fail. I think that's how my journey started.
Subbu Allamaraju: And in 2016, I joined the Expedia Group, where I'm asked to help lead the cloud migration for the company, which is, we have on-prem data centers with a ton of [inaudible] platforms working on it. And we have started a journey to take them to public cloud. And which meant that I had to learn and understand and influence others how to build resilient architectures in the cloud. So it's a fairly long journey of dealing with system safety, automation, DevOps culture, governance, and a whole bunch of topics. That's been my experience.
Rich Burroughs: Yeah, that's great. You recently did a talk at OSCON, and you also wrote a blog post about it. Jacob and I both read the blog post and we really loved it. I think we're going to want to kind of dig into some of the things that you covered in there first, for a little bit here.
Subbu Allamaraju: Absolutely.
Rich Burroughs: Yeah. So you mentioned in the post that you had studied 1500 incidents in a year's time. How did that come about?
Subbu Allamaraju: So let me take a step back and I think that was, I think, one of the most fruitful work I did in the last two years. So, as part of our cloud journey, I was also pretty worried from the beginning about making sure that as we get to the cloud, we are getting better in terms of our architecture. More particularly about the availability of our platforms to customers, making sure that we are able to serve the traffic around the clock in multiple countries across multiple lines of businesses. So how do you get to the architecture in a very systematic, incremental manner? So we're getting better as we get to the cloud and increase our investments.
Subbu Allamaraju: And I came to that, my initial push was to create a set of guardrails to help make teams make trade-offs. That is an example, a guardrail could be, “Automate everything.”
Rich Burroughs: Sure.
Subbu Allamaraju: Or, make sure you build multiple fault domains internally. You call this a Vegas rule. Essentially you create multiple fault domains and making sure that there's no crisscross of traffic between those fault domains. So when there is an incident, or a failure in one of the fault domains, you can shift traffic away, with even a long classic pattern. And so, I started with some of those guardrails in mind. Let's call them hypotheses, to help make trade-offs. And as I approach 2017, late 2017 and early 2018, I'm beginning to doubt myself, because I knew that we don't automate everything. We automate what's most important but don't automate a whole lot of things.
Rich Burroughs: Sure.
Subbu Allamaraju: We have great ideas and architecture available from prior experiences from other companies as well as plenty of literature. And yet, we don't invest enough. This across the board. And I don't mean, necessarily mean, Expedia Group, because we have a lot of priorities, a lot of work in multiple fronts, and adding features, growing the business, as well as re-architecture, re-platforming, and a whole bunch of work.
Subbu Allamaraju: So how do you influence teams to invest in the area, and how do I make a reasonable hypothesis that teams can actually buy into? So I wanted to push my opinions aside, keep my opinions aside, and wanted to understand what is actually happening in the company. What is happening in terms of incidents, how people react to it, what are the patterns across incidents.
Subbu Allamaraju: So at this time last year, 2018, I looked at a sample of a hundred incidents to see what's going on in those incidents. And most of those applications that were impacted were on the cloud. So I was just making a simple observation that it changed behind the incident. Something that happened before the incident. So I notice around 70% of incidents were triggered by a change. And that made sense, because we have invested quite a bit in CI/CD in the last two, three years. And that has meant a lot of speed. Teams are able to move faster and faster.
Subbu Allamaraju: And so, I noticed the pattern, and I showed this to a number of my colleagues and peers, and people were, "Oh, goodness, this is a big number. So what do we do about it?" And so, we started thinking really, safety. How do you make releases safely into the cloud? Make sure your pipelines have some sense of progress of delivery. And so, later last year I spent about 10-plus days of my winter break looking at bigger corpus of incidents. I looked at around a thousand incidents from January 1st of last year to the end of 2018. And that showed me a similar pattern.
Subbu Allamaraju: And in addition to changes, I also noticed this configuration drift. There are other patterns including a set of incidents that we don't know quite why they happen, but they happen. So came across a number of interesting patterns. And I keep doing that work mainly to find a way to learn from incidents. Because often times teams, as you found in my blog post and the talk at OSCON, people, most of us, see incidents as things that are not supposed to happen.
Jacob Plicque: Yeah.
Subbu Allamaraju: We don't like them. We believe they are annoyances. Because we have work to do in a sprint. We want to deliver the features and other good stuff. And so, when an incident happens, we get distracted, we don't like it, and we think something didn't work as it's supposed to work, or someone didn't do the job as he or she is supposed to do. So that's how we treat incidents.
Subbu Allamaraju: I want to change that culture. I want to say that incidents are feedback from the systems. Incidents are not annoyances, and the fact that there is no single root cause, so we need to use a different language when we speak about incidents.
Subbu Allamaraju: So that was the intent of going through all this and I formulated some ideas on how to learn from incidents. I think I'm fundamentally, I come to believe that we're learning from incidents. We can influence our architectures, think of how to invest in our tools and technologies and culture. A whole bunch of avenuest opened up when we study incidents, and that's what I saw from incidents.
Rich Burroughs: That's fantastic. I have to say, Subbu, I follow you on Twitter, and I've read a number of your blog posts, and you strike me as a very, very thoughtful person. I find it really interesting that somebody with your level of experience in the industry didn't just go off his gut and kind of say, "Hey, I know how this stuff works." But actually dug in and did a bunch of research, to try to confirm or disprove your hypothesis about what's going on.
Subbu Allamaraju: Actually, there's one additional thing I learned in this process is, I began to question whether the word Chaos Engineering is that actually, why should it work? What's the theoretical or mathematical or philosophical even, background to say that it should work? And I wanted to answer that question for myself. And I think through that course of trying to answer the question, I came across a number of works by folks like Richard Cook, and others, and actually the best book I read is Drift Into Failure that helped me understand why systems fail. And so these kinds of foundational research and other folks helped me understand and make sense of system stability and safety.
Rich Burroughs: Those folks listening, we'll link to some of the stuff in the show notes to Subbu's blog post. I think you're talking about How Complex Systems Fail, that paper by Richard Cook. We'll link to that as well. And then the Dekker book, Drift Into Failure.
Rich Burroughs: It's funny that you mentioned the Richard Cook paper, because that came to my mind when I was reading your blog posts.
Jacob Plicque: Exactly.
Rich Burroughs: There's a number of things that you talk about that sort of echo some of the points that he brings up. And one of the things that I find really fascinating about that paper, it's like these 18 bullet points, but there's so much information in there, they're so dense. He talks about the fact that our systems are always sort of like degraded. And I love that idea that these incidents are points of feedback for us, and not something that we should look at in a negative way.
Subbu Allamaraju: Absolutely right. In fact, once you complete reading Drift Into Failure, you realize, yes, things are always in some state of drift, and they fail. What do we do about it? I don't think you find answers to that question. But on the other hand, once you start thinking of systems giving you feedback and you're constantly on the lookout to learn from incidents, then you get a different approach. Then you think of guardrails and you think of how you invest in culture and your processes to take feedback from the systems and then react to that feedback on an ongoing basis.
Jacob Plicque: I think for me, like the trap that I fell into three or four years ago, which is kind of right before I learned about Chaos Engineering was just the assumption that incidents are not preventable. I kind of enjoyed being the incident superhero. Not that anyone wanted to be woken up at 4:00 in the morning, but I always kind of wore it as like a badge of honor. And so I love that kind of five steps, up a little bit, like it levels up and said, "Let's gather all this data and see what I can learn." Because a key component of it, of Chaos Engineering, is just having that hypothesis and proving it out. And you took all that data and put this in this blog, which I think is really cool.
Jacob Plicque: But one thing that I'm really interested in finding out more about is, so if the number of 1500 is really high, right? But did you find that it was a particular like subset of industries or different size or companies or anything like that? Or did any of that even matter?
Subbu Allamaraju: I really don't normally looking for an answer from my fellow practitioners in the industry. I think what I found was that within the Expedia Group, we have systems that are young and old based on investments at different points in time. Some systems are a bit older than we wanted. And at the same time we have newer systems. And so, across this gamut there is a set of patterns and what I really want to do in the coming months and with my team, my teams at Expedia, is look into specific domains and see if the patterns are different. And let those teams form hypotheses based on those observations.
Subbu Allamaraju: And then invest in either improving architectures, changing, improving the release safety practices, or testing for failures at different levels. I think all those things can follow based on specific patterns, in a specific domain. Like in, at work we have a number of fault domains, we have a number of systems, so I think those might provided a more helpful clues.
Subbu Allamaraju: You touched on something about forming hypotheses. I think there is a tendency, I think this is something that we can do a better job in the community, is that there is a belief that testing for failures randomly in a, will make the systems resilient and then eliminate a number of incidents. I think we should start to explain that that is not the case. You have to form hypotheses, you have to first know the safety boundaries of your system. Because if you start randomly breaking things on purpose, you are going to cause harm to your customers and your stakeholders.
Subbu Allamaraju: So you want to know the boundaries. How far can you go? And for that, you have to really have an understanding of the existing system. Its architecture, its fault boundaries, so that you can play within those boundaries. I think that's something I would really insist on people doing this kind of work.
Rich Burroughs: It's interesting to me. So people talk a lot about Chaos Monkey, and the model for Chaos Monkey was, this thing is always running. It's going to kill your instances. So if you deploy an application into this environment, you know that it has to be resilient to that form of failure, right? Because it's just going to happen constantly. And I feel like there is value to that, because it's a forcing function, you know? I mean, people just know that when they're going to go into that environment, they have to be able to withstand that form of failure. But I also think that there's a lot of value to what you're talking about to experimentation that's coming from a hypothesis and a more kind of thoughtful point of view.
Subbu Allamaraju: Right. I think the thing that there is a balance, you have to find out what is right for your environment. I think if you are starting from fresh, you could start an environment that has certain conditions, certain rules of the road, saying that any instance could go away, all infrastructure is ephemeral, there is this monkey killing your instances. That could be a rule, a rule of the road, to force you into that. Or you could have a rule of the road that says no compute in this area should, shall talk to other compute outside that, outside that environment, except through these well-defined means, and that puts a safeguard.
Subbu Allamaraju: I've seen companies where the fact is, it is kind of easy like model to force fit into you into file domain so that you can actually failover nicely. So I think depending on environment you can make hypotheses. I think having that understanding and the awareness is, I think, the key.
Jacob Plicque: And my assumption is, I think we kind of all agree that there's a place for both in our world, so to speak, but that it's not necessarily where folks should be starting, maybe. I think that's the, I don't want to say fallacy behind Chaos Monkey, but that's kind of where folks are getting their hands dirty and saying, "Okay, well, it has to be random," when you know that's not the case, which I think is what we're all agreeing on. Right?
Subbu Allamaraju: That's right.
Rich Burroughs: I agree with that as well. I think that your point Subbu is great in that it's going to be a lot easier to do that sort of work in a new environment where people just know, you know? It's sort of the, you have to be this tall to get on the ride thing, you know?
Subbu Allamaraju: Right.
Rich Burroughs: They just know that those are the operating conditions of that environment. It's going to be harder to force that sort of randomness on a longer lasting environment on something that's been around for awhile. You mentioned older applications, I'm glad you didn't use the term “legacy,” because that has such negative connotations to so many people. But the reality is, that if you've got a team that's been operating together for a while and they're used to doing things a certain way, you're going to have difficulty, I think, coming in to them and saying, tomorrow we're going to start randomly shutting off your systems, you know?
Subbu Allamaraju: Right. And in fact, let me actually pick up the point you were making about the legacy versus older. I think this is an important point. Nobody has an infinite number of resources and time. We have a finite number or resources and a finite number of time. And based on that, we make investments into different areas. Okay. This year, we may invest in X. Next year, you may invest in Y. And when you don't invest in Z for three years, it gets old. And that's just a fact of life, and that's okay.
Subbu Allamaraju: So I make the point because when you think of investing in a resiliency activity, or chaos testing, or something else, we have to be able to make, articulate a value to that. It's not going to come for free. So you have to make an argument about, by doing this I'm going to generate this much value for the company. Value for the team. And I think as part of making a hypothesis for, I would actually say that one of the hypotheses should be what's your value that you're trying to bring into that? And based on the value, you decide what hypothesis you want to test and how you want to test it, and when you want to test it. Hopefully because you want to test it. So I think once you bring in value, things will change.
Jacob Plicque: That actually really exposed something that I talk a lot about or get questions around is, is where to start. I tend to go down, I won't say a rabbit hole, but I tend to talk a lot about incident reproduction and runbook validation. But I think you've found an interesting way to really kind of bullet point it, in that maybe I'm spending the time building a hypothesis for something that really doesn't provide me much from a value perspective.
Jacob Plicque: Sure, I have some additional knowledge based off of my findings, but, does this, let's say, as an example, does this allow me to now kind of confirm that I'm able to failover to another availability zone or region, versus spiking up CPU on one host, right? So I think that's a really succinct way to do it.
Subbu Allamaraju: Exactly. So imagine you have a, let's say you have two lines of businesses, and one line of business brings you a million dollars a day. And the other one brings you a thousand dollars a day. And so, you have different values attached to these two lines of businesses. You must spend more time and money and resources into improving the architecture for the higher value thing than the lower value thing. And that's totally okay, I think. So having that business context, bringing that context into the picture, will help you pick and choose what you want to do instead of saying, "This is the rule for everything."
Rich Burroughs: Yeah. You mentioned that word trade-offs earlier, and that's what this is all about, right?
Subbu Allamaraju: Exactly.
Rich Burroughs: Every business has realities, and you can't have all the resources that you want. You're not going to get the resources for everything that it is that you would like to do in a dream world, you know? And so, I think that ability to demonstrate value is really important.
Rich Burroughs: If you were someone spinning up a new Chaos Engineering program, how is it that you would want to try to demonstrate that value to the upper management folks?
Subbu Allamaraju: I think there is no one single answer, but I'm looking for in the industry examples of doing this successfully. I'll give you an example that I'm familiar with, one of my teams. And we were having a debate of going to, picking an application, a fairly complicated, important application from the data center to one region in the cloud, investing a bit more time, and spinning up the same stack in two regions. And make sure there is a model to operate active/active model with some trade-offs.
Subbu Allamaraju: So both of these have different engineering investments and timelines attached to them. And now the question is, how do I motivate the team to make an investment to do the later, which is going in two regions and having the redundancy and ability to failover and also test that the failover can occur? To make the argument, we really had to go back and say, how much traffic are we serving? How much revenue does this piece of work generate? And how much failure, how much downtime can we tolerate? Is 15 minutes reasonable? Is two hours reasonable? What is the reasonable amount of time that this the system can be down for?
Subbu Allamaraju: So once we started bringing those numbers into the picture, then we could say by investing in B versus A, we can prevent, we can actually save this much revenue from getting lost in a routine incident. Let's say now, the product owner/author is making a decision, has a number to look at. "Oh, I'm looking at investment to improve the architecture of this much money. Was this a feature I want to build, that might generate something else." So then I can make a comparison between these two and say, "Okay." And you can make a value-based decision to do one or the other. It's not perfect, but it can help make better decisions. When you have investment constraints, like you have X-amount of money and time and resources, how do you invest that?
Rich Burroughs: Yeah. So you're talking about tying it to specific business metrics?
Subbu Allamaraju: Yeah, something around those lines. I think to make those things you have to understand how your business works. How do you make money? How does your business work? Having that understanding will help, I think, make those value-based trade-offs. If we are disconnected from those kinds of things, you may find yourself frustrated, because you don't know how to make a case for your ideas, your thoughts.
Subbu Allamaraju: Let's say you want to bring up a Chaos Engineering team and you need X-number of people for X-number of months or years. How would you make a case to fund it, to get it funded? So you need to equip yourself with the data that can make sense. And for that you have to understand how your business works.
Rich Burroughs: Yeah, that's a great answer. I feel like risk is an important part of that equation too, you know? Like that scenario where you're only in the one region is great until it goes down.
Subbu Allamaraju: Right.
Jacob Plicque: Oh man, you beat me to it, Rich. And like, as soon as you said just about the region, I had a grin on my face because I think I mentioned this before on the podcast, but my previous company, we had a long conversation about whether or not multi-region was the right way to go, and we ended up turning it down due to cost. And then two weeks later us-east-1 went down and so did we. So it bit us right away. But we didn't have enough data to be able to, if we had more data or even did some experiments on it, we would probably have, I hope anyway, come to a different conclusion.
Subbu Allamaraju: Absolutely. That's why engineers who are close to technology must know how the business works. So that they can help make better trade-offs, on whether between saving dollars by turning off a region, versus investing in a second region, or some sort of failover mechanism.
Jacob Plicque: Right.
Subbu Allamaraju: So that you can prevent loss of customer experience. So you need to be able to articulate value. And in fact one of the talks I'm giving later this year at Serverlessconf, I plan to touch on this point. Because any cool technology trend like Serverless, now, you need to still make the same case. How do you articulate value for the kind of transformation that you want to start at a company?
Rich Burroughs: Now, Subbu, in your blog post, you actually mentioned though, when we're talking about the region failover thing, that most of the incidents that you saw from your research weren't coming from infrastructure problems. And you actually used region failover as an example of one of the things that you thought people actually sort of overemphasize, that they kind of expect that it happens more than it really does?
Subbu Allamaraju: I think if you look back maybe over the last five to seven years, cloud providers have gotten better and better at keeping their fault domains like regions up and running. There are still failures. I don't say there are no failures, but the ocurence of those failures has been less frequent nowadays. And what is tripping most companies these days is not a cloud provider going down, but it is their own applications not working as they're intended, which is fairly natural. Because we are moving things faster. We are deploying more frequently. Our architecture is becoming more and more complex. Because we are building more and more microservices. We are interconnecting them with in different ways to produce value.
Subbu Allamaraju: So there is complexity that is increasing and so, and this complexity makes the systems less deterministic. There are a lot more stochastic] events going on in our systems, and we don't understand what's going on at any given time. So it's a natural trend that is going on. So what's hurting us is not the cloud provider. More often it's actually what is hurting is our own applications and data.
Rich Burroughs: And you mentioned specifically, changes being the cause of a lot of failures and you mentioned configuration drift, and it was interesting to read that because, I mean, I'm somebody who comes from a Configuration Management background. I worked at Puppet for awhile and was in that field, and that seems like an area that should be sort of a solved problem by now, but it really isn't.
Subbu Allamaraju: It isn't, and fascinating you mentioned that actually learn, you mentioned puppet. I learned about Configuration Management and drift and even closed-loop automation by using Puppet at a certain scale. And some of the earliest incidents that I dealt with when I was at eBay was with Puppet, and not because Puppet failed, but because we were not able to detect drift in the infrastructure, and that led to poor experience for a number of users over a period of time.
Subbu Allamaraju: And then we realized, "Oh, there is a name for this, what we are seeing. It's called config drift." And so, that made me realize why it happens and why it is important to pay attention to it. I have come across a number of incidents where teams found different root causes, so-called root causes, but at the core were configuration drift. Because things were set up one way and over time that drifted from the configuration, and suddenly there's a surprise on a given day. It's a fairly common occurrence. I've seen a number of incidents with that.
Rich Burroughs: Yeah.
Jacob Plicque: And that ties back to one of your major points around single root cause fallacy right there, I think, specifically. A great example of that from me is, I was a big Pokemon Go user back in the day, and I remember oh, like, there was always just like this intermittent downtime and I was like, "Come on." I'd be like, "You guys know how to do this autoscale. It's not that difficult." When in actuality, of course it's that difficult, especially with amazing amount of scale that it ended up being, and so it's never, I won't say never, but it's almost never as simple as it looks. You kind of do a base level root cause analysis. You’ve really got to dig, you know, very, very deep there.
Subbu Allamaraju: That's a fascinating example. That what I saw, again, looking at even my own experience in a lot of teams is that we don't get trained to operate our systems. Oftentimes you jump into your team and you get a pager, then you start being on call. And I see a lot of my friends joining different tech companies, they start their journeys like this, and, "Today I have a pager," and they feel good about it because they are empowered now to deal with incidents.
Jacob Plicque: Right.
Subbu Allamaraju: But what's happening is that, in this process, we are not really trained to understand complex systems, and how complex systems fail. And most of us discover these things over time. Many out of fluke, I would say. Even my own introduction to complex systems and their failures is just by fluke, and then I never approached it systematically. And that's true for a lot of companies, I think except a few folks like John Allspaw and others who are very systematic about talking about it and preaching this for a number of years.
Subbu Allamaraju: Other than that, it's most people, we don't pay enough attention, and suddenly you are empowered and thrown into the pool and say, hey, deal with complex systems and you have no idea how to deal with them. And so, we started this idea of systems are simple and linear, and the reality is that systems are non-linear and stochastic and that realization doesn't happen overnight. So that's how we end up getting locked into the single root cause fallacy that, “Oh, our systems are simple, only that thing has failed. So let's fix that. The incident won't occur.”
Jacob Plicque: Right.
Rich Burroughs: When the reality is by the time there's a big incident like that, a lot of times, multiple things have failed along the way.
Subbu Allamaraju: Exactly. You only learn that through experience, not by, not formally. And that's something we need to improve in the industry.
Jacob Plicque: Agreed.
Rich Burroughs: So that's one of the things that I've actually found really exciting about Chaos Engineering, is that idea of being able to do these experiments on your system and improve your mental model of how it is that it actually works.
Subbu Allamaraju: That's how I would explain testing, chaos testing. It's not about introducing failures, it's about building better mental models of the dynamic nature of our systems. So you can reason about the success as well as failure.
Rich Burroughs: Yeah. I mean, nobody's ever going to have a perfect model, right? They're going to be very flawed, but the more that we can improve them, the better position we're in when our pager goes off at 4:00 AM, and we've got to do something.
Jacob Plicque: It's understanding that it's never going to be perfect because our systems are always going to be in some set of chaos, right? Under the hood, so to speak. But on the flip side of that, we're also human, right? So, acknowledging that, I think, at least at a high level is really, really important.
Subbu Allamaraju: Yeah, that's right. That's why we hear a lot about human factors in complex systems and failures. I think that's a more recent realization that we are undergoing through. And that it's part of our culture. It's part of how we think of failure, of people and systems. And so, I think we are getting better in the industry and there's a lot more we should talk about and share our shared wisdom.
Rich Burroughs: I was just going to say, in that Richard Cook paper, one of the points that he makes that I love is that a lot of times it's actually the humans who are preventing the incidents, right? It's the fact that somebody reacted and jumped into a situation and stopped it, before it got to the point where it was exposed to customers. That's the thing that's actually keeping the system safe a lot of the time.
Subbu Allamaraju: Exactly, and even in the case of the change statistic I shared, a lot of changes actually succeed.
Jacob Plicque: Right.
Subbu Allamaraju: We have 99.95%, I think, of changes succeed. Which means that we’re actually doing a pretty good job. It's just that there are some changes that are causing issues. So I think I would still take pride in actually in our teams as well as you know, many other companies investing a lot in making thousands of changes every day, very safely. I think it's just that those narrow edge cases are what are getting unsafe conditions.
Rich Burroughs: So you talked in your post about building confidence in deploying the software safely, and I'm wondering what advice you have for people? Like how it is that you build that confidence in a team?
Subbu Allamaraju: You start with the some understanding of your architecture. I think there is this famous, a very useful paper by Werner Vogels at AWS on compartmentalization as an architecture principle. And I would actually start with compartmentalization from the get-go, when you're building any large scale service. So that you have your fault boundaries clearly identified and established. Once you have those identified, you have a lot more flexibility in managing changes. Then you have more weapons at your disposal to deploy changes progressively across those compartments, if you will.
Subbu Allamaraju: And then you are serving your customers and you are improving your confidence that the change is going to go successfully. This is how most large internet scale systems are managed. There are very few companies that deploy the same change and across the globe at the same time. And of course, when they do, we hear about them. Like the Facebook incident that happened in March. But except, barring those kind of large cases, think having compartments is the number one thing that you want to focus on and invest in. And of course, once you have that, then you can improve your CI/CD. There are many, many techniques available for deploying code safely like blue/green, canary releases, and whatnot. But I think you have to make an explicit investment to systematically productize these kinds of best practices in your architecture and your tools.
Rich Burroughs: My favorite quote from your blog post was, "Every change is a test in production."
Subbu Allamaraju: I'm not the first one to say that this is, it's been said by many, many people in the, in recent years. I think it comes from the realization that our environments are, have high entropy, and reproducing the environment on a laptop, on a separate compute server is next to impossible. You can make approximations. Once you start making approximations, you no longer have the same environment. It's a different environment. Different set of failure modes. Different set of assumptions. So you can never test the same change, the same entire environment. It's different always.
Rich Burroughs: Yeah. And in fact the production environment is not the same environment after you've made that change, right? Every time you make a change, every time you iterate on it, you're creating a whole new system.
Subbu Allamaraju: Exactly. So when you have a system that's undergoing 2000 changes a day, which means that it has changed its shape 2000 times. And with those 2000 times you have brought in a number of new assumptions that were not there before. And very few people know about those assumptions. So how do you keep track of it? How do you manage such a complex system?
Jacob Plicque: Having said all of that, it's still the place where you're getting your money.
Subbu Allamaraju: Exactly. Exactly. That's why people are fundamental in making systems work.
Jacob Plicque: So what do you think that is, I'll say, missing, but what do you think that folks should be, your company should be focusing on, to be able to build confidence in production so that way it's not something that they're scared of? Because they still have to make those changes and build these services and new features. So what do you think's the missing link there?
Subbu Allamaraju: I think the missing link is, I was actually speaking to a friend of mine who works at a tech company and he said he made the statement, "I'm now allowed to make this change." He joined the company recently. He said, "I'm allowed to make this change, push the feature, but I'm scared because if I touch it, it might break everything in the company. I don't know what's going to happen if I, once I submit." So that means that engineer is hesitant to deploy a change into a large complex environment, which means that the team could have invested in incrementally increasing the safety of the change by compartmentalization or some other ways of deploying in increments over a period of time.
Subbu Allamaraju: So instead of deploying all at once. I think to me, that is the fundamental shift that needs to happen in the industry, that there should be a way to make change safely. Systems are getting increasingly complex and complex and so we have to invest in this change safety.
Rich Burroughs: Yeah. I mean, even the tools are more and more complex. You think about something like Kubernetes and then you put a service mesh on top of it, and how many people who are operating Kubernetes and Istio really understand all the nuances of both of those tools?
Subbu Allamaraju: Don't even get me started on that.
Jacob Plicque: On part two, then.
Subbu Allamaraju: Yeah. Part two. It's with a whole set of new assumptions and new failure models to be understand.
Rich Burroughs: That's one of the other points that Richard Cook makes in that paper is that when you adopt these new technologies, they come with new failure modes. Right? And that they may actually be worse than the failure modes that you had before.
Subbu Allamaraju: Yeah, exactly. I think it's one of the Netflix engineers, by name Lorin. He makes this observation that the system that you build to keep systems up and running, I'm paraphrasing in my own words, can actually cause a problem. Like the monitoring agent or that control loop you downloaded from the internet core that starts with K, can create a new failure mode you don’t understand.
Jacob Plicque: It's true.
Rich Burroughs: Yeah. Lorin's great. Super smart.
Subbu Allamaraju: He's a good guy.
Rich Burroughs: I wanted to ask you about metrics. We were talking earlier about that idea of business value and showing the business that the resources that you're investing in Chaos Engineering are paying off. What are some of those metrics that you would look at? Would you look at things like Mean Time To Detect, and Mean Time To Repair? Or are there other things?
Subbu Allamaraju: We do look at, not necessarily Mean Time To Detect but look at on average, how's the team feeling comfortable about recovery. Because what I have seen in the past is that there is an incident. Teams tend to debug and spend a lot of time. In recent months, I've seen teams that went through validation testing through chaos testing means, are much more comfortable to failover or recover quickly and debug later. It's a more cultural change that I'm seeing. It doesn't necessarily reflect in all the metrics, all the numbers, but those are the kinds of stories I look for because they come from real experience and they spread the good word about how their investments are paying off.
Jacob Plicque: It probably ties into the management piece, I have to imagine, because they're more comfortable essentially flipping over to a failover. You know, I have to imagine a lot of the incidents from a time perspective tend to shrink in that regard because they flip over, validate, and go back to bed, versus debug, debug, debug, fix, and go to bed.
Subbu Allamaraju: Yes. And we are seeing those cases and that I'm still waiting for a bit longer to conclude that this is working. But the journey has been focusing on compartmentalization, focusing on traffic shifting over firefighting in place, as much as possible. And of course, the third is really safety. So these are the ones that we are looking at internally.
Jacob Plicque: Got it. And then, how does that communicate with those engineering teams that are essentially the ones fighting those fires? Because as someone at the VP level, how does that communication kind of spin up? Is there a, like a postmortem that you're involved in, or is it a little less formalized than that?
Subbu Allamaraju: We are not formal in this process and we actually, we are in the process of figuring out how do we structure ourselves to how do we have these conversations and stop using terms like postmortem and focus more on learning from incidents. How do we facilitate that? How do we spread the lessons learned through from the grassroots level? We still haven't found out what the right mechanism is, but there are plenty of ideas that we are experimenting with in the company. But the aim is to change the language that we're using to describe incidents, post incident processes, the tools we use, and then set up these working models for teams to jointly learn from incidents.
Jacob Plicque: I bet that's going to be a talk sometime soon, too.
Subbu Allamaraju: Absolutely.
Rich Burroughs: Speak of talks, I think we're about out of time. I think we could talk to you all day about this stuff, Subbu, so thanks so much for coming on. But I just wanted to point out that you're speaking at Chaos Conf in September. We're really excited to have you there. Are there other places on the internet that you want to point people to where they could find out more about you or follow you?
Subbu Allamaraju: Absolutely. First of all, thank you both for having me. It's been fun talking to you and look forward to speaking to you in person at the chaos conference.
Jacob Plicque: Absolutely.
Subbu Allamaraju: And the best way to find me and my prior work is at my blog, subbu.org.
Rich Burroughs: All right. We'll link to that in the show notes as well. We'll link to that blog post, and also your Twitter. Like I said, I really enjoy following you on Twitter. You're somebody who has what I would say is a very high signal to noise ratio. You don't tweet a lot, but when you do, you're usually saying very important things, so I like that a lot. Thanks again for coming on the show today. We really appreciate it.
Jacob Plicque: Absolutely. Thanks, Subbu.
Subbu Allamaraju: Thank you both.
Our music is from Komiku. The song is titled, Battle of Pogs. For more of Komiku's music, visit loyaltyfreakmusic.com or click the link in the show notes. For more information about our Chaos Engineering community, visit gremlin.com/community. Thanks for listening and join us next month for another episode.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more