Podcast: Break Things on Purpose | Ep. 5: Adrian Hornsby, Senior Technical Evangelist at Amazon Web Services
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!
In this episode, we speak with Adrian Hornsby, Senior Technical Evangelist at Amazon Web Services.
Transcript of Today's Episode
Rich Burroughs: Hi, I'm Rich Burroughs and I'm a Community Manager at Gremlin.
Jacob Plicque: And I'm Jacob Plicque, a Solutions Architect at Gremlin and welcome to Break Things on Purpose, a podcast about Chaos Engineering.
Rich Burroughs: This episode we spoke with Adrian Hornsby from Amazon Web Services. We had a really fun conversation about Chaos Engineering and resiliency. Jacob, what sticks out to you from our conversation with Adrian?
Jacob Plicque: So far and away that this discussion just flew by and that we covered so much. Also, that we were able to pack so much in and yet there could definitely be a part two in the future. What about you?
Rich Burroughs: Agreed. I really liked our discussion about randomness in Chaos Engineering experiments and also hearing about Adrian's experiences with customers at AWS.
Jacob Plicque: Awesome. So, we're five episodes in now, so we just wanted to thank everyone that has listened so far and we'd love to get some of your feedback about the podcast. Maybe things you want to hear more about or even suggestions for guests we should speak to. So our email is podcast@gremlin.com or you can reach us on Twitter at @btoppod.
Rich Burroughs: All right, great. Let's go now to the interview with Adrian. Today we're speaking with Adrian Hornsby. Adrian is a senior tech evangelist at AWS. Welcome Adrian.
Adrian Hornsby: Hey, Rich. Hi, Jacob. How are you?
Jacob Plicque: Doing awesome. Thanks for joining us.
Adrian Hornsby: Thank you for inviting me. That's an honor to be on your podcast.
Rich Burroughs: So, tell us a little bit about how you got interested in Chaos Engineering.
Adrian Hornsby: Oh, that's a long story actually. I think I was born very curious and always kind of tried to learn things by breaking them. At least that's what my mom would say. She would never like that I never built them after that. So, I kind of took a lot of radio and TV out like this. So, I tried to understand how they work. So, I guess it's always out of curiosity, right? I think you learn a lot by breaking things, anything, right? So, actually when I moved to software engineering, it felt kind of natural, when I started to read the stories of the Netflix and things like this, it was like, "Hey, sounds like very interesting concept. I apply these breaking things for software. So, actually it's really by reading the blogs of Netflix and also all the engineering blogs of different companies back in the day that really got me interested in the bigger picture of resiliency. I would say. Anyway, I think Chaos Engineering is just one part of it.
Rich Burroughs: Sure.
Adrian Hornsby: Definitely a fascinating part of it.
Rich Burroughs: Sure.
Adrian Hornsby: That's how it will happen. I think it's kind of a natural path, I think.
Rich Burroughs: I couldn't agree more. You mentioned that it kind of started very early in childhood and I can kind of think back kind of similarly around breaking things like robots and creepy crawlies and stuff like that. Was that something that stayed with you through the early stages of your career before joining Amazon?
Adrian Hornsby: Yeah, I mean, one thing we didn't call it Chaos Engineering, we called it curiosity or [inaudible]. I think that's what is interesting with Chaos Engineering is kind of setting up a process for deep diving into kind of some potential problems and especially making sure that it surfaces problems we might have not think about, and I think that's what is the interesting thing about Chaos Engineering is, it's a reminder of how bad we are remembering everything that has happened in the past in our head.
Jacob Plicque: I'm so glad you called that out because it's so true. I was actually just running a Chaos Engineering bootcamp a few days ago. I always have a slide about the scientific method and I joke like, “Hey, raise your hand if you remember sixth grade biology class when we were talking about the scientific method. Well it's back.".
Adrian Hornsby: Exactly.
Jacob Plicque: It's still around because it's still important. Setting up this hypothesis are really important, but just to kind of level set for the folks, can you tell us a little bit about, before we dive in, right into Chaos Engineering, a little bit about what brought you to, as a senior tech evangelist in the U.S., Rich mentioned that when looking at your Linkedin, you've got a solutions architect, director of engineering research scientists. So, how did it all kind of come together?
Adrian Hornsby: Well, I was working back in the day at Nokia Research and somehow distributed systems and large scale systems and trying to set up some experiment on updating things across a large fleet of cell phones in real time and things like this. And already, back then, I remember we were trying to figure out the effect of having a disruption in the network and [inaudible]. So, again it was not called Chaos Engineering, but it was kind of trying to break that network to raise some conditions, network conditions that we didn't think of. And, always, we could find some things to improve, right? The way the protocol would work or things like this. And I think after that, I was hired by some companies to build white label applications and that companies already were interested in the cloud. Then we just selected AWS because there was no other choice back in the days.
Jacob Plicque: Sure.
Adrian Hornsby: So, I started to work on AWS very, very early on, and I think for me, the way I learn new things is by learning from others, right? So, as I said, I was starting to read all the engineering blogs and see how people would build cool systems and you look at the Netflix at the Instagram engineering blog back in the days or so, that was pretty fascinating. The WhatsApp, the Airbnb kind of all these, have gone through a spectacular engijneering challenges and I think it's then curiosity and then reading that and then getting inspired and say, "Oh, that's kind of the goal. Being able to understand your system, find failures and prevent them and kind of ... and inspect those." I think those are engineering challenges that are fascinating, at least for me. I love it. Keeps me awake at night. And I just love it. It's just so interesting in my opinion.
Rich Burroughs: I think a lot of people have done some sort of failure injection even before there was a name for it. I've certainly talked to people who have, and I'm sure that I did things at previous jobs where I was manually injecting failure in ways before I knew what Chaos Engineering was.
Adrian Hornsby: Exactly. I remember when I was at school, I did a training at my dad's work and my dad was working in a nuclear power plant for... scientific nuclear power plant and in the control room there's tons and tons of redundant systems and literally it was eight levels of redundancy. And I remember it kind of shocked me that all of these kinds of resilience already back in the days. And they do also test like this at scale, trying to verify if the alarm system works and if the escalation path works and all that stuff.
Rich Burroughs: Right.
Adrian Hornsby: So, I remember it actually was very fascinating back in the day and I was 16, so that's a long time ago. It stayed with me. I think. And again, it's the curiosity of understanding the system itself and the failure point of the system, which is then how do you make it better and how do you improve it? I think it was just science is fascinating in a way.
Jacob Plicque: I think that redundancy is something that's a pretty obvious step and even back in the jobs that I had in like the '90s, early 2000s, we were thinking about redundancy, but I think that we weren't necessarily thinking on the levels of resiliency that people do today.
Adrian Hornsby: Well, the demand was very different as well. Back in the days you would build an application for your local market, right? And it was totally okay to be down for the weekend and say, "Oh, we'll come back Monday." There was no choice for other people to go and download another application. So, nowadays you can move from one application to another in a matter of seconds, right? Customer retention is a lot more important, therefore you need to have a lot more resilient systems, otherwise you lose your customers.
Jacob Plicque: Because the cost of downtime is such an interesting discussion just right there because there's things that are obvious, right? Those customers aren't going to come back. But, then there's other things that are a bit more unquantifiable around brand recognition and trust. That's something that's nearly impossible to get back once you lose it.
Adrian Hornsby: Absolutely. And it gets worse and worse, right? I think people are a lot less tolerant for mistakes nowadays than they used to be back in the day.
Rich Burroughs: Employee attrition is another one too, right? If your folks are fighting fires all the time and getting woken up by their pagers, they're probably going to be thinking about looking for something else.
Adrian Hornsby: Exactly. There's a lot of folks that asked me, "What was the main advantage of building or using Chaos Engineering?” I say, increasing my sleep?
Jacob Plicque: It's the truth. If you can see a failure at two o'clock in the afternoon versus we're getting woke up at 4:00 AM, would you do it?
Adrian Hornsby: Exactly.
Jacob Plicque: Everyone in the room's going to say yes.
Adrian Hornsby: Absolutely. Especially the ones that have pagers.
Jacob Plicque: Especially.
Rich Burroughs: So, Adrian, you wrote a really great blog post about Chaos Engineering and we'll link to that in the show notes.
Adrian Hornsby: Thank you.
Rich Burroughs: And in that blog post you talked about Jesse Robbins doing game days at Amazon in the early 2000s and you mentioned that Jesse had experience, that had trained as a firefighter.
Adrian Hornsby: Yeah. For me it's one of those fascinating, the human stories, right? Behind the technology. So, this Jesse Robbins is a firefighter in his free time, right? And a fascinating engineer as well, a software engineer. And he had the idea that firefighters build this intuition to go and fight fires and how he could do the same with his engineers. Basically. The problem he had is, and very often when you ask people, no one is trying to really fight an outage. Basically, your training happens on the fire. At least that's how it happened for me.
Rich Burroughs: Same.
Adrian Hornsby: The first time we experienced outages, I've never been trained for that. So, therefore I started to sweat. My heart rate really got up.
Jacob Plicque: So, true.
Adrian Hornsby: I became so stupid and did even more mistakes while trying to fight the outage. I always joke that I cleaned the databases in production and it exactly happened during an outage. You're like, "Ah, okay." And then you make them more mistakes, the rookie mistakes where you're in the wrong terminal and do the wrong thing trying to fight a fire and you actually put oil on the fire.
Rich Burroughs: Right.
Adrian Hornsby: So, he had the idea, how could we enforce that continuous learning and especially learning to fight fire in the production system. So, he started to do what they call back in the days, Game Days, right? And he would literally go around the data center and unplug servers and unplug things or unplug the power system of the data center and even unplug the whole data center so folks could practice recovering from an outage.
Adrian Hornsby: And I love the idea of building an intuition. I think if you've gone through 10 or 15 or 20 outages, you start to develop this kind of patterns of, hey, if I see, for example, Nginx conquer and connection curve and all of a sudden it becomes flat and you can think, okay, maybe it's the open file system on your Linux. So, that is the limit is not high enough. You can start to see patterns in the way outages are happening. You can see, oh, is this a cascading failure? Is it an overload failure? Is it ... all of this kind of different little things that give you a hint and you only get that through training. You're not born understanding of the consequences of small things. And I think this is the whole idea, in my opinion, that is very interesting, is how do you continuously build this kind of intuition to fight fire before it really happens?
Jacob Plicque: Exactly. I think of the analogy of baseball players, right? A baseball player takes batting practice and they think about their swing when they're in batting practice, but the idea is that when they get in the game, they're not thinking about the swing. They're just kind of going with the moment, and I think that, in a way, it's the same for people responding to an incident, right? Somebody shouldn't have to be learning how to respond to an incident and how to use their tools and understand the systems, during the incident, they should have done that training already.
Rich Burroughs: Or actually understanding what commands you should run to figure out the number of concurrent connections on your database, right? Then, if you go in and see the documentation at that time, there's a lot of things that can happen in between. You should know the commands, you should know how to follow your intuition and very fast.
Jacob Plicque: Or, God forbid, the documentation that you're reading hasn't been updated in a year.
Adrian Hornsby: Exactly.
Jacob Plicque: Because that's the last time that that outage happened, which is one thing, but that entire 12 month period you could have done a fire drill and said, "Hey, let's make sure that we're resilient to this particular failure. Let's make sure we didn't regress." Or maybe even had automated that after you fixed it the first time.
Adrian Hornsby: That's, actually, what is the best practice actually, right? And that's a problem I talk about regularly is, I see customers doing Chaos Engineering experiments. They sure face a lot of problems in their architecture or systems and then they don't necessarily fix it. I had the customer, actually, that did this literally a couple of months ago and two weeks after the chaos experiment, they experienced a 16 hours outage.
Rich Burroughs: Oh, no.
Adrian Hornsby: That was very costly. Simply because they didn't get the approval from the company management to stop the development of new feature and fix the things that were bad. They kind of say, "Okay, let's do it a bit later."
Rich Burroughs: Oh, wow.
Adrian Hornsby: 16 hour outage is very costly.
Jacob Plicque: Oh, man. Was that a result of, at least in your mind, having something that you're hearing about post-mortem, was that a situation where they ran the experiment and then they'd, "All right, well we'll put this on our backlog." And then the similar, the actual experiment that they ran was not the root cause of the outage, but something similar is what happened in production?
Adrian Hornsby: Exactly that.
Jacob Plicque: Wow.
Adrian Hornsby: So, what's surfaced out of the experiment kind of became the root cause of the outage two weeks later. 16 hours downtime.
Jacob Plicque: Wow.
Adrian Hornsby: So, it's important to fix it.
Jacob Plicque: Of course. You've got to complete the loop, so to speak. Right.
Adrian Hornsby: In my opinion, this was a problem of buy in right? So, the engineers really were into Chaos Engineering, but there was not a full buy in from the company, right? So, it was semi-buy in.
Rich Burroughs: That's hard. I think a lot of what people have to do, as SREs or other operations or development people who are operating applications in production is advocate for resilience. And as you're an evangelist, they kind of have to be evangelists within their own organization for these things.
Adrian Hornsby: Exactly.
Rich Burroughs: But, it's hard. You have to make a case to management and explain to them why you need the resources and why these things are important.
Adrian Hornsby: And this is why I always tell them never call it Chaos Engineering to your management. Call it resiliency engineering because usually people, if you tell them chaos, they get scared. It happens all the time. It's funny. I know it just happens all the time.
Jacob Plicque: We have enough chaos, right?
Adrian Hornsby: Exactly. Why create more chaos when there's already so much. Whereas, if you call it resiliency engineering and say, "Hey, we're going to improve our meantime to recovery and availability." And all that stuff, they're like, "Sounds good."
Jacob Plicque: It's interesting because, on the flip side, you could emphasize that, we're being reactive, then let's respond to those issues and then resolve those. And then, let's prove that we fixed that with chaos, or in this case, resiliency engineering, so you can verify it. So, it ties back into the earlier point. So, you're, by default, eliminating the unknown chaos in the systems that you're running into.
Adrian Hornsby: Absolutely. But just need buy in from the top management.
Jacob Plicque: Right, buy in first. Absolutely.
Adrian Hornsby: And trust from your management, right? And that's also another problem which I talk in the blog post is I see a lot of folks recommending to the Chaos Engineering in production and very fast. And I think a lot of people take pride on Twitter to talk about it. And, for me, it's also something that is, it's not a good thing to do initially because you don't earn trust from the organization, right? So, I think like any very risky engineering methods or systems, you need to earn trust from everyone, right? From your organization, from your manager and from your customers. And, I mean, that trust is understanding also how to do it and build this intuition for running chaos experiments, as well. The same things. So, it's about practice, practice, practice. If you start to do chaos experiments right away in production, you don't have practice and therefore you're probably going to make more mistakes. And you are allowed to do only one mistake.
Jacob Plicque: If that, right?
Adrian Hornsby: Exactly. I love Chaos Engineering as well and methodology as well for a test environment, local development, even for learning how to do software engineering in general. I think it's a very interesting methodology for learning.
Jacob Plicque: It's funny you say that because I think that there's still this misconception, I don't think it's as huge as it used to be, but that you have to start in production, you have to break something in production to learn and to build resiliency. And I think you're spot on. What if I'm designing an application and let's say I'm locally on my MacBook, right? But, I'm still testing things out by, you mentioned concurrent connections to my database, right? So, I'm still talking to it. I can still interrupt or add latency at that level, as small as that level, and that may impact my design.
Adrian Hornsby: Absolutely.
Jacob Plicque: And that's before you even go into a lab or development environment, much less staging or production for that matter.
Adrian Hornsby: Absolutely. And I love to use that, actually, to fight against biases, right? We all have biases around which programming language we want to use or what framework we want to use. I love to use Chaos Engineering to break that bias and to show actually my hypothesis for that particular choice is totally wrong, right?
Rich Burroughs: So, on the flip side of it, I've been thinking about this a lot lately. I hear people usually talking about Chaos Engineering in a couple of ways, and one is the sort of Chaos Monkey example where you've got this thing running all the time, right? And people know if they deploy an application into this environment that it's going to be subjected to these experiments on a regular basis, right? That's just kind of the contract of being able to run an application in that environment. And then, on the flip side you've got the kind of planned Game Days where people get together as a group and they kind of conduct these thoughtful experiments, and I think I tend to recommend to folks the latter, the Game Day approach, but there's a forcing function to the Chaos Monkey thing, right? I think that one of the things that's hard is getting that buy in sometimes and I wonder what your thoughts are about that?
Adrian Hornsby: I have to say, actually, I love both methods.
Rich Burroughs: Okay.
Adrian Hornsby: I love the first methods of running let's say schedulers in production eventually. I think they're great for enforcing rules. I think that's the whole point of why Netflix did it in production is they wanted to have very fast development, very agile development cycles so that people could do, update in production very fast and when you want that, the fact that you're increasing velocity of deployment, you also increase eventually the potential risk for failures, right? So, I think regular experiments running as a rule enforcement mechanism, I think is great. For example, they wanted all their application to be resilient to rebooting. Well what's the best than just to try to reboot them regularly, right?
Jacob Plicque: Exactly.
Adrian Hornsby: I think if you just wait for a Game Day to do that it might work during the game day, but then you make a version update of your library and then all of a sudden you realize that there is a dependency, usually a hidden dependency that actually fails, right? And then, the hypothesis that you had a week ago, two weeks ago is no longer valid because you've done a library update and a lot of the idea of running those chaos experiments as a game day. But, after that, turning that into a continuous test in your CI/CD pipeline then or eventually in production, right?
Rich Burroughs: Yes.
Adrian Hornsby: I would say first CI/CD pipeline actually is a great way to do it, to also build this confidence right into what's the effect of your monkey in your system.
Jacob Plicque: I think what's cool about that is especially if customers are moving from an on prem workload to the cloud is engineers are so used to choosing over their six months and one years of up times for their Windows SQL servers versus like, "Hey, failure happens, right? It's going to. Are you ready?" And so, doing things like shutting things down more often or just making sure that you're ready for that is really important. And I know that's kind of where Chaos Monkey comes in, picking hosts at random to shut down. Now, we talk a lot about that because usually when folks are coming to talk to us, whether it's at a booth or a demo or anything like that, Chaos Monkey is usually what they've heard of first. And so, we talk about kind of slowing down and starting a little smaller and maybe not as random, but there's a forcing kind of function that to have something running all the time like that, isn't there?
Adrian Hornsby: I totally agree with, I think the monkeys, especially, I would say the first Chaos Monkey, right?
Jacob Plicque: Right.
Adrian Hornsby: The host failure monkey, not necessarily the availability zone regional monkey.
Jacob Plicque: Right.
Adrian Hornsby: But, those ones enforce you to go around the idea of microservices, right? If you want microservices and you want to be able to scale horizontally, the first things you have to do is anyway build systems that are resilient to hosts coming up and down anytime, right?
Jacob Plicque: Right.
Adrian Hornsby: So, I think that's why that monkey is particularly popular because it enforces the first rule of microservices, which is stateless and reboot. So, I think that's also why it's very popular. And that's also why Netflix left it in production all the time is because the most important thing for Netflix was being able to scale up and down without causing downtime because it was happening literally all the time, right?
Jacob Plicque: Right.
Adrian Hornsby: So, it made total sense for them. Now, if you don't have a system that is really dynamic, scalable up and down all the time, I don't know if it totally makes sense to do that? Maybe not. But, there's probably other things that are more important.
Jacob Plicque: Right.
Adrian Hornsby: It's about, understanding, I think, also, the traffic that goes through your system and then from there defining what's the most useful experiment. And that's actually the second part of the chaps blog that I'm writing about is how do you get started by choosing which experiment you're going to run? And I can tell you what we do at Amazon. We look at our COEs, right? We look at all our outages and all the collections of COEs and then we define the most common outages and the list of most common outages and how we could have fixed them last year for example, or the last few years. And then, we are going to implement some experiment around that to make sure that it never happens again.
Rich Burroughs: So, what's a COE?
Adrian Hornsby: Corrections of Errors. So, it's like a post-mortem type of stuff.
Rich Burroughs: Gotcha.
Adrian Hornsby: Once we have an outage, we kind of start the post-mortem process. We just call that COEs because, I guess, we need to have a name in acronyms for everything, right?
Rich Burroughs: Of course.
Adrian Hornsby: AWS, COE, AMI, you know all those ones?
Rich Burroughs: Yeah. I like that one, though, and I certainly like it a lot better than post-mortem. That's not a phrase that I'm a big fan of.
Adrian Hornsby: No, I'm not either.
Jacob Plicque: It's funny, we were talking about COEs because my brain went right to Center of Excellence and I was like, "Wait a minute, we're talking about incidents. Hold on."
Adrian Hornsby: No, we call that a corrections of errors.
Jacob Plicque: There we go.
Adrian Hornsby: It's a process where we actually have a team. It's a team led by what we call a bar raiser in terms of COE, which will oversee the whole COE process. And then, we have several engineers deep diving into the incident. So, taking all the time out of the incident, what was the response and all the important metrics, putting that in the document and then the start deep diving onto understanding the root cause analysis, with the five whys for example. And we run several five whys, but we still use the five why's, and then into how we can avoid that into the future, right. Especially if it affects customers, which is very, very important for us that it doesn't affect customers. And then, we go into implementing the fix.
Rich Burroughs: I also saw a talk of yours that you gave in Oslo. We'll link to that as well in the show notes. You mentioned in there that Amazon does Chaos Engineering before a new region is brought up?
Adrian Hornsby: So, this is something that we do as well. I will write a little bit about that later as well. But, every time we bring a new region up around the world, an AWS region before it opens for customers, we run Game Days, the same as what we used to do back in the day is we put all the service team together and we start breaking things and see actually if the region is consistent across the rest, in terms of the way we can detect outages and things like this. Well, you can imagine regions are not exactly the same, right? Because they've been implemented at different times. So, there's different versions of everything. So, we need to make sure everything is up running.
Jacob Plicque: And that's an important thing to note, right? Because, all right, your production environment isn't the same as your staging environment, isn't the same as your development environment. And then, that's such a larger scale when you're talking about worldwide regions and ap-southeast, I think is what it's called, is not the same as us-west-1 is not the same as us-west-2. And that's as small as the hardware level, right? And then the network layer, and I can only imagine the detail that has to go into that. Can you elaborate on that a little bit?
Adrian Hornsby: I can't talk that much about it because, first I don't know all the details, it's just, and we can't give away so much.
Jacob Plicque: Sure.
Adrian Hornsby: But, you can imagine, right? When you run an application in a multi-region environment, it's very common to have configuration drifts, right? So, it's something that you need to always fight against and try to build. Our APIs aren't the same, right? It's what's underneath that might have difference in software version, in hardware version or in things like this. That's why cloud computing is cool because basically it's kind of agnostic of the underlying hardware layer, right? But, that's always the theory, right?
Jacob Plicque: Exactly.
Adrian Hornsby: And that's why we do chaos experiments is because theoretically it should work, but then you need to verify.
Jacob Plicque: Exactly.
Rich Burroughs: So, someone who says that they don't have time to do Chaos Engineering because they're in the middle of migrating to AWS, maybe they should be doing that while they're migrating?
Adrian Hornsby: It's a very good question. And I don't think there's a definite answer for that. It's all about the application, all about the company and kind of the priority. I think the problem comes when the expectations for availability is so high and then you haven't done the homework.
Rich Burroughs: I guess what I meant is more in terms of like that validation, just as you folks are validating your systems before you open up a new region, I would want to do the same thing if I was setting up a new application environment in a new cloud. To me, I would want to do some sort of validation, too.
Adrian Hornsby: Absolutely. It sounds like common sense to me, but you want to make sure it works as expected, right?
Jacob Plicque: And I think to dig deeper into that, I think there's a lot of thought around lijke that, by default, me moving to the cloud automatically makes me more reliable, when it really is about understanding the complexities of it. Because, to your point, cloud computing is super cool, right? But, it's different and it's a new thing for a lot of folks. So, you want to expose the differences, Maybe you battle test your on prem workload that you just moved to the cloud and how does that work differently and are there different signals, right? Because I don't have auto scaling on my on prem workloads, but I do in AWS, right? How is that different?
Adrian Hornsby: I think we go back to building that intuition, right? The company that's been running on premise might understand all the complexity and the fixes and how to run that on prem environment but you basically have to rebuild that on the cloud because you have different kind of paradigms, right. Scaling is one of them. On demand is one of them. The limits are in, there are still some of them, the configurations and the deployments, it's in my opinion things that you have to relearn, right? You can't just bring your past experience in managing your own data center and application deployment there and make it work right away in the cloud. The ideas are similar, but the intuition is slightly different in my opinion. The signals are different, like you said. I think that's important.
Jacob Plicque: That article I read of yours last year is the Medium article on acloud.guru around The Quest for Availability, which I love the name of, and the tag line of it is, "How many nines of happiness are your customers." Which is absolutely brilliant, right? Because you can talk a lot about the nines that you have, but the downtime that you have when you hit a certain nine could still be negatively affecting your customers and your brand. So, having said that, do you have any kind of tips to build more resilient systems to help with that?
Adrian Hornsby: There are a lot of tips, but I'll say if I look at the past or in the last maybe 10 years of outage that I've read documented or been victim or created myself, it's often the same things, right? Actually, you'd be surprised the effect of timeouts and retries. And I think it's probably a very big chunk of most of the outage I've lived. The fact that people don't set timeouts correctly and then you have retry storms because they haven't set the backoff mechanism for the retry occurs or simply the alert. There was no alert on the critical systems and then no one finds the issue, right. So, having a canary on an alert and verify that actually the alert is working is also responsible for a lot of outages. So, it's funny, it's very simple things usually. But, the timeouts, retries are definitely very, very important. And, of course, conflagration drift.
Rich Burroughs: I was just going to say, every outage it seems like you look at what they call the root cause and it's a configuration change.
Adrian Hornsby: Yep.
Rich Burroughs: Or something like that.
Adrian Hornsby: It's ridiculous. The first company that I worked with, we were deploying on AWS and we initially, like pretty much every company's still allow the SSH to instances in production to "hot fix" and I quote with my fingers when I say that. Usually people "hot fix" on Friday and then go on the weekends and come back on Monday and forget to commit that into the code or into the playbook. Back in the days there was no infrastructure as code.
Rich Burroughs: Right.
Adrian Hornsby: And then after you make a redeployment and, of course, you redeploy on things that have changed and that's an outage right there. That's literally 90% of the outages that we had back in the days. The easy fix was to move into a immutable environment where you actually, we disabled SSH two instances so people couldn't modify the thing.
Rich Burroughs: Amazing.
Adrian Hornsby: But, that fixed pretty much all the outages we had. So, it was like being able to redeploy your immutable infrastructure. So, we had a golden AMI and then we redeploy into an autoscaling group, then moved to the new autoscaling group. And if it didn't work, fell back to the old one almost instantly. And that, literally, solved pretty much all the problems of contribution drift because it's usually people doing those drifts.
Rich Burroughs: So, kind of like a blue/green deployment?
Adrian Hornsby: It was like, instead of doing an update on place, you kind of roll an update through a new environment, actually, and then you do either a canary deployments through a DNS canary, so you move some of the traffic to the new path growth to the new, let's say load balancer and autoscaling group and slowly ramp up that traffic and if you find errors, you actually rolled back into the previous DNS configuration into the old systems. I think that works very nicely. It's a very good way to avoid problems. Disable SSH is the best way to enforce people and then to enforce your deployments to work, right? I think the problem is people use SSH simply because they haven't automated everything and that's kind of being lazy in a way. And then, when you're lazy always creates outage, in my opinion.
Rich Burroughs: When you take that step to disable SSH, you just need to provide people with the tools they need to solve problems. Part of the reason why people want that shell is so that they can go in and run a bunch of commands while they're troubleshooting an issue and you need to give them other ways to get that same information.
Adrian Hornsby: I think it's about delivering the logs out of the instance to a centralized logging system. Have maybe a better visibility or what people call nowadays observability into your system, but, at the end of the day, why do people go into system? To do what? To change configurations. I just don't see any other reasons. It's very rare that you would run commands on a production system like that.
Rich Burroughs: At least you hope, right?
Adrian Hornsby: What do you mean?
Rich Burroughs: Just to kind of, that's the hope, right? Let's not go hop in just to hop in and look around and let's just run this, see what happens.
Adrian Hornsby: Exactly. And that creates more outages, right? I told you I cleaned the database in production. Why do you think I cleaned it? It's because I logged into an instance SSH and got all the rights to connect to the database and clean it.
Rich Burroughs: No.
Adrian Hornsby: I should have never had the right to do that.
Jacob Plicque: Definitely an argument not.
Adrian Hornsby: Exactly. And that's the thing to avoid humans stupidity, don't give them access.
Rich Burroughs: So, in your post about Chaos Engineering, you talk about forming a hypothesis and one of the ideas that I loved that I hadn't heard before was you talked about having everybody write down their hypothesis on a piece of paper and then share them. Do you want to talk about that?
Adrian Hornsby: I have to say is this one of my favorite part of the experiment, actually. When you put people in the room and you ask ... actually, usually, what I like to do is bring everyone from the product owner to the designers to the backend database, anyone that is pretty much linked to the application and then you make a hypothesis. If you don't let them write on the paper, there is this mechanism that is where people start talking and then they all agree eventually by just listening to each other. It's like a convergence, a natural convergence of social human beings and if you do the same thing but you ask them to write on the paper, there is this opposite effect. Which is divergence. So, everyone has different ideas of what happens. And usually I stop there and because it's why did everyone have so different ideas of what's happening if I, for example, unplug the cache or increased the latency or remove the master database or remove the recommendations services, literally no one has the same description of what would happen. It's super interesting to understand why.
Jacob Plicque: I actually had that happen my second Game Day that I ever ran, we were running a latency experiment between our customer's application and their database and I asked, "Hey, what does everyone think the steady state is?" Right? Because I realized kind of right before the experiment I was like, "Wait a minute, that's something that we need to expose a little bit. Let's talk about that." And everyone had a different opinion. And then what ended up happening is that all of those opinions were actually not right. And I had to let them know that's okay. That's why we're doing these experiments. So, it's okay for our hypothesis to be incorrect. That's what we're trying to expose.
Adrian Hornsby: I've never done that for a steady state, but that's actually a super good idea as well to do the same thing as it was. "Okay, guys, what's the steady state?" Everyone writes on the paper. Actually, I've never done that. I need to do that.
Jacob Plicque: Take it away.
Adrian Hornsby: It's brilliant. I do.
Rich Burroughs: All right. I think that's all the time that we have today. Thanks so much for joining us Adrian.
Adrian Hornsby: Thank you, guys. Been very nice.
Rich Burroughs: Where can people find you on the internet and do you have any kind of talks or things that you want to plug?
Adrian Hornsby: I'm pretty much all over the internet with Adhorn, A-D-H-O-R-N. If it's Twitter, Medium or even on GitHub, so that's usually the three things I use the most. I don't have to plug anything, I think. If people want to find it, they'll go and figure it out. I love anything related to resiliency, chaos and stuff like this, but, there's so many amazing people in the field that there's so much to read as well. So, I would say if you start with me, start also read everyone else because there's so many great stuff. You guys on the Gremlin blog have some very cool things as well. There's some very good things on pretty much every chaos or resiliency podcast as well out there. What is great with this community is that we talk a lot, I think, and then we share a lot. I like that because there's so much more to learn from others than just being stuck into your own thing.
Rich Burroughs: Agreed. Well, we'll link to your Twitter and Medium and all those things in the show notes, so thanks again. We really enjoyed talking with you.
Jacob Plicque: Absolutely.
Adrian Hornsby: Thank you. It was my pleasure as well and see you hopefully at the Chaos Conf.
Jacob Plicque: Exactly.
Adrian Hornsby: I've just booked my flights.
Jacob Plicque: Awesome.
Adrian Hornsby: I'll see you there then.
Jacob Plicque: Sounds good.
Rich Burroughs: All right. Bye-bye.
Adrian Hornsby: Take care. Bye-bye.
Rich Burroughs: Our music is from Komiku. The song is titled Battle of Pogs. For more of Komiku's music, visit loyaltyfreakmusic.com or click the link in the show notes. For more information about our Chaos Engineering community, visit gremlin.com/community. Thanks for listening and join us next month for another episode.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more