Podcast: Break Things on Purpose | Ep. 9: Kolton Andrus, CEO and Co-Founder at Gremlin
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
In this episode, we speak with our own Kolton Andrus, CEO and co-founder of Gremlin.
Transcript of Today's Episode
Rich Burroughs: Hi, I'm Rich Burroughs and I'm a Community Manager at Gremlin.
Jacob Plicque: And I'm Jacob Plicque, a Solutions Architect at Gremlin and welcome to Break Things on Purpose, a podcast about Chaos Engineering.
Rich Burroughs: Welcome to episode nine. In this episode, we speak with Kolton Andrus, the co-founder and CEO of Gremlin. Kolton has been building Chaos Engineering tools for years at Amazon, Netflix, and now at Gremlin. He has a ton of experience with reliability. Jacob, what leaps out to you from our chat with Kolton?
Jacob Plicque: Yes, so I was really excited to catch up with Kolton about Chaos Engineering in general and just pick his brain on where things are based off of his experience both before and after co-founding Gremlin. It's also cool to think about the first iteration of tools he helped build at Amazon for Chaos Engineering, and how that helped kickstart his journey that has taken him to where he is today. What about you?
Rich Burroughs: It's always really fun for me to talk with Kolton. When I interviewed a Gremlin, my first round was Kolton asking me to share some of my oncall horror stories with him. He wanted to hire someone who had felt the pain of being oncall and I really appreciated that. He has so much experience and insight about how people can improve the reliability of their systems.
Jacob Plicque: Yes, I had a pretty similar experience during my interviews as well. So just a reminder for everyone, you can subscribe to our podcast on Apple Podcasts, Spotify, Stitcher, and other podcast players. Just search for the name Break Things on Purpose.
Rich Burroughs: Great. Let's go to the interview.
Rich Burroughs: Today we're speaking with Kolton Andrus. Kolton is the CEO and co-founder at Gremlin. Welcome.
Kolton Andrus: Thank you. Pleasure to be here.
Jacob Plicque: It's really, really nice to have you on our podcast.
Kolton Andrus: Indeed, indeed.
Jacob Plicque: So just to kind of kick things off. So before you co-founded Gremlin, you worked at both Amazon and Netflix as an engineer. You're actually, I believe, a call leader at both. Can you tell us what that means and how you started doing it?
Kolton Andrus: Yes, so the call leader role is the person responsible for managing an incident while it's happened. So something's gone wrong, there's an outage, somebody has made a call that it's a high severity or a large customer impacting outage. And the call leader is the person that hops on the call or the chat room or the bridge and really takes control and manages that incident. They're not the person doing all the work. There's a lot of other smart engineers on the phone, but they’re the person coordinating and collaborating on what changes to be made. And probably most importantly, their role’s really one of judgment. If there's a scary decision to be made and people are unsure what the outcome will be, the call leader really has hopefully the business context and the technical context to make that judgment call.
Rich Burroughs: It's interesting that you mentioned the business context. How do you get that business context when you're working at a place as big as Amazon?
Kolton Andrus: Yes, I think that's one of the difficult parts. So becoming a call leader at Amazon is something you have to be nominated for. And you typically don't see more junior folks on that rotation. It's typically managers, senior managers, directors, and more senior engineers.
Rich Burroughs: Interesting. So when you're on those calls, sometimes an incident can drag on hours and hours, did you have people who would relieve you after certain amount of time or?
Kolton Andrus: So at Amazon, it was a 24-hour shift and we had, I don't know, between 10 and 15 of us. So you were on about once every two weeks. And that was your day. And yes, if we had an incident that went on in an extended period of time, I can recall one that went five hours, even at five hours, the call leader was still there. I think if it's something that goes much beyond that, typically there are people on the call that can help and you have the opportunity to pass the baton either temporarily or for a longer period.
Jacob Plicque: So based off of just that experience, I have to imagine you have the, I guess, you could say the opportunity to have to resolve of a lot of different incidents. So what trends did you observe in terms of the causes of those incidents?
Kolton Andrus: Yes, I mean, in terms of an education in reliability, I would highly recommend everyone, maybe not be part of every incident, but listen to every incident or listen to every retrospective after the fact. Because you just get to see the wide variety of things that happen. I remember one outage that was an order drop in Japan, and we did some research and we found out it was because there was an earthquake and all the systems were fine. People had more important things to worry about. I've dealt with power issues in a data center in China, where there's two levels of indirection and a language barrier while you're trying to make these complicated decisions. But we've also had a bad service or a service that made a bad deploy that quickly reverted or kind of your more garden variety, a cache that fell over, and so the service behind it fell over, so everything that relied on that fell over.
Rich Burroughs: It seems that there's this big trend that a lot of incidents nowadays have to do with configuration changes.
Kolton Andrus: That's an interesting observation. One of the things I've hit on or I've been speaking about the last couple of months, really, is that I think Chaos Engineering is a way to do runtime validation. We have a lot of real time validation in our CI CD pipelines and we're doing a lot of unit tests, we've usually got a set of integration tests. But you think about all the production configuration necessary to run a system: thread poools, timeouts, security groups, auto scaling rules, and then all of the things that monitor and check those, the alerting thresholds and everything that fires. There's a ton of that out there.
Rich Burroughs: And some of those things might be configured manually. You might be using multiple tools, you might be using Terraform, along with Ansible, or something like Puppet or Chef, it gets really complex doing all that configuration.
Kolton Andrus: Yes, I mean, Rich, you probably know that background better than me and what best practices are. I think the key is is, how do you know what the right value to put there is?
Rich Burroughs: Right. And a lot of times as the Operations person configuring the service, you really don't, you're looking for guidance from engineering, but... Yes. For those folks listening who might not know, I worked at Puppet for a while and was in that community for a few years before that. And definitely had a lot of experience with these things. One thing that people were trying to do at that point in time was actually write unit tests for their infrastructure code. And some people are still doing that with tools like Puppet and Chef and Terraform, but it's hard and the tooling around that stuff isn't great in some cases. So I do also love that idea of using Chaos Engineering to help with that validation.
Kolton Andrus: Yes, one of the example that I’d pull here, so Hystrix is a circuit breaker library. So the circuit breaker pattern is something's overloaded, trip the circuit, allow it to recover, don't just continue to overwhelm it. But knowing where that threshold is, where's the timeout on the network call or how much... what is overloaded? Where are you going to trip that circuit so the circuit breaker doesn't catch on fire, so to speak?
Kolton Andrus: And what we found was... so Matt Jacobs is on our team, he's a core contributor for Hystrix. We work together at Netflix. And we would often take our best guess at what that configuration was and then watch it. We would deploy it, we would watch that for a period of time and that would help us see if we got the base case, the happy case right. But we would have no idea whether we got the failure case correct. And so oftentimes, that was, well, if that service falls over, or we see a problem, we'll know whether or not we got it right and whether to tune it. And what we landed on was, that that wasn't good enough. There's just too much configuration, too many of these touch points to wait for it to fail and feel the pain once before going back and fixing it the second time. And so we started doing these proactive failure experiments to trip those circuit breakers. Basically, kind of like a load test. Let's take it to the point where it falls over and make sure that's the point we want it to fall over. And then from that, we've obtained enough information to actually make an informed decision about what the right value should be.
Jacob Plicque: So I had to imagine that this was sort of... you already were in the midst of doing Chaos Engineering when kind of kicking that offer or going through that process, but kind of to bring it back a little bit, like a story I tell all the time is, after surviving as I like to say, a Cyber Monday outage I heard a keynote at Reinvent 2017 about Chaos Engineering and it completely blew my mind. And I was like, "Oh, yes, duh, of course. Be proactive. That makes a lot of sense." I didn't even know that that was possible that breaking things on purpose was something that you could do and it was okay to start that process. So what was your sort of your introduction to Chaos Engineering and why did it resonate with you?
Kolton Andrus: So it's interesting because it makes me want to tell two stories. So let's put a pin in the where did you first hear about Chaos Engineering one? At Netflix when I joined, one of the ironies was I wasn't on the Chaos team or the Operations team, and I didn't join Netflix to work on Chaos Engineering, per se. I had the opportunity to build a Chaos Engineering system at Amazon 2009, 2010, just before the Chaos Monkey blog post came out. But I joined Netflix to work on this team that had built Hystrix and on the proxy and on the API gateway. And what I learned was due to needing to validate Hystrix fallbacks and these timeouts that we've discussed and these kind of operational issues that arose. We discovered that we needed better Chaos Engineering tools at Netflix. And so that was my impetus to shift and to build the failure injection system and Netflix that let us test these fallbacks and these timeouts.
Rich Burroughs: And that tool was called FIT?
Kolton Andrus: FIT. Yes. F-I-T.
Rich Burroughs: So tell us about FIT.
Kolton Andrus: FIT's a little bit different than kind of Chaos Monkey or some of these other approaches in that it's application level. So there was a little bit of code integrated, kind of at the function level, like a annotation or a cut point. And we could filter the traffic that was being impacted based on a variety of attributes. Was it this user ID? Was it this email? Was it this device type? Was it in this region? And so from that, we were able to run very precise experiments. We're going to test in the API service, this function that calls for recommendations and we're going to test it first only for Kolton at Netflix and see the outcome. And if that works well, I wasn't broken and I'm able to do my tests, then I'm willing to put some small number of customers at risk. Let's put point 0.1% of customers or let's put just 100 Xbox customers and see if it behaves well for them. And that really helped us do this incremental approach, which again, helps with de-risking where we might start in dev or stage and we go from those small percents to the large percents, but ultimately, we'd run these tests that 100% of production traffic.
Rich Burroughs: Well, that's a lot of traffic at Netflix, I bet.
Kolton Andrus: You’ve got to be careful.
Rich Burroughs: One of the things I find really interesting about that environment specifically too from talking to people is the vast amount of different kinds of devices that connect. So again, being able to kind of pare down that blast radius seems like it would be super valuable.
Kolton Andrus: If you talk about my philosophy of incident management, a lot of it is trying to narrow down the variables, so you can find that core subset that's impacted. I can in my head today see the dashboard of all the Netflix devices, because, okay, there's a streaming impact, cool. Is it isolated to a region? Check the different region dashboards. No signal there. All right. Is it related to a specific device? Check the device dashboards. Okay. Oh, Xbox is the only one impacted. Cool, what changed on Xbox or what service... That ability to narrow down what you're looking for to help you understand the true scope of the impact is an important part of the game.
Rich Burroughs: And so we have some more tooling in Gremlin where you can do the application level injection. And I remember seeing you do a demo, where it was that same idea where you're only experimenting on your own account. And I love that idea from a safety perspective.
Kolton Andrus: Yes. So when I joined Amazon in 2009, I'd been working for a few years in industry, I'd been on call, I'd work for some small startups, but it was my first kind of big gig. And I showed up and it was a little intimidating. I had a lot to learn and there were a lot of smart folks there. And I ended up being put on a team that was the availability team. A team responsible for making sure the website didn't go down. Now, what's funny is when I joined that team, there were five project managers and no engineers. And so I was the first engineer on this team.
Rich Burroughs: That's usually pretty backwards. The ratio of engineers to the project managers.
Kolton Andrus: That's super backward for Amazon. And so I joined this team and I sat next to the efficiency team, I sat next to the latency team, the latency focused on performance. There was really about five or six of us and we were sharing some ideas, five or six engineers sitting together. And one of my managers and one of the senior engineers on my team came up with this idea of causing failure, unbeknownst to the engineer, in the background with some kind of, malicious intent or some availability causing failure. So the teams would have to respond, investigate, understand it and fix it. And at Amazon, they have this concept of a one pager. So it's one to two pages kind of essay format to pitch an idea. And so this one pager was actually half a page. It wasn't even really that fleshed out. And they haded it to me and they said, "Kolton, this sounds interesting, important. Do you want to work on this?" And of course, I read it and I was like, "Well, this sounds amazing. Yes, I would love to work on this."
Kolton Andrus: And so what I had was kind of a two paragraph statement idea to begin from, and I spent the next year, again, interviewing teams, attending every incident review, being on every call, and really figuring out, what do we need to build, what do people need, how does it need to work? And so we landed on... it's funny. It's very similar to some of the lessons we've learned that we built into Gremlin, Gremlin is like the third iteration of our platform. So we get to cheat and take past learnings. But it was not just rebooting hosts, it was also taking up disk space, we'd seen a few outages related to a disk filling up, it was dropping network traffic, it was introducing delay. We built an integration with our monitoring tools, so that if an alert fired, we could halt an attack or clean things up. And rolled it out with a good user interface so that it could be self-serve, so that the teams and engineers could go and play with this without necessarily needing our guidance.
Jacob Plicque: Makes a lot of sense. So it wasn't even like... from my scenarios, a pain point that then was exposed, yours was more along the lines of, oh, this is a really cool idea and we should be doing this. And so the value was really immediate, especially as the first engineer on the availability team.
Kolton Andrus: Yes, I mean it was one, it was software I could write to help solve the problem. The idea wasn't new. There was Jesse Robbins back in the early 2000s, who was shutting down data centers and racks and cut pulling network cables. They were doing it physically to test both the infrastructure and the team's responses. And what happened ultimately, when I first joined, actually, in our Seattle office down in Union Station, we pulled a network cable to a data center, and I think it was the Seattle data center, and it cut off internet access for every engineer in two Seattle offices.
Rich Burroughs: Wow.
Jacob Plicque: Oh, my goodness.
Kolton Andrus: And I believe, don't quote me on that now, it's on it's on a podcast, so maybe quote me on it. But I believe there was AWS impact as well related to that Chaos Engineering experiment. And so the business and AWS obviously stepped in and said, "Look, these experiments are worthwhile and valuable, but we can't have thousands of engineers be unproductive for a whole afternoon and we can't break our publicly facing customers. So figure out a better way." And the answer was to build software, to make it more discreet, to build in these safety features. A lot of that inspiration came from the pain we felt by doing it by hand.
Rich Burroughs: Yes. Shout out to Jesse Robbins, though, who I think doesn't really get the credit that he deserves, like in the sort of history of Chaos Engineering.
Jacob Plicque: Father of GameDays, right?
Rich Burroughs: Yes.
Kolton Andrus: The father of GameDays. I think Jesse deserves a lot of credit. And I learned a lot from him. I went and looked at a lot of his docs and a lot of his ideas when I joined the availability team at Amazon. So he's been a huge source of inspiration for me.
Rich Burroughs: So it seems a lot of companies aren't great at training people to be on call. And I'm wondering, with your kind of vast experience at handling incidents, what kind of advice that you have for people who are in those roles where they're on call or they're call leaders or just responding to incidents in any way?
Kolton Andrus: Yes, as you know, one of my favorite jokes when I speak is my own call training amounted to here's your pager, good luck. You're smart. We'll figure it out.
Rich Burroughs: I've heard about people being put in rotations on their first week of work. It just blows my mind.
Jacob Plicque: Guilty as charged. Been there.
Kolton Andrus: I mean, what's the analogy there? You're a firefighter, but you haven't yet gone through any training. Maybe you have your equipment, but hop on the truck, good luck. Here it comes.
Jacob Plicque: Here's the hose.
Kolton Andrus: Yes, I think we need to do a lot better there. I think as an industry, it's a place that everyone wants to do DevOps and everyone wants to do SRE, but our people and our management and our teams are willing to put in the effort to train people and prepare them, or is this just an unreasonable expectation, everybody's doing it, so you should do it. Figure it out as you go. I would love a world where, okay, it's your first week on call or it's your first week on the team. You're going to be on call in a few weeks. We'll have you shadow someone your first week. Join any incidents that happen, ask questions in the background, let's have a quick five minutes afterwards to talk about what we learned. You're ready to be on call, cool, let's prepare you. Let's run a drill. Let's break something small, but something that will trigger the whole process and let's have somebody get paged, check the alert, look at their dashboard, look at the logs. And I think just going through that, we're going to find all of these process-related problems or things that we could do better or that we just have never thought about and then we can have a discussion and make decisions about.
Jacob Plicque: Yes, especially when I was on call at Fanatics, I always took to it like, feeling like a superhero. I always was like Spider-Man, like, yes, I'm going to save the day if something goes wrong. And it's only now like benefit of hindsight that that really isn't like the right... as fun as that is to say and as that is to think like, that wasn't exactly the right way to think about it. It was sort of looking forward to the fires and there's waste for them to be proactively like sussed out, so to speak, by doing these types of experiments ahead of time versus, “All right 4:00 AM, let me put on my tights and go to work.”
Kolton Andrus: Well, you're braver than I am. I remember a couple of the services that I owned at Amazon for a time, I didn't write and I knew very little about and I had some out of date runbook that really, if something major happened, I was paging the... I was calling the engineer that wrote it. And there was a few weekends, it was like, just praying that the pager didn't go off and thankfully that service never failed on me, that was never one that I got an issue on. But you bring up a point about, sometimes we have a little bit of the hero culture the person that swoops in, and as call leader, I got to feel I got to feel like that. It was never my fault, but I could come in and fix it and be a hero and make things better. But I think as now as a CEO, as we think about how to incentivize the behavior we want. That's not the behavior we want. If we're rewarding the firefighters and everyone's super grateful to them, and the team that worked really hard that never fails, maybe they have the reputation of being rock solid, but is that team get promoted, or is it the squeaky wheel gets the grease?
Rich Burroughs: Yes, I think the hero culture is actually a very negative thing and I'm glad that you brought that up. But on the flip side of it, I think that... I complain a lot about the time that I spent in my career on call. But there is sort of a bonding, a team bonding experience that I think happens when you're like working on these incidents with your team and trying to work through them and figure them out.
Kolton Andrus: I agree. I have a kinship with my fellow call leaders and my team members from Amazon. From those, it's a bit of that, hazing is not the right word, but there's a bit of a group dynamic where if you go through tribulation together, it does form a closer bond. And I think there is an aspect of that in Ops. I'm a gamer, so it's a little bit of this like go to war analogy. Okay, things are broken, the team's getting together. We're going to fix it, we're going to make it right, and then we're going to go home and be happy.
Rich Burroughs: What's the longest incident you were ever a part of?
Kolton Andrus: I think the Amazon had one that was about five hours, and I think it was 2010. And that was the longest one that... I wasn't call leader for that one though. So I was not on the phone for five hours. I was running around trying to investigate in the background and had even joined an hour late because ironically, I was up all night working on the Chaos Engineering tooling I was building for Amazon. I was coding, I was a little younger, I coded till 3:00 or 4:00 in the morning, which meant I slept until 10:00 or 11:00, got into the office at like 11:30, 12:00, and that incident hit at like 11:30 AM Pacific and then went on until about 5 PM.
Jacob Plicque: Wow, that's a pretty really long time. I mean, so even though it was like a ... what was the biggest learning from that specific incident because I have to imagine it's one that comes to mind pretty easily.
Kolton Andrus: It was one that was hard because it was hours and hours of no one knowing what was really wrong, and that's the worst feeling on an outage.
Rich Burroughs: Yeah.
Kolton Andrus: Let me contrast it with you. So I remember an outage, a caching service in China, there was only one data center, they had a script that basically rebuilt the cache in a region per zone. But as China had one zone, when they ran the script, it took down the cache and then it had to be rebuilt. And rebuilding that cache was like the catalog for the retail website. So it took 45 minutes or an hour to rebuild. And there's a VP on the call asking every five minutes, “Where are we at? Can we go faster? What's happening?” And I was call leader for that one and it was like, “Cache is rebuilding, we're doing everything we can.”
Kolton Andrus: This incident, this five hour incident, two hours in and out, usually 20-30 minutes in we've narrowed in on the symptoms and scoped out where the problem is. And we knew it was related to some network traffic in some places, and that was about all we knew. And even three hours in, we'd narrowed it down to routes of traffic, but we hadn't still yet identified what had changed or what was causing it. And it was a colleague of mine, Dan Vogel, who helped a lot with the Chaos Engineering tooling we built at Amazon that helped to suss out that it was related to asymmetric packet loss around one set of [inaudible] in between either one set of regions or one set of zones. But it had this knock on effect that things piled up and lost, and so it caused a bunch of retries and connection resets that any service that flowed through that was feeling the pain.
Kolton Andrus: So that was definitely one of the trickiest ones I've been part of and I take no claim or credit to solving it or managing it, I just watched the horror unfold and went, wow.
Rich Burroughs: I'm sure I've been a part of at least one incident that involved asymmetric routing. I think there's probably more than one and yes, those are pretty hairy to figure out.
Jacob Plicque: Yes, the ones that I've been a part of I don't think ever got as granular as that, it was just the network team blaming the database team, and the database team blaming the network team, whether it was latency or not.
Kolton Andrus: It's funny you joke about that. So we talk about a kinship on calls. There's another gentleman I really enjoyed working with, Brian Scanlon. He's in Dublin, but he was part of the core L7 team at Amazon. And look, that team got blamed for everything. Every time there's an outage, everyone one went, “It's the network, it's the load balancers, it's not me.” And so they were the best team at responding. They were the first on the call, they had their dashboards, they had the data to back it up, and they'd be clear, if it was them, they would say up front, “Yes, we're having problem, we're on it, here's what we're doing.” If it wasn't them and it was some service owner, they'd be like, “Here's the data, respectfully, here's your service failing behind the load balancer. No, not the load balancer's fault.”
Kolton Andrus: And I really respected that because again, yeah, I think that network teams, they get a bad rap and sometimes it's, it's often not earned.
Jacob Plicque: Very much so. So you can respect the ... something that I've learned a lot recently is from companies talking about their public facing outages. It's funny, I didn't even know Gitlab existed until they had that big incident a couple of years ago. So, yeah, I didn't even hear about the company, right. And then someone who had messaged me and was like, hey they're, I don't remember what the cause was, but they're on Twitch talking about their incident and they're fixing it live, do you want to check it out? And I was working overnight at the time just doing network monitoring and some server patching and was just learning about the cloud at the time and I took a minute to, just hop in and it was super valuable.
Jacob Plicque: So, I think it's something that we're getting to, you talked about the industry a little earlier as a whole that I think we're getting a lot better at talking about the ways that companies are running into incidents and they're talking about it. But I'm curious as to if you can zone in on what the value is of talking about those incidents publicly?
Kolton Andrus: Yeah, I mean, I think there's, like you just said, in terms of education, whether it's preparing to be on call, whether it's preparing to be a call leader, or whether it's preparing for a retrospective, just seeing it done many, many times is very useful. Seeing other smart people take their best guess, take their shot, take their approach and hearing some of the rationale. “Given what limited information you had at the time, you made this judgment call, why?” And and not why in that you made the wrong one and they judge you for it, but just help us understand your mental process. What were you seeing? What were you thinking? Because if I find myself in that situation, that might help me.
Kolton Andrus: So I think that's part of it. I think everybody thinks they're bad at this. Everybody. Everybody I talk to. Engineers, in general, we've got our flaws but we tend to be pretty humble about the quality of our code. We all know that we could do better. And so when I talk to so many teams, they're like, “Well, we're bad at this, we know we need to do better, we don't even know where to start.” But the truth is, Amazon was bad at this 10 years ago, and Netflix was bad at this 10 years ago, and five years ago. And the sausage is made the same in many places.
Kolton Andrus: I think if people were more willing to share about their outages, we got to remove that stigma that they failed and they're bad at their jobs, and their company is bad. We may have as engineers made some headway in that ground, but the executives haven't, and the PR people haven't, it's happening, it's showing up in TechCrunch and in the mainstream news, but they're they're not embracing it. They're not debunking it by talking about what happened, and what they did, and why they were good engineers, they're trying to hide it or they're trying to not talk about it.
Kolton Andrus: And I think when people start sharing those stories, one, we have better data. No one's really sharing their incident or outage data, and so I'll tell you from my experience, most companies are between three and four nines, they're having between 45 minutes and eight hours of downtime a year. Those are the better companies. A lot of the companies that are up and coming are between two and three nines where they're having probably more than an hour of outage a month, and even some really top tier companies are in that bucket. And that's okay because there's a path forward and there's ways to make it better but you have to acknowledge it, you have to shine the light on it, you have to tell yourself and your company, hey, we're going to do better and we're going to invest in it because sticking our head in the sand, and just hoping things get better, and just hoping to dodge the bullet, and hoping that when it happens, we can just shirk out of the limelight is a poor strategy, in my opinion.
Rich Burroughs: Yes, agreed. I mean, I remember when that Gitlab thing happened where they were live streaming the incident. And I remember thinking, gosh, I hope I never have to do this. I really admire the level of transparency but I like ... just dealing with the incidents is enough for me, let alone having an audience.
Jacob Plicque: Yes, it's funny because growing up in tech, I can't even imagine talking about the incidents that I've either run into or even ones that I've caused because I'm a human being and now talking about it for a living. So now I wouldn't have it any other way.
Rich Burroughs: Hey, so to shift gears a little bit Kolton, so you're one of the authors of a paper about Lineage Driven Fault Injection or LDFI. And I tried to read that paper and it was a bit over my head. So, I'm hoping you can explain to me and the listeners like we're five years old what LDFI is.
Kolton Andrus: Yes, it's both a mouthful and as an academic paper, it can be a little hard to digest. There is the Netflix tech blog where we try to show some pictures and simplify it for folks that may be about to follow along at home. So the idea behind Lineage Driven Fault Injection is systems really stay up because there's some amount of redundancy. Whether it's hardware redundancy, a host failed, we had another host to take its place, or it's a logical redundancy. We had a bit of code and it failed, but we have some other way to fill that data or to have a fallback for that data.
Kolton Andrus: And so the key idea was, if we have some way to walk the system, we have some way to graph it, think like tracing, and we can see how the pieces fit together, so we can see the dependencies, then we could start to reason about, if one of these dependencies failed, could something else take its place? And so at its heart, it's an experiment, it's really we're walking this graph, and we're failing a node, and then we're checking to see what the user response was. So this is a key part. You have to build a measure did the failure manifest to the user or was the user able to continue doing what they wanted to do?
Kolton Andrus: And that sounds easy. It's like, oh, just check if the service returned a 200, or a 500. But in reality, you have to go all the way back to the user experience and measure that ala real user monitoring to see if the user had a good experience or not because the server could return a 200, and then the device that received that response could find that inside that 200 is a JSON payload that said error, everything failed. It happened. That's not a hypothetical. That was a learning from the process.
Kolton Andrus: So, we build this service graph, we walk it, we fail at something, and then we rerun that request, or we look for another one of the same type of request. And we see if something else popped up and took its place or if that request failed. And then the other computer sciencey piece is, in the end, these service graphs are something that we can put into a satisfiability, a SAT solver. And so we can basically reduce it down to a bunch of ORs and ANDs. Hey, we've got this tree, obviously, if we cut off one of the root nodes of that tree, we're going to lose all of the children and all of those branches. And so we don't have to search all of those if we find a failure higher up because we can be intelligent that we'll never get to those.
Kolton Andrus: So at its root, it's build a graph in steady state, build a formula that tells us what things are most valuable for us to fail first, on subsequent or retried requests, fail those things and see if the system either has redundancy that we find, that the request succeeds, or if the request fails. And then as we go, we're getting into more and more complicated scenarios where we start failing two, or three, or four things at the same time.
Rich Burroughs: Oh, wow. Yeah, we actually just had Haley Tucker from Netflix on our last episode and I think that we talked about some of this and I didn't realize that we were talking about LDFI, so thank you for that explanation.
Kolton Andrus: Yes, I mean, there's a lot of cool things. Building FIT at Netflix really enabled LDFI because we needed that framework to cause the failure very precisely to run the experiments. It enabled CHAP, so the chaos automation platform is entirely built on FIT, where it's essentially routing traffic to canary and control clusters, and then causing failures with FIT to see how they respond and how they behave. And then I believe, Haley and her team are continuing that forward and even looking at other ways to do more of this A/B Canary style testing around failure.
Rich Burroughs: Yes, she mentioned that they're adding in load testing along with the Chaos Engineering in that scenario, which I think is super cool. I love that idea of doing that A/B testing and doing the actual statistical analysis on what's going on.
Jacob Plicque: Yes, I think it's interesting too because I feel like we're seeing a lot of the different pieces come together. Obviously, things like continuous chaos within a CI/CD pipeline is typically where we're first start with that more automated chaos. So of course you have your build or the canary cluster like you mentioned, but adding the load testing in front of that to help drive a steady state metric before you even kick it off makes a lot of sense. So, I mean, is that the next step or is there something from a overall Chaos Engineering perspective, is that where you think things are going Kolton or am I missing the gap?
Kolton Andrus: Well I think part of the trend here is the move from synthetic to real users. We saw this back in the days with Gomez and some of the tooling Keynote where it used to be very can traffic that we'd throw at servers to see what the response time and then Gantt charts look like. But we moved to just sampling the user base because we got a higher diversity of customer types, customer locations, devices, it was just better data. With load testing, we can throw synthetic load at a server, but we can still get surprised. There's a new use case, there's an edge case over here, oh, this is a very expensive operation and we didn't weight it appropriately.
Kolton Andrus: And so at Netflix, a lot of ... on the platform teams for the proxy and the API, a lot of our performance testing was to push real traffic to a single node or a couple of nodes and watch them under duress, watch them hit that elbow point where the graph really starts to go exponential, and where things start to break down, and learn from that. I think when it comes to a lot of the testing we did, we did synthetic testing and we had canned use cases, and canned test user accounts, but the best testing we did at Netflix was our canary builds. One out of 300 hosts is running new code and we can use really three out of 300 because you can't just have one, right.
Kolton Andrus: But then we can sample those three against the behavior we had before to see if there's any interesting changes or switches. And I think Chaos Engineering just fits right in here as well. We can do synthetic things, and we can do planned failure modes, but really testing real failures, on a real system, with real users is what is ultimately going to prepare us best.
Rich Burroughs: So one of the things I talked to people about when I talk about why I think people should be experimenting in production when they're ready for that, you don't want to start off there. But when you've gotten to that point, is that your data is just going to be so different, especially if you've got long lived applications with data, you might have data that goes back 10 years that's been through all kinds of schema changes and migrations over that time. And typically when people in QA test something, they're generating maybe some test data based on the latest schema, and they're never going to run into the weird things that you'll see in your real data.
Kolton Andrus: 100%.
Rich Burroughs: So, Kolton you've had the opportunity to talk to a lot of people who are doing Chaos Engineering. What kinds of challenges have you seen them facing?
Kolton Andrus: Yes, I think it really falls into two buckets. So one you just touched on with when you're ready to run in production, which is the risk you're taking. To someone that is new to the idea, it just sounds risky, why would you do this? And it takes a little context and understanding to realize that it's de-risking things, it's taking small risks to prevent much larger risks. So I think that's a big part of general apprehension. Oh, you want to do this in production, oh, hell no. And answer is, if you're not comfortable with failure happening in production, you're just not ready. And that's okay, but just acknowledge that, don't hide from it. So I think that's one bucket.
Kolton Andrus: The other bucket is cultural. It's difficult to be able to get people to make cultural changes.
Rich Burroughs: Yes.
Kolton Andrus: A lot of folks just they've got a lot to do, they're busy, if they don't understand up front why it's going to save them time or make them better at their jobs, then just like going to the gym, or eating healthier, or flossing every day, they're things that people know that they should but sometimes don't. And so I think that's really it comes to, how do we incentivize our organizations? Who feels the pain when these things go wrong? I'm a big fan of the DevOps movement in general because I saw teams where the engineer were oncall, and they felt the pain, they fix things more quickly, they cared a little bit more than the teams where it was thrown over the fence or it was some maintenance software that somebody didn't really care about.
Kolton Andrus: So I think it's, how do you incentivize the right behavior so that we have what we want. We want reliable systems that our customers have good experiences from. We as engineers don't want to be woken up and we don't want that awkward feeling of our code not doing what we expect. And Ops teams know that if they're taking over a bit of code or running it, that it runs smoothly, that it does what they think it should do, it does what the engineer thought it should do.
Rich Burroughs: Yes. I mean, if your system has some brittleness in it, it's like, you're just waiting for the shoe to drop, right. There's probably going to be a failure at some point and wouldn't you rather have that happen when you're actually causing it, in a thoughtful way, as opposed to at 2:00 AM when you've got woken up with no context.
Kolton Andrus: I'm coming to believe there's a set of folks out there, and maybe they're just a little behind the curve here, that really believe they can just get lucky. And avoid it happening. And I just know for me what clicked at Amazon was, I read a paper that did the math on, if a disk fails this often, if you have a data center full of these disks, this failure that was happening once every three years is now happening once a day. And I think it's that magnitude of scale and seeing it at Amazon and Netflix where it will happen. It doesn't matter how good you think you are, it will happen. And so you can embrace it and prepare for it, or you can clench and hide and just hope it doesn't happen. But one of those is a plan for success and one of those is hoping
Jacob Plicque: And I think it was something that's always super fascinating to me is that on top of that because we're talking about Amazon and Netflix is two prime examples of very large distributed systems. But it's not just like this, that's not a just a prerequisite to doing these types of experiments, right. You can google architecting for failure on whatever system and there are papers, and blogs, and talks about where to start with that. And a key component that you'll hear about the all those talks are failure is inevitable, so let's get in front of it.
Kolton Andrus: Yes. I mean, look, one of the things I like to touch on as well is, we were all sold that micro services and the cloud are the answer to all our problems, and they were to then answer to some of our problems, they're the cause of some of our problems now. Microservice architectures come with a tradeoff and that is, we put the unreliable network in between everything. So I think that, that's exasperated some of the pain we felt. And I think actually this focus you've seen on reliability and Chaos Engineering the last five years is because people are feeling more and more of that pain. But it's something that will always be important. It's something that is always something we should prepare for. It just like styles, like fads, it's going to change how and where we need to prepare based on our current technology decisions and the way the market is shaping up.
Rich Burroughs: Yes.
Jacob Plicque: Right on.
Rich Burroughs: Hey, Kolton, I think we're about out of time here. Really appreciate you coming on the podcast. It's been very fun to get a chance to talk to you. I always appreciate having conversations with you. Do you have anything you want to plug like your Twitter or anything like that?
Kolton Andrus: No, I'm good. I'm easy to find.
Rich Burroughs: Well, we'll link to it in the show notes anyway. And also the LDFI paper so people could go in and beat their heads against that.
Kolton Andrus: Yes, I'm probably good to link to the Netflix tech blog, articles on FIT and on the LDFI project we did, that'll be useful for folks. But really thank you for having me. I love to share stories. I love to talk about how this came to be and how I am honored to be able to participate in and be able to play a role and help move this forward. Our mission at Gremlin is to help build a more reliable internet. So, if I can help my grandma, or my aunts, or my mom, not get a delayed flight, or not have their phone break, or be able to buy something on the first click, then I've done a little good in the world and that's what we're here for. Try to help each other out and and have safe boring systems that just do the right thing.
Jacob Plicque: And get some sleep, right.
Kolton Andrus: Well, at least if you're going to stay up late, let it be a choice. Don’t be paged awake.
Jacob Plicque: Yes, let's play some video games, right.
Rich Burroughs: All right. Well, thanks again, Kolton. We really appreciate you spending time with us.
Jacob Plicque: Thanks.
Rich Burroughs: Our music is from Komiku. The song is titled Battle of Pogs. For more of Komiku's music, visit loyaltyfreakmusic.com or click the link in the show notes.
Rich Burroughs: For more information about our Chaos Engineering community, visit gremlin.com/community. Thanks for listening, and join us next month for another episode.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more