Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.

In this episode of the Break Things on Purpose podcast, we speak with Alex Hidalgo, Director of SRE at Nobl9.

Episode Highlights

Transcript

Patrick Higgins: Matilda has been an absolutely atrocious puppy since we got home. She's been growling at the door. It's been ...

Alex Hidalgo: Matilda.

Patrick Higgins: I don't know.

Alex Hidalgo: You're wearing headphones, nevermind.

Jason Yee: We could have a special podcast guest.

Patrick Higgins: Welcome to season two of Break Things on Purpose. My name is Patrick Higgins. I'm a Chaos Engineer at Gremlin, and I'm joined here today by Jason Yee, who is the Director of Advocacy Gremlin. How are you doing, Jason?

Jason Yee: Hey Pat, I'm doing great.

Patrick Higgins: Jason, happy new year. We're back for the new year. We had a chat late last year with Alex Hidalgo, who is at Nobl9 and a author of The SLO Book. I wanted to ask you to kind of reminisce and talk about what you enjoyed about that chat, how that went for you.

Jason Yee: Yeah, I really enjoyed, I always enjoy chatting with Alex every time I get the opportunity to. He's got just such a wealth of experience. And one thing that we were talking about earlier that I loved was how he relates that experience to non-technical experiences as well. We all realized that at some point in our lives, we had worked in the service industry as bartenders and chefs and cooks and things, and a lot of what we do, the technical stuff, really comes down to people processes and how we deal with customers. And having those experiences of having to serve people really informs that.

Patrick Higgins: Yeah, definitely. He's also an excellent storyteller. He told a bunch of different stories that I really got into and, as you said, he's got so much experience that he's had. These different things have happened to him that have been really funny and really interesting. I'm really excited for everyone to hear them. So without further ado, let's get into it. Our conversation with Alex Hidalgo.

Patrick Higgins: Today, Jason and I are welcoming Alex Hidalgo to the Break Things on Purpose podcast. Alex is a principal SRE at Nobl9, and he's an author of The SLO Book. How are you doing, Alex?

Alex Hidalgo: Doing great. Thanks so much for having me.

Patrick Higgins: Thanks so much for being here. So on the podcast, we like to ask our guests about specific horrible incidents that they've encountered in their career. What happened? How did you discover the issue? How did you go through resolving it? And what was that process like for you personally?

Alex's Adventure Into The Absurd

Alex Hidalgo: I've been doing this for a while, so I have way too many of these stories unfortunately. Some of them sad and some of them honestly hilarious. Especially being on the prod mon team at Google for a long time. Right, we're the team responsible for ensuring everyone else gets sure that they get their alerts and they can know how their service is going. So ensuring that was running was definitely a feat.

But my favorite version of the story is one that's kind of filled with absurdity, and I just love to share it. A long time ago, I was working for a managed service provider, like an IT firm. And one of our clients, like I was the designated like Linux guy and networking guy. And one of our clients, they wanted to upgrade this enterprise software. I won't name the vendor. They're still around today and doing very well. But it involved like four different components and they all talked to each other, almost like early microservices in a way. And this in the days of before everything was just like packaged. I had a checklist of like 40 steps I had to take.

This was late enough that early virtualization existed. So I spun up three servers and I followed the checklist very carefully. And edited each config file exactly how I was supposed to and placed the license key in the place. And then I started the services in the order that they were supposed to be started. And the last one, and none of this worked without all four services running, right? The last one just started, and then crashed, and just dumped the entire heap. Dumped the entire heap, the standard out, which is problematic enough. And there was just too much data there for me to dig through and figure out what was wrong. So I was like, okay, cool. I've got these new virtual machines. Let me just blow them away and I'll start from scratch.

And so I did it again, and followed the checklist very carefully. And I did it a third time, and same result every time. As soon as that last service started, it just crashed. I could not get it up. And so eventually I realized, oh, wait, this is paid for enterprise software. Let me just contact their support. And so I contact the support, first support engineer can't figure anything out. And the second support engineer can't figure anything out. We're on like day two or three or even four by now. And eventually I get escalated to, I believe it was the Vice President of Engineering, or something, but this person wrote most of the original code, right? He was one of the co-founders of this company.

And we can't figure it out. We're trying everything. The most minor things like, well, it runs on Red Hat for everyone else, let's try Scientific Linux, and just every possible thing we could think of. And eventually him and I exchanged personal cell phone numbers even, and every morning we'd waked up, and like he's Colorado time, so I'd wait a bit. But we call each other and we're trying to figure it out. And I've given him access to the boxes by now so he can poke around. Still, everything is supposed to be right as far as we can tell, except this one last service just keeps crashing every single time.And we're out of ideas at this point, but I had to get it done. We were being paid as a company to upgrade this software. Luckily, this was a whole separate environment, right? This is like version five of the software, and version four was still running on some other server somewhere else. It was okay that it was taking a bit, but it couldn't take forever. And I just keep poking around and I keep Googling, but the product wasn't that well-known right, not that many people used it. Couldn't find anything.

And eventually during all this troubleshooting, I turn on those VIM settings where it displays every special character. It shows you tabs and spaces. I randomly, on a lark, just because, I don't know, I was checking everything. I was rechecking the config files and this and that. I opened the license key file, and I noticed at the end of the file, backslash R, backslash N because this license key had been copied from a windows machine. And this caused the entire service to crash. I ran dos2unix, right? I replaced the character return new line with just a new line character. Everything came up perfectly, no problems at all after that. A single character in the license key file that otherwise upon just your eyes looking at looked totally perfect caused this entire service to not even be able to start.

Patrick Higgins: What do you take away from a story like that? What do you learn for next time?

Alex Hidalgo: I do tell the story mostly because it is so absurd and it's kind of funny. But on the same token, it teaches you that the problem can lie anywhere and you should never make assumptions. If you encounter an incidence, the first thing you should ask yourself, what is different about reality than I thought, right? What difference is there? The my understanding of the world. What has changed to have caused this problem? And yeah, in this case, it was a very kind of outlier one that I may never run into again, but on the same token, I'll always be checking the format of the license key files for the rest of my life.

Patrick Higgins: How did you go ahead explaining this to your bosses afterwards?What was that story like?

Alex Hidalgo: Well, my bosses didn't care cause we were charging this client by the hour.

Patrick Higgins: Right.

Alex Hidalgo: It was a very small company. There was like seven, eight of us total, and we're all really good friends. And there was no leadership chain I had to worry about placating. And the client, they hired us because they didn't have the technical chops, right? They knew how to use this product really well. They built a very successful business. I'm not going to call out who they are either, but they were eventually bought by Google. But they didn't know how to do this, and that's why they hired my company to help them with this upgrade in the first place. And sure, it took a lot longer than expected, but I don't really remember there being any friction there either. In many situations I can imagine there being, but I kind of lucked out. Everyone involved was kind of understanding.

Jason Yee: I feel like there's a good story here too from the software developer perspective, right? Of constantly check your input filters. The fact that it crashed because of that rather than returning like, you have an invalid license key.

Alex Hidalgo: Yeah, I don't think I ever even asked, and if I did I cannot remember, what the details are of exactly how the code couldn't handle this in that catastrophic of a manner. I remember it was a Java app, so it was all proprietary code. This wasn't anything I could go poke out. This is not stuff any of us could poke at. But hey, if anyone out there is listening and thinks of a way that a Java program may completely crash, especially back in like 2010, because the input wasn't validated just right, please let me know because I've always kind of wondered about that.

Google's Pager List Mishaps

Alex Hidalgo: Here, I'll tell another quick one, because it also relates to like input validation in a way, or being able to input the wrong thing. I was on prod mon at Google, and I was on call for the alert manager side of things, right? So the entire infrastructure at Google that delivers pages to people. So very important. And I had already packed up for the day and I'm 20 feet away from my desk, something like that. And I get a page, it's got very weird text on it. And then I get another and another, and I'm like, okay, these pages don't even make sense to me what they're saying, and I'm in charge of the alerting infrastructure right now. So let me go back, I better turn around and sit down at my desk.

And by the time I get back to like the seating area, I hear everyone's pagers going off. And I'm like, well, this is going to be fun. And it takes us a little bit to figure out exactly what's going on, and that part isn't interesting really. But turns out what happened was someone went to go share a document, and you can share documents, Google docs, with mailing lists. And these mailing lists will autocomplete. And at Google, you don't need to have the person's address, right? Normally if you're sharing like a G Suite doc, often you have to have talked to that person before. In some organizations, they may set it up so you have everyone. But at Google it was so everyone had access to every single mailing list. And I guess through some kind of typo or something, this person hit underscore twice and then shared it with a mailing list that was purposely named with two underscores so it wouldn't accidentally auto-complete and things like that.

And turns out this was am old mailing list that existed during an internal mail server migration at Google. And so it contained something like 40,000 email addresses. Many of them old pager addresses, including those of people who no longer worked at Google. This was a years and years old document. I think it was like a six year old mailing list.

Patrick Higgins: Wow.

Alex Hidalgo: So people got alerted all over the world. All over the world, whether you worked at Google or not anymore. And the problem compounded itself because a lot of people were getting these as emails, right, because it was [inaudible 00:12:07] an email list, just many of them were plus pager, which would redirect to your pager. So people were like, reply all, "Please stop," or reply all, "I think you shared the wrong document with me."

Patrick Higgins: Oh no.

Alex Hidalgo: Which then of course sent pages to everyone else as well. And within an hour we got an all under control. It was one of those, in our incident retrospective, we had great items, like where did we get lucky? The person in charge of that mailing list in Australia happened to be up and recognized immediately.

Patrick Higgins: Wow.

Alex Hidalgo: That person even went as far and found the buganizer, like Google's internal ticketing system. Found the buganizer bug that was still open that said, "We need to delete this list."

Patrick Higgins: That's amazing.

Alex Hidalgo: We also got really fun stuff. There was another engineer in Australia who had forgot to set their alarm that morning, but they got woken up anyway. We got to add things in the what went well section of the retrospective to multiple people reached out and said, "Hey, it was really nice to hear from Telebot again. It was really nice to hear from Google's paging system because I've been gone for so long," and all of that because a long lost mailing list that should have been deleted half a decade before accidentally autocompleted in someone's share this document.

Patrick Higgins: Wow. It really begs the question how many of these bigger older companies have that foot gun just lying around all the time.

Alex Hidalgo: Yeah. Technical debt is difficult, right? Even those of us who care about it the most, we're always leaving it behind. And the engineer who was responsible for that list originally was one of the best engineers I've ever worked with. I knew him personally. Stuff happens, and that's fine. Stuff breaks sometimes, but sometimes they break in funny and interesting way.

Patrick Higgins: Yeah, 100%.

Jason Yee: I can see a new Gremlin Chaos Engineering attack of sending emails to old email addresses.

Patrick Higgins: Mailing list.

Jason Yee: Yeah. That's some good Chaos Engineering.

Patrick Higgins: Particularly if it's recursive and it automatically replies all a couple of times. That would be in order.

Crashing NYU's Exchange Server and Hyrum's Law

Alex Hidalgo: There are so many stories of entire company's mail servers going down, right? Especially, not quite as frequently now that most people have hosted email services, but I think it was just like eight years ago or so the NYU's exchange server went down because someone accidentally emailed every student at NYU. And being college students, they knew what they were doing. They were purposely going to reply all, and this just caused such a, or at least I think it was NYU. If anyone's listening and I'm wrong, it was some college, but it caused such a feedback loop that the exchange server just died.

A good way to think about it is if someone's able to do it, they will, right? Hyrum's Law, are you familiar with Hyrum's law? So Hyrum, engineer at Google, actually just wrote the software architecture, a Google book, I think it's called. But anyway, his observation, his law is, "With a sufficient number of users of an API, it does not matter what you promise in the contract, all observable behaviors of your system will be depended upon by someone."

Patrick Higgins: Yeah.

Alex Hidalgo: Right? If someone can do something, they're going to do it eventually. And yeah, I think it's an interesting thing to think about when you're talking about Chaos Engineering and exploring the outliers of your systems because at some point it's going to happen, whatever it is.

Jason Yee: So I think it's interesting, right, the whole funny thing of having this false alarm because of that email list. But I think it ties back to something that you've written a lot about, and that's SLOs and just generally monitoring and what we should be tracking and alerting on. So I'm curious if you could dive a little bit more into that and let's chat about your thoughts on SLOs. How did you get there, number one? What brought you to SLOs?

Alex Hidalgo: Yeah. In a way it was just introduced to me naturally because I wasn't SRE at Google, right? I may have eventually gone on to write the book, but I didn't come up with the concept, at least not how it was originally formulated. And it was just a thing that you did and I didn't totally get it at first, to be honest. I'd spent years and years in industry already. I'm like, what are we doing with this? But then we were forced to because the product I was working on was also a cloud product, or at least a back to cloud product. And since Google had SLAs and all their GCP services needed an SLO set at a level below that, so we would know if we were out of error budget before we might violate our SLA. And that made it make a little bit more sense, but it still didn't like resonate with me.

What did eventually though, is when we deleted all the rest of our alerts. When we moved to a world where we only got alerted on fast burn and got tickets on slow burn, right? So the idea being our math says, we're burning through the error budget at a rate that is not recoverable without likely human intervention. And when you get to that point, and it's difficult to get there, but when you can get to a point where you're reasonably sure you're only catching a page if it's actually going to cause you to violate your SLA, then that's pretty awesome. There's so many false positives that go away. There's just so much general pager load that just disappears. And I was like, wow, these things are kind of cool. But I didn't quite understand how to use them for things that didn't have an SLA sitting in front of them.

But then I joined the CRE team, the Customer Reliability Engineering team, which is a group of veteran SRE kind of tasked with teaching Google's largest cloud customers how to SRE. And when you're trying to have conversations, not just cross team or cross org but cross company, cross industry, right? It's not like Google's cloud customers are all other large tech companies. How do you have the proper vernacular? How do you figure out how to have conversations about things? And what the CRE team decided is that was going to be SLOs. So the idea was basically, if we engage with you, we will help you. We will teach you what we've learned. We will examine your systems. We'll make you more robust, we'll make you more resilient, and therefore make you more reliable. But we need to know how to speak the same language first.

So the idea was basically, we will come onsite with you. We will run an SLO workshop. It'll be hands-on, we'll spend a whole week with you even. I'd spend a whole week at various different companies, offices. But the goal was we need you to establish at least starter SLOs, and then once we're measuring your reliability from that standpoint, then we can engage further. Then we can figure out how to really isolate where the problems are, and things like that. And that's when they really clicked with me. That's when I started to understand the potential behind these kinds of approaches. And that's when it clicked with me this is maybe a new formulization, but it's something that everyone already knows. Nothing's ever perfect, right? Don't shoot for a 100%. Humans are actually okay with failure. And I started recognizing this in everything I'd ever done for a living.

Bartending Makes You Better

When I was a bartender, I'd tried to greet all my customers within 30 seconds. And I knew I couldn't greet all of them within 30 seconds if it was busy. But I also knew that if I greeted enough of them, I'd still have a good night. But if it was way too busy and too many people were walking out, then it wasn't a good night anymore, right? And that little story, that's all SLOs really are when you really get down into the nitty gritty. It's accepting the fact that you're going to have failures. It's accepting the fact that your customers, your users are actually okay with that. Every human is cool with something breaking every once in a while, as long as it doesn't break too much. That's what SLOs really are.

Patrick Higgins: I absolutely love that you use experiences from bartending in how you think about your current work because I absolutely do that as well. Thinking about any number of things, like cues and code promotion, how to be successful, I always take it back to bartending and like the physicality of getting things to people.

Alex Hidalgo: Mm-hmm.

Patrick Higgins: Yeah. That's awesome, I love it. It's really interesting you bring up the idea of, it seems so much about best practices is establishing this common vernacular with people you're trying to convey concepts to. And really a lot of it's just about getting terminology succinct and correct, and establishing a common agreement with it as well. I think that's really interesting.

Alex Hidalgo: So many problems are just over miscommunication. The problem there is that humans are very emotional creatures, right? And we establish definitions of things in our heads and it's difficult to convince ourselves that maybe this thing that we were convinced meant this actually means this, right? It's difficult to convince people of. If you catch them when they're first learning you can be like, no, no, no, actually this thing means this. But preconceived notions can be very difficult to dissuade people of. We hold onto them. They help form our reality, right? If we suddenly learn this thing we thought our whole lives is actually wrong, that can be shocking. That can be jarring.

I think it's true just in the workplace as well, to a lesser extent, but once you believe something, when you think something, then it's difficult to change that. And that's exactly why establishing a common vernacular is so important because people likely know all these words, but they may have entirely separate definitions of them. And that can make things even worse, right? If someone doesn't know the word, they'll be, okay, what does that mean? But if you both know the phrases, but you have even slightly different definitions, you end up talking past each other without even realizing you are. And that ends up in disaster all the time.

Nobl9

Jason Yee: So we've been chatting about your time at Google. You've recently joined a new organization. So congrats on the new gig. But tell us about, you were saying Nobl9, right, is the name of it? Tell us a little bit more about what you're doing there.

Alex Hidalgo: Yeah, before I do, I do want to give a shout out to Squarespace. I was there for two years between Google and Nobl9, and I absolutely loved it. The only reason I left Squarespace is because I'm so excited about what Nobl9's doing. Yeah, as we've alluded to, SLOs are kind of my thing at this point, and Nobl9 is aiming to build the most comprehensive SLO platform. People often think that SLOs are something that can be simple to do, and this is often because the philosophies are simple. Let's define what that SLI means, let's define what an SLO means, let's define what an error budget means. And then people will go start to do it, and then they realize, oh wait, my monitoring tool can't actually calculate error budgets. Very few do. And even those that do only do so in like one way, and there's four or five potential ways that you can calculate error budgets.

And then you realize, okay, cool, this is fine. I'll build some tooling. So now you build your own internal service because nothing exists out there to help you do this stuff. And then you realize some of your metrics, when you're talking about SLOs you're generally talking about high volume request, response, API things. That's fine, but you don't just want to measure the latency of your API requests. You want to measure a whole user journey. So now suddenly you have to build tooling to allow you to actually probe or trace across many different services.

Then after that, you run into a service that only has like four data points per hour. And if you have a single error per hour, that almost make it seem like you're only being 75% reliable, but you know that's not actually the case because you're actually running fine the rest of the time. You don't have the data points to prove it. So then you're like, okay, cool. I can solve this with stats. And so you go out there and you learn about binomial distributions and a way to normalize this data over time. And then suddenly you realize you need a whole team to build all this for you because there aren't vendors doing this. There aren't metric systems that do this. There aren't time series systems like that can actually do this.

And then next thing you know, you've spent two years building this tooling, which is basically the story of my time at Squarespace. There's so much more to it outside of the generic examples. The generic example of your web API, and let's make sure that your latency isn't too high or it's not too high too often. That's not what most people's services look like. It's the easiest to explain and it's what a lot of Googles look like. And that's how they defined it and that's how they wrote about it in the first two SRE books. But that's not what everyone else's stuff looks like, and that makes it really difficult to adopt this approach in any kind of meaningful way.

So that's what we're doing at Nobl9. We're looking to build that tooling so you don't have to keep building it. So people don't have to keep building it themselves. But beyond that, because we're an entire company focused on this, as opposed to you and your side project, it's going to be the most comprehensive version of this possibly imaginable. So we're going to do things like we'll be able to collect, it's still very early for us. So I have no timelines on any of this, but I have a list of 36 data sources we currently hope to integrate with, for example. So we're not just talking about like, we'll convert your Prometheus metrics for you. We're talking about, let's talk to your business logic systems. Let's talk to literally anything that can send us data. Yeah, do so, we'll do the math for you, and we'll give you better data to make better decisions.

Patrick Higgins: And you're obviously dealing with a varied set of circumstances when it comes to different potential customer use cases. Have you had any edge cases yet where that's kind of happened and you've been like, oh, I did not that coming, like I didn't expect that at all?

Alex Hidalgo: A little bit. We only have a handful of beta customers right now. Again, we're still very early. But yeah, we've already ran into situations where queries against certain monitoring vendors are not returning the data we expected. Were not returning the data we thought. And the data looked totally different once we got it, once we grabbed it out of their API versus what the customer thought it looked like inside that vendor's tool. So it's not a super exciting example, but yeah, we're already running into things like we're following the API docs, and we thought we were following the query documentation, the query language documentation, and it still operated in a way that we didn't expect. The data did not look like how we expected it.

Jason Yee: That's something that we encounter a lot just in Chaos Engineering, right? I was chatting with the customer the other day and they were like, "We injected some chaos that was supposed to consume all the CPU, and we see it in one graph and we're not seeing it in the other, but it clearly says, 'cluster CPU, percentage,' why isn't this working?" Yeah, and you dig down through multiple layers of docs and suddenly you find the note that says, "Oh, this doesn't actually mean that this means," this other thing.

Alex Hidalgo: Yep. And per chance that graph had been looked at for years thinking it represented something when actually it didn't, right?

Patrick Higgins: That's such a good example of the fact that we're looking at these things, trying to figure out, trying to discover those preconceived notions that we're trying to break out of and trying to really push the boundaries of what we believe into trying to generate these new models, a whole new world of new beliefs.

Alex Hidalgo: And sometimes you never even figure out what's actually going on. I remember at Squarespace we had a dashboard for the ELK stack, right, the big elastic search log ingestion stuff. And we were running into some problems with something and I'm trying to dig into it. I can't really figure out what's up. I'm like, "Oh, maybe the network links are saturated." And I go and look at the graphs that we had set up, and they looked fine. Okay, but this really feels like maybe the network links are getting saturated. So I logged onto one of the servers, and when I ran iftop, showed a whole different story, right? Pushing 16 times the amount of data than these graphs are showing us.

Okay, let's find the graphs are wrong. So I go to the graphs and I look at the query and I look at the metrics, and it was just a statsd exporter, and it looked right. And I go check the statsd docs, and it looked right. And I never once figured out what it was. I never quite figured out what that discrepancy was. But the kernel of the iftop was telling me something entirely different than what statsd was. And I just replaced the graphs with a totally different data source. That was fine, but right, as far as I know, those graphs existed for several years and people had just assumed that they were accurate. And when it was time to actually examine the data within those graphs, they just simply weren't.

What Alex Is Currently Excited About

Patrick Higgins: Well, Alex, I would like to ask you about the things that are going on for you at the moment, what you're excited about. Could you plug your pluggables in terms of what you're super excited about right now?

Alex Hidalgo: Yeah. So I think we're at a really interesting time in the industry because I feel like for the first time that I've been involved, at least, in Tech with a capital T, that people are starting to understand that we need to be looking outside of our own discipline, and we can learn so much from others, and that we shouldn't just be trying to come up with everything from scratch. I see this from everything from people just discovering that statistics as a discipline and can help you learn about numbers all the way to the adoption of looking at what safety engineering and resilience engineering can teach us. It just seems like people are finally more open than they ever have been before to, let's learn from others instead of trying to be the best. Software is not different enough to not be able to learn from others. So that's in a very large scheme something I'm incredibly excited about.

I'm very happy that we are finally starting to see some people out there truly understand what observability means as opposed to just metrics collection. Two companies I'm not affiliated with, but I love them both very much, Lightstep and Honeycomb are both absolutely phenomenal. Go check them out. I absolutely love what Gremlin is doing. I love just the general acceptance of let's make sure that we understand our systems by not necessarily always breaking them. Chaos Engineering doesn't have to involve breaking. Let's understand our systems better, right? We cannot make them safe or resilient or robust, and therefore not reliable. And that's my whole thing, right, reliability, without understanding them better. And to understand them better we can't just let them sit stagnant. Yeah, broadly, those are some things I'm most excited about, just seeing it spread across the rest of the industry.

I have some qualms. I hope that these things aren't all subsumed by the marketing departments of various companies like we've seen happen with, like dev ops originally was a philosophy, and now it's a Microsoft Azure product name. I can't even wrap my head around that, much less people with the title dev ops. Sorry, I'm not trying to be insulting to anyone. I'm old, and that journey, that term's taken. It's fine, language evolves. I get that. I just hope it doesn't happen with things like observability or chaos engineering or resilience engineering. Or even just the word reliability itself, right? I see it very often get conflated with availability, and they're very different things.

So yeah, I'm excited about a lot that's going on. Slightly pensive, hoping that the popularity of some of these things doesn't ultimately become their downfall. But I think it's actually a really, really cool time to be in the reliability space, to be in this space of how can we make these global scale, distributed, multi-component, deep systems, how can we make our complex systems, how can we make them reliable? How can we make them more useful to our users as well as the people that have to maintain them? I'm generally pretty optimistic.

Patrick Higgins: Awesome. Well on that note, because I think that's a beautiful note to end this on, thanks so much for joining us today, Alex.

Alex Hidalgo: Thanks so much. I had a blast being here.

No items found.
Categories
Pat Higgins
Pat Higgins
Chaos Engineer
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL