Podcast: Break Things on Purpose | Ep. 4: Caroline Dickey, Site Reliability Engineer at Mailchimp
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!
In this episode, we speak with Caroline Dickey, Site Reliability Engineer at Mailchimp.
Transcript of Today's Episode
Rich Burroughs: Hi, I'm Rich Burroughs and I'm a Community Manager at Gremlin.
Jacob Plicque: I'm Jacob Plicque, a Solutions Architect at Gremlin. Welcome to Break Things on Purpose, a podcast about Chaos Engineering.
Rich Burroughs: Welcome to episode four. This episode we have an interview with Caroline Dickey of Mailchimp. Everyone we've spoken to has a different story and Caroline is a Site Reliability Engineer at a company that lets many businesses communicate with their customers. So Reliability and resilience are very important to them.
Jacob Plicque: Yeah, it's really cool to see how Chaos Engineering has become an important part of the reliability story. We're also excited to announce that Caroline will be one of the speakers at this year's Chaos Conf, our yearly conference about Chaos Engineering. Chaos Conf will be held September 26th at the Regency Ballroom in San Francisco. You can get more info about the conference at ChaosConf.io.
Rich Burroughs: Okay, great. Let's go now to the interview with Caroline.
Rich Burroughs: Today we're speaking with Caroline Dickey. Caroline is a Site Reliability Engineer at Mailchimp. Welcome.
Caroline Dickey: Hi. Thank you. Thank you so much for having me.
Jacob Plicque: Let's kind of kick things off. What was it that got you started in the kind of, I guess, the overall like tech field or maybe even just computing in general?
Caroline Dickey: Yeah, that's a great question. I got my undergraduate degree from Georgia Tech. I started out majoring in biomedical engineering. Many freshmen do and many freshmen leave. I ended up switching from BME into Computer Science, found it to be a great fit for me and never looked back.
Jacob Plicque: That's interesting. I have a friend of mine that actually is in that biomedical/engineering sort of field, and at the local college level there's like three separate degrees. There's the biomedical engineering and then IT, and then there's like the information like tech management type of field. It's interesting that you kind of, instead of going that route, you kind of dug deep right away.
Caroline Dickey: Yeah, I think I just found my niche. I found something that I found very interesting and was able to get a great career out of it, which is kind of the dream.
Rich Burroughs: Yeah, it's great that you were able to recognize that you needed a different course like that early on. Some people don't realize that until a lot later.
Caroline Dickey: I'm actually, I guess, a glutton for punishment because I'm working on my Master's degree right now in computer science. I know, couldn't keep me away.
Rich Burroughs: Oh wow.
Jacob Plicque: Not out of the woods yet then, right?
Caroline Dickey: Exactly.
Jacob Plicque: That's interesting. Was there like a particular event that caused you to make that switch? Then where did that lead into the going back for the Masters degree?
Caroline Dickey: The switch from biomedical engineering into software engineering?
Jacob Plicque: Yep.
Caroline Dickey: My first chemistry class was what prompted that switch. Just, it was not for me. I took my first computer science class, I think it was either Python or Java, and I really enjoyed it. I just continued down that path. Then as far as the Master's degree, as a Site Reliability Engineer, you kind of need to know it all at least a little bit. You're working with a lot of different teams and it really helps to have kind of a working knowledge of the type of work that they're doing, if it's Kubernetes, Kafka, databases, networking. I just kind of realized that I wanted to know a little bit more about some areas that I wasn't as strong in. I'm on my second class. It's been pretty good so far.
Rich Burroughs: After school you get your undergraduate degree in computer science. What happens in between then and when you showed up at Mailchimp?
Caroline Dickey: Before I worked at Mailchimp, I was a backend software developer at Delta Airlines. I started working there during college and then continued a little bit after graduating.
Rich Burroughs: Okay, and then on to Mailchimp from there?
Caroline Dickey: Yes, then on to Mailchimp.
Rich Burroughs: Great. Obviously at Mailchimp you've got a lot of customers who are depending on your platform to be able to send out messages about what's going on with their products and services and just stay in touch with folks. How important is reliability to you all there?
Caroline Dickey: It's incredibly important. We are a marketing platform for small businesses. Small businesses rely on us to do their marketing, their emails, e-commerce. If we're down, they may not be able to make a sale. It directs our customers incredibly if we're down, we're not available. It can be hard. Whenever we experience an outage, like all companies do, you kind of see the pain that our customers are having. They're not able to send out the newsletter they send out every week. It really makes us want to give them that experience that they deserve, as high level of reliability and resiliency that they can possibly get.
Rich Burroughs: It's interesting because we've done several of these interviews now and one thing that comes up repeatedly is that idea of looking at things from the customer's perspective, have an empathy for them, because really in essence that's what reliability is all about. We want to build these reliable platforms so our customers have a better experience.
Caroline Dickey: Yeah, absolutely. Mailchimp is ... Or I guess the Site Reliability Engineering team is really starting to prioritize Service Level Indicators and Objectives. I think that's frequently associated with Site Reliability Engineering, basically setting that level of service your customers should expect and then measuring whether or not you're meeting it.
Jacob Plicque: I think something that you said that was really interesting specifically was like, "Hey, these things do happen." I know like we talk about outages and we see things that ... public facing postmortems or in some cases there are companies that are even like in the middle of their incident, they're going live on Twitch and YouTube and saying, "Hey, we want to understand better and be very transparent about the things that are happening internally," because essentially going back to the point, customer experience is everything. I think that's really interesting to just kind of jump in and say, "Hey, these things happen," but we're tying these directly into these SLOs and SLIs that you mentioned so that we can get better.
Caroline Dickey: Yes, absolutely. I think it's so important that you are transparent with your customers because like we were talking about earlier, they are relying on you for a really important service. If you just leave them in the dark and they aren't able to use the service, that's really frustrating. I think we've all been on that other side of it where you want to use a service and it's down and they're not communicating effectively to you. We obviously want to give our customers as much transparency as we possibly can, kind of with that understanding that these things do happen. We're dealing with incredibly complicated applications and we are trying to move quickly. We're trying to just keep momentum and keep building new features. Sometimes, every once in a while, that rapid growth can be at the cost of maybe you do experience a short outage.
Jacob Plicque: Sure, or in some cases things go wrong when that happens, right? Tying that together, how did Chaos Engineering kind of enter your radar based off of those previous points?
Caroline Dickey: Well, so I first learned about Chaos Engineering from a talk given by Nora Jones at SREcon in 2017.
Rich Burroughs: I was at that talk.
Caroline Dickey: Oh, I guess we were in the same room. That's so funny. She did a fantastic job. I kind of brought that context back to my team at Mailchimp. I was fortunate enough to have a very supportive team and manager that allowed me to start investigating what that could look like at Mailchimp. Over the next few months, that was kind of like a side hustle as I continued to do normal SRE day to day work. We've kind of grown from there and it's become very successful.
Rich Burroughs: Yeah, I actually met your manager at SREcon.
Caroline Dickey: Peter is great.
Rich Burroughs: Yeah, he is. It's interesting to me because like from my perspective before I started working at Gremlin, I hadn't done any Chaos Engineering, but I'd been to SREcon a few times. I had seen Nora talk about it. I had seen some other folks from Netflix talk about what they're doing, but I had never actually done any of it. I think that to me, most people who are working in operations usually tend to have a pretty big backlog of things that they'd like to do that sometimes aren't a priority because of other things that the business needs. Maybe you've got to deliver something to engineering and so you're not able to make these other improvements that you'd like to make. What's your experience been like in terms of being able to prioritize Chaos Engineering and make that something that becomes a regular part of your schedule?
Caroline Dickey: I think that is absolutely true. It's true for operations teams. It's true for development teams. Feature development is very important. You've got to continue building and putting out great new features, but yeah, I think prioritizing resiliency and reliability is equally important and doesn't always get the attention it deserves. At Mailchimp we have committed to once a month, we will do Game Day every single month. It doesn't have to be anything crazy. Committing to that cadence has allowed us to grow. In the beginning it was just the Site Reliability Engineering team. As we continued, we were able to bring in people from different operations teams, from database engineering, from services. They, of course, had things that they were interested in testing. We came up with Game Days for them and it just kind of expanded. I think you just got to push through those first few ... whatever that cadence is. For us a month made the most sense, but you've just got to put it on the calendar.
Caroline Dickey: People won't always be able to make it, and that's okay. I think our second Game Day we only had like three people other than me. That was a little bit disappointing, but now we have quite a bit more attendance. As soon as you stop doing it, nobody else is going to pick it up for you. Well, it depends on what team is championing it. For us, it's the Site Reliability Engineering team. For us, it was important that our team continued to be committed to that once a month.
Jacob Plicque: What I think is really interesting there is it seems like it was a natural growth process just based off of the fact that A, you were doing it, and B, what it sounds like, and correct me if I'm wrong, is you were talking about doing it. You were talking about the things that you found. To be frank, some of this stuff is interesting and really fun, right? I think that kind of naturally spreads. I'm curious if that's what you ran into at Mailchimp.
Caroline Dickey: Yes, definitely. Between Game Days I would reach out to engineering managers directly and kind of share with them what we were doing and see if they had anybody on their team who was interested in participating. They typically did. They would give me a name and we would invite them to the next Game Day and they would kind of tell a friend. Just like you said, it did spread somewhat organically. We also communicated a lot. We created a chaoseng Slack channel and used that for sharing updates about Game Days. We published an internal newsletter after Game Days.
Rich Burroughs: Oh, that's great.
Caroline Dickey: Yeah, it was a really great way you could kind of come up with a nice summary and send it out. We like our newsletters at Mailchimp.
Rich Burroughs: I was just going to say, you've got a tool to send out newsletters.
Caroline Dickey: Yeah, I use Mailchimp for it. That was great. I gave an internal tech talk. I think that was definitely one of the tipping points where I was able to share some successes that we'd had with other groups. And people who had just been unfamiliar with Chaos Engineering didn't know what it meant, all of a sudden did and were able to kind of start thinking about what that could look like for their team.
Rich Burroughs: Yeah. I mean I've been in that position of being a champion, not with Chaos Engineering, but with other tools. By a champion we mean kind of that person who's sort of advocating for the use of this thing inside the company. It's really, I think, important to be able to show successes. When you talk about something in kind of a vacuum, people don't necessarily get excited, but when you can show them that, hey, we discovered this problem with our systems and now we fixed it and we're going to be more resilient because of that, I think that goes a lot farther.
Caroline Dickey: Oh, definitely. I think we had one of our biggest kind of turning points or public turning points for the company when we were able to use Chaos Engineering to identify the source of kind of a tricky incident. We got some really smart people together in a room and were able to use Gremlin's tooling to do some network attacks and identify kind of a weird internal dependency and fix it. That was really, really great because we were able to show a very tangible success story. I think that got a lot of people thinking. From there we've had a whole team decide that they're going to start doing Game Days regularly for app performance and set aside four or five hours. Just that's something they're going to do regularly too. All of a sudden we have more people getting invested and involved.
Jacob Plicque: Yeah, that's awesome. I think we talked about this on previous episodes and I think it's something that we're probably going to talk about almost every episode is is those wins that you find. A part of it is definitely validating, hey, I want to make sure my monitoring is okay and this fail was over the way I expect it, but I can't think of a better example of a situation where, and correct me if I'm wrong, but if I recall correctly, that was a situation where you were able to validate a particular outage use case or something, a cause I should say, and then you were able to fix it in the Game Day and then rerun the experiment and prove out that you were able to resolve it.
Caroline Dickey: Yes, that's exactly what we did. It was incredibly gratifying.
Jacob Plicque: That's the creme de la creme right there.
Caroline Dickey: Exactly.
Rich Burroughs: Yeah, I mean those dependencies that you don't know about, that kind of stuff can just be brutal. I don't know. When I talk about Chaos Engineering, one of the things I talk about with people is the difference between like that architecture that you drew up on the whiteboard, the way that you intellectually think that the thing is working and what it's actually doing. Those dependencies that creep in are one of those big things that you might not even have documented.
Caroline Dickey: Oh, definitely. I think even if you do have it documented, it's, have the right people seeing that documentation and do they understand that documentation? Yeah, I mean, it's tricky. I think Chaos Engineering is a great way to kind of help a group of people understand how something fails and also how it works, but in a way that maybe documentation can't convey quite as effectively.
Rich Burroughs: Yeah. I love that use case of being able to recreate something that you actually saw in your production environment that maybe caused an outage, maybe it didn't. But to be able to actually use tooling to inject that network latency or do whatever it is to try to ... You're never going to exactly duplicate what happened in production but to just be able to simulate it to a certain extent and see what happens.
Caroline Dickey: Yes. We used the staging environment in this case. We didn't want to fully recreate that outage. We were able to recreate the exact same symptoms, the app going down whenever the dependency went down, which really shouldn't have happened, and then figure out exactly why that was the case, which was an HTTP error code that was being returned. It was a 503 instead of a 500. It was just one of those moments where we're like, "What? What is this?"
Rich Burroughs: Oh wow.
Jacob Plicque: So it was like, "Why?"
Caroline Dickey: Yeah, exactly. I'm sure it was just one of those funny things. The dependency application had never gone down before. It had always been fairly dependable until, of course, it did go down. We'd never experienced that before.
Rich Burroughs: Yeah. In the old days it used to be the thing where you rebooted the server that hadn't been rebooted for like three years, right? And then suddenly all kinds of stuff happens that you don't expect. I also had the pleasure of sitting in on one of your Game Days because you all are a Gremlin customer. It was super interesting. I have to say, I have to give you props. I thought you all worked really well together as a team. It seemed like you all communicated really well. The one that you were doing when I was listening was actually some experiments related to Kubernetes itself, to etcd.
Caroline Dickey: Yeah. This was a Game Day that I'd been looking forward to from the very beginning of Chaos Engineering. Because with Kubernetes, you can do things like kill pods and you can expect some resiliency. Our main Mailchimp application runs on bare metal CentOS. There's really ... There's not much of an expectation of resiliency if you kill both servers necessarily. It was exciting to be able to test kind of with a little bit more expectation of resiliency there. In this case, one of the things we were testing was... we use Puppet for configuration management and we've had a Puppet change cause our Kubernetes master nodes to change to regular worker nodes, which caused some impact, as you might imagine. We saw impact to etcd in a few different Kubernetes components.
Rich Burroughs: That is really interesting.
Caroline Dickey: It was. We don't have any ... I could be wrong about this, but I don't believe we have any true customer facing applications running on Kubernetes. We do have quite a few internal applications and tooling, things like that. It wasn't catastrophic, but it certainly was something we wanted to not have happen again.
Rich Burroughs: If I remember right, you all were looking at some, the etcd kind of promotion, like what happens when your primary goes away?
Caroline Dickey: Yes. Etcd is the backend key/value store for Kubernetes. If etcd becomes unstable then we expect the cluster to become unstable. If it can't communicate, then there won't be a leader. In this case we were hoping to see some altering. The test that we ran was basically injecting latency. We wanted to see how much latency could be injected into outgoing traffic from the current etcd leader before a different leader was elected. We were doing this in our staging environment, which is set up a little bit different than our production environment. On our staging environment, we only have three master nodes. In production we have more. We started with 200 milliseconds of latency, didn't see any impact, bumped it up to 1000 milliseconds, saw some lag and alerts, still didn't see the leader reelection. At 2000 milliseconds, we didn't have any leader at all and our API server was all of a sudden read only. This was weird. This is not what we were expecting to see. That wasn't what matched how we had configured it.
Caroline Dickey: In this case, this was kind of what we believed was something a little bit off due to the way that we tested it. We believe that we did the network latency attack on a certain range of ports. Those ports request just the leader but not the heartbeat to the other nodes. The non, I guess, master nodes believed that the one that we were targeting was available and it believed that it was not. Everybody got confused. It was interesting. It was definitely interesting to kind of think through and observe. That wasn't quite as applicable to production, but we definitely did learn a lot.
Caroline Dickey: Then after that, we kind of had another strange etcd issue where we decided to just fully blackhole the leader node to make sure that the leader election worked correctly. Instead of that working, we had accidentally put it in a kind of split brain state because we only had two master nodes left and they weren't able to elect a leader. It was very interesting. It was a great learning opportunity for everybody in the room who hadn't done much with Kubernetes. You kind of were able to see all the people that did do things with Kubernetes get very confused, but I think in this case it was some kind of weirdness related to it being stage, but in that Game Day we also did identify a few different misconfigured alerts. We definitely did have some actionable things we were able to take away from that Game Day in addition to the learning.
Jacob Plicque: That's awesome. It kind of makes me think of two specific things that are really, really important around what you expect to happen. What it sounds like to me is the hypothesis was very different at 200 milliseconds and very different at 1000 and very, very different at 1000. It sounds like all three of them weren't what you were expecting and so it impacted differently than anticipated. That's kind of what started the discussion from there and the learnings from there. Is that accurate?
Caroline Dickey: Yeah, that sounds right. It was a very interesting discussion. A lot of people just not really understanding why we didn't have any leader. It was a pretty good Game Day. We enjoyed that one.
Rich Burroughs: I love this use case of actually experimenting on the Kubernetes pieces themselves because it's something where if you're operating a cluster you again can have read the docs and you can understand leader election to a certain extent, but you don't necessarily know how it's going to behave in every different kind of circumstance. Being able to sort of trigger some failures and see what actually happens and practice is super cool, I think.
Caroline Dickey: Absolutely. That was the feedback we got from our systems engineers. They went into the Game Day a little bit skeptical and they went out with kind of a new found respect for it because they found things they didn't expect to find and they learned things that they thought that they did not... They didn't expect to learn anything from this test.
Jacob Plicque: Yeah. Yeah. See that's actually really interesting. Was it skepticism about like Chaos Engineering as a whole or that things would work the way that they said that they would, or what do you think?
Caroline Dickey: I think probably that things would work the way they expected them to. I think that's kind of a common expectation going into a Game Day where you've worked in a system and you feel fairly confident that things are going to go well.
Jacob Plicque: Right, and then sometimes things can follow and then you'll go, "Oh, all right, well that happened." So then what do we do from there? I'm curious to what were the next steps from that aspect afterwards.
Caroline Dickey: From that Game Day we did fix the monitoring. That was kind of a pretty easy first step. We've continued ... We haven't done too much more work in that area just because we aren't using Kubernetes for anything customer facing right now, but we are planning on at some point doing another Game Day on production now that we've developed some confidence testing in that staging environment. That's definitely on our radar.
Jacob Plicque: Yeah. You stole my next question. That's perfect. I was just about to ask because I know that Mailchimp is doing some Game Days in production and I was ... A question we get a lot is, specifically around like there's kind of a common misconception that you have to start there. We talk a lot about, as you mentioned, starting in staging and figuring out what we know and then getting to a certain comfort before we get there. It's awesome that Mailchimp as a whole is ready, and ready to learn from things where the money's made, so to speak.
Caroline Dickey: Totally agree. I think when I started kind of exploring Chaos Engineering, you have that desire to break things in production because it just sounds like you're going to just learn so much, but starting in stage was absolutely the right move for us. I think it allowed us to get confident in the practice of Chaos Engineering. We didn't take anything down and we were able to really push the limits of the application all the way to knocking over a database, things like that that you would not want to do in production without ... really under any circumstances like [inaudible] cases. We do run Game Days in production. Typically for doing that, we will also be running similar experiments in stage just to make sure that nothing crazy is going to happen before I move into production. We try to be fairly cautious around that just because there is risk. While you would want to have an outage during the daytime versus waking somebody up in the middle of the night, I think the preferred thing would be to have no outage at all.
Rich Burroughs: No one here is going to argue with that.
Caroline Dickey: Right.
Rich Burroughs: We're with you. We've actually started publishing the results of some of our Game Days, the internal ones that we do. It's a very similar kind of pattern. Our coworker Tammy Butow who was on our first episode has put out this sort of roadmap for an example set of Game Days. First one is testing our monitoring in staging and then second one was doing a production. And lo and behold, just like you all, we found some things that we can improve in our monitoring when we actually got in there and did some Chaos Engineering.
Caroline Dickey: It is amazing. You'd think that you wouldn't continue to find things but applications are incredibly complex. You just poke it at a slightly different angle and all of a sudden all these other things pop up.
Rich Burroughs: Things are changing all the time. You mentioned that you all are using Puppet. I was actually an SRE at Puppet in one of my previous jobs. Every time one of those PRs gets merged to change a configuration, you've in essence got a different system than what you had before. It's not the same anymore.
Caroline Dickey: Yeah, Mailchimp deploys continuously. We deploy around 100 times a day directly to production, often gated with feature flags. There is that kind of level of gating, but this type of fast paced deployment right to production, you're balancing ... There's a little bit of risk with that fast pace. Got to be careful about that.
Rich Burroughs: It's interesting to me that you mention feature flags because I've been talking to folks about this lately, that to me that aspect of using feature flags is very, very similar with the way that we think about blast radius in Chaos Engineering, right? The idea is that we want to minimize the risks. Maybe you've got a new feature behind a feature flag and then you enable it for a few users, a certain segment, and then you expand it and expand it. That's the same thing that we do with blast radius. Like you talked about in your Kubernetes, your etcd example, you start off with a certain amount of latency and then you increase it and you increase it and you keep seeing what the impact is.
Caroline Dickey: Yeah, absolutely. Definitely agree.
Rich Burroughs: What kind of experiments have you found to be really useful that you all do?
Caroline Dickey: I think kind of the most basic tests that we've done right from the beginning were testing just the interactions between our load balancers, our app servers and our databases. That's been interesting, just kind of trying to see at what point will the app fall over if we inject a certain amount of latency? We did a really interesting experiment where we were looking at making a change to the way our database is set up so that if the virtual IP switches from the primary to the secondary ... Right now both databases are read and write. We're looking at making that change so that only one of them would be read write and the other one would be read only. When the virtual IP switches from the primary to the secondary or back, both of them could be read only for a split second. We wanted to understand how this would affect the application if the database was read only.
Caroline Dickey: We ran a Game Day around that. This was one of my favorite discoveries that we found out that we were exposing a raw database error in certain views whenever the database was read only because it was the legacy code that didn't have error handling in place.
Rich Burroughs: Oh wow.
Caroline Dickey: That was kind of like, whoa, that's not good. That could be a security problem. That was a really interesting scenario that whenever that database change gets made, we certainly would not want that to be ever happening such that a user could see it. That was one that we definitely prioritized getting fixed.
Rich Burroughs: Yeah, and beyond the security thing, it's getting back to that sort of user experience and the fact that they are the people who should be foremost in our minds. They don't know what that stuff on that screen probably means, unless they're an SRE or something, which some people are.
Jacob Plicque: Right. It's like, "Internal server error? I don't know what that means, but where's my mail?"
Caroline Dickey: Exactly.
Jacob Plicque: "Is my mail gone? I don't know what happened." Yeah, 100% about user experience.
Caroline Dickey: Yeah, I mean we've done things like ... Process killing is an interesting one and easy one to pull off. I think just kind of starting by identifying interactions between kind of your moving pieces is a really great place to start.
Rich Burroughs: Yeah. A lot of times we advise people ... It's sort of funny we were talking about Puppet earlier. When I was sort of that internal champion for Puppet at one of my previous jobs before I worked there, the kind of canonical examples of where you start with Puppet are things like managing ntp.conf or something like that that's going to be pretty low impact, touches the entire fleet, but it's real easy to demonstrate how you can just push out a change to all the hosts. I think that what we tend to like recommend to people more with Chaos Engineering is that you start with something that's higher impact because that's really where you're going to get the value is by looking at your most critical applications and how resilient they are.
Caroline Dickey: Yeah, definitely. That's what we did. We started with the Mailchimp application and then since then we've started working more with teams that support the Mailchimp application, things like Kafka, Kubernetes and then like our job runner team or delivery team. That's been really a great opportunity for learning because we are generally fairly familiar with the Mailchimp app. We hear about it all day long. Yeah, it's a big part of our days. Working with some of these other teams that do incredibly important work that isn't necessarily part of the Mailchimp monolith is really interesting and really eye opening because if those teams have an outage, we want to be able to help support them too, and not just only be able to support the developers that work on the Mailchimp application.
Jacob Plicque: See that's interesting. That's something I don't think I thought about a lot is Mailchimp app's down, definitely SEV One all hands on deck. This particular person's app is down, a SEV Two, but this is still impacting someone. The user may not see it right now, but that report that someone in, let's say, the C level is supposed to get every hour may be impacted. There's definitely things happening. That's interesting to be able to say you kind of use Chaos Engineering to be like, for lack of better term, yeah, we care about you too. You're important to us as well so let's see what we can do.
Caroline Dickey: Yeah, definitely. I think as a Site Reliability Engineer, we sometimes think about our customers as being internal. Of course, we certainly care about our external customers as well, but we support the developers, we support the operations engineers. If our developers aren't able to deploy, for example, that's a problem. That's something that we need to help them with. Even if the application isn't customer facing, it's definitely still something that SRE cares about. It's within our purview.
Jacob Plicque: You can't see it right now, but like my fists are like saying like, "Yes!" So, so accurate. I love it.
Rich Burroughs: Yeah. I mean everything's got a user or app wouldn't be running.
Caroline Dickey: Exactly.
Rich Burroughs: Can you talk us through a little bit about how you all got started on the Chaos Engineering journey there at Mailchimp?
Caroline Dickey: Sure. I learned about Chaos Engineering at the conference, brought it back to my team at Mailchimp and was encouraged to pursue it and to learn more. I did some research. I personally ended up writing kind of a formal proposal to start doing this, but I think that that would just depend on how an organization communicates. For me, that was helpful for me just to kind of get my thoughts together, but once we get the green light to go ahead with breaking things on purpose, as you all say, we decided to put that first Game Day on the calendar, came up with some basic scenarios. We used that first Game Day as an opportunity to evaluate Gremlin, the software. It's kind of that build versus buy problem. I think that's going to be different for every company.
Caroline Dickey: We chose not to build our own tool for a number of reasons. It wasn't the right use of time for our engineers when there's a solution out there that worked for us. From there I think that allowed us to have the momentum to pick it up and run with it versus trying to take a step back, build something, maybe it works, maybe it doesn't. At that point, maybe we've lost the momentum. We were able to iterate from there. We committed to running one Game Day every month. At first it was just the SRE team that attended and then once we got the next one scheduled, it just kind of kept going from there. I always order lunch and bring in cupcakes. I set that precedent, I got to keep doing it. People expect the cupcakes now, so at least whatever gets them in the room.
Rich Burroughs: I would come to a meeting about many things if there were cupcakes involved.
Caroline Dickey: Exactly.
Jacob Plicque: It's just amazing to think about the cupcakes being a single point of failure.
Rich Burroughs: Oh no.
Caroline Dickey: Oh man.
Jacob Plicque: We need some redundancy for those cupcakes. Can we have some cookies, perhaps?
Rich Burroughs: Yeah. I mean the build versus buy thing is interesting to me. Of course we're probably a little biased working for a vendor that actually sells Chaos Engineering software, but even when I was an SRE before I had like been doing Chaos Engineering, I tend to come down on that buy side a lot of the time because I'd rather have the people who are writing code be working on actually improving the things that are critical to our business rather than be reinventing some sort of infrastructure tool that other people have already written.
Caroline Dickey: Absolutely. Yeah, definitely. We're a medium sized company. We have about 350 engineers, about 1000 employees. We have one SRE team, about 12 engineers. It just didn't scale. We don't need to have our engineers pulled away from what they're doing on a day to day basis to build a tool that already exists. Like you said, we want our engineers to be building tools that support our business rather than building something that is kind of supporting that.
Rich Burroughs: Yeah, no hate to the people who build cool Chaos Engineering tools.
Caroline Dickey: You know what? We're happy to have it.
Rich Burroughs: Caroline, do you have any other advice that you would give to people who would want to get a Chaos Engineering practice going?
Caroline Dickey: I think one of the biggest things I would say is make sure you get your buy in. I would encourage folks not to try to do Chaos Engineering alone. You're going to lose some value there. We've had people at Mailchimp interested in doing Chaos Engineering like for their little specific tool. I think that you're kind of losing that aspect of knowledge sharing if you do that. I would encourage anybody interested in Chaos Engineering to make it a priority to try to share it with others within your organization and not just limit it to the one tool that you're trying to test. This is one of those things that it's going to depend on the company. I was fortunate enough to be able to kind of have that autonomy to move forward with pursuing this. It might be easier at a smaller company than a larger company to do this, but I think that regardless, it's worth putting in that initial effort to just get that buy in before you move forward.
Caroline Dickey: We were also fortunate we didn't have anybody that just rejected the idea of Chaos Engineering, but we did hear a lot of, "We aren't ready, and we already know our system has Chaos," that kind of thing. I think that that's probably true. Our system certainly isn't perfect. I don't think any system is.
Rich Burroughs: What was your response to that?
Caroline Dickey: I mean, so I think I was fortunate enough that they weren't saying no. Being able to do these experiments and then show them like, hey, here's this issue that we've fixed, I think being able to show people that this is adding value versus just saying, "Oh, this will be great. Think how great it could be." If you're not able to go forward with it, you have people really pushing back, I think that there certainly are a lot of resources out there. I think starting by looking at the incidents you've had, looking at the type of things that page people and maybe seeing if you can identify areas where Chaos Engineering could mitigate that and reduce that. I think people really like to see results. The more kind of tangible the numbers, the results that you can give them, the better you're going to be.
Rich Burroughs: Yeah. When you're making that case to management. On the build versus buy thing, you've got to do that either way, right? It's either going to be headcount or it's going to be software licenses or something, right? There's resources that you need to be able to do this work. I do think that like the more tangible you can make it, like bringing in actual metrics, is super helpful.
Caroline Dickey: Definitely.
Jacob Plicque: Especially since you're tending to ... For the most part, folks don't have Chaos Engineering budgeted for their fiscal year. You're essentially, in some cases, kind of taking it from something else. Then you're saying, "Hey, but this is the value that we're going to find with it," and running with that too. I think that's spot on.
Caroline Dickey: I think that's true. I also think that during a really bad incident, there is no budget. You will throw money at the problem until the problem is gone and then you'll deal with budget later. If you think about how much an outage costs, just on kind of every level on what it's costing your customers, what it's costing in man hours, what it's costing you as a company, like in just reputation, you could typically justify spending a little bit of time trying to reduce that.
Rich Burroughs: Yeah, I mean I remember seeing the GitHub, their excellent postmortem they did for a big outage they had a few months ago. That was one of their line items was we're going to start doing some Chaos Engineering. I think that's a great thing to tell your users, especially when they're technical people and understand what that means.
Jacob Plicque: Yeah. It's also an interesting recruiting tool as well, both from a customer perspective as well as a, whether it's an SRE or a developer saying, "Hey, this is what we do to make sure that we're resilient and come help us do that," is really awesome too.
Caroline Dickey: Yeah, definitely. I think another aspect of incidents is that they're bad for morale. You feel bad when you see that you're not providing the customers the experience that they're wanting. You have too many of those and you're just kind of feeling bad all the time. You want to work for a company that prioritizes resiliency and reliability. I think that's why Site Reliability is becoming more and more of a widespread discipline. It's important.
Jacob Plicque: Yeah. I mean like you feel like a superhero the first couple of times and then 4:00 AM is still 4:00 AM. It becomes a little less fun and a little more painful. Yeah, 100% with you on that one.
Rich Burroughs: All right. Well I think that's all the time that we have. Caroline, I really want to thank you for coming on to talk with us. This was super enjoyable.
Jacob Plicque: Absolutely.
Rich Burroughs: Is there anything that you want to plug, like places that people can find you on the Internet or anything like that?
Caroline Dickey: I do have a Twitter handle. I believe it's CarolineEDickey. I am not very active, so you can take that as you will. Then you can always email me, caroline at mailchimp dot com.
Rich Burroughs: That's great. All right. Thanks so much, Caroline.
Caroline Dickey: Thank you so much for having me.
Rich Burroughs: Our music is from Komiku. The song is titled Battle of Pogs. For more of Komiku's music, visit loyaltyfreakmusic.com or click the link in the show notes. For more information about our Chaos Engineering community, visit Gremlin.com/Community. Thanks for listening and join us next month for another episode.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more