Time for a cross over! Today Page it to the Limit host Mandi Walls, DevOps Advocate at PagerDuty joins Julie for a special episode. In this two part episode, Julie and Mandi interview Kolton Andrus, co-founder of Gremlin and Alex Solomon, co-founder of PagerDuty. Each of them share the origins of their respective companies, how they build amazing cultures, and some of the fun anecdotes along the way. Kolton and Alex reflect on how they identified the space where they could build their respective companies and the shift from larger entities to start ups. Each of them offer up some excellent insight! Part 2: Break it 2 the Limit will premier on the Page it to the Limit podcast March 15, 2022.  

Episode Highlights

In this episode, we cover:

  • 00:00:00 - Intro
  • 00:01:56 - How Alex and Kolton know each other and the beginnings of their companies
  • 00:10:10 - The change of mindset from Amazon to the smaller scale
  • 00:17:34 - Alex and Kolton’s advice for companies that “can’t be a Netflix or Amazon”
  • 00:22:57 - PagerDuty, Gremlin and Crossovers/Outro

Links:

Transcript

Kolton: I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don’t want your money. I’m going to bootstrap it. We’re going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”

Julie: Welcome to the Break Things on Purpose podcast, a show about chaos, culture, building and breaking things with intention. I’m Julie Gunderson and in this episode, we have Alex Solomon, co-founder of PagerDuty, and Kolton Andrus, co-founder of Gremlin, chatting about everything from founding companies to how to change culture in organizations.

Julie: Hey everybody. Today we’re going to talk about building awesome things with two amazing company co-founders. I’m really excited to be here with Mandi Walls on this crossover episode for Break Things on Purpose and Page it to the Limit. I am Julie Gunderson, Senior Reliability Advocate here over at Gremlin. Mandi?

Mandi: Yeah, I’m Mandi Walls, DevOps Advocate at PagerDuty.

Julie: Excellent. And today we’re going to be talking about everything from reliability, incident management, to building a better internet. Really excited to talk about that. We’re joined by Kolton Andrus, co-founder of Gremlin, and Alex Solomon, co-founder of PagerDuty. So, to get us started, Kolton and Alex, you two have known each other for a little while. Can you kick us off with maybe how you know each other?

Alex: Sure. And thanks for having us on the podcast. So, I think if I remember correctly, I’ve known you, Kolton, since your days in Netflix while PagerDuty was a young startup, maybe less than 20 people. Is that right?

Kolton: Just to touch before I joined Netflix. It was actually that Velocity Conference, we hung out of that suite at, I think that was 2013.

Alex: Yeah, sounds right. That sounds right. And yeah, it’s been how many years? Eight, nine years since? Yeah.

Kolton: Yeah. Alex is being humble. He’s let me bother him for advice a few times along the journey. And we talked about what it was like to start companies. You know, he was in the startup world; I was still in the corporate world when we met back at that suite.

I was debating starting Gremlin at that time, and actually, I went to Netflix and did a couple more years because I didn’t feel I was quite ready. But again, it’s been great that Alex has been willing to give some of his time and help a fellow startup founder with some advice and help along the journey. And so I’ve been fortunate to be able to call on him a few times over the years.

Alex: Yeah, yeah. For sure, for sure. I’m always happy to help.

Julie: That’s great that you have your circle of friends that can help you. And also, you know, Kolton, it sounds like you did your tour of duty at Netflix; Alex, you did a tour duty at Amazon; you, too, Kolton. What are some of the things that you learned?

Alex: Yeah, good question. For me, when I joined Amazon, it was a stint of almost three years from ’05 to ’08, and I would say I learned a ton. Amazon, it was my first job out of school, and Amazon was truly one of the pioneers of DevOps. They had moved to an environment where their architecture was oriented around services, service-oriented architecture, and they were one of the pioneers of doing that, and moving from a monolith, breaking up a monolith into services. And with that, they also changed the way teams organized, generally oriented around full service-ownership, which is, as an engineer, you own one or more services—your team, rather—owns one or more services, and you’re not just writing code, but you’re also testing yourself. There’s no, like, QA team to throw it to. You are doing deploys to production, and when something breaks, you’re also in charge of maintaining the services in production.

And yeah, if something breaks back then we used pagers and the pager would go off, you’d get paged, then you’d have to get on it quickly and fix the problem. If you didn’t, it would escalate to your boss. So, I learned that was kind of the new way of working. I guess, in my inexperience, I took it for granted a little bit, in retrospect. It made me a better engineer because it evolved me into a better systems thinker. I wasn’t just thinking about code and how to build a feature, but I was also thinking about, like, how does that system need to work and perform and scale in production, and how does it deal with failures in production?

And it also—my time at Amazon served as inspiration for PagerDuty because in starting a startup, the way we thought about the idea of PagerDuty was by thinking back from our time at Amazon—myself and my other two co-founders, Andrew and Baskar—and we thought about what are useful tools or internal tools that existed at Amazon that we wished existed in the broader world? And we thought about, you know, an internal tool that Amazon developed, which was called the ‘Pager Duty Tool’ because it organized the on-call scheduling and paging and it was attached to the incident—to the ticketing system. So, if there’s was a SEV 1 or SEV 2 ticket, it would actually page either one team—or lots of teams if it was a major incident that impacted revenue and customers and all that good stuff. So yeah, that’s where we got the inspiration for PagerDuty by carrying the pager and seeing that tool exist within Amazon and realizing, hey, Amazon built this, Google has their own version, Facebook has their own version. It seems like there’s a need here. That’s kind of where that initial germ of an idea came from.

Kolton: So, much overlap. So, much similarity. I came, you know, a couple of years behind you. I was at Amazon 2009 to 2013. And I’d had the opportunity to work for a couple of startups out of college and while I was finishing my education, I’d tasted startup world a little bit.

My funny story I tell there is I turned down my first offer from Amazon to go work for a small startup that I thought was going to be a better deal. Turns out, I was bad at math, and a couple of years later, I went back to Amazon and said, “Hey, would you still like me?” And I ended up on the availability team, and so very much in the heart of what Alex is describing. It was a ‘you build it, you own it, you operate it’ environment. Teams were on call, they got paged, and the rationale was, if you felt the pain of that, then you were going to be motivated to go fix it and ensure that you weren’t feeling that pain.

And so really, again, and I agree, somewhat taken for granted that we really learned best-in-class DevOps and system thinking and distributed system principles, by just virtue of being immersed into it and having to solve the problems that we had to solve at Amazon. We also share a similar story in that there was a tool for paging within Amazon that served as a bit of an inspiration for PagerDuty. Similarly, we built a tool—may or may not have been named Gremlin—within Amazon that helped us to go do this exact type of testing. And it was one part tooling and it was one part evangelism. It was a controversial idea, even at Amazon.

Some teams latched on to it quickly, some teams needed some convincing, but we had that opportunity to go work with those teams and really go develop this concept. It was cool because while Netflix—a lot of folks are familiar with Netflix and Chaos Monkey, this was a couple of years before Chaos Monkey came out. And we went and built something similar to what we built a Gremlin: An API, a front end, a variety of failure modes, to really go help solve a wider breadth of problems. I got to then move into performance, and so I worked on making the website fast, making sure that we were optimizing things. Moved into management.

That was a very useful life experience wasn’t the most enjoyable year of my life, but learned a lot, got a lot done. And then that was the next summer, as I was thinking about what was next, I bumped into Alex. I was really starting to think about founding a company, and there was a big question: Was what we built an Amazon going to be applicable to everyone? Was it going to be useful for everyone? Were they ready for it?

And at the time, I really wasn’t sure. And so I decided to go to Netflix. And that was right after Chaos Monkey had come out, and I thought, “Well, let’s go see—let’s go learn a bit more before we’re ready to take this to market.” And because of that time at Amazon—or at Netflix, I got to see, they had a great start. They had a great culture, people were bought into it, but there was still some room for development on the tooling and on the approach.

And I found myself again, half in the developer mindset, half in the advocacy mindset where needed to go and prove the tooling to make it safer and more scalable and needed to go out and convince folks or help them do it well. But seeing it work at Amazon, that was great. That was a great learning experience. Seeing at work at Amazon and Netflix, to me said, “Okay, this is something that everyone’s going to need at some point, and so let’s go out and take a stab at it.”

Alex: That’s interesting. I didn’t realize that it came from Amazon. I always thought Chaos Engineering as a concept came from Netflix because that’s where everyone’s—I mean, maybe I’m not the only one, but that’s—that was my impression, so that’s interesting.

Kolton: Well, as you know, Amazon, at times, likes to keep things close to the vest, and if you’re not a principal engineer, you’re not really authorized to go talk about what you’ve done. And that actually led to where my opportunity to start a company came from. I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don’t want your money. I’m going to bootstrap it. We’re going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”

Mandi: So, what ends up being different? Amazon—I’ve never worked for Amazon, so full disclosure, I went from AOL to Chef, and now I’m at PagerDuty. So, but I know what that environment was like, and I remember the early days, PagerDuty you got started around the same time, like, Fastly and Chef and, like, that sort of generation of startups. And all this stuff that sort of emerged from Amazon, like, what kind of mindset do you—is there a change of mindset when you’re talking to developers and engineers that don’t work for Amazon, looking into Amazon from the outside, you kind of feel like there’s a lot more buy-in for those kinds of tools, and that kind of participation, and that kind of—like we said before, the full service-ownership and all of those attitudes and all that cultural pieces that come along with it, so when you’re taking these sort of practices commercial outside of Amazon, what changes? Like, is there a different messaging? Is there a different sort of relationship you have with the developers that work somewhere else?

Alex: I have some thoughts, and it may not be cohesive, but I’m going to go ahead anyway. Well, one thing that was very interesting from Amazon is that by being a pioneer and being at a scale that’s very significant compared to other companies, they had to invent a lot of the tooling themselves because back in mid-2000s, and beyond, there was no Datadog. There was no AWS; they invented AWS. There wasn’t any of these tools, Kubernetes, and so on, that we take for granted around containers, and even virtual servers were a new thing. And Amazon was actually I think, one of the pioneers of adopting that through open-source rather than through, like, a commercial vendor like VMware, which drove the adoption of virtual everything.

So, that’s one observation is they built their own monitoring, they built their own paging systems. They did not build their own ticketing system, but they might as well have because they took Remedy and customized it so much that it’s almost like building your own. And deployment tools, a lot of this tooling, and I’m sure Kolton, having worked on these teams, would know more about the tooling than I did as just an engineer who was using the tooling. But they had to build and invent their own tools. And I think through that process, they ended up culturally adopting a ‘not invented here’ mindset as well, where they’re, generally speaking, not super friendly towards using a vendor versus doing it themselves.

And I think that may make sense and made a lot of sense because they were at such a scale where there was no vendor that was going to meet their needs. But maybe that doesn’t make as much sense anymore, so that’s maybe a good question for debate. I don’t know, Kolton, if you have any thoughts as well.

Kolton: Yeah, a lot of agreement. I think what was needed, we needed to build those things at Amazon because they embraced that distributed systems, the service-oriented architectures early on, that is a new class of problem. I think in a world where you’re not dealing with the complexity of distributed systems, Chaos Engineering just looks like testing. And that’s fine. If you’re in a monolith and it’s more straightforward, great.

But when you have hundreds of things with all the interconnections and the combinatorial explosion you have with that, the old approach no longer works and you have to find something new. It’s funny you mentioned the tooling. I miss Amazon’s monitoring tooling, it was really good. I miss the first iteration of their pipelines, their CI/CD tooling. It was a great iteration.

And I think that’s really—you get to see that need, and that evolution, that iteration, and a bit of a head start. You asked a bit about what is it like taking that to market? I think one of the things that surprised me a little bit, or I had to learn, is different companies are at different points in their journey, and when you’ve worked at Amazon and Netflix, and you think everybody is further along than they are, at times, it can be a little frustrating, or you have to step back and think about how do you catch somebody up? How do you educate them? How do you get them to the point where they can take advantage of it?

And so that’s, you know, that’s really been the learning for me is we know aspirationally where we want to go—and again, it’s not the Amazon’s perfect; it’s not the Netflix is perfect. People that I talk to tend to deify Netflix engineering, and I think they’ve earned a lot of respect, but the sausage is made the same, fundamentally, at every company. And it can be messy at times, and it’s not always—things don’t always go well, but that opportunity to look at what has gone well, what it should look like, what it could look like really helps you understand what you’re striving for with your customers or with the market as a whole.

Alex: I totally agree with that because those are big learning for me as well. Like, when you come out of an Amazon, you think that maybe a lot of companies are like Amazon, in that they’re… more like I mentioned: Amazon was a pioneer of service-oriented architecture; a pioneer of DevOps; and you build it, you own it; pioneer of adopting virtual servers and virtual hosting. And you, maybe, generalize and think, you know, other companies are there as well, and that’s not true. There’s a wide variety of maturities and these trends, these big trends like Cloud, like AWS, like virtualization, like containerization, they take ten years to fully mature from the starting point. With the usual adopter curve of very early adopters all the way to, kind of, the big part of the curve.

And by virtue of starting PagerDuty in 2009, we were on the early side of the DevOps wave. And I would say, very fortunate to be in the right place at the right time, riding that wave and riding that trend. And we worked with a lot of customers who wanted to modernize, but the biggest challenge there is, perhaps it’s the people and process problem. If you’re already an established company, and you’ve been around for a while you do things a certain way, and change is hard. And you have to get folks to change and adapt and change their jobs, and change from being a, “sysadmin,” quote-unquote, to an SRE, and learn how to code and use that in your job.

So, that change takes a long time, and companies have taken a long time to do it. And the newer companies and startups will get there from day one because they just adopt the newest thing, the latest and greatest, but the big companies take a while.

Kolton: Yeah, it’s both that thing—people can catch up quicker. It’s not that the gap is as large, and when you get to start fresh, you get to pick up a lot of those principles and be further along, but I want to echo the people, the culture, getting folks to change how they’re doing things, that’s something, especially in our world, where we’re asking folks to think about distributed system testing and cross-team collaboration in a different way, and part of that is a mental journey, just helping folks get over the idea—we have to deal with some misconceptions, folks think chaos has to be random, they think it has to be done in production. That’s not the case. There’s ways to do it in dev and staging, there’s ways to do it that aren’t random that are much safer and more deterministic.

But helping folks get over those misconceptions, helping folks understand how to do it and how to do it well, and then how to measure the outcomes. That’s another thing I think we have that’s a bit tougher in our SRE ops world is oftentimes when we do a great job, it’s the absence of something as opposed to an outcome that we can clearly see. And you have to do more work when you’re proving the absence of something than the converse.

Julie: You know, I think it’s interesting, having worked with both of you when I was at PagerDuty and now at Gremlin, there’s a theme. And so we’ve talked a lot about Amazon and Netflix; one of the things, distinctly, with customers at both companies, is I’ve heard, “But we’re not Amazon and we’re not Netflix.” And that can be a barrier for some companies, especially when we talk about this change, and especially when we talk about very rigid organizations, such as, maybe, FinServ, government, those types of organizations, where they’re more resistant to that, and they say, “Don’t say Amazon. Don’t say Netflix. We’re not those companies. We can’t operate like them.”

I mean, Mandi and I, we were on a call with a customer at one point that said we couldn’t use the term DevOps, we had to call it something different because DevOps just meant too forward-thinking, even though we were talking about the same concepts. So, I guess what I would like to hear from both of you, is what advice would you give to those organizations that say, “Oh, no. We can’t be Netflix and we can’t be Amazon?” Because I think that’s just a fear of change conversation. But I’m curious what your thoughts are.

Alex: Yeah. And I can see why folks are allergic to that because you look at these companies, and they’re, in a lot of ways, so far ahead that you don’t, you know—and if you’re a lower level of maturity, for lack of a better word, you can’t see a path in your head of how do you get from where you are today to becoming more like a Netflix or an Amazon because it’s so different. And it requires a lot of thinking differently. So, I think what I would encourage, and I think this is what you all do really well in terms of advocacy, but what I’d encourage is, like, education and thinking about, like, what’s a small step that you can take today to improve things and to improve your maturity? What’s an on-ramp?

And there’s, you know, lots of ideas there. Like, for example, if we’re talking about modern incident management, if we’re talking Chaos Engineering, if we’re talking about public cloud adoption and any of these trends, DevOps, SRE, et cetera, maybe think about how do you—do you have a new greenfield project, a brand new system that you’re spinning up, how do you do that in a modern way while leaving your existing systems alone to start? Then you learn how to do it and how to operate it and how to build a new service, a new microservice using these new technologies, you build that muscle. You maybe hire some folks who have done it before; that’s always a good way to do it. But start with something greenfield, start small, you don’t have to boil the ocean, you don’t have to do everything at once. And that’s really important.

And then create a plan of taking other systems and migrating them. And maybe some systems don’t make sense to migrate at all because they’re just legacy. You don’t want to put any more investment in them. You just want to run them, they work, leave them alone. And yeah, think about a plan like that. And there’s lots of—now, there’s lots of advice and lots of organizations that are ready and willing to help folks think through these plans and think through this modernization journey.

Kolton: Yeah, I agree with that. It’s daunting to folks that there’s a lot, it’s a big problem to solve. And so, you know, it’d be great if it’s you do X, you get Y, you’re done, but that’s not really the world we live in. And so I agree with that wisdom: Start small. Find the place that you can make an impact, show what it looks like for it to be successful.

One thing I’ve found is when you want to drive bottoms-up consensus, people really want to see the proof, they want to see the outcome. And so that opportunity to sit down with a team that is already on the cutting edge, that is feeling the pain, and helping them find success, whether that’s SRE, DevOps, whether it’s Chaos Engineering, helping them, see it, see the outcome, see the value, and then let them tell their organization. We all hear from other folks what we should be doing, and there’s a lot of that information, there’s a lot of that context, and some of its noise, and so how we cut through that into what’s useful, becomes part of it. This one to me is funny because we hear a lot, “Hey, we have enough chaos already. We don’t need any more chaos.”

And I get it. It’s funny, but it’s my least favorite joke because, number one, if you have a lot of chaos, then actually you need this today. It’s about removing the chaos, not about adding chaos. The other part of it is it speaks to we need to get better before we’re ready to embrace this. And as somebody that works out regularly, a gym analogy comes to mind.

It’s kind of like your New Year’s, it’s your New Year’s resolution and you say, “Hey, I’m going to lose ten pounds before I start going to the gym.” Well, it’s a little bit backwards. If you want to get the outcome, you have to put in a bit of the work. And actually, the best way to learn how to do it is by doing it, by going out getting a little bit of—you know, you can get help, you can get guidance. That’s why we have companies, we’re here to help people and teach them what we’ve learned, but going out doing a bit of it will help you learn how you can do it better, and better understand your own systems.

Alex: Yeah, I like the workout analogy a lot. I think it’s hard to get started, it’s painful at first. That’s why I like the analogy [laugh]—

Kolton: [laugh].

Alex: —a lot. But it’s a muscle that you need to keep practicing, and it’s easy to lose, you stopped doing it, it’s gone. And it’s hard to get back again. So yeah, I like that analogy a lot.

Julie: Well, I like that, too, because that’s something that we talked a lot about for being on call, and understanding how to handle incidents, and building that muscle memory, right, practice. And so there’s a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can’t see, says, “The world is always on. Let’s keep it this way,” and Kolton, you talk about reliability being no accident.

And so when we talk about the foundations of both of these organizations, it’s about helping engineers be better and make better products. And I’m really excited to learn a little bit more about where you think the future of that can go.

For the second part of this episode, check out the PagerDuty podcast at Page it to the Limit. For links to the Page it to the Limit podcast and to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or wherever you listen to your favorite podcasts.

Jason: Jason: Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

No items found.
Categories
Julie Gunderson
Julie Gunderson
Senior Reliability Advocate
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL