Podcast: Break Things on Purpose | Tomas Fedor, Head of Infrastructure at Productboard
Tomas Fedor, Head of Infrastructure at Productboard, is here to talk about his personal passions and professional perfections. Tomas takes us through some of his biggest adaptations he had to make when adopting the cloud. He also tackles the complexities of working through his POC process, and how to keep consistencies across various teams. Teams are a central focus for Tomas as well, and his techniques and experiences in growing and leading specific technical teams is insightful. Tune in for his meticulous processes, team building insight, communicating to non-technical members of your organization, and more!
Show Notes
In this episode, we cover:
- 00:00:00 - Introduction
- 00:02:45 - Adopting the Cloud
- 00:08:15 - POC Process
- 00:12:40 - Infrastructure Team Building
- 00:17:45 - “Disaster Roleplay”/Communicating to the Non-Technical Side
- 00:20:20 - Leadership
- 00:22:45 - Tomas’ Horror Story/Dashboard Organziation
- 00:29:20 - Outro
Links:
- Productboard: https://www.productboard.com
- Scaling Teams: https://www.amazon.com/Scaling-Teams-Strategies-Successful-Organizations/dp/149195227X
- Seeking SRE: https://www.amazon.com/Seeking-SRE-Conversations-Running-Production/dp/1491978864/
Transcript
Jason: Welcome to Break Things on Purpose, a podcast about failure and reliability. In this episode, we chat with Tomas Fedor, Head of Infrastructure at Productboard. He shares his approach to testing and implementing new technologies, and his experiences in leading and growing technical teams.
Today, we’ve got with us Tomas Fedor, who’s joining us all the way from the Czech Republic. Tomas, why don’t you say hello and introduce yourself?
Tomas: Hello, everyone. Nice to meet you all, and my name is Tomas, or call me Tom. And I’ve been working for a Productboard for past two-and-a-half year as infrastructure leader. And all the time, my experience was in the areas of DevOps, and recently, three and four years is about management within infrastructure teams. What I’m passionate about, my main technologies-wise in cloud, mostly Amazon Web Services, Kubernetes, Infrastructure as Code such as Terraform, and recently, I also jumped towards security compliances, such as SOC 2 Type 2.
Jason: Interesting. So, a lot of passions there, things that we actually love chatting about on the podcast. We’ve had other guests from HashiCorp, so we’ve talked plenty about Terraform. And we’ve talked about Kubernetes with some folks who are involved with the CNCF. I’m curious, with your experience, how did you first dive into these cloud-native technologies and adopting the cloud? Is that something you went straight for, or is that something you transitioned into?
Tomas: I actually slow transition to cloud technologies because my first career started at university when I was like, say, half developer and half Unix administrator. And I had experience with building very small data center. So, those times were amazing to understand all the hardware aspects of how it’s going to be built. And then later on, I got opportunity to join a very famous startup at Czech Republic [unintelligible 00:02:34] called Kiwi.com [unintelligible 00:02:35]. And that time, I first experienced cloud technologies such as Amazon Web Services.
Jason: So, as you adopted Amazon, coming from that background of a university and having physical servers that you had to deal with, what was your biggest surprise in adopting the cloud? Maybe something that you didn’t expect?
Tomas: So, that’s great question, and what comes to my mind first, is switching to completely different [unintelligible 00:03:05] because during my university studies and career there, I mostly focused on networking [unintelligible 00:03:13], but later on, you start actually thinking about not how to build a service, but what service you need to use for your use case. And you don’t have, like, one service or one use case, but you have plenty of services that can suit your needs and you need to choose wisely. So, that was very interesting, and it needed—and it take me some time to actually adopt towards new thinking, new mindset, et cetera.
Jason: That’s an excellent point. And I feel like it’s only gotten worse with the, “How do you choose?” If I were to ask you to set up a web service and it needs some sort of data store, at this point you’ve got, what, a half dozen or more options on Amazon? [laugh].
Tomas: Exactly.
Jason: So, with so many services on providers like Amazon, how do you go about choosing?
Tomas: After a while, we came up with a thing like RFCs. That’s like ‘Request For Comments,’ where we tried to sum up all the goals, and all the principles, and all the problems and challenges we try to tackle. And with that, we also tried to validate all the alternatives. And once you went through all these information, you tried to sum up all the possible solutions. You typically had either one or two options, and those options were validated with all your team members or the whole engineering organization, and you made the decision then you try to run POC, and you either are confirmed, yeah this is the technology, or this is service you need and we are going to implement it, or you revised your proposal.
Jason: I really like that process of starting with the RFC and defining your requirements and really getting those set so that as you’re evaluating, you have these really stable ideas of what you need and so you don’t get swayed by all of the hype around a certain technology. I’m curious, who is usually involved in the RFC process? Is it a select group in the engineering org? Is it broader? How do you get the perspectives that you need?
Tomas: I feel we have very great established process at Productboard about RFCs. It’s transparent to the whole organization, that’s what I love the most. The first week, there is one or two reporters that are mainly focused on writing and summing up the whole proposal to write down goals, and also non-goals because that is going to define your focus and also define focus of reader. And then you’re going just to describe alternatives, possible options, or maybe to sum up, “Hey, okay, I’m still unsure about this specific decision, but I feel this is the right direction.” Maybe I have someone else in the organization who is already familiar with the technology or with my use case, and that person can help me.
So, once—or we call it a draft state, and once you feel confident, you are going to change the status of RFC to open. The time is open to feedback to everyone, and they typically geared, like, two weeks or three weeks, so everyone can give a feedback. And you have also option to present it on engineering all-hands. So, many engineers, or everyone else joining the engineering all-hands is aware of this RFC so you can receive a lot of feedback. What else is important to mention there that you can iterate over RFCs.
So, you mark it as resolved after through two or three weeks, but then you come up with a new proposal, or you would like to update it slightly with important change. So, you can reopen it and update version there. So, that also gives you a space to update your RFC, improve the proposal, or completely to change the context so it’s still up-to-date with what you want to resolve.
Jason: I like that idea of presenting at engineering all-hands because, at least in my experience, being at a startup, you’re often super busy so you may know that the RFC is available, but you may not have time to actually read through it, spend the time to comment, so having that presentation where it’s nicely summarized for you is always nice. Moving from that to the POC, when you’ve selected a few and you want to try them out, tell me more about that POC process. What does that look like?
Tomas: So typically, in my infrastructure team, it’s slightly different, I believe, as you have either product teams focus on POCs, or you have more platform teams focusing on those. So, in case of the infrastructure team, we would like to understand what code is actually going to be about because typically the infrastructure team has plenty of services to be responsible for, to be maintained, and we try to first choose, like, one specific use case and small use case that’s going to suit the need.
For instance, I can share about implementation of HashiCorp Vault, like our adoption. We leveraged firstly only key-value engine for storing secrets. And what was important to understand here, whether we want to spend hours of building the whole cluster, or we can leverage their cloud service and try to integrate it with one of our services. And we need to understand what service we are going to adopt with Vault.
So, we picked cloud solution. It was very simple, the experience that were seamless for us, we understood what we needed to validate. So, is developer able to connect to Vault? Is application able to connect to Vault? What roles does it offer? Was the difference for cloud versus on-premise solution?
And at the end, it’s often the cost. So, in that case, POC, we spin up just cloud service integrated with our system, choose the easiest possible adaptable service, run POC, validate it with developers, and provide all the feedback, all the data, to the rest of the engineering. So, that was for us, some small POC with large service at the end.
Jason: Along with validating that it does what you want it to do, do you ever include reliability testing in that POC?
Tomas: It is, but it is in, like, let’s say, it’s in a later stage. For example, I can again mention HashiCorp Vault. Once we made a decision to try to spin up first on-premise cluster, we started just thinking, like, how many master nodes do we need to have? How many availability zones do we need to have? So, you are going to follow quorum?
And we are thinking, “Okay, so what’s actually the reliability of Amazon Web Services regions and their availability zones? What’s the reliability of multi-cross-region? And what actually the expectations that is going to happen? And how often they happen? Or when in the past, it happened?”
So, all those aspects were considered, and we ran out that decision. Okay, we are still happy with one region because AWS is pretty stable, and I believe it’s going to be. And we are now successfully running with three availability zones, but before we jumped to the conclusion of having three availability zones, we run several tests. So, we make sure that in case one availability zone being down, we are still fully able to run HashiCorp Vault cluster without any issues.
Jason: That’s such an important test, especially with something like HashiCorp Vault because not being able to log into things because you don’t have credentials or keys is definitely problematic.
Tomas: Fully agree.
Jason: You’ve adopted that during the POC process, or the extended POC process; do you continue that on with your regular infrastructure work continuing to test for reliability, or maybe any chaos engineering?
Tomas: I actually measure something about what we are working on, like, what we have so far improved in terms of post-mortem process that’s interesting. So, we started two-and-a-half year ago, and just two of us as infrastructure engineers. At the time, there was only one incident response on-call team, our first iteration within the infrastructure team was with migration from Heroku, where we ran all our services, to Amazon Web Services. And that time, we needed to also start thinking about, okay, the infrastructure team needs to be on call as well. So, that required to update in the process because until then, it works great; you have one team, people know each other, people know the whole stack. Suddenly, you are going to add new people, you’re going to add new people a separate team, and that’s going to change the way how on-call should be treated, and how the process should look like.
You may ask why. You have understanding within the one team, you understand the expectations, but then you have suddenly different skill set of people, and they are going to be responsible for different part of the technical organization, so you need to align the expectation between two teams. And that was great because guys at Productboard are amazing, and they are always helpful. So, we sat down, we made first proposal of how new team is going to work like, what are going to be responsibilities. We took inspirations from the already existing on-call process, and we just updated it slightly.
And we started to run with first test scenarios of being on call so we understand the process fully. Later on, it evolved to more complex process, but it’s still very simple. What is more complex: we have more teams that’s first thing being on call; we have better separation of all the alerts, so you’re not going to route every alert to one team, but you are able to route it to every team that’s responsible for its service; the team have also prepared a set of runbooks so anyone else can easily follow runbook and fix the incident pretty easily, and then we also added section about post-mortems, so what are our expectations of writing down post-mortem once incident is resolved.
Jason: That’s a great process of documenting, really—right—documenting the process so that everybody, whether they’re on a different team and they’re coming over or new hires, particularly, people that know nothing about your established practices can take that runbook and follow along, and achieve the same results that any other engineer would.
Tomas: Yeah, I agree. And what was great to see that once my team grew—we are currently five and we started two—we saw excitement of the team members to update the process so everybody else we’re going to join the on-call is going to be excited, is going to take it as an opportunity to learn more. So, we added disaster roleplay, and that section talks about you are new person joining on-call team, and we would like to make sure you are going to understand all the processes, all the necessary steps, and you are going to be aligned with all the expectations. But before you will actually going to have your first alerts of on-call, we would like to try to run roleplay. Imagine what a HashiCorp Vault cluster is going down; you should be the one resolving it. So, what are the first steps, et cetera?
And that time you’re going to realize whatever is being needs to be done, it’s not only from a technical perspective, such as check our go to monitoring, check runbook, et cetera, but also communication-wise because you need to communicate not only with your shadowing buddy, but you also need to communicate internally, or to the customers. And that’s going to change the perspective of how an incident should be handled.
Jason: That disaster roleplay sounds really amazing. Can you chat a little bit more about the details of how that works? Particularly you mentioned engaging the non-technical side—right—of communication with various people. Does the disaster roleplay require coordinating with all those people, or is it just a mock, you would pretend to do, but you don’t actually reach out to those people during this roleplay?
Tomas: So, we would like to also combine the both aspects. We would like to make sure that person understands all the communication channels that are set within our organization, and what they are used for, and then we would like to make sure that that person understand how to involve other engineers within the organization. For instance, what was there the biggest difference is that you have plenty of options how to configure assigning or creating an alert. And so for those, you may have a different notification settings. And what happened is that some of the people have settings only for newly created alert, but when you made a change of assigned person of already existing alert, someone else, it might happen that that person didn’t notice it because the notification setting was wrong. So, we encountered even these kind of issues and we were able to fix it, thanks to disaster roleplay. So, that was amazing to be found out.
Jason: That’s one of the favorite things that I like to do when we’re using chaos engineering to do a similar thing to the disaster roleplay, is to really check those incident response processes, and validating those alerts is huge. There’s so many times that I’ve found that we thought that someone would be alerted for some random thing, and turns out that nobody knew anything was going on. I love that you included that into your disaster roleplay process.
Tomas: Yeah, it was also great experience for all the engineers involved. Unfortunately, we run it only within our team, but I hope we are going to have a chance to involve all other engineering on-call teams, so the onboarding experience to the engineering on-call teams is going to rise and is going to be amazing.
Jason: So, one of the things that I’m really interested in is, you’ve gone from being a DevOps engineer, an SRE individual contributor role, and now you’re leaving a small team. I think a lot of folks, as they look at their career, and I think more people are starting to become interested in this is, what does that progression look like? This is sort of a change of subject, but I’m interested in hearing your thoughts on what are the skills that you picked up and have used to become an effective technical leader within Productboard? What’s some of that advice that our listeners, as individual contributors, can start to gain in order to advance where they’re going with their own careers?
Tomas: Firstly, it’s important to understand what makes you passionate in your career, whether it’s working with people, understanding their needs and their future, or you would like to be more on track as individual contributor and you would like to enlarge your scope of responsibilities towards leading more technical complex initiatives, that are going to take a long time to be implemented. In case all the infrastructure, or in case of the platform leaders, I would say the position of manager or technical leader also requires certain technical knowledge so you can be still in close touch with your team or with your most senior engineers, so you can set the goals and set the strategic clearly. But still, it’s important to be, let’s say, people person and be able to listen because in that case, people are going to be more open to you, and you can start helping them, and you can start making their dreams true and achievable.
Jason: Making their dreams true. That’s a great take on this idea because I feel like so many times, having done infrastructure work, that you start to get a mindset of maybe that people just are making demands of you, all the time. And it’s sometimes hard to keep that perspective of working together as a team and really trying to excel to give them a platform that they can leverage to really get things done. We were talking about disaster roleplaying, and that naturally leads to a question that we like to ask of all of our guests and that’s, do you have any horror stories from your career about an incident, some horror story or outage that you experienced and what you’ve learned from it?
Tomas: I have one, and it actually happened at the beginning of my career of DevOps engineer. What is interesting here that it was one of the toughest incidents I experienced. It happened after midnight. So, the time I was still new to a company, and we have received an alert informing about too many 502, 504 errors written from API. At the time API process thousands of requests per second, and the incident had a huge impact on the services we were offering.
And as I was shadowing my on-call buddy, I tried to check our main alerting channel, see what’s happening, what’s going on there, how can I help, and I started with checking monitoring system, reviewing all the reports from the engineers of being on-call, and I initiated the investigation on my own. I realized that something is wrong or something is not right, and I realized I was just confused and I want sleep, so it took me a while to get back on track. So, I made the side note, like, how can I start my brain to be working as during the day? And then I got back to the incident resolution process.
So, it was really hard for me to start because I didn’t know what [unintelligible 00:24:27] you knew about the channel, you knew about your engineers working on the resolution, but there were plenty of different communication funnels. Like, some of the engineers were deep-focused on their own investigation, and some of them were on call. And we needed to provide regular updates to the customers and internally as well. I had that inner feeling of let’s share something, but I realized I just can’t drop a random message because the message with all the information should have certain format and should have certain information. But I didn’t know what kind of information should be there.
So, I tried to ping someone, so, “Hey, can you share something?” And in the meantime, actually, more other people send me direct message. And I saw there are a lot of different tracks of people who tried to solve the incident, who tries to provide the status, but we were not aligned. So, this all showed me how important is to have proper communication funnel set. And we got the lucky to actually end up in one channel, we got lucky to resolve incident pretty quickly.
And what else I learned that I would recommend to make sure you know where to work. I know it’s pretty obvious sentence, but once your company has plenty of dashboards and you need to find one specific metric, sometime it looks like mission impossible.
Jason: That’s definitely a good lesson learned and feeds back to that disaster roleplays, practicing how you do those communications, understanding where things need to be communicated. You mentioned that it can be difficult to find a metric within a particular dashboard when you have so many. Do you have any advice for people on how to structure their dashboards, or name their dashboards, or organize them in a certain way to make that easier to find the metric or the information that you’re looking for?
Tomas: I will have a different approach, and that do have basic dashboard that provides you SLOs of all the services you have in the company. So, we understand firstly what service actually impacts the overall stability or reliability. So, that’s my first advice. And then you should be able to either click on the specific service, and that should redirect you to it’s dashboard, or you’re going to have starred one of your favorite dashboards you have. So, I believe the most important is really have one main dashboard where you have all the services and their stability resourced, then you have option to look.
Jason: Yeah, when you have one main dashboard, you’re using that as basically the starting point, and from there, you can branch out and dive deeper, I guess, into each of the services.
Tomas: Exactly, exactly true.
Jason: I like that approach. And I think that a lot of modern dashboarding or monitoring systems now, the nice thing is that they have that ability, right, to go from one particular dashboard or graphic and have links out to the other information, or just click on the graph and it will show you the underlying host dashboard or node dashboard for that metric, which is really, really handy.
Tomas: And I love the connection with other monitoring services, such as application monitoring. That gives you so much insight and when it’s even connected with your work management tool is amazing so you can have all the important information in one place.
Jason: Absolutely. So, oftentimes we talk about—what is it—the three pillars of observability, which I know some of our listeners may hate that, but the idea of having metrics and performance monitoring/APM and logs, and just how they all connect to each other can really help you solve a lot, or uncover a lot of information when you’re in the middle of an incident. So Tomas, thanks for being on the show. I wanted to wrap up with one more question, and that’s do you have any shoutouts, any plugs, anything that you want to share that our listeners should go take a look at?
Tomas: Yeah, sure. So, as we are talking about management, I would like to promote one book that helped make my career, and that’s Scaling Teams. It’s written by Alexander Grosse and David Loftesness.
And another one book is from Google, they have, like, three series, one of those is Seeking SRE, and I believe other parts are also useful to be read in case you would like to understand whether your organization needs SRE team and how to implement it within organization, and also, technically.
Jason: Those are two great resources, and we’ll have those linked in the show notes on the website. So, for anybody listening, you can find more information about those two books there. Tomas, thanks for joining us today. It’s been a pleasure to have you.
Tomas: Thanks. Bye.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more