Podcast: Break Things on Purpose | Paul Marsicovetere, Senior Cloud Infrastructure Engineer at Formidable
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
In this episode of the Break Things on Purpose podcast, we speak with Paul Marsicovetere, Senior Cloud Infrastructure Engineer at Formidable
Episode Highlights
- Accidental SRE (1:41)
- Migrating to AWS (4:18)
- Prod vs Non-prod (8:43)
- Mentoring and advice (12:57)
- Failure is normal (19:41)
Transcript
Jason Yee: So before we get started, can you help us pronounce your last name?
Paul Marsicovetere: Yeah, it's Marsicovetere, it's Italian from my dad's side. It's at the ankle of the boot near Potenza. I'm originally from Melbourne and I live in Canada now, so it's all very confusing. My mum is English, so just throw the cultures all in there. It's great.
Jason Yee: Welcome to the Break Things on Purpose podcast, a show about failure, experimentation and learning. In this episode, Paul Marsicovetere joins us to chat about incidents and mentors.
Jason Yee: You mentioned back when you were in Australia and that's actually how we got introduced was through Tammy Bryant, who is at Gremlin.
Paul Marsicovetere: Yeah, it was back in my second year of university doing my first internship and I was fortunate to have Tammy as my assigned mentor and yeah, they're super awesome and super great. I wasn't as invested in tech in my career at the time, I was very happy to get the low yields tech internship experience as it was, but my mind was post degree where I was going to go and moving and it was really, really cool to see someone so passionate and just crushing it with everything that they did. So it's definitely something to aspire to. So that's how we got along.
Accidental SRE
Jason Yee: So that was your entrance into tech as a job?
Paul Marsicovetere: Yeah, I did a bachelor of information technology in Melbourne, primarily because it had the internship opportunities. It was a bit more geared towards project management and business analysis versus straight SRE, which I do now day to day. And so the internship experiences were really key to work for large orgs and see how they cut and slice things like working in a bank is quite eye opening from a tech perspective. And that's how I got involved. When I finished university, I had a few graduate positions lined up in Melbourne, but I wanted to make the move and I did so under the presumption that if I move and I don't get a job here in Canada and Toronto specifically, I'll just move back. What was funny is that I moved, I got a job in client support and then eventually worked for a power lines company outside of Toronto in their IT department doing like server admin, desktop support and client support, which involved bill print design.
Paul Marsicovetere: It's very far removed from what I do today, but it was really good grounding in systems admin and backup administration and how to support an AD cluster on-prem. So yeah, that's how the tech experience went along the journey, that particular role, it provided really good grounding from my day to day now. I landed a job with Benevity, they're a SaaS company, they're in the corporate social responsibility space and it was in their web operations team as they were named at the time. So I owe my current career direction and path to both Rob Woolley and Nina d'Abadie because they took a chance on me joining the team, I didn't have a lot of great Linux experience in supporting websites as a whole, but I was fortunate enough to be chosen by the company to migrate from our hosted provider into AWS. So that changed everything for me personally and that team changed down into SRE and the rest is history, so I like to tell people it's like an accidental SRE in a way. Like, was it an accident? Was it fate? Who knows? I think for me probably right time, right place, that's it.
Jason Yee: That's an interesting journey because I feel like because SRE is a relatively new field or a new career title, most of us who've ended up in this space, never intended to be in this space. We were always sysadmins or developers and we sort of happened into this. So it's always interesting to hear people's journeys.
Migrating to AWS
Jason Yee: You mentioned something that I think a lot of our listeners will be really interested in, and that is that migration to AWS and I'm curious what that experience was like for you. And if you have any stories to share around that.
Paul Marsicovetere: Yeah. So fortunately it was mostly in a hosted provider. It's quite funny because when I was working out in Toronto, it was, everything was on-prem. And then I went to a hosted provider at Benevity and it was like this VMWare set up and for me, I was like, this is cool, I don't have to actually rack a server myself or pop in a hard drive, this is great. And it didn't really realize that at the time it wasn't super scalable. And that was the reason why Benevity chose AWS because of the scalability factor. The team at the time grew, we had a two day migration day, which we're kind of live on in infamy forever. It's one of those bonding experiences that you just can't replicate to see it come to life and to fruition, it was definitely challenging, but totally worth it for all the success you get afterwards and everything that you get by going into AWS in terms of straight scalability, it was quite funny. It took me a little bit to realize we had some clients when I was working for Benevity that were in Australia and our hosted provider was in the US and sometimes they would say things like, "Oh, it's really latent and we can't get to the website" and we now in AWS had the global reach to be like, why don't we just pop a server in Sydney and see what is actually going on? So just little things like that, it was just totally, totally the right call to do so, even if it's tricky to get in that, but it's good.
Jason Yee: I'm curious if you could dive into that, you mentioned that it was two days of extremely intense, maybe unpleasant experience, any particular details that contributed to that?
Paul Marsicovetere: I would say the thing about our team then, and probably still now was that we had just really good comradery and really good trust. And we were all hyper-focused on making sure that we scaled a few things down. I think it was early Saturday morning and we were like, by the time Sunday, end of day rolls around and Monday morning everything's going to be online and it's going to be like nothing happened. And I think it was some point late on the Sunday.
Paul Marsicovetere: Rob was saying like, "Hey, we're in, we're in now, and it's only forward that we're going to go" so that probably is what sticks out in my mind. We had like a war room for the week and it was a few different database oddities where some things are like, we didn't realize this was still running on this old database that we moved. "How do we make it work?" So very much a fail forward kind of mentality at that point, cuz you have to, and I think it took about a good three to four months to really get used to everything, but now it was great and we got to share our experience at a meetup with the greater Calgary area and that just reinforced that. We did the right thing and that was a good bonding experience for everyone.
Jason Yee: Yeah. That's especially nice when you can take a weekend or take some time to actually do that because I know a lot of times we think that our systems have the requirement to be a hundred percent uptime, 24/7, no downtime. When in reality, I think a lot of organizations can afford a little bit, especially for major things like cloud migrations, where the customer will ultimately have some huge benefit from it. I think our customers recognize when you're trying to make their lives better by providing more reliable system or service or more features that they're willing to accept some of these with the idea of like, "We're upgrading, you're going to love this. You're going to really benefit from it".
Paul Marsicovetere: Yeah. That's the thing like the selling of the upgrade and what you get out of it is super key because with our previous provider, if you wanted to make a port change, you would have to maintenance window that in somehow, some way, get a support ticket. It was just not very efficient. So just having that ability to go from, okay, it's handholding a lot in a hosted provider, and this is probably the limits, the upper limits of how many times we can deploy a day to multiple times a day. Things going forward is... You can't really put a price on it, right.
Prod vs Non-prod
Jason Yee: You mentioned that you had somewhat of a horror story from your experience.
Paul Marsicovetere: I particularly find that the incidents that you cause yourself are the ones that you learn the most from. So yeah, during my time at Benevity, we had a fleet of web servers, they were behind a load balancer in front of that load balancer was another set of Nginx servers that did some custom path routing and redirects. So we ran deploys like during a protected day of the week and that was the one week where it was my turn to finally get some of the Nginx rewrites into prod. So you probably see this is going, but I actually spent the previous week really testing out the change thoroughly in our non-prod environment to ensure that it was safe, rewriting, doing the things that it was supposed to be doing and everything looks good in non-prod, we merged the review into our workflow. So yeah, happy days and then deploy day rolls around.
Paul Marsicovetere: And we had our Nginx nodes in an ASG. We were kind of following the Golden AMI process. So deploy the nodes, the health check of the ASG was green for the new nodes, instances were up, and I'm like, "Cool, job done, let's scale down the old nodes". And it had to be less than 10 seconds, the pager just starts going crazy, every site is offline and unreachable. So not a good feeling when that happens, but I guess the one good thing about that particular incident is that I knew straight away what the problem was. It was my Nginx push clearly. So yeah, the team was online during the deploy thankfully. And as soon as the question of, because the question comes up straight away, "Hey, what's happening?" I let everyone know. Yep. Pushed Nginx, something's definitely wrong and we followed the incident response process.
Paul Marsicovetere: And like I said before, Benevity we kind of followed a fail forward approach but this time we didn't really have time or space to do that or error budget because the sites are offline now midday. So we just said, let's roll back to the last known good image, but unfortunately we're only about four to five months into AWS. So we didn't have everything in every roll back processing place. But I'd say within about five to 10 minutes of scrambling, we marked the previous Nginx image as the latest, scaled the ASG, sites were reachable again. The root cause was two fold: one, the rewrite rules were completely not compatible with the production website, suddenly the non-prod environment and the health check of the ASG, they only showed if the node was healthy, it didn't show if the Nginx service was up. So the Nginx service itself is down, even if the rest of the node was perfectly healthy. Like what good is that node at that point? So that's when I think about every now and again.
Jason Yee: That's great insight having that monitoring into Nginx. Nodes don't actually particularly matter, unless they're completely down, you did a bunch of testing in non-prod and things worked just fine. And then you go to prod and I think we can encounter this a lot. Right? Prod is never the same as staging. What lessons were learned that you applied back to non-prod to help you ensure that those tests were more reliable?
Paul Marsicovetere: You hit the nail on the head. It's just a constant reminder that prod is so different to every other environment. I really dig the Charity Majors mindset and messaging of like, it doesn't count if you time test in prod, because it's true that the incident probably could have been avoided if non-prod was exactly like prod to the T, but who knows if another push to prod to the web servers right before I pushed my changes would have then caused the lockstep to be not in sync. So, I think it was more just around trying to get the non-prod more closely aligned with prod, but then being super vigilant on the fact that one health check can't really rule them all in the sense of the actual nodes themselves, because quite clearly, even if the server itself is fine, the service needs to be working correctly and in that particular instance, it was definitely not.
Mentoring and advice
Paul Marsicovetere: So it was something to look out for the future. When I used to work with Benevity a former colleague of mine, Alex Castro, he probably gave me the best piece of SRE advice that I've ever gotten. And I was very green in the web operations team at the time. So this is kind of a sidebar story, but I was actually running a shell script on a remote host and it had about 20,000 lines to get through, but it was only running at about one line a second. So that was going to be a really long time of me moving my cursor to make sure I didn't drop the session connection and he helped me safely exit the script and we set up screen, which I would always do if I ever had to SSH to a server ever again.
Paul Marsicovetere: But afterwards he said to me, "Hey, I know you're usually pretty calm and I could see you getting quite anxious and a bit frazzled when that was happening on the script." He said, "Whenever something's happening on the system, or if there's page and the site's down, remember to just breathe". And so that learning of literally just breathe has helped me a lot. It definitely helped me in that instance because it's not great to cause a site outage for everything, but the "just breathe" advice kicked in because it really does help give you a bit of a focus and clarity, especially during those very stressful events. And another thing is that, I'm also someone that if I make a mistake, I'll put my hand up and own it and in my opinion, good SRE teams have solid communication between teammates, but great SRE teams have trust.
Paul Marsicovetere: And at Benevity, it was kind of the non-negotiable that you really had to trust each other and trust to do the right thing and not shy away from making mistakes and letting your teammates know it was me or something happened and you see something. So that was another key aspect of that particular one. And the other thing I was going to mention is, I'm also someone who's, I'm a bit crazy and that I like to document playbooks and runbooks diligently. And at the time our team didn't have, "Hey, how do we safely roll back and quickly?" in runbooks. We ate a few extra minutes of downtime, but with every prod service, I think after that point and every new prod service launch, I always try to include just a guide or a diagram, a simple runbook that kind of quick to just help step through a few different points and just make things a bit more efficient when you're time pressed. Right. So lots of, lots of learnings.
Jason Yee: I'm going to say, you're the type of teammate that I think everybody wants, especially with that documentation. So that's always so key. You always get in that situation and you're just like, there's no documentation and so you're trying to figure out who's the subject matter expert who can tell you what to try. And then at the end, most people get kind of lazy and you forget about documenting, because you're like, now I know that thing and you're not really thinking about the next person that doesn't know it.
Paul Marsicovetere: I always try and think of that. What if someone else received this page, what would they do? Where would they go? How do you log in? And I mean, I could do it 50 times in a row and then it's muscle memory. But the next person that picks up the pager and takes on an alert, they need to know, they need some help. So yeah, I really, I do enjoy doing that, makes me a bit weird, but I own it.
Jason Yee: I think that dovetails something, that's sort of a theme of what I've seen in your career is you've had all of these great mentors who have really helped you and taught you the things that you didn't know to help you become a better engineer. And it sounds like documentation is sort of one of your efforts to do that as you think of mentorship and helping other engineers level up, what are some of the ways that you try to do that or advice that you could give to others to help them mentor more junior engineers?
Paul Marsicovetere: Yeah. That's a good question. I've been fortunate to do a little bit of mentoring for like interns and things and one of the things it's definitely like trying to know your audience for documentation is pretty key and that, and then alongside the good SRE principle of trust because everyone, especially when they're green are gonna make mistakes. So it's really important to communicate often and openly and even communicate your ideas and thoughts and feelings.
Paul Marsicovetere: If you have a sense that something isn't going to work well, please raise it. Don't just think to yourself, "I don't think this is going to work. I'm gonna let it fail." That's not great because we can all learn from each other and I do try and take on board a little bit of everyone's experience as well, even if you're fresh into AWS or fresh into the job market, you can learn from every level, you've got to be open to that because people are showing me things like "You don't need to write this this way, I don't like that". And I'm like, "Wow, that saves me so much time, and definitely thank you very much for saving 15 lines of code into three, that makes a huge difference". So it's about being open, right?
Jason Yee: That's an excellent point. You can learn something at every level. I like that point because even for folks that are extremely skilled, everybody's experience is different. You mentioned using screen.
Paul Marsicovetere: Yeah.
Jason Yee: And we've been ingrained to not log into servers. Right. Everything's done remotely through APIs. And even at that, like why should you log into a server? Just kill it and restart it. Right. That's the common troubleshooting method. And so I'm curious how many people have never heard of screen just because who SSHes into machines these days.
Paul Marsicovetere: Yeah. I mean, back when I was at the Power Line utility company, it was like printer support and I was "Why isn't this printer talking?" oh, the port is down, crazy things on a switch and like, "Yeah, there's the cable it's not flashing", things that nowadays so far removed from anything you would expect in a cloud native environment because the cloud guidance and method and framework is just kill it and get the next server online. There's some times in some situations where you've got to dig a little deeper and you want to protect that asset because there might be some particular log or some security event where you're like, "I can't kill this right now. I have to save it". And so you need to have a few different tools to be able to get the job done. So yeah, fun times in the cloud environment, it's always changing, but I think some of the principles, even around my backups and things like that, it's always going to stay pretty, pretty true. And if you have a good grounding in that, it can really take your far.
Jason Yee: So you've talked about your experience at your previous job now with your current job at what's your role like?
Paul Marsicovetere: So I work for a company named Formidable. We are an awesome consulting company that works with large, mostly e-commerce clients. So right now I'm working in a large retail client in an SRE sub-team and we're responsible for a internal CI/CD tooling framework. So that's where my day-to-day focus lies right now. And Formidable is really great company. We were founded by engineers for engineers and the company values are craft, autonomy and inclusion, and it's an incredible culture. I've really enjoyed being a part of it for the last nine months, learned a ton, great people and really fun and interesting projects for sure. I mean, coming from an SRE background, I always gravitate more towards cool and exciting experimental projects within the cloud. And that's what I'm fortunate enough to be working on right now. It's pretty cutting edge. And it does make you think of situations in different ways.
Jason Yee: So you mentioned things that are experimental. I think one of the challenges that folks often face in the industry, especially in the SRE role is the primary concern is reliability. And so there's always this hesitancy to want to experiment, especially when it is extremely risky, not just meaning Chaos Engineering and actually causing incidents, but just straight up experimentation of, I don't know what this new technology does, let's try it out. Do you have any thoughts on that and advice for folks? How do you balance that experimentation with reliability?
Paul Marsicovetere: It's a good point. I personally am of the opinion that especially now in the cloud, like everything is going to fail at any point. And we really adopted that mindset when we were up in Benevity as well. We were like, it's not going to be the same as the server is going to be up for 400 days straight and that's never going to have some sub-component fail. So you have to intrinsically build in the reliability with a lot of your design. And even then sometimes you can't build certain aspects. So you need to fall back on your playbooks or your runbooks so that you can improve your reliability if certain incident happens. I think it's a real balance because like I'm also from a mindset of that speed kills everything. You need to have something fast and snappy because if it is too slow, you lose adoption and you lose interest. So you have to balance the speed, reliability and costs.
Failure is normal
Jason Yee: I'm curious if you want to share your second story that you have?
Paul Marsicovetere: Yeah. So it's another incident from my time at Benevity and it was kind of funny in retrospect. So there was another deploy day. Typical deploy day. We had a bunch of patches for some security and some just general code changes for the web servers that had to go out. And at the time we were actually using AWS's EFS for like a mounted NFS persistant key website and service functionality. So patched everything and when we were patching, everything and we're about halfway through and a senior dev came over and asked, "Hey, did you patch the servers with our latest build?" And our team responded as we did and with, "Yeah, we did your servers, are you seeing something? Because server's up and health check looks good". And they just very simply said, "Take a look at the homepage". And we go, and the homepage is literally just a menu bar with a gray screen, nothing else.
Paul Marsicovetere: So yeah, the health check was a 200, sure it was up, but it wasn't doing anything. There was nothing actually on the site. So scrambled, check logs, check the build, check the patches, follow the incident response process as we did. And we eventually found that it was actually the EFS mount itself was returning an error saying that it wasn't mounting correctly. So what was crazy was that we checked the AWS status page and that wasn't showing anything and the actual EFS mount points in the console, they looked up, that everything was green and so we contacted support and they told us, "Oh yeah, we're having some issues with the stunnel channel for TLS on EFS mounting. So we followed incident response, I was on point. So we have to remount everything without TLS, while AWS resolved the stunnel issues on their end and then had to remount with TLS once everything was okay.
Paul Marsicovetere: So all in all was about five hours of really having to handhold the servers and make sure we're not eating into unnecessary downtime as we went along. But yeah, it was just another reminder that systems architecture is complex. There's no easy way to say it. I've never truly seen a quite simple service or application because of, especially the complex nature of the cloud, services can break anywhere at any time for literally anything, think about it, who would have thought the stunnel for TLS braking would have been, that's where I'm putting my money on. That's what it is. As soon as it happens, it would've paid a thousand to one. So it's just another reminder to expect to the unexpected at all times in SRE.
Jason Yee: I think the other lesson is when AWS shows you a green light, do not trust it.
Paul Marsicovetere: Well, it's funny because I love AWS. I really do, it's really given my career a focus and a direction, but any cloud vendor, AWS, any application any service, nothing is a silver bullet. It's not a silver bullet to all the problems. It helped Benevity scale a ton, way more than we would have ever thought. But like any service you have to expect, it's going to become unavailable or super degraded at any time. And if you're not really ready for that, especially as you just migrate in it, it can be a bit of a shock when it's something as crazy as an stunnel goes down because then your stack is compromised at that point. What can you do?
Jason Yee: I was complaining the other day to somebody about websites that return a 200 OK, with an error message back. That's like the body of the response. And I'm like, no, it should be returning a 500, even if it's an error message back. And I think the other lesson was just thinking about monitoring, having extra visibility, the idea of diving deeper with your monitoring, diving deeper with how you inspect that things are actually okay.
Paul Marsicovetere: Yeah, exactly. Because I even think we had something set up that just checked URL endpoints, but not the actual home page. It's like, well, if I use a current click through, they're not going to know to hit the exact URL that they want to go to. Like you have to look at this. So Yeah, health checks are fun, fun ground, for sure.
Jason Yee: Thanks for sharing your stories and especially your advice on helping to educate others and helping other engineers level up. Before we log off, I'm curious if you would want to make any plugs, anything that listeners should know about.
Paul Marsicovetere: Yeah, for sure. Like I said, Formidable is hiring so check us out formidable.com or FormidableLabs on GitHub for more info. I myself am pretty easily found on LinkedIn because of my last name. I have a bit of a dormant twitter @paulmarsicloud and blog thecloudonmymind.com. I'm hoping to be more active in those platforms going forward. And I also wanted to mention there's been some really horrific events happening up here in Canada recently. So I just wanted to urge anyone to donate to both the Indian Residential School Survivors Society and head to LaunchGood and search for London Community United Against Hate.
Jason Yee: We'll definitely have links to those online so that people can click over to that. Thanks for joining us, Paul.
Paul Marsicovetere: Yeah. Thank you again so much for having me hope to do it again, one day with new and exciting incidents of time's gone past.
Jason Yee: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more