Podcast: Break Things on Purpose | Jose Nino, Staff Software Engineer at Lyft
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
In this episode of the Break Things on Purpose podcast, we speak with Jose Nino, Staff Software Engineer at Lyft.
Episode Highlights
- Balancing consistency and complexity (1:22)
- Extending consistency to mobile clients (3:07)
- Know what you're trying to solve (9:50)
Transcript
Jason Yee: Welcome everyone to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. With this series, we like to talk to people that have built incredible things. So today with us, we have our co-host, Anna Medina, and I've got our special guest, Jose Nino. Jose, why don't you introduce yourself and tell us a little bit about what you've built.
Jose Nino: Yeah. Thank you. Thank you for having me Ana and Jason, it's, it's great to be here. So I am a software engineer and for the past little bit, over four years now, I've worked at Lyft on the networking space. The first half of that, I worked on the founding team for Envoy Proxy, which has been deployed in countless production environments to handle network traffic in the server ecosystem. And then for the last two years or so, I've worked with the team that created Envoy Mobile, which is pursuing, bringing Envoy into mobile clients, and we'll dig deeper into why we wanted to do that crazy thing.
Jason Yee: Yeah. So for our listeners who aren't familiar with Envoy, tell us a little bit about how that works.
Balancing consistency and complexity
Jose Nino: Yeah. So starting with Envoy. The idea for Envoy really came out of Lyft's move from this giant monolith to an environment of a bunch of microservices. And what started happening when Lyft started pursuing that is that we had a lot of inconsistencies around our ecosystem. We had services in Python. We had a front proxy that needed to handle requests from our mobile clients. We had connections to databases. I mean a total mess. And we were really only adding to the mess by breaking apart into hundreds and hundreds of microservices. And really the insight there is that while we didn't want to limit what our product engineers could build at the application layer. We wanted consistency underneath that, we wanted consistency of three dimensions mainly. We wanted consistency of configuration, consistency of features and resilience features, and we wanted consistency of importantly observability. And this is all at the network layer.
We wanted to make sure that regardless of what the application was doing, we had consistency at the network because we were creating this giant ecosystem of interconnected and networked applications. And so having consistency there, what it really did, was lower the cognitive load of the operators, of people that respond to incidents at the time of incidents so that we could get better times to resolution. And we could get better features across the board to prevent those problems in the first place. So Envoy came out of that insight that while, but didn't want to limit the top layer. We really did want to limit and to create consistency at the network layer.
Extending consistency to mobile clients
Ana Medina: That's super, super cool. We hear about all these microservices and when folks are building out microservices, you just think that it's going to be simple and easy and that all your problems are solved by just being in a microservice architecture. It seems like that also just led for y'all to have more room to build. And you now wanted to bring this exact same consistency closer to the users over in their mobile devices. Is there another reason specifically why mobile devices really needed something like Envoy Mobile?
Jose Nino: Right. So two years ago, after Lyft successfully broke apart its monolith and moved into this microservices world and did it with some pain, but also a lot of gains from this layer of consistency. We started to think, we have consistency all across our server infrastructure, from the moment that request comes in, at our edge and then through, but really we're providing consistency, and we're providing reliability server-side, but our customers don't interact with our servers. At least materially, obviously every request goes to the servers, but what they're actually interacting with is the mobile client, right? If you open the Lyft app, and you can't request a ride, and you get a spinner that is not going anywhere, you're going to have a frustrated customer, both as a rider and as a driver. So the insight into moving from Envoy into Envoy Mobile was we've gained a lot from this consistency server-side.
There's no real reason that we should think of our mobile clients as completely separate from our infrastructure. In fact, it would benefit us from thinking that our mobile clients are just another node in this networked ecosystem. And so we investigated that from different angles and obviously from existing software. But we felt, ultimately, that the only way to bring this universal consistency at the network layer was to take the software that we had built for the servers Envoy and build it for the mobile clients. And so it seemed a little crazy when we started thinking about it and not even from, "is it going to run," but is it even going to compile? How are we going to be able to compile the C++ codebase that we build for servers into Swift and Kotlin binaries that we can deploy with our iOS and Android applications. So it took a leap of faith, but at this point, we have Envoy Mobile deployed to both our rider and our driver production clients at Lyft, and it's handling production traffic that our riders and drivers are experiencing.
Jason Yee: So I'm curious, along with that, I have a little bit of with Envoy, at least within the context of having played with Istio and Service Meshes and things like that, because Envoy has become such a foundational piece of software among all the service mesh space. I know almost nothing about how mobile applications work. So tell me about that? How did you get it to run on mobile applications? How does it run? What were some of the challenges that you faced?
Jose Nino: Well, that's you and me, Jason. Starting this project, I had actually never worked in the mobile space. And a personal note here: one of my favorite places to be in, in terms of work, is that in between place of you're stepping on this platform or something you know, in this case that was Android for me, but all around you is unknown, right? It's like, how do I even work in the mobile space here? I had never done that. Thankfully, our team is really quite, it's just a very fortunate intersection of people. The four engineers that we have in our team have deep expertise in different aspects and non-overlapping aspects. So we had a very talented iOS engineer, a very talented Android engineer and another, a server engineer like me, but they had worked in the industry in the networking space for over 15 years.
So we kind of all didn't know enough, but knew enough to kind of help each other during the early stages of this project. And so really it all started with a very, very rough proof of concept, right? Can we compile this? And at that stage, we actually had another engineer that helped us with all the compiler issues and getting this to compile for iOS and Android, but to our surprise and we were glad about this. It actually wasn't as terrible as we thought it could be to get this, to compile it into something that we could deploy to iOS and to Android. And a lot of that came from Envoy being built with Bazel, which I'm not going to get into that whole space, but you know, Bazel at this point had enough interoperability between cross toolchain that we were able to package Envoy into something that we could deploy to iOS and to Android.
Ana Medina: It seems like your entire team learned a lot through the process. What are some things that y'all learned of things to consider when you're building something so fundamental on the edge for mobile devices and specifically to platforms at the same time?
Jose Nino: Yeah, that's actually one of the bigger points, going back to the topic of consistency, a lot of the or I don't want to say a lot, but a not insignificant amount of issues that we have had with our mobile clients is inconsistency between iOS and Android, right? Like you have iOS engineers that worked with, to some degree in collaboration with Android engineers. And because it's two codebases, you end up building things that are slightly different, right? And you then create this nightmare of scenarios where you think you're doing the same thing in both, and you're just not.
And so, one of the things that we wanted to bring with Envoy Mobile to the mobile clients was consistency of the codebase because now we were writing just one codebase, Envoy Mobile, and compiling and deploying to two different platforms. And there are some differences there. And that's one of the lessons that we learned. There will always be low level differences in how iOS and Android deal with things like network sockets and all that stuff. But we were at least eliminating the inconsistencies as much as possible. We were really getting to the bare minimum of inconsistencies that we could, that we could bring along.
Know what you're trying to solve
Ana Medina: That's still super fascinating to me. One of the things that I wanted to ask you, now that you have had so much experience with Envoy, specifically Android mobile, what's the one thing that people need to think about regarding specifically, reliability, when they're actually implementing Envoy Mobile into their applications?
Jose Nino: Yeah. So I think we're going to sound like a broken record at the end of this with microservices and playing with the networking stack of your infrastructure. But really like the main principle, and I think we did this to begin the project anyways, was do we really need this? Are you at a point in your infrastructure, a point in the volume of traffic, are you at a point in number of customers that the complexity that you're adding into your infrastructure by deploying something like this? And at this point, we're talking about server and mobile infrastructure. Is that complexity really worth it? And that's a case by case scenario in every company, but at the point that Lyft is at with the amount of traffic and the amount of, at the mobile client specifically with riders and drivers moving around and spotty and lossy connections, it became clear that the complexity was worth it in this case.
So that's one thing. I think a second thing with Envoy specifically is like you said Jason, Envoy has become this very foundational piece of networking infrastructure. And one of the benefits, but can also have a double-edged sword there, is how feature rich it has to become. People and the maintainers of Envoy, we have tried to guide a little bit the project and showing you what is the bare minimum and what is all the added functionality. But I think you need to come at Envoy with knowing what is it that you're going to use it for? Because if not, you're going to be either overwhelmed by the amount of functionality or misguided into thinking that you need all of that to operate a reliable service. So I think both points really boiled down to make sure that you know what is it that you're trying to solve before you go and pursue this added complexity.
Ana Medina: That also sounds like the perfect advice for anyone that's considering a lot of the cloud native technologies like this abstraction layer that you're putting in. Is it going to be helpful for your day-to-day application?
Jose Nino: Right. And that's what really on the flip side we're trying to do with something like Envoy/Envoy Mobile at Lyft is, are we providing the right level of observability, configurability and resilience features that we need to provide to our product engineers so they can operate these things without an increased cognitive load.
Ana Medina: So you mentioned that this has already been implemented on the Android side of the applications over a Lyft. Is there any learnings that y'all have taken as you're running this on production?
Jose Nino: Yeah. Sorry, just to clarify, it's, it's both on the iOS and the Android mobile clients, but it's funny because we, we obviously experimented with replacing the legacy networking stacks and had a slow ramp up of the production traffic that we were taking. And it was one of those classic tales of you have these aggressive timelines and when you think you're going to be done. And the more production traffic we took, the more like little wrinkles appeared here and there that is like, oh, this, this is going to take a little bit longer than we expect it to. One of the interesting ones, and this was actually on iOS, not an Android, that we didn't expect is, and this goes back to the really diving into the low levels of how the networking technologies on iOS and Android work, but DNS had some very peculiar behavior with the third party library that Envoy uses for DNS resolution in iOS.
And it took some time to realize that it was DNS and it, it's a meme, but it almost always ends up being DNS. But in this case it was DNS, and it demanded us to understand how DNS resolution was working, and how it was working differently in iOS. And so that's one of the things that I love about this project is there is enough technical depth that you can really focus on as deep or as high as you can think in the networking stack. And you will find things that you can optimize. And in this case that you need to fix in order to continue your production rollout.
Ana Medina: This is still continuing to be fascinating to me just cause it's like such a different complexity, but these problems totally need to be solved and so exciting to have you on this podcast and get to hear all of it. I hope that folks that are listening really do get a chance to dive a little bit more deep on Envoy and Envoy Mobile. And I mean, if you have any questions we have Jose Nino here that I'm sure is willing to answer some of them.
Jose Nino: Absolutely.
Ana Medina: Where, where can folks find you?
Jose Nino: Yeah. My Twitter is @junr03 and DMs are open. Also, you can find us on the Slack for Envoy. The Slack for Envoy is super friendly, and people are always willing to ask and answer questions. And within the Slack we have an Envoy-mobile channel. So that's also a great place and as always issues and pull requests are welcome and super appreciated.
Ana Medina: Yes, I'm definitely going to be checking out Envoy mobile. Thank you, Jose Nino for having us. Thank you, Jason Yee, for also being a co-host for today. Very, very excited to see where this project goes. And thank you all very much for being here for Building Things on Purpose and we'll catch you at the next episode.
Jason Yee: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to Break Things on Purpose on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more