Podcast: Break Things on Purpose | Maxim Fateev and Samar Abbas, creators of Temporal
Join Jason for another round of “Break Things on Purpose.” This time Jason is joined by Maxim Fateev and Samar Abbas, co-founders of Temporal, to talk about the software and solutions they are developing for orchestrating micro services. Maxim and Samar talk about their joint work in the past on various projects to include the Cadence project, which has laid the foundation for what they are continuing to do at Temporal. Check out this “Build Things” conversation as Maxim and Samar go into the details on the nature of their work at Temporal!
Show Notes
In this episode, we cover:
- 00:00:00 - Introduction
- 00:04:25 - Cadence to Temporal
- 00:09:15 - Breaking down the Technology
- 00:15:35 - Three Tips for Using Temporal
- 00:19:21 - Outro
Links:
- Temporal: https://temporal.io
Transcript
Jason: And just so I’m sure to pronounce your names right, it’s Maxim Fateev?
Maxim: Yeah, that’s good enough. That’s my real name but it’s—[laugh].
Jason: [laugh]. Okay.
Maxim: It’s, in American, Maxim Fateev.
Jason: And Samar Abbas.
Samar: That sounds good.
Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build and operate modern applications. In this episode, Maxim Fateev and Samar Abbas join us to chat about the problems with orchestrating microservices and the software they’ve created to help solve those problems.
Jason: Hey, everyone, welcome to Break Things on Purpose, the podcast about reliability, chaos engineering, and all things SRE and DevOps. With me today, I’ve got Maxim Fateev and Samar Abbas, the cofounders of a company called Temporal. Maxim, why don’t you tell us a little bit more about yourself?
Maxim: Hi, thanks for inviting me to the podcast. I have been around for quite a while. In 2002, I joined Amazon and Amazon was pretty small company back then, I think 800 developers; compared to its current size it was really small. I quickly moved to the platform team, among other things. And it wasn’t AWS back then, it was just the software platform, actually was called [Splat 00:01:36].
I worked in a team which owned the old publish-subscribe technologies of Amazon, among other things. As a part of the team, I oversaw implementation of service and architecture, Amazon [unintelligible 00:01:47] roll out services at large scale, and they built services for [unintelligible 00:01:51] Amazon so all this asynchronous communication, there was something my team was involved in. And I know this time that this is not the best way to build large-scale service-oriented architectures, or relying on asynchronous messaging, just because it’s pretty hard to do without central orchestration. And as part of that, our team conceived and then later built a Simple Workflow Service. And I was tech leader for the public release of the AWS Simple Workflow Service.
Later, I also worked in Google and Microsoft. Later I joined Uber. Samar will tell his part of the story but we together built Cadence, which was, kind of, the open-source version of the same—based on the same ideas of the Simple Workflow. And now we driving Temporal open-source project and the company forward.
Jason: And Samar, tell us a little bit about yourself and how you met Maxim.
Samar: Thanks for inviting us. Super excited to be here. In 2010, I was basically wanted to make a switch from traditional software development like it used to happen back at Microsoft, to I want to try out the cloud side of things. So, I ended up joining Simple Workflow team at AWS; that’s where I met Maxim for the first time. Back then, Maxim already had built a lot of messaging systems, and then saw this pattern where messaging turned out—[unintelligible 00:03:08] believe that messaging was the wrong abstraction to build certain class of applications out there.
And that is what started Simple Workflow. And then being part of that journey, I was, like, super excited. Since then, in one shape or another, I’ve been continuing that journey, which we started back then in the Simple Workflow team while working with Maxim. So, later in 2012, after shipping Simple Workflow, I basically ended up coming back to Azure side of things. I wrote this open-source library by the name of Durable Task Framework, which looks like later Azure Functions team ended up adopting it to build what they are calling as Azure Durable Functions.
And then in 2015, Uber opened up office here in Seattle; I ended up joining their engineering team in the Seattle office, and out of coincidence, both me and Max ended up joining the company right about the same time. Among other things we worked on together, like, around 2017, we started the Cadence project together, which was you can think of a very similar idea as like Simple Workflow, but kind of applying it to the problem we were seeing back at Uber. And one thing led to another and then now we are here basically, continuing that journey in the form of Temporal.
Jason: So, you started with Cadence, which was an internal tool or internal framework, and decided to strike out on your own and build Temporal. Tell me about the transition of that. What caused you to, number one, strike out on your own, and number two, what’s special about Temporal?
Maxim: We built the Cadence from the beginning as an open-source project. And also it never was, like, Uber management came to us and says, “Let’s build this technology to run applications reliably,” or workflow technology or something like that. It was absolutely a bottoms-up creation of that. And we are super grateful to Uber that those type of projects were even possible. But we practically started on our own, we build it first version of that, and we got resources later.
And [unintelligible 00:05:09] just absolutely grows bottoms-up adoption within Uber. It grew from zero to, like, over a hundred use cases within three years that this project was hosted by our team at Uber. But also, it was an open-source project from the beginning, we didn’t get much traction first, kind of, year or two, but then after that, we started to see awesome companies like HashiCorp, Box, Coinbase, Checkr, adopt us. And there are a lot of others, it’s just that not all of them are talking about that publicly. And when we saw this external adoption, we started to realize that thing within Uber, we couldn’t really focus on external events, like, because we believe this technology is very widely applicable, we needed, kind of, have separate entity, like a company, to actually drive the technology forward for the whole world.
Like most obvious thing, you cannot have a hosted version [unintelligible 00:06:00] at Uber, right? We would never create a cloud offering, and everyone wants it. So, that is, kind of like, one thing led to another, Samar said, and we ended up leaving Uber and starting our own company. And that was the main reasoning is that we wanted to actually make this technology successful for everybody in the whole world, not just within Uber. Also the, kind of, non-technical but also technical reasons, one of the benefits of doing that was that we had actually accumulated quite pretty large technical debt when running, like, Cadence, just because we were in it for four years without single backwards-incompatible change because since [unintelligible 00:06:37] production, we still were on the same cluster with the same initial users, and we never had downtime, at least, like, without relat—infrequent outages.
So, we had to do everything in backwards-compatible manner. At Temporal, we could go and rethink that a little bit, and we spent almost a year just working on the next generation of technology and doing a lot of fixes and tons of features which we couldn’t do otherwise. Because that was our only chance to do backwards-incompatible change. After our first initial production release, we have this promise that they’re not going to break anyone, at least—unless we start thinking about the next major change in the project, which probably is not going to come in the next few years.
Samar: Yeah. One thing I would add is back then one of the value propositions that Uber was going after is provide rides as reliable as running water. And that translated into interesting system requirements for engineers. Most of the time, what ended up happening is product teams at Uber are spending a large amount of time building this resiliency and reliability into their applications, rather than going after building cool features or real features that the users of the platform cares about. And I think this is where—this is the problem that we were trying to solve with us back at Uber, where let us give you that reliability baked into the platform.
Why does every engineer needs to be a distributed systems engineer to deal with all sorts of failure conditions? We want application teams to be more focused on building amazing applications, which makes a lot of sense for the Uber platform in general. And this is the value proposition that we were going after with Cadence. And it basically hit a nerve with all the developers out there, especially within Uber. One of the very funny incidents early on, when we are getting that early adoption, one of the use cases there, the way they moved onto Cadence as an underlying platform is there was actually an outage, a multi-day outage in one of the core parts of that system, and the way they mitigated the outage is they rewrote that entire system on top of Cadence in a day, and able to port over that entire running system in production and build it on top of Cadence and run it in production. And that’s how they mitigated that outage. So, that was, in my opinion, that was a developer experience that we were trying to strive, with Cadence.
Jason: I think, let’s dive into that a little bit more because I think for some of our listeners, they may not understand what we’re talking about with this technology. I think people are familiar with simple messaging, things like Kafka, like, “I have a distributed system. It’s working asynchronously, so I do some work, I pass that into a queue of some sort, something pulls that out, does some more work, et cetera, and things act decoupled.” But I think what we’re talking about here with workflows, explain that for our listeners a little bit more. What does it provide because I’ve taken a look at the documentation and some of the demos and it provides a lot of really cool features for reliability. So, explain it a little bit more, first.
Maxim: [crosstalk 00:09:54] describe pretty well how systems are built now. A lot of people, kind of, call it choreography. But basic idea is that you have a bunch of callbacks which listen on queues, then update certain data sources’ databases, and then put messages into the queues back. And also they need to actually—in a real system also need to have durable timers, so you either build your own timer service, so you just poll your databases for messages to be in certain state to account for time. And skill in these things are non-trivial.
The worst part is that your system has a bunch of independent callbacks and you practically have very complex state machine and all these business requirements just practically broken in, like, a thousand, thousand little pieces which need to work together. This choreography in theory kind of works, but in practice is usually a mess. On top of that, you have very poor visibility into your system, and all this other requirements about retries and so on are actually pretty hard to get. And then if something goes wrong, good luck finding the problem. It goes into the orchestrat—it does orchestration.
Means that you implement your business logic in one place and then you just call into these downstream services to implement the business logic. The difference is that we know how to do that for short requests. Practically, let’s say you get a request, your service makes the five downstream API calls, does something with those calls, maybe makes a little bit more calls, than [unintelligible 00:11:14] data. If this transactions takes, let’s say, a second, is pretty easy to do, and you don’t care about reliability that much if it fails in the middle. But as soon as they come to you and say, “Okay, but any of those calls can fail for three minutes,” or, “This downstream call can take them ten hours,” you practically say, “Okay, I have this nice piece of code which was calling five services and doing a few things. Now, I need to break it into 50 callbacks, queues, and whatever in database, and so on.”
It’s Temporal [unintelligible 00:11:38] keep that code [unintelligible 00:11:39]. The main abstraction, which is non-obvious to people is that they practically make your process fully fault-tolerant, including stack variables, [unintelligible 00:11:49], and so on. So, if you make a call, and this call takes five hours, you’re still blocked in exactly the same line of code. And in five hours, this line of code returns and then continues. If you’re calling sleep for one month, you’re blocked on this sleep line of code for one month, and then it just returns to the next line of code.
Obviously, there is some magic there, in the sense that we need to be able to [unintelligible 00:12:11] and activate state of your workflow—and we call it workflows but this code—but in exactly the same state, but this is exactly what Temporal provides out of the box. You write code, this failure doesn’t exist because if your process fails, we just reconstruct the exactly the same state in a different process, and it’s not even visible to you. I sometimes call it a fault-oblivious programming because your program not even aware that fault happened because it just automatically self-healing. That is main idea. So, we’ll give you a fault-tolerant code is guaranteed to finish execution.
And on top of that, there are a lot of things which kind of came together there. We don’t invoke these services directly, usually, we invoke them from queues. But these queues are hidden in a sense because all you say execute, for instance, some activity, call some API, and then this API is in workflow asynchronously. But for your coding code, it’s not visible. It’s, kind of, just normal RPC call, but behind the scenes, it’s all asynchronous, has infinite retries, exponential retries, a [unintelligible 00:13:11] granular tasks, and so on.
So, there are a lot of features. But the main thing which we call workflow which ties all these together, is just your business logic because all it does is just practically makes this call. And also it has state. It’s stateful because you can keep state in variables and you don’t need to talk to database. So, it can have you—for example, one of the use cases we saw is customer loyalty program.
Imagine you want to implement the UI airline, you need to implement points, give points to people. So, you listen to external events every time your trip finished, your flight finished, and then you will get event or a need to increment that. So, in normal system, you need to get better-based cues, and so in our world, you would just increment local variables saying, “Yeah, it’s a counter.” And then when this counter reaches a hundred, for example, you will call some downstream service and say, “Okay, promote that person to the next tier.” And you could write this practically your type of application in 15, 20 minutes on your desktop because all of the rest is taken care of by Temporal.
It assumes that you can have millions of those objects and hundreds of millions of those objects running because you have a lot of customers—and we do that because we built it at Uber, which had hundreds of millions of customers—and run this reliably because you don’t want to lose data. It’s [unintelligible 00:14:29] your financial data. This is why Temporal is very good for financial transactions, and a lot of companies like Coinbase, for example, uses them for their financial transactions because we provide much better [unintelligible 00:14:39] that alternative solutions there.
Jason: That’s amazing. In my past as an engineer of working on systems and trying to do that, particularly when you mentioned things like retries, I’m thinking of timeouts and things where you have a long-running process and you’re constantly trying to tune, what if one service takes a long time, but then you realize that up the stack, some other dependency had ended up timing out before your timeout hit, and so it’s suddenly failed even though other processes are going, and it’s just this nightmare scenario. And you’re trying to coordinate amongst services to figure out what your timeout should be and what should happen when those things timeout and how retries should coordinate among teams. So, the idea of abstracting that away and not even having to deal with that is pretty amazing. So, I wanted to ask you, as this tool sounds so amazing, I’m sure listeners will want to give this a try if they’re not already using it.
If I wanted to try this out, if I wanted to implement Temporal within my application, what are, say, three things that I need to keep in mind to ensure that I have a good experience with it, that I’m actually maintaining reliability, or improving reliability, due to using this framework, give us some tips for [crosstalk 00:15:53] Temporal.
Maxim: One thing which is different about Temporal, it requires you rethinking how application is structured. It’s a kind of new category of software, so you cannot just take your current design and go and translate to that. For example, I’ve seen cases that people build system using queues. They have some downstream dependency which can be down. For example, I’ve seen the payment system, the guys from the payment system came to us and said, “What do we do? We have downstream dependency to bank and they say in the SLA that can be done for three days. Can we just start workflow on every message if the system is down, and keep retrying for three days?” Well, technically, answer is yes, you can absolutely create workflows, productivity retried options for retry for a month, and it is going to work out of the box, but my question to those people first time is what puts message in that Kafka queue which your are listen, you know?
And they are, “Oh, it’s a proxy service which actually does something and”—“But how does the service is initiated?” And they say, “Oh, it’s based on another to Kafka queue.” And the whole thing, I kind of ended up understanding that they had this huge pipeline, this enormous complexity, and they had hard time maintaining that because all these multiple services, multiple data sources, and so on. And then we ended up redesigning the whole pipeline as just one workflow, and instead of just doing this by little piece, and it helped them tremendously; it actually practically completely changed the way application was designed and simplified, they removed a lot of code, and all these, practically, technology, and reliability was provided by Temporal out of the box. So, my first thing is that don’t try to do piecemeal.
Again, you can. Yeah, actually, they initially even did that, just to try to prove the technology works, but at the end, think about end-to-end scenario, and if you use Temporal for end-to-end scenario, you will get [10x 00:17:37] benefits there. That probably would be the most important thing is just think about design.
Samar: So, I think Max, you gave a pretty awesome description of how you should be approaching building applications on top of Temporal. One of the things that I would add to that is think about how important durability is for your application. Do you have some state which needs to live beyond a single request response? If the answer to those questions is yes, then I think Temporal is an awesome technology, which helps you deal with that complexity, where a traditional way of building those applications, using databases, queues, retry mechanisms, durable timers, as Max mentioned, for retrying for three days because we have—instead of building a sub-system which deals with this retrying behavior, you can literally just, when you schedule an activity, you just put an activity options on it to, say, put your retry policy and then it will retry based on three days, based on your policy. So, I think—think holistically about your system, think about the statefulness and how important that state is, and I think Temporal is an amazing technology which really developer-friendly because the way currently industry—I think people have accepted that building these class of applications for cloud environment is inherently complex.
So, now a lot of innovation which happens is how to deal with that complexity as opposed to what Temporal is trying to do is, no, let’s simplify the entire experience of building such applications. I think that’s the key value that we are trying to provide.
Jason: I think that’s an excellent point, thinking of both of those tips of thinking about how your application is designed end-to-end, and also thinking about those pieces. I think one of the problems that we have when we build distributed applications, and particularly as we break things into smaller and smaller services, is that idea of for something that’s long-running, where do I store this information? And a lot of times we end up creating these interesting, sort of, side services to simply pass that information along, ensure that it’s there because we don’t know what’s going to happen with a long-running application. So, that’s really interesting. Well, I wanted to thank you for joining the podcast. This has been really fantastic if folks want to find more information about Temporal or getting involved in the open-source project or using the framework, where should they head?
Maxim: At temporal.io. And we have links to our Git repos, we have links to our community forum, and we also have, at the bottom that page, there is a link to the Slack channel. So, we have pretty vibrant community, so please join it. And we are always there to help.
Jason: Awesome, thanks again.
Maxim: Thank you.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more