Mikolaj Pawlikowski: "Introducing Chaos in Bloomberg" - Chaos Conf 2018
I work for Bloomberg but I work in London. I had the pleasure of the magic airplane when I spent 11 hours on the plane and I landed three hours later. Be easy on me.
I'm going to talk about the Chaos Engineering experience we had in our team, the tool that we built and that hopefully will be useful to you too. On the menu, I have prepared is that I'm going to start with the problem, what we encountered, how we dealt with that. Then, the second part will be about the PowerfulSeal and hopefully by the end of this twenty minutes or so that we have together, you have yet another tool in your tool box to help you spreading chaos.
Before we jump into that, just a quick shout out to my team. This is the wonderful people that I have a chance to work with. To give you a little bit of background and why I am here, we are building this platform called DTP. It's a microservices platform internally for other engineers in the company. I won't go into too much detail about that. If you are interested, we gave a talk at KubeCon in Seattle two years ago. You can just YouTube that, but otherwise, it's a quickly growing Kubernetes based platform that presented a lot of challenges to us.
Special note for Kelvin, who is interning with us this summer and who actually wrote some of the features that I am going to talk about in a minute.
We found ourselves with DTP two or three years ago. We were adopting Kubernetes. We found ourselves with a thousand moving parts and we needed a way of making sure that things worked together. I don't know if you've reached that moment, but I definitely have when you go and find some solace in Google asking, "OK, so why are the distributed systems so hard?" It's a little bit reassuring because Google tells you, or it used to tell you, that 49 million reasons not to worry about it because they're quite hard. If I update the same slides now, it went up pretty much double. If I extrapolate, by the time I retire, it's going to be pretty much impossible to work with them ... but, let's not do that.
If we are a little bit more serious about that, I really like this quote by Lamport that I was introduced to by a friend and I think this is pretty much the essence of the problem. You know in a distributed system when some computer you didn't know existed is capable of messing with your system and rendering your own computer unusable. I don't know if people have the same feeling about it, but I find it difficult to find a better quote to pretty much describe that.
We said it in previous talks, but pretty much, the bottom line is that everything is pretty plotting against you and trying to get you when you're not looking. The communications, the hardware, and the cascading errors are some of the main culprits.
Given that it is a chaos conf, I'm expecting that we know, or even an understanding about this. The problem arises when you try to talk to someone who is not that technical, your manager, for example, about that.
I found that it is very useful to have an example ready with some very basic maths that can be quickly verified in your mind and the example, that I usually go with is something like this. If you assume that your servers are super reliable, let's say, 10 years, median time between failure, or average even, you only need 3,650 of them to average one failure a day. This is really something that gives people this mental image that they need to understand that this is not a question of if, this is a question of when, because the scale is not that big for it to happen. If you have a moment later, check out the slides. They're from 2009, but there is a very nice section there about the back of the envelope calculations that everybody should probably be comfortable with.
Distributed systems are hard, obviously. We have found some solace in Google confirming that. What do we do? Surprise, surprise, we do and do some Chaos Engineering and that's pretty much what we did on our team. The principles of Chaos Engineering, it's a nice website that's also manager friendly, because it is short enough for it to be actually read, probably, but it can do ever better. It can summarize that in three bullet points, another useful tool. We increase confidence in our distributed systems by introducing failure on purpose and then, we try to detect bugs and unpredictable outcomes.
The key word here is really the increasing confidence. We are not really proving anything. We are just increasing that confidence by doing that and that sounds good and most of the time, usually it's pretty easy to convince the technical people that it's a valuable thing to do, but if you ever run into trouble talking about that, here are four common things that, I think, are useful to have a counter argument against.
The first one you are most likely to encounter is some kind of variant of, "Okay, if it ain't broke, don't fix it." It's a fallacy and I think it can be pretty easily retorted by a combination of two things. The thing that we just proved, well, the thing what we just illustrated, the when, instead of, if, and also the cost of finding the bugs later versus finding them sooner. That's usually enough to win this argument.
The second one that usually arises is "We've got enough trouble already and, to add to that, we have already have a back log that we can't fulfill." I've found that it helps to say something along the lines, "We are only going to limit that to, for example, the time when everybody is out of the office and we minimize the number of calls that people get during the night."
Then, the next one is usually some kind of thing about expertise, that we don't have that. Well, that's obviously no longer the case. There are tools easily available and I'm going to be talking about one of them. The other guys have been selling to you other things all morning.
The last one is usually the production. The production is always tricky. It's the Holy Grail but, I think that if you at least sell a little bit of that, running production continuously, and then you ramp it up progressively, that is more realistic to introduce.
That's the start. I wanted to share with you what we came up with. I think everybody recognizes this logo now. Like it was said in the first talk at the beginning, there was Chaos Monkey. We wanted to use Chaos Monkey but we needed something a little bit more customizable. We were testing Kubernetes. We wanted something that actually spoke Kubernetes, instead of just taking the nodes up and down and we came up with PowerfulSeal. PowerfulSeal is trying to fulfill that. Don't ask me why it's called PowerfulSeal. It doesn't matter right now, but it is indeed quite powerful. It's also trying to be very simple, very simple to set up. Basically, the set up means giving it access to the things that it needs to be able to speak to. It's going to need to speak to your Cloud API. In our instance, that was OpenStack that we worked with. There is now an AWS driver contributed by the community too. Then, you are going to need to be able to talk Kubernetes, some Kube config and then some SSH connectivity to Kube things via Docker and announce that he wants to expose to it.
It is pretty easy to get started with. It comes with, well hopefully, batteries included. It comes with four modes of operation. I am going to talk about them and show you what they feel like and what they look like. In general, we wanted to cover use cases varying from an interactive mode, where you go and explore and you have task completion and feels a little bit, much like any other console tool.
Then, a label mode that behaves much more autonomously when you just mark things that you want to be killed by The Seal.
A demo mode, that is even more hands off, where you basically point it at some metrics and it kills things for you.
And finally, the autonomous mode, where the actual serious work is done, where we know already know what we are trying to break and we just need a good way, a flexible way, of describing these things.
To give you a feel of what it's like to use it. In the interactive mode, we work on nodes and we work on pods, well, on all of the nodes. In the interactive mode, you are going to be able to basically do things like list the nodes for this availability zone, for this range of IP's, for this thing in the name, for this particular state, and then you can filter them, take them up and down and explore the first approach, my application is running, what happens when I take things up and down from this central console.
Then, the second layer is pods. We already have the Kubernetes layer in which we can list them again for deployment, labels, pretty much all the usual stuff, a nice basis. Then, we take a sample of them and we crash them. It's set up SO that actually we crash it through Docker so that if it comes up as a failure and is reasserted by Kubernetes for you. That's the interactive mode.
The second mode I mentioned was label mode. It's strives to be even simpler to use. In this mode, you fire up The Seal somewhere in your cluster and then, you go and you mark a particular pause that you want to interact with a couple of labels. In order to enable The Seal with defaults, you can just mark something as Seal/Enabled:"True". Then, you can tweak some other things like the probability of it being healed and killed. You can click the days and time of day so that you can satisfy that previous thing, when you try to convince your manager it's a good idea.
The third mode I mentioned is the demo mode and, in this case, it is a new addition. In this case, you point it at heapster logs, heapster metrics, sorry. Then, we have a little algorithm that tries to be a little bit clever to figure out what's actually worth crashing for you. It looks like things like CPU, run usage and tries to make assumptions about, if it's busy, it's probably important, so let me kill that. You can click it with just aggressiveness level so that it's less or more aggressive. The idea here is to really to just fire it up and have something quickly.
Then, the real work really comes in the autonomous mode. The idea in autonomous mode is that you already know how your application is defined, what the weak points are, and you want to write scenarios that basically prove that the weak points are handled by the application.
Scenarios are policy files in a powerful seal speech. It's basically, a YAML file following a certain schema that two arrays of scenarios, one for nodes, and one for pods and then, The Seal goes and executes them.
The idea behind the scenarios is that you match things, then you filter them out and then, eventually, you action on the ones that are remaining. In this example, you can imagine that you have too much risk, whether those are pods or nodes, it doesn't really matter. It works the same way. You have one that results to A, B, C. Then you have C and D in the other one, so we are going to turn that into a set to duplicate it. Then, you are going to have a filter, you are going to see a few examples in a second, but a filter removes two of them. We are left with A and C. Then, we have the actions. For the actions, we're going to execute them on all the remaining bits.
To give you an example of a real life policy, scenario, rather, it looks a little bit like that. We have the matchers. They are a little bit different for pods than they are for nodes. For nodes, we can match on any properties that are available, for example, name of the machine, IP, group availabilities, and that kind of thing. For pods, on the other hand, we can list them either through an Essbase or a particular deployment or if we want to do something more funky, we can just have whatever labels that are available and match by that.
After that, we're going to have a set of pods or nodes and we filter things out from there. Again, we can filter on properties or we can filter on things like random sample, whether it's a percentage or some kind of fixed amount number, or probability related things. There are also filters for the day of the week and the time of the day and stuff like that, so that you can implement the work day again. Then after that, we have the actions. In particular, currently the actions for the pods are 1. You can kill them for the nodes. You can take them up and down. You can execute arbitrary stuff, if you really want to, or you can delete them.
That's it really. It's pretty quick to get started with that. The usual kind of use cases that you discover some problem with your application, you can write a policy that crashes it in the right way and then, you try to fix it. Then. you leave that policy run continuously so that you can show that your application is no longer prone to crashing because of that.
The autonomous mode also comes with a couple of nice features, auto debugs. It comes with Prometheus metrics. You probably already have a set up of Prometheus plus Grafana that are scraping your metrics, so you can just plug it into that. Then, you can tweak your alerts to exclude the things that crashes, for example, that The Seal is executing for you or, you can alert on The Seal not to working properly itself. It also comes with another new addition, with a UI, web-based UI. If you are not into command line, you can also click and play, start stop, change config and see all the logs from a nice looking booster up UI as an alternative.
That's really what I have prepared about The Seal. You can go grab it and get up. We open sourced it last year at Kubecon in Austin and I would be really keen to get feedback from people. We've got some people who have told us how they have used it. If you can, please do tell us, if you have success with that or maybe even more importantly, if you don't. Go tell us why. Don't forget to star us on Repo.
The conclusions that I have is that life is too short to worry about the next outage. Go make your own and now you have one more tool. Go embrace The Seal please.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more