Unpacking The Python Toolkit For Chaos Engineering

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So say hi to our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up.

Go to python podcast.com/linode,

l I n o d e, to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And if you're like me, then you need a simple and easy to use tool to keep track of all of your projects.

Some project management platforms are too flexible, leading to confusion of workflows and days' worth of setup, and others are so minimal that they aren't worth the cost of admission.

After using Clubhouse for a few days, I was impressed by the intuitive flow.

Going from adding the various projects that I work on to defining the high level epics that I need to stay on top of and creating the various tasks that need to happen only took a few minutes.

I was also pleased by the presence of subtasks,

seamless navigation, and the ability to create issue and bug templates to ensure that you never miss capturing essential details.

Listeners of this show will get a full 2 months for free on any plan when you sign up at python podcast.com/clubhouse.

So help support the show and help yourself get organized today.

And don't forget to visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And don't forget to keep the conversation going at python podcast.com/chat.

Your host as usual is Tobias Macy. And today, I'm interviewing Sylvain Helle Gouache about Chaos Toolkit, a framework for building and automating chaos engineering experiments. So, Sylvain, could you start by introducing yourself? Yeah. Very much. Thank you very much, Tobias, for having me. So my name is indeed Sylvain Negros. You you said it very well, which is quite,

it's not easy. Yeah. I'm a I've been a software developer for,

I don't know, 20 years or something, and

and mostly a Python developer for just as long pretty much. And, you know, I've been, on that on that path of graduating

from just merely software developer to, almost system

engineer,

basically. So seeing the system as a whole, not just as a piece which lives somewhere else, not on the machine. Right? So that's that's where basically my passion for system reliability and resiliency came from.

That's why I started,

Kiosk toolkit with my,

associate Russ Miles. And do you remember how you first got introduced to Python?

Yeah. I started with Python in

when was that? Early 2000, probably 2000 or 2001,

through the framework called ZOAP. And, honestly, there was quite something to start there, and it showed dedication, so I should go beyond that. Zoho was

a a full stack from you know, web framework back then where

you could actually use a UI

to navigate your your server, basically. And it was quite

quite a good idea to just wait with funds for the era and for what browsers could do and and what servers could do. So

I I started there,

and then I moved to another firm called Cherry Pie, and I stuck with it for years. I wrote a book on it. I, participated to basically the project for

until, what, 2 or 3 years ago. It was a fantastic path, really. We we loved building the community and the project,

and my life has been there. I've never been a a very vocal

Python developer in the Python community,

but I've loved the community and loved the pro the the the the environment for, you know, ever since. Yeah. For pretty most

18 years or something. And so

in recent years, you've been working on the chaos toolkit, which is for building and running these chaos engineering experiments. So before we get into the particulars of your project, can you start by giving an overview of what chaos engineering is and why somebody might be interested in it? Of course.

So,

I'd like to preface that it's probably my own view on on chaos engineering.

Because it's a fairly nascent field at large,

you'll find many definitions. So I don't expect to say that this is a definitive 1. I don't think there is. But to me, chaos engineering is a is a is it's it's a a vehicle. It's vehicle to explore your system.

And more importantly, it's to explore a system so that you discover your vulnerabilities,

your weaknesses.

I think all too often what I saw in my in in my,

in my experience in my software experience

or or test experience was

we, as software developers, we don't it's not we don't care about our our system or users or where these things live, but we don't get the direct feedback to it. We don't know what's going on. So when usually, what happens, at least back in the, you know, you know, back a few years, you know, ago,

you you'd basically get to a bug somewhere. Someone would say, you know, this is broken.

And most most likely, you'd be really annoyed with it because it interrupted you in your flow. You you are interested in in finding that, you know, the issue of the bugs. Probably it's the users doing something stupid or the operations not understanding what you did. Not to say developers were condescending or anything, but more that they were remote

to the system, to the live system. And I think we've you know, for the past few years, we've seen DevOps, SREs, you know, the old

game of saying we need to get closer.

And by being closer, it means you need to take responsibility,

you know, about what you're producing, right, and what you're putting in production.

Chaos engineering is, to me, that vehicle that was going to help you get there.

It gives you the opportunity to explore the system. So by doing that, it says,

your system is full of unknowns.

You need to surface those unknowns so that they become unknowns. Right? And you make good decisions about them. When I say decision, it doesn't mean you have to fix anything, That's why it's interesting. It doesn't ex actually, you know, ex expect a fix. It expects you to know and maybe prioritize or maybe said this isn't that important if that happens.

So

chaos engineering is really about that. It's about surfacing your

what you don't know, what we could talk that sometimes,

what you don't know about your system. And it makes you a better software engineer all around, a better team

because you can make decisions based on data,

which you surface more than, you know, your intuition, which may or may not be wrong or right, basically.

So, yeah, chaos centering is all about that. Now, obviously, people are most aware of Chaos Monkey from Netflix.

And, it's a tool. Right? It's just a tool for a given

scenario.

It's, it's doing it well, and it's well recognized, but still a tool.

So what we're trying to do here is to realize that chaos engineering goes beyond that tool or any specific tool, basically. It's a thought process. It's a culture change.

Those are big words, I know, but they're trying to make you a better software engineer,

not making you good at

using a tool. So that's where chaos engineering is coming from, in my opinion.

And in order to help

people explore this area and start to find these areas of dark debt, as you called it, in their systems and expose the unintended consequences

of how the different components play against each other. You ended up building the chaos toolkit. So can you talk a bit about

what the project is and your motivation for creating it in the first place? Of course. Yeah. And it's it the the what you said is just like it's exactly that.

It's basically,

Iyanna,

yeah, about 18 months ago, I think, or something like that, we were with

ResMiles. We were, you know, looking at those various tools, around chaos engineering.

We were actually also,

you know, interested in applying the chaos engineering experiments

as we had read it from the chaos engineering, you know, principle.

And we were looking at at the tooling that was was there, and we realized none of the tools

actually took us by the end on on that workflow.

Everything was about degrading your system. Everything was about breaking something, which to us was

a mean, not a not an end. It didn't tell us how to improve. It didn't tell us what we could learn. It only told us, you know, use that. It's going to provoke that error.

So what? I mean, that that didn't seem very powerful to us on the long term. So we realized there was a gap there, and that's when we started Chaos Toolkit. The Chaos Toolkit itself doesn't actually,

on its own, doesn't create any chaos. What it does is it gives you a a way to orchestrate those

actions of breaking

or of impacting of in inducing latency

of, you know, any degradation,

any change in your system, basically. It doesn't have to be, you know, about breaking your system.

So it tries to put that in a format that will make it sensible for you to to come after and look at the results.

It makes it simple for you to share the experiment and discuss about it,

also make sense of its results.

So the the Chaos Circuit really was about, trying

to formalize, if you will, or come up with a framework, like you said, library

that applies chaos engineering, the vocabulary of chaos engineering, which are, you know, hypothesis, you know, your experiments,

your findings.

So all that, you know, the word that you could use in in chaos engineering when you're exploring your weaknesses, when you're discovering what's going on. It's like the this, you know, scientific approach really. And we try to actually come up with a tool and a format, a simple 1 that was very important to us and it make it easy for people to get there and to, you know, understand that it goes well beyond, you know, impacting a node or that sort of thing. So the chaos toolkit can be seen as an orchestrator for chaos engineering more than anything else. And you mentioned in your discussion about what chaos engineering is, some existing toolkits

were primarily from Netflix in terms of Chaos Monkey and the larger suite of projects that they've built in the form of the Simeon army.

And there's also another company called Gremlin that has come about recently

that seeks to formalize some of these chaos engineering experience and provide a way to make it easier to apply them. So I'm curious if you can just give a quick bit of,

comparison to what you're doing with Chaos Toolkit and what's available in the Simeon Army and what Gremlin is doing?

Of course. So the so the Simeon Army tries you know, comes from Netflix. So, usually, it's, you know, a lot of tools we see happening on the chaos engineering,

you know, ecosystem, if will come

from,

companies who created their own tool for their own specific needs.

So if you have the same sort of environment or same need, usually, you can apply them. Still, you're still no better off because you don't have the learning

workflow

that goes with it. But still, you can get there because you have the tooling. At least that's 1 1 aspect of it. So Seminar and Arm is fantastic if you're living in AWS

and you have all those sorts of of question that Netflix probably asked, but, you know, failover, you know, redundancy, you know, losing a region and that sort of thing or losing notes. That makes sense to go for the semi anomaly, definitely. And they are fantastic tools, but, again, they don't cover all your problems, all your difficulties.

GrainIn has a fantastic product,

because

they're trying to

as far as I understand, they're trying to come up with something initially that helps you do

define degraded

degraded experience in your system. So, you know, create, you know, latency,

network issues, CPU issues,

that sort of thing, which are actually good ideas to to look for. Recently, they came up with Alfie, which is another side of the product, which is more application oriented. So you can, as I understand it, you can apply

a chaos at an application of a way you say for this specific application induce that sort of degradation.

So that's interesting because

what we see here is chaos engineering is not just about the the infrastructure. Right? It goes all the the way up to the application and, honestly, even to the people in a sense where you're trying to look at how people react and what what they are missing

to actually do their job properly.

So Gramin has has a sort of tools, and the chaos toolkit is different where it's going to drive for example, Gramin. Because Gramin exposes an API, you could use the chaos toolkit to drive their

API as

as a way to create some degradation in your in your system. But chaos toolkit will give you the experiment vocabulary and and and and results

that will you you will be able to share with others and and and compare with. So

I think they all work to, you know, together seamlessly in in my opinion, but the toolkit tries to be really that orchestrator for all those tools.

And so it sounds like with Chaos Toolkit, it's more the laboratory environment where you're recording the outcomes

after you posit the hypotheses, and you're trying to get the overall feedback to see from a system view what are the learnings that we can get out of this, what were the impacts of these different tests, and actually be able to record the different events as they happen

as opposed to things like Gremlin or the simian army, where those are just the tools that are used as part of the experimentation

that don't necessarily

have any concept of

recording the outcomes or reporting them or providing any useful analysis

of what went on during those,

either hard or soft failures? That's my understanding of of of those tools. I'm not necessarily

an expert. And and I know, for example, Netflix has CHAP, which is probably closer to what the Chaos Toolkit and

and Chaos platform that we, you know, I'll briefly introduce later on, is all about. But it's an internal they haven't released that. I'm not sure. It's so it seems to be so,

Netflix oriented. I I don't know if they'll ever release it because I don't know if that's going to be applicable to someone else.

I haven't used Kremlin, the product for a long time, so I don't know exactly where they are now. So I wouldn't want to, you know, judge of of the face of what they exist now. But definitely the chaos toolkit tries to

to take you by the end of the experiment. Like you said, the laboratory laboratory,

angle, the scientific approach of, okay, I've got a question.

What happens if this happens? Right? Okay. I'm going to set an experiment, tries to find the indicators that will help me actually figure out what that means when that happens.

And

I'll make, a report of that, the chaos toolkit is going to help me create that report. Once that's done,

well, 2 things can happen. I can shelf the the report,

and that's it.

Or I can put that report and we can discuss, analyze, but I can also use reuse that experiment because it's fully automated at that stage

and say, well, I'm going to run that automatically,

you know, schedule it to run every day. Right? So you keep you you move from exploratory

to

validation.

At that stage, you know you found a weakness. You're just trying to make sure that you don't have a regression later on, And that gets closer to what probably the human army tries to do sometime by running continuously. It says, well, we knew we had that weakness. We fixed it. Now we're back to, you know we don't want to reintroduce it later on. So by running continuously, you get there. And can you talk through the overall approach and workflow for getting started with the Chaos toolkit to identify a potential hypothesis that you want to validate, and then building and running the experiment to get the results that will either support or refute that hypothesis?

Completely. So the first thing, in my opinion, to get is

is at least good indicators. It's it's funny because,

they are probably the the the most difficult parts of the whole process is to say, well, what do what does matter to us? If you ask any engineer or even management or any team, basically, and they say, okay.

You've got a superb software and system running,

But what matters to you? What you can't live without.

Right? If I take Amazon, for example, the the the this, you know, the ecommerce site, And I say, well, what matters the most to Amazon in terms of functionalities,

is that that people can pay?

Is that that people can people

search for products or can put them in their baskets? By by really, you know, making sure that you think about those thing and and saying, okay. We need at least, you

know, people to be able to search.

Even if they can't buy, that's fine because they still can use a service. That's a minimal thing that we need the system to be about, you know, to be able to do. Then you start realizing that your system has priority. You care more for thumb's sakes. Right? So those indicators

are going to be translated and since you

you can actually

look for. So

the way it would work is

assuming you know those things,

you probably have some some form of metrics somewhere that gives you that that information.

If you don't, you already learned that you're you're missing probably

some in quite quite, important information to even make sense of your of your business. Right? But assuming you do have metrics,

you could say that the toolkit would then allow you to say, well, I'm going to talk to my metric, you know, my monitoring system,

however it's packaged,

and say, give me that value.

If that value is under, you know, a certain threshold or is, you know, is x or y, then my system is LC. In a sense where it means, from a business perspective, we're happy with the system.

It may have some broken state here and there, but at least we know people can do with something.

So that's your business metric, and you probably have a couple, you know, in in usually in in c steps, a few of them. Then you have more technical ones. For Apple, you may want an experiment where you say, well,

I want to make sure I don't have an alert

of this, you know, of x or y.

So you start with the chaos toolkit. What you're going to do is you create, a steady state hypothesis, and it's a block where you define it's all declarative. The chaos toolkit is all for now declarative. Think Terraform, basically.

But it you basically define that, that block, and you say, I'm going to query

that value. So I'm going to make an HTTP call, however it's packaged, and I'm going to ask for that value, and I'm going to make

look and give a tolerance. So that tolerance could be binary, true false, 10, whatever, but can be a bit more complex. Right?

What the chaos we could says is when we're going to run that experiments,

we're going to run that those probes first.

If any of the tolerance is not true,

then the system is not even in a state where we can learn from it. It's already in a state that is

it's not good for for from our perspective. Right? So we beg. So the toolkit stops there. If everything is fine, then we move on to the method. The method is basically the body of your experiment. It does,

inflict that thing, you know, maybe create latency,

maybe, you know, remove, change that configuration parameter somewhere. Whatever, you know, question, that's your hypothesis.

If I do this, if this happens,

then tell me what you know, how the system works, right, after that. So at that stage, it's it's the same. It could be an API. So, for example, if you're using AWS, you can call the EC 2 or, you know, any of the API of of of AWS to change your system.

Kubernetes, you can stop a pod. You can remove a service. You know, all those things that already exist

in the API can be driven by the cluster. You don't need anything any fancy tooling, really. And, actually, most of our extensions that our community comes up is basically driving those APIs

in a way that makes sense to them. So we run that thing,

and we come back at the steady state that we had initially, and we run it again. If the system has changed,

then likely,

that thing is going to fail in a sense where

we have deviated from what we had initially.

Now

when the difference between that and test is when we have deviation, we celebrate it. We said fantastic. We've learned something about our system.

Under those conditions,

we know the system is impacted.

Right? We don't necessarily know how much it is impacted, but we know it is impacted. And and that's basically it. So your experiments usually are fairly short because they just drive

existing APIs in 69 points if you've got them. And they package that into,

you know, the results into a JSON file. You can have a rollback, and I think we may come back later on to that. For example, the demo that we did on, at KubeCon in December was we call the SEO,

endpoint where we SEO as a as a fault injection API. So we

say, create some,

delay for

for that user latency. In the rollback, what you're going to say is tell basically, revert that and do that that that thing you did. So you said don't remove the delay or the latency for that user. So the undoing and the rollback is more of an undo because

if your system has deviated, first of all, you want to analyze what's happening,

and we can't promise a rollback in the sense of putting the system back in place because it has changed, and we don't know how. We don't know how. It's not testing that perspective. And that's it. Basically, you define all those things in a declarative

form, and you give it to chaos toolkit,

CLI, and it runs it. And that's that's how that's a workflow, basically. So most of the time, you end up doing those extensions

in Python or directly as a process. We're talking also about being able to drive all the, you know, runtime,

like, Golang or whatever,

and and see how that works. But, basically, that's the idea. You define your experiment as the file you put on GitHub. People can read, can share, and can make sense of, can reuse

as long as they have the the right extensions because they can just run it. And that's really the the workflow that you keep running. And after that, once you've done it once, you'd run that continuously, basically. So to to run that continuously, we move on to the chaos platform. The chaos toolkit is more of a CLI.

So it's in isolation. It's you draw you know, exploring, like you said, laboratory. You know, you move into the lab. So you're doing that on on your own. And then you've got the chaos platform, which is more of a of a

set of API where you can give visibility. Right? Think about,

almost like if you think,

Chaos toolkit is the git. Right?

Well, Chaos platform is GitLab or GitHub

in a sense where it it puts

in front of everyone the experiments, the scheduling, the policies for those scheduling, so saying you can't run dash that experiment at that time because there is another 1 running already, that sort of thing, and all the reports. So you can basically see what what happened, you know, compare those and and that sort of thing. And as you were saying, once you've got the experiment built, it becomes essentially cheap or free to be able to run it on a regular basis.

And a lot of the focus of chaos engineering is to execute these experiments in your production environment, which can be quite fear inducing for a lot of operators in particular,

but it's definitely useful for being able to ensure that you're getting the most value out of your learnings

because it's hard to simulate production scale traffic on a different environment. But I'm curious what your advice and experience

are in terms of whether to run-in a production or preproduction environment, particularly when you're executing these on a regular basis, possibly in an automated fashion.

I think I think,

chaos engineering is like like I was saying earlier, there is a coach a control shift. Right? It's

it's a bit like DevOps, SRS.

Every time you there is a new practice,

it doesn't just bring new tools. Right? It it changes also

the way we we think about our system, the way we behave, in front of our system. So

in effect, what what that tells me here is you need to start and I said need, not likely, but you probably need to start in staging or placing like that. Not necessarily because you're going to learn more but or less, but more because you're getting used to it. That fear you're mentioning

is something is

trying to help you fight. Because the more you you'll be able to to look at your system as not something

of, like, a crystal glass. Right? It's not something particularly,

to be worried about.

It needs to be something you need to be able to confront

to actually say, you're not Dominique you know, I don't, you know, you don't give me orders, basically, system. Right? I am your boss, basically, and

and people need to start learning how not to be afraid of their system. And

so probably the workflow would be best to start in in in

in staging. But then what I do is move to prod to and look at the system and, again, look at maybe a

a side of the system that

if it goes down because of a mistake we made while we're already in case engineering, first, we know how to put it back quickly.

That's always important to be able to do a recovery.

Then when we do the first, know, the first few iteration of chaos engineering, then we do that altogether

so that we know that the people who know can be there to fix it at best.

And probably on something that is, yeah, not really impactful to the business. But the more you're going to do that, the more you're going to feel comfortable to move to something else more meatier. So it takes a little while. Like, any practice

I wouldn't expect people to get

as fluent with care centering as probably Netflix or LinkedIn are

before a couple of years. It it's it it goes beyond tooling. Right? It takes it's it's a learning curve.

So it's it's a long tail game, definitely,

but it's worth it because at the end of the day, you you should be a better software engineer, and you should be less afraid of your system. So, yeah, start in staging, start small, start then, move to prod, be all, you know, around the system.

Start with something that doesn't, you know, I wouldn't say matter, but, you know, is I wouldn't go for the the biggest thing in your system and gradually start from there, basically. Yeah. It's a similar approach to

the DevOps idea of releasing more often with the case of if it hurts, then you should do it more frequently until it starts to hurt less. So just those with chaos engineering. If you're afraid of it, then that means that you don't fully understand your system, and the only way to really gain that knowledge and familiarity

is to break it and see what happens so that you can fix it, and then get a better grasp of how everything fits together and how you can either improve things or create mitigations

or recovery

strategies.

Completely. Completely. That's exactly the point. It's

you said it well. The more you do it, you know,

that's the only way to learn, basically, is by failing often but small, and then you that's that that hurts less.

That's it.

And so

as far as

when you're running the experiment,

what are some of the strategies that you have found to be most useful for being able to identify the points of failure either ahead of time or during execution,

particularly when you have some of these complex systems and there might be unexpected

cascade effects and identifying

how they're playing out and being able to

trace the sequence of events. So what what we have realized,

alongside the the fact that the tool, Cares Toolkit, was missing in our opinion, is also,

quite surprising maybe,

that's

there's no there's no easy answer to that question because,

obviously, every system is different. Even if you're using the same you know, let's say you're using Kubernetes. You're still running different application on top of that. You have different

constraints. You have different needs. You've got you know, pretty much everything is different except for the API. Right? So,

at least you still have that API, and that's probably the biggest thing that Kubernetes has, you know, given us or or the you know, it's it's under the API. So that means if you move from 1 project to other or when you you look for for, you know, for knowledge, you know that people know Kubernetes API. It's a bit like Java back in the days or Spring, I guess. It was you knew you could get a Spring engineer. You knew that

you were aware that it knew the API. Right? So there was a baseline of knowledge.

But when it comes to failure and and that sort of of aspects, runtime aspects,

well, it doesn't really exist to catalog of that. So that's what we're trying to build with,

with SkillsIQ, the company

I I work for,

where

we're trying to leverage

the knowledge that people have because

there is such knowledge,

but it's not easy to share. And we think you're centering through the experiment vocabulary is fantastic to do that.

So the idea here is you could benefit from others, you know, knowledge from other places,

to to to gain that knowledge. And we call that initiative, upon chaos initiative where where you apply

you share the knowledge you gained for the benefit of everyone else.

So

I was presenting that at KubeCon, and and 1 thing I realized is most of the time when you go to conferences, tech conferences, what, obviously, you're interested in in, in in in the speakers and talks, but

what you're really looking for is how do you do this? Talking to your peers, learning from them, you know, in corridors or on stage when you, you know, you get that thing. Someone saying, this is what we did. This is what we learned. Usually, those are packed rooms because

you're trying to figure out that yourself,

and they're applying that to your own, you know, environment. Well, we're trying to do that with OpenKios initiative and KiosIQ

through those tools, Kios2kit and Kios platform, by saying if we could encode

those questions of failure points, saying, you should pay attention to this. You know, someone else in the industry has said this is important.

Right? Maybe you could just grab the experiment and run it. So that shifts the fact that you have to have all the answers to where can I find someone that can help me? Where can I find that knowledge? So but the strategies

simply are to simply go and talk to your peers. There won't be any tool, any machine learning, any of this that will ever replace the good old, you know, human relationship where you say, this is what I see and someone else, you know, giving you hints and that sort of thing. So

I think it's not easy to to respond to that, but strategies are going to be

try to, again,

you know, at high level, look at your indicators. What matters to my system? Right? What matters to my business?

And then what I'm worried about. That gives you almost like a perimeter of things you should worry about. Everything that falls outside of that, then you don't worry about that. And that's fear where you, you know, that that that those boundaries

with indicator of things that matter to my my business

and things I'm worrying about that could happen, then you start exploring there. And you're saying, for example,

I'm going to give you a random,

random example here. But let's say your users don't really care if your system took 1 or 5 seconds. Maybe latency

of a few 100 milliseconds is not your problem. So that's not where you should be looking at. That's and that's why it it's so important that you

you figure out what matters to you as a as a system, as a business,

because that's a starting point for you. Okay. From there, if we look back,

how you know, what are the indicators, technical

and and otherwise,

that we should be looking at? And those indicators are going to say, okay. Now we know what to look for, what to measure, what are the failure points we have in front of that. Other failure points don't seem to matter to your business if you if you let them out. Right? So that's probably good. Don't try to go and and test for everything. It doesn't really matter. That's why starting with a tool is

is never really efficient in my opinion because if you're just applying CourseMonkey or any of the tools like that, but, yeah, you broke something, but in what way does it matter to you? How do you measure that it actually improved your knowledge of your system? That's why you really need to figure out what matters to you and, and walk back basically to find the failure points that would impact those indicators.

And when you're executing

a test run with Chaos Toolkit,

what kinds of information or statistics or feedback are being captured by the Chaos Toolkit, and what kinds of reporting are available

as an output of an experiment run? So, the toolkit again is

I, you know, I've implemented it. I've designed it, but, I'm still,

I'm still gonna say it. The the toolkit is fairly dumb. In a sense, what it does, what you tell you know, you tell it to do, it's a bit like Terraform.

If you give it to an empty file, then it won't do anything. Right? So it comes back to

our realization as of Russ and I that we don't have audiences.

You have more answers and more expertise in your system than we do. And because those indicators I was mentioning about are yours,

it seems only fair that for now, you make sense of what data you need to collect. So what happens is, earlier, I explained the body of an experiment. Usually, what happens is not only do you have an action that does something to your system. Often, you also,

put there

what we call probes or data collection

actions. So those things are going to simply ask whatever monitoring system, whatever data that you collect yourself,

and try to aggregate that.

And that will become part of the report that is generated at the end so that when you start to do the analysis, you can you can look at it and say, okay. We did that, and we saw that. 1 key you know, 1 simple example. Let's assume you've got 2 2 services talking over HTTP. Right? And 1 of them you say, what happens

for service a consuming service b, service b returns an HTTP error? So you can you can certainly simulate that HTTP error. That's fine. And you likely see, an error on service a. What you're going to collect as data is probably the logs,

maybe some monitoring, some alerts. You know? So you're going to talk to your system. You're going to actually query your logs and see

what error did actually work.

Because if it happens in real life,

so that's the error that my operations are going to see. Can the action on that? Can then make sense to a traceback

that's that happened in production? Often, developers don't realize that by saying, oh,

I logged the trace back. It doesn't really help the operation because they still don't know what to do with it. That's for you to analyze after after the fact. Right? So it's interesting to go that way to look at your system and say, well, okay. I'm going to cherry pick that data, that data, that data. Obviously, let's choose that beforehand, you have a good idea of the data you need with a solid case of reports. So sometimes what happens is you play the the experiment once, realize, okay. I didn't collect that data. And you play it again until you've got

enough

information in there collected to actually be the solid

report.

Now as as a report, the chaos toolkit is not clever enough yet to actually make your reports analyzed by default yet. So what that means is the report that is generated is mostly,

almost like

a UI view of your report, but it can generate PDF, HTML, that sort of thing. 1 aspect I did mention that it has, that is important for reports, When you run an experiment, let's say you've got you've noticed

you've got 3, 4, 5, or or 10 experiments that cover, you know, certain surface of your of your system. So they are related in terms of

they're trying to showcase something specific

from a different angle, each of them, but but they're all

related to the experiments. Well, you what you can do is there is a small piece of metadata that you can put put in there, which is called contribution. That's a free phone text with a weight. And, basically, you're saying that experiment is going to care for availability at a high level, but doesn't care for security. It cares for reliability and that sort of thing. And basically what the reporting would do once it aggregates all of those

results, it will build a

form of a punch card, if you will. And you can look at that punch card and say, okay. I realize I do a lot for availability, but I don't do anything with security. That's interesting because sometimes that helps you figure out where you put a lot of efforts in terms of chaos engineering. And what that says is that's fair. We don't care for security,

but that's a known. Right? It's not something hidden. That's a decision. That's a conscious decision. So the the the report, the more you do chaos engineering experiments, the more you are reporting that the p d the we can generate is going to be useful. Right? But again, it can't make

yet

a prediction about what you should do about that error.

That's not there yet.

And can you take some time now to discuss how the Chaos toolkit itself is actually implemented in terms of the software design and architecture

and how the implementation has evolved since you began working on it? Yes, of course. So it's written in Python, or I probably will be there. It's in Python 3, 3.5.

Actually, we're moving to 3.6 as a minimum. And if I was there at the villa, it would probably be 37 or 38 anyway. I try to to keep to

the at least 3.5.

Now it's a conscious choice not to piss people off, but I've been, you know, I've been using Python for a long time. And while Python 2 was

was really nice, Python 3,

especially since Python 3.5, in my opinion, has moved to a different level, has become, again, I could say, a modern language. Now it's, you know, I don't know if it's, something that people could contest or not or whatever, but that's my take anyway. And, but the decision we made I made when,

when we started designing was we want it wanted the system to be extensible.

So the chaos took it as a call, which is fairly small, which is basically taking the JSON or or the declarative form, the YAML, and turn that into something that is can that that is runnable. Now extensions are actually usually either HTTP. In that case, the toolkit is going to call for you through requests,

or so you just put in your URL and calls and gives you the feedback.

Or process, you can call a process that's used for if you have a tool that induce some load,

you can just call it.

Or more often,

than not, you extend through a Python library. And I wanted that to be a separate library. So it creates a lot of small libraries, of course, but it helps the call to move at its own pace

and libraries to move that their own pace. And it it has worked out very well, I must say, because the community

feels

happy to move fast on their extensions

and make sure the call you know, they don't impact the call. I think we would see less contributions if we had said everything is packaged at once because people

would be afraid of breaking the core.

Right now, the core is solid. It's reliable. To be fair, it doesn't

the the design initially has hardly moved since the beginning. And I think it's due to 1 simple thing. We didn't go for a state,

based design. So there are no classes. It's only functions.

So we design

the toolkit in a functional

approach, if you will. I wouldn't say that Python is functional, you know, language by any stretch, but it does it doesn't, you know,

force you into a state language with with, you know, with classes.

And I think it has really worked out because

they are so easy to reason with. When you start saying, okay. You've got a function. It's just entry and, you know, input, outputs, and that's it.

And, basically, we when we pass things to to the extension functions,

we always pass a copy because, you know, mutable and stuff. But we we we said if you if you change that, it won't be seen from the outside. So we pass a copy of the data that that they would expect. And that has worked very well. That has helped us to move fast. And I think lots of the contributor of the documentation is what it is, but we never had to actually go too deep into the documentation. People could just grab the piece of code of an extension and just copy paste

and go from there. So it's it's been working very fine.

The 1 bit that 2 things that I I may have done differently. It's not blocking per se, but it's a question that I've I've always had with with Python is I'm using a lot of exceptions, not as a control flow necessarily,

but as a way to

they have sometimes a control flow aspect. I never quite know if that's a good approach. Been doing that for 15 years. I never I'm still not sure about, you know, exceptions versus returning an error that Golang does.

It's a pain in the ass for, you know, as a consumer, obviously, of that error if you have just a code or a number, but at the same time, it makes the function even simpler to reason with. So that 1 is is a question I still have, and I probably will have for as long as I run Python. And the other 1 is that I see as a problem,

that was really the only issue I had with choosing Python initially. If I had chosen, Golang, I could easily create a binary that people could download. Now it's it's not easy at all. The whole generation of,

an artifact with Python is,

it's it's a better story than it used to be, but it's still not perfect. It's not as as lean and nice as Golang, you know, gives you. Obviously, they have different philosophy and different way of of working, but still, it's a very

it's a bit of an of a problem to me. So I managed because I think people using Python program are used to be to do pip install or whatever you use. But I'm not necessarily completely happy about that because I would love to be able to say, well, package, you know, Chaos 2 kit wherever

and run that. You don't as a consumer, you just need 1 thing, Kiosk toolkit, its extensions,

and the experiment. Right now, it's it's a bit more important if you want to do that. So what I did was use p a py installer to create a bundle. That was fine. It was

a learning curve as well, but that does work. But it never feels really quite right either, for some reason. I don't know why. So Python was a good choice. I would love if Python could be more and more used in the DevOps world again, versus Golang. Not that Golang is bad or anything, but Python, it's so nice when you start dealing with strings and, and, and dynamic payloads like that. It's just a breeze, Really, it's fantastic. And now with typing int, it's it's really nice to code in Python, to be honest. So I don't have any issues with that. So the the design haven't changed that much. Actually, I haven't changed at all.

No. I've been really happy, and I think we we got it right, luckily. Yeah. There's been a lot of conversation

over the years, and

it seems to be coming up again recently about the whole idea of inheritance versus composition,

and whether or not you need classes or if classes are useful for different contexts. So it's interesting to see that you have gone with a function oriented approach and using those functions for composing the system together and making it easier to add in these different plugins because you don't have to worry about accrued state as the objects are being passed around. Exactly. Exactly. The point was I didn't want people to think about those questions. I want them. It was basically write a function. That function is becomes an action,

and it it is an extension to Chaos Toolkit now, period. And that was that has been very easy to for people

a lot of people who've who've been using Chaos Toolkit have said, I haven't done a lot of Python, you know, for a long time, but they've caught up really quickly. I think it's

a function

is something that pretty much any developer has run into in 1 way or the other. Right? I didn't want to impose on them Python idioms so too much. That's why I went away with classes, and I've been happy with it. The code is linear in my opinion and more robust.

And in terms of

building

the

toolkit in a way that you can

create safe and

reversible experiments? What have been some of the most challenging aspects of that, particularly

as far as providing a rollback path for people who are implementing their own plugins or custom logic and also just for building experiments to allow them to have the confidence that they can run that experiment without irreversibly changing their system?

It's always interesting because I think we got that question recently as well, and it's a very sound question.

That's probably where here we are more on the exploratory

than on the testing side in a sense where because you're an exploratory, you know, you're exploring. By default, you don't know what you're going to break. So it's really, really hard in advance to say, well, I know I'm going to put it back.

So while the, chaos toolkit talks about rollback,

I think we think more and more that's probably

a that was wrong choice of work. We talk more about remediation,

cleanup, or undoing a sense where what we promise

in the terms of API of interface or what you you should think about when you do chaos engineering anyway,

is saying, I'm going to to impose degradation to the system. What I can do

is

reverse

that degradation.

It doesn't mean I'm putting the system back on track

because you're trying to figure out what that meant

that you integrated the system. So I see I I certainly see your point about,

fail safe and, you know, being again, coming back to not being afraid too much about what you're going to do. But that would be a because you want to learn, you need to be to take the risk. It's a leap of faith, basically, that can you can learn that way. You can learn by doing and, yeah, hurting yourself initially, exactly. So that's why, like, as we said, we need to do that in a in a small parameter as well initially. But the more you're doing it, the less afraid of it. And you know the all the mechanism are already in place.

Say you you run something and you try to put the system back on track. Shouldn't that be automatic?

Why is it part of your experiment that is back on track? I should think that if something is down,

the system should be almost self feeding

to a point anyway. Right?

So it's dangerous to say, I'm going to put it back for you as part of the experiment, because

it prevents you in a way to learn that you don't have those mechanism in place. If a user falls into a trap and start breaking something, you know, just because there was a bug or anything, do you think they're going to try to put it back? No. They're not. Right? So what you need is to learn,

could I get, you know, could I could I learn about that issue beforehand? Could I get alerted?

Could I actually

self heal, resolve? Maybe maybe put the this that side of the website in maintenance automatically just so that, you know, further damages aren't done. So the old discussion about rollback should be

thought in terms of I've

let's say I've created a latency

on that service.

My rollback is to remove that latency,

and that that's pretty much the perimeter of the rollback, at least as a promise. I don't mean to say you can't try to put, you know, few SIGs, in place, but I think it's out of band of the Chaos toolkit, Chaos engineering. It may come. Maybe in the future, we'll we'll,

we'll get better at this more, and the industry say, yes. We can, you know, go, as far as that. For example, 1 of our goal with the Open Chaos initiative is to go back to

to, you know,

to places like Google Cloud, Amazon, and all those all those folks and say, right. You should be able to provide experiments

proving that your system can cope with x or y. But, also, you should probably expose

a rollback thing so that if this happens,

you're basically telling me how you're going to cope with that. And that puts things in perspective because you're now relying on them to actually put things in place, not your application, but at least

their side of the infrastructure.

So you see my point here is not the experiment

surfaces what you don't know and what is your assumption,

but you should challenge your the fact that you have to do everything.

That's why it's so interesting to have more microservice in my opinion as well because a lot of people say, microservice, blah blah blah. They are bad. But 1 thing they did well to me was to say, as a developer, you can't micro, you know, decide anymore.

Because before, as a monolith,

you made all the decisions inside and as and as as an app, I couldn't see what were you what you were doing. But now because you've split that into smaller pieces,

I can question

that shows you've met. You have less your face to make a decision. That can be good or not. But the first thing we do as developers truly, I'm going to fix that. Yeah. But your fix may be not the right fix.

Maybe it's it's it's not that fixed at all we need. Maybe it needs to be beforehand. Right? But chaos engineering is going to help you on that as well to say, as a system,

we need to prove the system copes with x or y. We need to do that because we've made a promise or or or uses.

But that's that's pretty much how far

as a SecureCentering

experiments is going to go.

Everything else is

on your shoulders or on your service provider shoulders.

It's probably on both, Brady.

So that's that's you know, it's interesting because now you start seeing the parameter you have, the functional, the application parameter you have much more, thanks to chaos engineering.

You don't assume you have to care for all of that, basically.

And in terms of your own

experience and lessons learned in the process of building and maintaining the chaos toolkit project

and the organization that you're building on top of it, what have been some of the most interesting or useful or unexpected lessons?

So there are a few things.

Knowing what what, you know, that your users

are using actually your tool. In the open source world, it's not easy. It's not easy because

people grab your, you know, your your project

and run it, but they don't always

come back and say whether it's bad or good. Right? So you you don't really know

much about your your tools or what people see in it. Right? And that's sometimes frustrating because

you're happy with criticism,

but you don't even see them,

or you're happy with surprises. Although you see some of them, you don't see all of them either. Right? And that makes you sometimes a bit,

you need to have passion,

to be patient and to have passion about what you're doing because

you don't know what people do.

So that's

that's always

yeah. Sort of frustrating. Sometimes you wish you could know a bit more. So you don't really have visibility on on what people do unless they tell you explicitly. 1 thing that I

see,

at least has always been on my mind, is we need a nice and kind and civil community.

I don't like,

and I don't appreciate when people come and and barge in and criticize.

There is a tone, basically, asking things. And

I say that because I I don't think it it you know, there is no stupid questions. I never think there are stupid question. Usually, when a question is asked many times, a kid has tend to say,

documentation is bad

or or project has a problem. There is a bug here. So I don't, but sometimes

I think the tone matters because you're still doing open source. Right? So you're doing that because you love it, but

you don't expect people to be,

nice about it. Right? So, luckily, in that community, it's extremely, extremely rare.

Pretty much everyone has been very happy, very nice, and very, very patient with us. So that's fantastic.

I've all I think the the architecture as well we've set up means we've seen a lot of,

contribution that we wouldn't have seen otherwise, so I'm really happy about that. That was a good surprise.

So we've got quite a few, not just people using, but actually actually contributing to the extensions.

So that's been good.

Yeah. Other than that, basically, it's it's been all positive all around, really, that experience so far. It's been really fun.

So we we are thinking about

moving the governance from

us or mostly me, basically, making all the decisions to a more open governance where for extensions, maybe people could take the lead on their own extensions and and basically don't have me doing the review for everything. And

in addition to the Chaos toolkit and the various extensions that you've built for plugging into it, you have also

started a draft proposal for a

common specification

of chaos engineering,

particularly in terms of

the agreed upon language to use

around how the experiments are created and just ways to ensure that everyone is using the

same format for conversing about Chaos engineering. So can you talk briefly about that, and also any plans that you have for the future of that specification

and the Chaos Toolkit project itself? Of course. So, yeah, it's called the Open Chaos Initiative,

and we call that initiative not,

manifesto.

It's initiative because we want it to be short lived, hopefully.

The idea is really to federate the industry around,

shared vocabulary. We

we reckon that helping people to

get their vocabulary

and get it right

mean we we all talk about the same thing. And and that's actually more important than you'd think. Right? It's really words have meaning, and it's difficult when you have a discussion and you're not talking about the same thing. Because words will give a remit, will give a scope. That means things that go out of that scope can be discussed elsewhere.

It doesn't mean they are not important, but that means

we can focus on aspect of chaos engineering, not trying to it's not,

you know, an ad lib discussion.

OTEC is

we think chaos engineering

and resiliency reliability

should be an industry wide

discussion. I don't we don't think chaos IQ is meant to say do Chaos Engineering that way

or any other provider.

What we say is

we all have the industry has ton of knowledge, ton of information,

ton of ways of dealing with those things. If we could start sharing that way and sharing about at least the vocabulary, what it means, how we learn from each other, how we learn from that,

then that means we start,

making the whole industry better. Right? We make it

more aware of those problems. I think it's a bit similar to what DevOps in culture has done,

what we see in SREs,

coming up. So that's sort of thing we're looking at is saying,

can we all agree on a vocabulary

on the way to think? And that would be fantastic. So that's what we're trying to do. It's,

so if people from all those, you know,

across industry are listening to this, you know, please join us on on the on the chaos toolkit Slack because we want to have your input on that. As for the the project,

the chaos toolkit

right now, we've continue evolving, at a, you know, at its own pace. I think we've got a 1 0, which is way overdue, and I didn't get the chance to actually release.

We want to say, this is a milestone. You've worked really well out of the, you know, all contributors,

and you can bet on that. Even if you change this, you can always come back and fall back to that to that 1. So I think it's important to leave the,

0 point,

versioning scheme and move into the 1 0 version scheme.

So that's probably,

a major, you know, a short

short, term thing.

A long term thing is to basically

ensure that whatever the OpenKios initiative comes with in terms of specification,

the chaos circuit applies and implements them so that you have, a de facto standard

implementation of those those, those findings.

It will remain open source ever, you know, for for a long time. We have plans to talk about,

hosting it at the CNCF considering if the governance needs to be there because we think it fits with the goals of the CNCF. So that's the sort of thing we're looking at, this year, definitely.

So quite an exciting year to put the project,

to grow the project quite quite

quite,

quite extensively, definitely.

And are there any other aspects of chaos engineering or the chaos toolkit or the open chaos initiative that we didn't cover yet that you'd like to discuss before we close out the show? I think we have covered everything. The the important,

I think,

part is skills toolkits is a CLI, and that's the most

it's it's just a year, but it feels like it's,

it's been there for years, right, to me because,

you know, people talk about the toolkit now, and that's fantastic. But,

what we have this year is a chaos platform, which is, the the big sister of the toolkit where

we want to take that to the next level so that you can install the platform and you can start having the visibility. So from a

future perspective, the platform is going to be where I'm going to be at mostly this year more than on the toolkit.

Doesn't mean the toolkit is going to be left alone, definitely not, But the development will happen on the platform, and and and and it's going to be exciting. Really fun. And for anybody who wants to follow along with you and get in touch or ask any questions about the projects that you're involved with, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

And this week, I'm going to choose a movie I watched recently called Time Trap. It was a very interesting

sci fi movie about

a group of people who end up in this cave that has some weird aspects about how time flows

internally and externally. And

I'm not gonna give away too much more other than to say that it was just a very interesting movie, different take than a lot of the movies I'm used to watching. So it was a lot of fun, and I recommend it for anybody who's looking for something to

make you think and just for a fun sci fi film. So with that, I'll pass it to you, Siobhan. Do you have any picks this week? Very good. Very good. I'll I'll I'll make sure to look it up.

I think my my my pick on this week is is

is actually,

I'm trying to get back on playing my guitar.

I've left it in its case for a long time, and I realized

how much I miss it. So I picked it up the other day, and

it was fun. Although

it really hurt my fingers

because I haven't played for a long time.

But, yeah, I'm I'm trying to get, I guess,

to force myself to get away from the from the machine from time to time.

How exciting the machine is and and and developing an open source and all that, I'm realizing there is a life outside,

you know, this this tiny box. So that's that's my pick is

doing something else.

Definitely good advice and something that bears repeating every now and then. So thank you for that reminder to everybody who works in this industry. Thank you again for taking the time today to talk about your work on the Chaos toolkit and Chaos engineering.

And it's definitely something I'll be taking a closer look at. So I appreciate that, and I hope you enjoy the rest of your day. Thank you very much, Tobias. Thank you for having me.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__