Summary
Chaos engineering is the practice of injecting failures into your production systems in a controlled manner to identify weaknesses in your applications. In order to build, run, and report on chaos experiments Sylvain Hellegouarch created the Chaos Toolkit. In this episode he explains his motivation for creating the toolkit, how to use it for improving the resiliency of your systems, and his plans for the future. He also discusses best practices for building, running, and learning from your own experiments.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Sylvain Hellegouarch about Chaos Toolkit, a framework for building and automating chaos engineering experiments
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what Chaos Engineering is?
- What is the Chaos Toolkit and what motivated you to create it?
- How does it compare to the Gremlin platform?
- What is the workflow for using Chos Toolkit to build and run an experiment?
- What are the best practices for building a useful experiment?
- Once you have an experiment created, how often should it be executed?
- When running an experiment, what are some strategies for identifying points of failure, particularly if they are unexpected?
- What kinds of reporting and statistics are captured during a test run?
- Can you describe how Chaos Toolkit is implemented and how it has evolved since you began working on it?
- What are some of the most challenging aspects of ensuring that the experiments run via the Chaos Toolkit are safe and have a reliable rollback available?
- What have been some of the most interesting/useful/unexpected lessons that you have learned in the process of building and maintaining the Chaos Toolkit project and community?
- What do you have planned for the future of the project?
Keep In Touch
Picks
- Tobias
- Sylvain
- Playing Guitar
- Step away from the computer
Links
- Chaos Toolkit
- Chaos IQ
- Gremlin chaos engineering service
- Russ Miles Chaos IQ co-founder
- Zope
- CherryPy minimalist Python web framework
- Cherrypy Essentials book
- Chaos Engineering
- Chaos Engineering Book
- DevOps
- SRE (Site Reliability Engineering)
- Dark Debt
- Netflix Simian Army
- Chaos Monkey
- Terraform
- Kubecon
- Istio service mesh
- Chaos Platform
- PyInstaller
- Composition vs Inheritance
- Open Chaos Initiative
- CNCF
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So say hi to our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. Go to python podcast.com/linode, l I n o d e, to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And if you're like me, then you need a simple and easy to use tool to keep track of all of your projects.
Some project management platforms are too flexible, leading to confusion of workflows and days' worth of setup, and others are so minimal that they aren't worth the cost of admission. After using Clubhouse for a few days, I was impressed by the intuitive flow. Going from adding the various projects that I work on to defining the high level epics that I need to stay on top of and creating the various tasks that need to happen only took a few minutes. I was also pleased by the presence of subtasks, seamless navigation, and the ability to create issue and bug templates to ensure that you never miss capturing essential details. Listeners of this show will get a full 2 months for free on any plan when you sign up at python podcast.com/clubhouse.
So help support the show and help yourself get organized today. And don't forget to visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And don't forget to keep the conversation going at python podcast.com/chat. Your host as usual is Tobias Macy. And today, I'm interviewing Sylvain Helle Gouache about Chaos Toolkit, a framework for building and automating chaos engineering experiments. So, Sylvain, could you start by introducing yourself? Yeah. Very much. Thank you very much, Tobias, for having me. So my name is indeed Sylvain Negros. You you said it very well, which is quite,
[00:02:07] Unknown:
it's not easy. Yeah. I'm a I've been a software developer for, I don't know, 20 years or something, and and mostly a Python developer for just as long pretty much. And, you know, I've been, on that on that path of graduating from just merely software developer to, almost system engineer, basically. So seeing the system as a whole, not just as a piece which lives somewhere else, not on the machine. Right? So that's that's where basically my passion for system reliability and resiliency came from. That's why I started, Kiosk toolkit with my,
[00:02:44] Unknown:
associate Russ Miles. And do you remember how you first got introduced to Python?
[00:02:49] Unknown:
Yeah. I started with Python in when was that? Early 2000, probably 2000 or 2001, through the framework called ZOAP. And, honestly, there was quite something to start there, and it showed dedication, so I should go beyond that. Zoho was a a full stack from you know, web framework back then where you could actually use a UI to navigate your your server, basically. And it was quite quite a good idea to just wait with funds for the era and for what browsers could do and and what servers could do. So I I started there, and then I moved to another firm called Cherry Pie, and I stuck with it for years. I wrote a book on it. I, participated to basically the project for until, what, 2 or 3 years ago. It was a fantastic path, really. We we loved building the community and the project, and my life has been there. I've never been a a very vocal Python developer in the Python community, but I've loved the community and loved the pro the the the the environment for, you know, ever since. Yeah. For pretty most 18 years or something. And so
[00:03:55] Unknown:
in recent years, you've been working on the chaos toolkit, which is for building and running these chaos engineering experiments. So before we get into the particulars of your project, can you start by giving an overview of what chaos engineering is and why somebody might be interested in it? Of course.
[00:04:11] Unknown:
So, I'd like to preface that it's probably my own view on on chaos engineering. Because it's a fairly nascent field at large, you'll find many definitions. So I don't expect to say that this is a definitive 1. I don't think there is. But to me, chaos engineering is a is a is it's it's a a vehicle. It's vehicle to explore your system. And more importantly, it's to explore a system so that you discover your vulnerabilities, your weaknesses. I think all too often what I saw in my in in my, in my experience in my software experience or or test experience was we, as software developers, we don't it's not we don't care about our our system or users or where these things live, but we don't get the direct feedback to it. We don't know what's going on. So when usually, what happens, at least back in the, you know, you know, back a few years, you know, ago, you you'd basically get to a bug somewhere. Someone would say, you know, this is broken.
And most most likely, you'd be really annoyed with it because it interrupted you in your flow. You you are interested in in finding that, you know, the issue of the bugs. Probably it's the users doing something stupid or the operations not understanding what you did. Not to say developers were condescending or anything, but more that they were remote to the system, to the live system. And I think we've you know, for the past few years, we've seen DevOps, SREs, you know, the old game of saying we need to get closer. And by being closer, it means you need to take responsibility, you know, about what you're producing, right, and what you're putting in production.
Chaos engineering is, to me, that vehicle that was going to help you get there. It gives you the opportunity to explore the system. So by doing that, it says, your system is full of unknowns. You need to surface those unknowns so that they become unknowns. Right? And you make good decisions about them. When I say decision, it doesn't mean you have to fix anything, That's why it's interesting. It doesn't ex actually, you know, ex expect a fix. It expects you to know and maybe prioritize or maybe said this isn't that important if that happens. So chaos engineering is really about that. It's about surfacing your what you don't know, what we could talk that sometimes, what you don't know about your system. And it makes you a better software engineer all around, a better team because you can make decisions based on data, which you surface more than, you know, your intuition, which may or may not be wrong or right, basically.
So, yeah, chaos centering is all about that. Now, obviously, people are most aware of Chaos Monkey from Netflix. And, it's a tool. Right? It's just a tool for a given scenario. It's, it's doing it well, and it's well recognized, but still a tool. So what we're trying to do here is to realize that chaos engineering goes beyond that tool or any specific tool, basically. It's a thought process. It's a culture change. Those are big words, I know, but they're trying to make you a better software engineer, not making you good at using a tool. So that's where chaos engineering is coming from, in my opinion.
[00:07:21] Unknown:
And in order to help people explore this area and start to find these areas of dark debt, as you called it, in their systems and expose the unintended consequences of how the different components play against each other. You ended up building the chaos toolkit. So can you talk a bit about
[00:07:40] Unknown:
what the project is and your motivation for creating it in the first place? Of course. Yeah. And it's it the the what you said is just like it's exactly that. It's basically, Iyanna, yeah, about 18 months ago, I think, or something like that, we were with ResMiles. We were, you know, looking at those various tools, around chaos engineering. We were actually also, you know, interested in applying the chaos engineering experiments as we had read it from the chaos engineering, you know, principle. And we were looking at at the tooling that was was there, and we realized none of the tools actually took us by the end on on that workflow.
Everything was about degrading your system. Everything was about breaking something, which to us was a mean, not a not an end. It didn't tell us how to improve. It didn't tell us what we could learn. It only told us, you know, use that. It's going to provoke that error. So what? I mean, that that didn't seem very powerful to us on the long term. So we realized there was a gap there, and that's when we started Chaos Toolkit. The Chaos Toolkit itself doesn't actually, on its own, doesn't create any chaos. What it does is it gives you a a way to orchestrate those actions of breaking or of impacting of in inducing latency of, you know, any degradation, any change in your system, basically. It doesn't have to be, you know, about breaking your system.
So it tries to put that in a format that will make it sensible for you to to come after and look at the results. It makes it simple for you to share the experiment and discuss about it, also make sense of its results. So the the Chaos Circuit really was about, trying to formalize, if you will, or come up with a framework, like you said, library that applies chaos engineering, the vocabulary of chaos engineering, which are, you know, hypothesis, you know, your experiments, your findings. So all that, you know, the word that you could use in in chaos engineering when you're exploring your weaknesses, when you're discovering what's going on. It's like the this, you know, scientific approach really. And we try to actually come up with a tool and a format, a simple 1 that was very important to us and it make it easy for people to get there and to, you know, understand that it goes well beyond, you know, impacting a node or that sort of thing. So the chaos toolkit can be seen as an orchestrator for chaos engineering more than anything else. And you mentioned in your discussion about what chaos engineering is, some existing toolkits
[00:10:14] Unknown:
were primarily from Netflix in terms of Chaos Monkey and the larger suite of projects that they've built in the form of the Simeon army. And there's also another company called Gremlin that has come about recently that seeks to formalize some of these chaos engineering experience and provide a way to make it easier to apply them. So I'm curious if you can just give a quick bit of, comparison to what you're doing with Chaos Toolkit and what's available in the Simeon Army and what Gremlin is doing?
[00:10:44] Unknown:
Of course. So the so the Simeon Army tries you know, comes from Netflix. So, usually, it's, you know, a lot of tools we see happening on the chaos engineering, you know, ecosystem, if will come from, companies who created their own tool for their own specific needs. So if you have the same sort of environment or same need, usually, you can apply them. Still, you're still no better off because you don't have the learning workflow that goes with it. But still, you can get there because you have the tooling. At least that's 1 1 aspect of it. So Seminar and Arm is fantastic if you're living in AWS and you have all those sorts of of question that Netflix probably asked, but, you know, failover, you know, redundancy, you know, losing a region and that sort of thing or losing notes. That makes sense to go for the semi anomaly, definitely. And they are fantastic tools, but, again, they don't cover all your problems, all your difficulties.
GrainIn has a fantastic product, because they're trying to as far as I understand, they're trying to come up with something initially that helps you do define degraded degraded experience in your system. So, you know, create, you know, latency, network issues, CPU issues, that sort of thing, which are actually good ideas to to look for. Recently, they came up with Alfie, which is another side of the product, which is more application oriented. So you can, as I understand it, you can apply a chaos at an application of a way you say for this specific application induce that sort of degradation. So that's interesting because what we see here is chaos engineering is not just about the the infrastructure. Right? It goes all the the way up to the application and, honestly, even to the people in a sense where you're trying to look at how people react and what what they are missing to actually do their job properly.
So Gramin has has a sort of tools, and the chaos toolkit is different where it's going to drive for example, Gramin. Because Gramin exposes an API, you could use the chaos toolkit to drive their API as as a way to create some degradation in your in your system. But chaos toolkit will give you the experiment vocabulary and and and and results that will you you will be able to share with others and and and compare with. So I think they all work to, you know, together seamlessly in in my opinion, but the toolkit tries to be really that orchestrator for all those tools.
[00:13:09] Unknown:
And so it sounds like with Chaos Toolkit, it's more the laboratory environment where you're recording the outcomes after you posit the hypotheses, and you're trying to get the overall feedback to see from a system view what are the learnings that we can get out of this, what were the impacts of these different tests, and actually be able to record the different events as they happen as opposed to things like Gremlin or the simian army, where those are just the tools that are used as part of the experimentation that don't necessarily have any concept of recording the outcomes or reporting them or providing any useful analysis of what went on during those,
[00:13:50] Unknown:
either hard or soft failures? That's my understanding of of of those tools. I'm not necessarily an expert. And and I know, for example, Netflix has CHAP, which is probably closer to what the Chaos Toolkit and and Chaos platform that we, you know, I'll briefly introduce later on, is all about. But it's an internal they haven't released that. I'm not sure. It's so it seems to be so, Netflix oriented. I I don't know if they'll ever release it because I don't know if that's going to be applicable to someone else. I haven't used Kremlin, the product for a long time, so I don't know exactly where they are now. So I wouldn't want to, you know, judge of of the face of what they exist now. But definitely the chaos toolkit tries to to take you by the end of the experiment. Like you said, the laboratory laboratory, angle, the scientific approach of, okay, I've got a question.
What happens if this happens? Right? Okay. I'm going to set an experiment, tries to find the indicators that will help me actually figure out what that means when that happens. And I'll make, a report of that, the chaos toolkit is going to help me create that report. Once that's done, well, 2 things can happen. I can shelf the the report, and that's it. Or I can put that report and we can discuss, analyze, but I can also use reuse that experiment because it's fully automated at that stage and say, well, I'm going to run that automatically, you know, schedule it to run every day. Right? So you keep you you move from exploratory to validation.
At that stage, you know you found a weakness. You're just trying to make sure that you don't have a regression later on, And that gets closer to what probably the human army tries to do sometime by running continuously. It says, well, we knew we had that weakness. We fixed it. Now we're back to, you know we don't want to reintroduce it later on. So by running continuously, you get there. And can you talk through the overall approach and workflow for getting started with the Chaos toolkit to identify a potential hypothesis that you want to validate, and then building and running the experiment to get the results that will either support or refute that hypothesis? Completely. So the first thing, in my opinion, to get is is at least good indicators. It's it's funny because, they are probably the the the most difficult parts of the whole process is to say, well, what do what does matter to us? If you ask any engineer or even management or any team, basically, and they say, okay.
You've got a superb software and system running, But what matters to you? What you can't live without. Right? If I take Amazon, for example, the the the this, you know, the ecommerce site, And I say, well, what matters the most to Amazon in terms of functionalities, is that that people can pay? Is that that people can people search for products or can put them in their baskets? By by really, you know, making sure that you think about those thing and and saying, okay. We need at least, you know, people to be able to search. Even if they can't buy, that's fine because they still can use a service. That's a minimal thing that we need the system to be about, you know, to be able to do. Then you start realizing that your system has priority. You care more for thumb's sakes. Right? So those indicators are going to be translated and since you you can actually look for. So the way it would work is assuming you know those things, you probably have some some form of metrics somewhere that gives you that that information.
If you don't, you already learned that you're you're missing probably some in quite quite, important information to even make sense of your of your business. Right? But assuming you do have metrics, you could say that the toolkit would then allow you to say, well, I'm going to talk to my metric, you know, my monitoring system, however it's packaged, and say, give me that value. If that value is under, you know, a certain threshold or is, you know, is x or y, then my system is LC. In a sense where it means, from a business perspective, we're happy with the system. It may have some broken state here and there, but at least we know people can do with something.
So that's your business metric, and you probably have a couple, you know, in in usually in in c steps, a few of them. Then you have more technical ones. For Apple, you may want an experiment where you say, well, I want to make sure I don't have an alert of this, you know, of x or y. So you start with the chaos toolkit. What you're going to do is you create, a steady state hypothesis, and it's a block where you define it's all declarative. The chaos toolkit is all for now declarative. Think Terraform, basically. But it you basically define that, that block, and you say, I'm going to query that value. So I'm going to make an HTTP call, however it's packaged, and I'm going to ask for that value, and I'm going to make look and give a tolerance. So that tolerance could be binary, true false, 10, whatever, but can be a bit more complex. Right?
What the chaos we could says is when we're going to run that experiments, we're going to run that those probes first. If any of the tolerance is not true, then the system is not even in a state where we can learn from it. It's already in a state that is it's not good for for from our perspective. Right? So we beg. So the toolkit stops there. If everything is fine, then we move on to the method. The method is basically the body of your experiment. It does, inflict that thing, you know, maybe create latency, maybe, you know, remove, change that configuration parameter somewhere. Whatever, you know, question, that's your hypothesis.
If I do this, if this happens, then tell me what you know, how the system works, right, after that. So at that stage, it's it's the same. It could be an API. So, for example, if you're using AWS, you can call the EC 2 or, you know, any of the API of of of AWS to change your system. Kubernetes, you can stop a pod. You can remove a service. You know, all those things that already exist in the API can be driven by the cluster. You don't need anything any fancy tooling, really. And, actually, most of our extensions that our community comes up is basically driving those APIs in a way that makes sense to them. So we run that thing, and we come back at the steady state that we had initially, and we run it again. If the system has changed, then likely, that thing is going to fail in a sense where we have deviated from what we had initially.
Now when the difference between that and test is when we have deviation, we celebrate it. We said fantastic. We've learned something about our system. Under those conditions, we know the system is impacted. Right? We don't necessarily know how much it is impacted, but we know it is impacted. And and that's basically it. So your experiments usually are fairly short because they just drive existing APIs in 69 points if you've got them. And they package that into, you know, the results into a JSON file. You can have a rollback, and I think we may come back later on to that. For example, the demo that we did on, at KubeCon in December was we call the SEO, endpoint where we SEO as a as a fault injection API. So we say, create some, delay for for that user latency. In the rollback, what you're going to say is tell basically, revert that and do that that that thing you did. So you said don't remove the delay or the latency for that user. So the undoing and the rollback is more of an undo because if your system has deviated, first of all, you want to analyze what's happening, and we can't promise a rollback in the sense of putting the system back in place because it has changed, and we don't know how. We don't know how. It's not testing that perspective. And that's it. Basically, you define all those things in a declarative form, and you give it to chaos toolkit, CLI, and it runs it. And that's that's how that's a workflow, basically. So most of the time, you end up doing those extensions in Python or directly as a process. We're talking also about being able to drive all the, you know, runtime, like, Golang or whatever, and and see how that works. But, basically, that's the idea. You define your experiment as the file you put on GitHub. People can read, can share, and can make sense of, can reuse as long as they have the the right extensions because they can just run it. And that's really the the workflow that you keep running. And after that, once you've done it once, you'd run that continuously, basically. So to to run that continuously, we move on to the chaos platform. The chaos toolkit is more of a CLI.
So it's in isolation. It's you draw you know, exploring, like you said, laboratory. You know, you move into the lab. So you're doing that on on your own. And then you've got the chaos platform, which is more of a of a set of API where you can give visibility. Right? Think about, almost like if you think, Chaos toolkit is the git. Right? Well, Chaos platform is GitLab or GitHub in a sense where it it puts in front of everyone the experiments, the scheduling, the policies for those scheduling, so saying you can't run dash that experiment at that time because there is another 1 running already, that sort of thing, and all the reports. So you can basically see what what happened, you know, compare those and and that sort of thing. And as you were saying, once you've got the experiment built, it becomes essentially cheap or free to be able to run it on a regular basis.
[00:23:18] Unknown:
And a lot of the focus of chaos engineering is to execute these experiments in your production environment, which can be quite fear inducing for a lot of operators in particular, but it's definitely useful for being able to ensure that you're getting the most value out of your learnings because it's hard to simulate production scale traffic on a different environment. But I'm curious what your advice and experience are in terms of whether to run-in a production or preproduction environment, particularly when you're executing these on a regular basis, possibly in an automated fashion.
[00:23:53] Unknown:
I think I think, chaos engineering is like like I was saying earlier, there is a coach a control shift. Right? It's it's a bit like DevOps, SRS. Every time you there is a new practice, it doesn't just bring new tools. Right? It it changes also the way we we think about our system, the way we behave, in front of our system. So in effect, what what that tells me here is you need to start and I said need, not likely, but you probably need to start in staging or placing like that. Not necessarily because you're going to learn more but or less, but more because you're getting used to it. That fear you're mentioning is something is trying to help you fight. Because the more you you'll be able to to look at your system as not something of, like, a crystal glass. Right? It's not something particularly, to be worried about.
It needs to be something you need to be able to confront to actually say, you're not Dominique you know, I don't, you know, you don't give me orders, basically, system. Right? I am your boss, basically, and and people need to start learning how not to be afraid of their system. And so probably the workflow would be best to start in in in in staging. But then what I do is move to prod to and look at the system and, again, look at maybe a a side of the system that if it goes down because of a mistake we made while we're already in case engineering, first, we know how to put it back quickly. That's always important to be able to do a recovery.
Then when we do the first, know, the first few iteration of chaos engineering, then we do that altogether so that we know that the people who know can be there to fix it at best. And probably on something that is, yeah, not really impactful to the business. But the more you're going to do that, the more you're going to feel comfortable to move to something else more meatier. So it takes a little while. Like, any practice I wouldn't expect people to get as fluent with care centering as probably Netflix or LinkedIn are before a couple of years. It it's it it goes beyond tooling. Right? It takes it's it's a learning curve.
So it's it's a long tail game, definitely, but it's worth it because at the end of the day, you you should be a better software engineer, and you should be less afraid of your system. So, yeah, start in staging, start small, start then, move to prod, be all, you know, around the system. Start with something that doesn't, you know, I wouldn't say matter, but, you know, is I wouldn't go for the the biggest thing in your system and gradually start from there, basically. Yeah. It's a similar approach to
[00:26:29] Unknown:
the DevOps idea of releasing more often with the case of if it hurts, then you should do it more frequently until it starts to hurt less. So just those with chaos engineering. If you're afraid of it, then that means that you don't fully understand your system, and the only way to really gain that knowledge and familiarity is to break it and see what happens so that you can fix it, and then get a better grasp of how everything fits together and how you can either improve things or create mitigations or recovery strategies.
[00:27:01] Unknown:
Completely. Completely. That's exactly the point. It's you said it well. The more you do it, you know, that's the only way to learn, basically, is by failing often but small, and then you that's that that hurts less. That's it.
[00:27:17] Unknown:
And so as far as when you're running the experiment, what are some of the strategies that you have found to be most useful for being able to identify the points of failure either ahead of time or during execution, particularly when you have some of these complex systems and there might be unexpected cascade effects and identifying how they're playing out and being able to
[00:27:41] Unknown:
trace the sequence of events. So what what we have realized, alongside the the fact that the tool, Cares Toolkit, was missing in our opinion, is also, quite surprising maybe, that's there's no there's no easy answer to that question because, obviously, every system is different. Even if you're using the same you know, let's say you're using Kubernetes. You're still running different application on top of that. You have different constraints. You have different needs. You've got you know, pretty much everything is different except for the API. Right? So, at least you still have that API, and that's probably the biggest thing that Kubernetes has, you know, given us or or the you know, it's it's under the API. So that means if you move from 1 project to other or when you you look for for, you know, for knowledge, you know that people know Kubernetes API. It's a bit like Java back in the days or Spring, I guess. It was you knew you could get a Spring engineer. You knew that you were aware that it knew the API. Right? So there was a baseline of knowledge.
But when it comes to failure and and that sort of of aspects, runtime aspects, well, it doesn't really exist to catalog of that. So that's what we're trying to build with, with SkillsIQ, the company I I work for, where we're trying to leverage the knowledge that people have because there is such knowledge, but it's not easy to share. And we think you're centering through the experiment vocabulary is fantastic to do that. So the idea here is you could benefit from others, you know, knowledge from other places, to to to gain that knowledge. And we call that initiative, upon chaos initiative where where you apply you share the knowledge you gained for the benefit of everyone else.
So I was presenting that at KubeCon, and and 1 thing I realized is most of the time when you go to conferences, tech conferences, what, obviously, you're interested in in, in in in the speakers and talks, but what you're really looking for is how do you do this? Talking to your peers, learning from them, you know, in corridors or on stage when you, you know, you get that thing. Someone saying, this is what we did. This is what we learned. Usually, those are packed rooms because you're trying to figure out that yourself, and they're applying that to your own, you know, environment. Well, we're trying to do that with OpenKios initiative and KiosIQ through those tools, Kios2kit and Kios platform, by saying if we could encode those questions of failure points, saying, you should pay attention to this. You know, someone else in the industry has said this is important.
Right? Maybe you could just grab the experiment and run it. So that shifts the fact that you have to have all the answers to where can I find someone that can help me? Where can I find that knowledge? So but the strategies simply are to simply go and talk to your peers. There won't be any tool, any machine learning, any of this that will ever replace the good old, you know, human relationship where you say, this is what I see and someone else, you know, giving you hints and that sort of thing. So I think it's not easy to to respond to that, but strategies are going to be try to, again, you know, at high level, look at your indicators. What matters to my system? Right? What matters to my business?
And then what I'm worried about. That gives you almost like a perimeter of things you should worry about. Everything that falls outside of that, then you don't worry about that. And that's fear where you, you know, that that that those boundaries with indicator of things that matter to my my business and things I'm worrying about that could happen, then you start exploring there. And you're saying, for example, I'm going to give you a random, random example here. But let's say your users don't really care if your system took 1 or 5 seconds. Maybe latency of a few 100 milliseconds is not your problem. So that's not where you should be looking at. That's and that's why it it's so important that you you figure out what matters to you as a as a system, as a business, because that's a starting point for you. Okay. From there, if we look back, how you know, what are the indicators, technical and and otherwise, that we should be looking at? And those indicators are going to say, okay. Now we know what to look for, what to measure, what are the failure points we have in front of that. Other failure points don't seem to matter to your business if you if you let them out. Right? So that's probably good. Don't try to go and and test for everything. It doesn't really matter. That's why starting with a tool is is never really efficient in my opinion because if you're just applying CourseMonkey or any of the tools like that, but, yeah, you broke something, but in what way does it matter to you? How do you measure that it actually improved your knowledge of your system? That's why you really need to figure out what matters to you and, and walk back basically to find the failure points that would impact those indicators.
[00:32:36] Unknown:
And when you're executing a test run with Chaos Toolkit, what kinds of information or statistics or feedback are being captured by the Chaos Toolkit, and what kinds of reporting are available as an output of an experiment run? So, the toolkit again is
[00:32:56] Unknown:
I, you know, I've implemented it. I've designed it, but, I'm still, I'm still gonna say it. The the toolkit is fairly dumb. In a sense, what it does, what you tell you know, you tell it to do, it's a bit like Terraform. If you give it to an empty file, then it won't do anything. Right? So it comes back to our realization as of Russ and I that we don't have audiences. You have more answers and more expertise in your system than we do. And because those indicators I was mentioning about are yours, it seems only fair that for now, you make sense of what data you need to collect. So what happens is, earlier, I explained the body of an experiment. Usually, what happens is not only do you have an action that does something to your system. Often, you also, put there what we call probes or data collection actions. So those things are going to simply ask whatever monitoring system, whatever data that you collect yourself, and try to aggregate that.
And that will become part of the report that is generated at the end so that when you start to do the analysis, you can you can look at it and say, okay. We did that, and we saw that. 1 key you know, 1 simple example. Let's assume you've got 2 2 services talking over HTTP. Right? And 1 of them you say, what happens for service a consuming service b, service b returns an HTTP error? So you can you can certainly simulate that HTTP error. That's fine. And you likely see, an error on service a. What you're going to collect as data is probably the logs, maybe some monitoring, some alerts. You know? So you're going to talk to your system. You're going to actually query your logs and see what error did actually work.
Because if it happens in real life, so that's the error that my operations are going to see. Can the action on that? Can then make sense to a traceback that's that happened in production? Often, developers don't realize that by saying, oh, I logged the trace back. It doesn't really help the operation because they still don't know what to do with it. That's for you to analyze after after the fact. Right? So it's interesting to go that way to look at your system and say, well, okay. I'm going to cherry pick that data, that data, that data. Obviously, let's choose that beforehand, you have a good idea of the data you need with a solid case of reports. So sometimes what happens is you play the the experiment once, realize, okay. I didn't collect that data. And you play it again until you've got enough information in there collected to actually be the solid report.
Now as as a report, the chaos toolkit is not clever enough yet to actually make your reports analyzed by default yet. So what that means is the report that is generated is mostly, almost like a UI view of your report, but it can generate PDF, HTML, that sort of thing. 1 aspect I did mention that it has, that is important for reports, When you run an experiment, let's say you've got you've noticed you've got 3, 4, 5, or or 10 experiments that cover, you know, certain surface of your of your system. So they are related in terms of they're trying to showcase something specific from a different angle, each of them, but but they're all related to the experiments. Well, you what you can do is there is a small piece of metadata that you can put put in there, which is called contribution. That's a free phone text with a weight. And, basically, you're saying that experiment is going to care for availability at a high level, but doesn't care for security. It cares for reliability and that sort of thing. And basically what the reporting would do once it aggregates all of those results, it will build a form of a punch card, if you will. And you can look at that punch card and say, okay. I realize I do a lot for availability, but I don't do anything with security. That's interesting because sometimes that helps you figure out where you put a lot of efforts in terms of chaos engineering. And what that says is that's fair. We don't care for security, but that's a known. Right? It's not something hidden. That's a decision. That's a conscious decision. So the the the report, the more you do chaos engineering experiments, the more you are reporting that the p d the we can generate is going to be useful. Right? But again, it can't make yet a prediction about what you should do about that error.
That's not there yet.
[00:37:11] Unknown:
And can you take some time now to discuss how the Chaos toolkit itself is actually implemented in terms of the software design and architecture
[00:37:19] Unknown:
and how the implementation has evolved since you began working on it? Yes, of course. So it's written in Python, or I probably will be there. It's in Python 3, 3.5. Actually, we're moving to 3.6 as a minimum. And if I was there at the villa, it would probably be 37 or 38 anyway. I try to to keep to the at least 3.5. Now it's a conscious choice not to piss people off, but I've been, you know, I've been using Python for a long time. And while Python 2 was was really nice, Python 3, especially since Python 3.5, in my opinion, has moved to a different level, has become, again, I could say, a modern language. Now it's, you know, I don't know if it's, something that people could contest or not or whatever, but that's my take anyway. And, but the decision we made I made when, when we started designing was we want it wanted the system to be extensible.
So the chaos took it as a call, which is fairly small, which is basically taking the JSON or or the declarative form, the YAML, and turn that into something that is can that that is runnable. Now extensions are actually usually either HTTP. In that case, the toolkit is going to call for you through requests, or so you just put in your URL and calls and gives you the feedback. Or process, you can call a process that's used for if you have a tool that induce some load, you can just call it. Or more often, than not, you extend through a Python library. And I wanted that to be a separate library. So it creates a lot of small libraries, of course, but it helps the call to move at its own pace and libraries to move that their own pace. And it it has worked out very well, I must say, because the community feels happy to move fast on their extensions and make sure the call you know, they don't impact the call. I think we would see less contributions if we had said everything is packaged at once because people would be afraid of breaking the core.
Right now, the core is solid. It's reliable. To be fair, it doesn't the the design initially has hardly moved since the beginning. And I think it's due to 1 simple thing. We didn't go for a state, based design. So there are no classes. It's only functions. So we design the toolkit in a functional approach, if you will. I wouldn't say that Python is functional, you know, language by any stretch, but it does it doesn't, you know, force you into a state language with with, you know, with classes. And I think it has really worked out because they are so easy to reason with. When you start saying, okay. You've got a function. It's just entry and, you know, input, outputs, and that's it.
And, basically, we when we pass things to to the extension functions, we always pass a copy because, you know, mutable and stuff. But we we we said if you if you change that, it won't be seen from the outside. So we pass a copy of the data that that they would expect. And that has worked very well. That has helped us to move fast. And I think lots of the contributor of the documentation is what it is, but we never had to actually go too deep into the documentation. People could just grab the piece of code of an extension and just copy paste and go from there. So it's it's been working very fine.
The 1 bit that 2 things that I I may have done differently. It's not blocking per se, but it's a question that I've I've always had with with Python is I'm using a lot of exceptions, not as a control flow necessarily, but as a way to they have sometimes a control flow aspect. I never quite know if that's a good approach. Been doing that for 15 years. I never I'm still not sure about, you know, exceptions versus returning an error that Golang does. It's a pain in the ass for, you know, as a consumer, obviously, of that error if you have just a code or a number, but at the same time, it makes the function even simpler to reason with. So that 1 is is a question I still have, and I probably will have for as long as I run Python. And the other 1 is that I see as a problem, that was really the only issue I had with choosing Python initially. If I had chosen, Golang, I could easily create a binary that people could download. Now it's it's not easy at all. The whole generation of, an artifact with Python is, it's it's a better story than it used to be, but it's still not perfect. It's not as as lean and nice as Golang, you know, gives you. Obviously, they have different philosophy and different way of of working, but still, it's a very it's a bit of an of a problem to me. So I managed because I think people using Python program are used to be to do pip install or whatever you use. But I'm not necessarily completely happy about that because I would love to be able to say, well, package, you know, Chaos 2 kit wherever and run that. You don't as a consumer, you just need 1 thing, Kiosk toolkit, its extensions, and the experiment. Right now, it's it's a bit more important if you want to do that. So what I did was use p a py installer to create a bundle. That was fine. It was a learning curve as well, but that does work. But it never feels really quite right either, for some reason. I don't know why. So Python was a good choice. I would love if Python could be more and more used in the DevOps world again, versus Golang. Not that Golang is bad or anything, but Python, it's so nice when you start dealing with strings and, and, and dynamic payloads like that. It's just a breeze, Really, it's fantastic. And now with typing int, it's it's really nice to code in Python, to be honest. So I don't have any issues with that. So the the design haven't changed that much. Actually, I haven't changed at all.
No. I've been really happy, and I think we we got it right, luckily. Yeah. There's been a lot of conversation
[00:42:50] Unknown:
over the years, and it seems to be coming up again recently about the whole idea of inheritance versus composition, and whether or not you need classes or if classes are useful for different contexts. So it's interesting to see that you have gone with a function oriented approach and using those functions for composing the system together and making it easier to add in these different plugins because you don't have to worry about accrued state as the objects are being passed around. Exactly. Exactly. The point was I didn't want people to think about those questions. I want them. It was basically write a function. That function is becomes an action,
[00:43:28] Unknown:
and it it is an extension to Chaos Toolkit now, period. And that was that has been very easy to for people a lot of people who've who've been using Chaos Toolkit have said, I haven't done a lot of Python, you know, for a long time, but they've caught up really quickly. I think it's a function is something that pretty much any developer has run into in 1 way or the other. Right? I didn't want to impose on them Python idioms so too much. That's why I went away with classes, and I've been happy with it. The code is linear in my opinion and more robust.
[00:43:59] Unknown:
And in terms of building the toolkit in a way that you can create safe and reversible experiments? What have been some of the most challenging aspects of that, particularly as far as providing a rollback path for people who are implementing their own plugins or custom logic and also just for building experiments to allow them to have the confidence that they can run that experiment without irreversibly changing their system?
[00:44:29] Unknown:
It's always interesting because I think we got that question recently as well, and it's a very sound question. That's probably where here we are more on the exploratory than on the testing side in a sense where because you're an exploratory, you know, you're exploring. By default, you don't know what you're going to break. So it's really, really hard in advance to say, well, I know I'm going to put it back. So while the, chaos toolkit talks about rollback, I think we think more and more that's probably a that was wrong choice of work. We talk more about remediation, cleanup, or undoing a sense where what we promise in the terms of API of interface or what you you should think about when you do chaos engineering anyway, is saying, I'm going to to impose degradation to the system. What I can do is reverse that degradation.
It doesn't mean I'm putting the system back on track because you're trying to figure out what that meant that you integrated the system. So I see I I certainly see your point about, fail safe and, you know, being again, coming back to not being afraid too much about what you're going to do. But that would be a because you want to learn, you need to be to take the risk. It's a leap of faith, basically, that can you can learn that way. You can learn by doing and, yeah, hurting yourself initially, exactly. So that's why, like, as we said, we need to do that in a in a small parameter as well initially. But the more you're doing it, the less afraid of it. And you know the all the mechanism are already in place. Say you you run something and you try to put the system back on track. Shouldn't that be automatic?
Why is it part of your experiment that is back on track? I should think that if something is down, the system should be almost self feeding to a point anyway. Right? So it's dangerous to say, I'm going to put it back for you as part of the experiment, because it prevents you in a way to learn that you don't have those mechanism in place. If a user falls into a trap and start breaking something, you know, just because there was a bug or anything, do you think they're going to try to put it back? No. They're not. Right? So what you need is to learn, could I get, you know, could I could I learn about that issue beforehand? Could I get alerted?
Could I actually self heal, resolve? Maybe maybe put the this that side of the website in maintenance automatically just so that, you know, further damages aren't done. So the old discussion about rollback should be thought in terms of I've let's say I've created a latency on that service. My rollback is to remove that latency, and that that's pretty much the perimeter of the rollback, at least as a promise. I don't mean to say you can't try to put, you know, few SIGs, in place, but I think it's out of band of the Chaos toolkit, Chaos engineering. It may come. Maybe in the future, we'll we'll, we'll get better at this more, and the industry say, yes. We can, you know, go, as far as that. For example, 1 of our goal with the Open Chaos initiative is to go back to to, you know, to places like Google Cloud, Amazon, and all those all those folks and say, right. You should be able to provide experiments proving that your system can cope with x or y. But, also, you should probably expose a rollback thing so that if this happens, you're basically telling me how you're going to cope with that. And that puts things in perspective because you're now relying on them to actually put things in place, not your application, but at least their side of the infrastructure.
So you see my point here is not the experiment surfaces what you don't know and what is your assumption, but you should challenge your the fact that you have to do everything. That's why it's so interesting to have more microservice in my opinion as well because a lot of people say, microservice, blah blah blah. They are bad. But 1 thing they did well to me was to say, as a developer, you can't micro, you know, decide anymore. Because before, as a monolith, you made all the decisions inside and as and as as an app, I couldn't see what were you what you were doing. But now because you've split that into smaller pieces, I can question that shows you've met. You have less your face to make a decision. That can be good or not. But the first thing we do as developers truly, I'm going to fix that. Yeah. But your fix may be not the right fix.
Maybe it's it's it's not that fixed at all we need. Maybe it needs to be beforehand. Right? But chaos engineering is going to help you on that as well to say, as a system, we need to prove the system copes with x or y. We need to do that because we've made a promise or or or uses. But that's that's pretty much how far as a SecureCentering experiments is going to go. Everything else is on your shoulders or on your service provider shoulders. It's probably on both, Brady. So that's that's you know, it's interesting because now you start seeing the parameter you have, the functional, the application parameter you have much more, thanks to chaos engineering. You don't assume you have to care for all of that, basically.
[00:49:35] Unknown:
And in terms of your own experience and lessons learned in the process of building and maintaining the chaos toolkit project and the organization that you're building on top of it, what have been some of the most interesting or useful or unexpected lessons?
[00:49:52] Unknown:
So there are a few things. Knowing what what, you know, that your users are using actually your tool. In the open source world, it's not easy. It's not easy because people grab your, you know, your your project and run it, but they don't always come back and say whether it's bad or good. Right? So you you don't really know much about your your tools or what people see in it. Right? And that's sometimes frustrating because you're happy with criticism, but you don't even see them, or you're happy with surprises. Although you see some of them, you don't see all of them either. Right? And that makes you sometimes a bit, you need to have passion, to be patient and to have passion about what you're doing because you don't know what people do.
So that's that's always yeah. Sort of frustrating. Sometimes you wish you could know a bit more. So you don't really have visibility on on what people do unless they tell you explicitly. 1 thing that I see, at least has always been on my mind, is we need a nice and kind and civil community. I don't like, and I don't appreciate when people come and and barge in and criticize. There is a tone, basically, asking things. And I say that because I I don't think it it you know, there is no stupid questions. I never think there are stupid question. Usually, when a question is asked many times, a kid has tend to say, documentation is bad or or project has a problem. There is a bug here. So I don't, but sometimes I think the tone matters because you're still doing open source. Right? So you're doing that because you love it, but you don't expect people to be, nice about it. Right? So, luckily, in that community, it's extremely, extremely rare.
Pretty much everyone has been very happy, very nice, and very, very patient with us. So that's fantastic. I've all I think the the architecture as well we've set up means we've seen a lot of, contribution that we wouldn't have seen otherwise, so I'm really happy about that. That was a good surprise. So we've got quite a few, not just people using, but actually actually contributing to the extensions. So that's been good. Yeah. Other than that, basically, it's it's been all positive all around, really, that experience so far. It's been really fun. So we we are thinking about moving the governance from us or mostly me, basically, making all the decisions to a more open governance where for extensions, maybe people could take the lead on their own extensions and and basically don't have me doing the review for everything. And
[00:52:26] Unknown:
in addition to the Chaos toolkit and the various extensions that you've built for plugging into it, you have also started a draft proposal for a common specification of chaos engineering, particularly in terms of the agreed upon language to use around how the experiments are created and just ways to ensure that everyone is using the same format for conversing about Chaos engineering. So can you talk briefly about that, and also any plans that you have for the future of that specification
[00:53:01] Unknown:
and the Chaos Toolkit project itself? Of course. So, yeah, it's called the Open Chaos Initiative, and we call that initiative not, manifesto. It's initiative because we want it to be short lived, hopefully. The idea is really to federate the industry around, shared vocabulary. We we reckon that helping people to get their vocabulary and get it right mean we we all talk about the same thing. And and that's actually more important than you'd think. Right? It's really words have meaning, and it's difficult when you have a discussion and you're not talking about the same thing. Because words will give a remit, will give a scope. That means things that go out of that scope can be discussed elsewhere. It doesn't mean they are not important, but that means we can focus on aspect of chaos engineering, not trying to it's not, you know, an ad lib discussion.
OTEC is we think chaos engineering and resiliency reliability should be an industry wide discussion. I don't we don't think chaos IQ is meant to say do Chaos Engineering that way or any other provider. What we say is we all have the industry has ton of knowledge, ton of information, ton of ways of dealing with those things. If we could start sharing that way and sharing about at least the vocabulary, what it means, how we learn from each other, how we learn from that, then that means we start, making the whole industry better. Right? We make it more aware of those problems. I think it's a bit similar to what DevOps in culture has done, what we see in SREs, coming up. So that's sort of thing we're looking at is saying, can we all agree on a vocabulary on the way to think? And that would be fantastic. So that's what we're trying to do. It's, so if people from all those, you know, across industry are listening to this, you know, please join us on on the on the chaos toolkit Slack because we want to have your input on that. As for the the project, the chaos toolkit right now, we've continue evolving, at a, you know, at its own pace. I think we've got a 1 0, which is way overdue, and I didn't get the chance to actually release.
We want to say, this is a milestone. You've worked really well out of the, you know, all contributors, and you can bet on that. Even if you change this, you can always come back and fall back to that to that 1. So I think it's important to leave the, 0 point, versioning scheme and move into the 1 0 version scheme. So that's probably, a major, you know, a short short, term thing. A long term thing is to basically ensure that whatever the OpenKios initiative comes with in terms of specification, the chaos circuit applies and implements them so that you have, a de facto standard implementation of those those, those findings.
It will remain open source ever, you know, for for a long time. We have plans to talk about, hosting it at the CNCF considering if the governance needs to be there because we think it fits with the goals of the CNCF. So that's the sort of thing we're looking at, this year, definitely. So quite an exciting year to put the project, to grow the project quite quite quite, quite extensively, definitely.
[00:56:19] Unknown:
And are there any other aspects of chaos engineering or the chaos toolkit or the open chaos initiative that we didn't cover yet that you'd like to discuss before we close out the show? I think we have covered everything. The the important,
[00:56:33] Unknown:
I think, part is skills toolkits is a CLI, and that's the most it's it's just a year, but it feels like it's, it's been there for years, right, to me because, you know, people talk about the toolkit now, and that's fantastic. But, what we have this year is a chaos platform, which is, the the big sister of the toolkit where we want to take that to the next level so that you can install the platform and you can start having the visibility. So from a future perspective, the platform is going to be where I'm going to be at mostly this year more than on the toolkit.
[00:57:07] Unknown:
Doesn't mean the toolkit is going to be left alone, definitely not, But the development will happen on the platform, and and and and it's going to be exciting. Really fun. And for anybody who wants to follow along with you and get in touch or ask any questions about the projects that you're involved with, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a movie I watched recently called Time Trap. It was a very interesting sci fi movie about a group of people who end up in this cave that has some weird aspects about how time flows internally and externally. And I'm not gonna give away too much more other than to say that it was just a very interesting movie, different take than a lot of the movies I'm used to watching. So it was a lot of fun, and I recommend it for anybody who's looking for something to make you think and just for a fun sci fi film. So with that, I'll pass it to you, Siobhan. Do you have any picks this week? Very good. Very good. I'll I'll I'll make sure to look it up.
[00:58:09] Unknown:
I think my my my pick on this week is is is actually, I'm trying to get back on playing my guitar. I've left it in its case for a long time, and I realized how much I miss it. So I picked it up the other day, and it was fun. Although it really hurt my fingers because I haven't played for a long time. But, yeah, I'm I'm trying to get, I guess, to force myself to get away from the from the machine from time to time. How exciting the machine is and and and developing an open source and all that, I'm realizing there is a life outside, you know, this this tiny box. So that's that's my pick is doing something else.
[00:58:51] Unknown:
Definitely good advice and something that bears repeating every now and then. So thank you for that reminder to everybody who works in this industry. Thank you again for taking the time today to talk about your work on the Chaos toolkit and Chaos engineering. And it's definitely something I'll be taking a closer look at. So I appreciate that, and I hope you enjoy the rest of your day. Thank you very much, Tobias. Thank you for having me.
Introduction to Sylvain Helle Gouache and Chaos Toolkit
Understanding Chaos Engineering
Motivation Behind Chaos Toolkit
Comparison with Other Chaos Engineering Tools
Workflow for Using Chaos Toolkit
Running Experiments in Production vs. Preproduction
Strategies for Identifying Points of Failure
Information Captured by Chaos Toolkit
Implementation and Evolution of Chaos Toolkit
Challenges in Creating Safe and Reversible Experiments
Lessons Learned from Building Chaos Toolkit
Open Chaos Initiative and Future Plans
Closing Remarks and Picks