StackStorm with Tomaž Muraus and Patrick Hoolboom

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great. You can subscribe to our show on iTunes, Stitcher, TuneIn Radio, or add our RSS feed to your podcatcher of choice.

You can also follow us on Twitter or Google plus and please give us feedback. Leave us a review on Itunes to help other people find the show, send us a tweet or an email, or leave us a message on Google Plus. You can also join our community. Visitdiscourse.pythonpodcast.com

for your opportunity to find out about upcoming guests, suggest questions, propose show ideas, and follow-up with past guests.

I'd like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.

For details on how to support the show, you can visit our site at pythonpodcast.com.

Linode is sponsoring us this week. Check them out at linode.com/podcastinit

and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project.

Coming up in Boston on May 21st 22nd is the Open Data Science Conference.

So if you are in the area and you're interested in learning about how Python can be used in the areas of data science, then you should check it out. You can use the discount code

EP for 20% off your ticket.

That's odsc.com.

And your hosts, as usual, are Tobias, Macy, and Chris Patti. Today, we're interviewing Tomasz Muraus and Patrick Hulboom about the StackStorm project, which is an event driven system automation framework.

So could you please introduce yourselves? How about you go first, Patrick?

Yeah. I'm Patrick Holboom. I'm a DevOps engineer at StackStorm.

I like to jokingly say I've been doing DevOps since before it was called DevOps.

So I'm very passionate

about auto remediation, event driven automation, and so I'm really excited to be a part of the StackStorm project.

Remember when they used to call us generalists?

Yeah. I remember when they used to call me a lot of things.

And, Tomasz, how about you? Alright. I'm, Tomasz Murasz. I'm 1 of the engineers at StackStorm.

I started with the whole, I would say, like, system admin stuff, which was called back in the day and,

programming pretty early on when, like, father got me to 86.

At the beginning, it was Linux and stuff, then

it started with the IRC and, like, the

I don't know how much should I dive into the

background right now or Whatever you think is relevant or as much as you wanna share.

Okay. So

I guess 1 of for me, the interesting starting point was when I got the Internet connection, and it kinda started with the RC.

At that time, 1 of my first programming slash scripting languages was TCL,

which is I don't think many people use it this day, but back in the day, it was used to write

the scripts for the 1 of those RC bots. And I think this is kinda interesting because this was probably, like, 12 years ago and it was this was, like, really early days over the chat ops. At that time, it was mostly, like, querying information from the system, like, uptime

and quizzes and things like that, but I would call this early chat ops days.

And I think TCL as a language is also interesting because

I know the SQLite project uses it for integration testing, and also

the Aradis project also uses for integration test. So even though not many people use it these days, it's still using on some specific areas.

So after the IRC, there was a whole PHP websites error. Did a bunch of that.

And after that, I found, like, the Django project because I was doing a bunch of websites and also, like, some fun and stuff. And

with Django, I kinda, like, learned about Python.

That's kinda

how it started for me, and maybe we can talk more about it later.

So how did you get introduced to Python?

So I was actually I think I was mostly looking how to, like, make it easier to develop websites. And I guess at that time,

1 of the hard things was Ruby on Rails and also Django. And I was already slightly exposed to Python for some system automation and things like that. So I was thinking, like, okay. Let's see if I can use, like, that knowledge and

figure out Django out, and that's kinda how it started.

And I really love the simplicity of Django. This was, like, end of the web days when you, like, use PHP and doing, like, really many PHP frameworks. There was a lot of duplication. There was no there's, like, MVC frameworks or many things like that. So it was a lot better experience than

PHP.

Yeah. For me, you know, I actually come from a Perl background, so it's the the old school systems administration scripting and things like that, Perl and Bash and all that.

As I started looking for

other jobs or other things, I noticed that Python was kind of the new hotness as Facebook and a lot of Google guys and stuff like that were using it. So I started doing Python on the side. And then when I came on board with StackStorm, it was our primary language. So it was really when it picked up for me.

So what is StackStorm, and what problems does it solve?

Okay. Sure. So I would say a lot of people from our team come from, like, different backgrounds. They were, like we have experience with operating software as a service products and running those. And, also, we have experience, like, from large enterprises where you potentially have a lot of, like, legacy software. And the StackStorm kinda, like, allows you to glue all of these different environments and potentially all these different tools together and, like, automate them. So this could be, like, I would say, these new tools which are popular. This is Shaft Puppet and things like that, but it can also be, like, a a bunch of legacy tools enterprises have. So StackStorm allows you to, like, integrate all of that together in a easy manner and then do all kinds of call automation with that.

I kinda get on my soapbox at this point. I like to preach a little bit. 1 of the big things that we're bringing is we're bringing

transparency

and collaboration

to operational patterns. So when you look at the way that a typical SIS admin works, they're no longer just shelled into a box and and working in this black box and then reporting back what they did at the end of the day. They're now doing this through workflow that they're sharing with their team, and they're doing it all via chat ops and and and things like that. So everyone's working together and everyone can see what each other are doing.

So, yeah, Patrick had a good point. So, basically, 1 of the thing we're trying to do is, and, like, I would say in the old days or a lot of companies still do it today, they have, like, this potential, like, workbooks when it says if something goes wrong, perform these steps and, like, usually a person runs this,

this this commands manually. But, with the StackStorm, we allow you to automate it, which means you we allow you to codify this process, which means you can, like, share this different ops practices with other people. And on top of that, as Patrick already mentioned, you get visibility into that and audit and everything which comes with that.

And so does StackStorm actually have a facility for being able to define runbooks and have those stored in a central location to be able to maybe

correlate that with a given event that's happening so that you can then,

know which particular routines need to be run. I'm assuming that most of that is intended to be automated, but for the case of where somebody has not yet gotten to the point of actually automating it entirely, I'm curious how that plays out.

Sure. So maybe I I wanna give you like some background for some like basic concepts in StackStorm.

1 of the concept is an event or a trigger as we called. This is basically some kind of information which comes into the system. This could be like, this, monitoring track is triggered. Someone posted a message on Twitter.

Something else has

happened. And another thing is an action, which is basically a response to this event. An action could be open a Jira ticket, run some kind of workflow. Workflow is another thing we'll dive into later. And then there's another thing which is called rules. Rules basically allow you to glue those together. Rule rules allow you to say, if this event has happened, run this action or run this workflow. And as far as workflows go, workflows are basically

a bunch of steps

codified in, in our case, in a YAML file.

The common pattern that we have people break things down into if they're not quite ready for a full auto remediation, which I believe is kinda what you're asking,

what we push people towards then is what we call facilitated troubleshooting. And that's where you have an event come into the system still, and the system will collect a little bit of diagnostic data and make some recommendations. And it can do that via email or chat or or SMS or whatever your preferred communication mechanism is, and then it will make recommendations. And then you, as the human, will go fire off those remediation workflows or or tasks that need to be done by hand after getting that recommendation.

See, that's really interesting. That's kinda what I was gonna ask you is when I think about when I've worked in organizations that had really effective,

sort of, you know, production management processes, I guess, that were fairly rigorized,

delineated,

you know, like, now we're in diagnosis phase. This person is running comms. This person is you know what I mean? Like like, there were roles assigned and they were fairly, sort of, well defined,

demarcated

phases of troubleshooting

and fixing the problem?

And would StackStorm

sort of enable that kind of playbook as well?

So yeah. Basically, I wanted to say, in a ideal world, everything would be automated, and it's also our end goal. But we're trying to approach it as with a iterative approach. So usually, this is first step is potentially getting data and, I guess, transmitting this data to to to the operator, to the SaaS SysAdmin, things like that, and then allow person to manual act into that. So this is, like, a first step. Then usually, the next step is automating a simple task, like checking if disk is full and, like, potentially deleting logs and things like that until you, like, get, more advanced and more advanced and more and more stuff is automated.

What you described there is really you're describing a well a well oiled war room. Right? It's a war room that's that's well orchestrated and everyone knows their role or someone's clearly

calling out those roles for people during an outage, during an incident. And StackStorm really does facilitate that. It allows you to to collect a lot of that data, and StackStorm itself becomes,

another another soldier in the war room, if you will. It's working alongside you. It's collecting data. It's it's resolving issues that can. It's doing those things, and it's sharing those for you in those collaborative fashions. So it definitely does help with that.

That's great. You know, talking about the roles in the war room, I remember a presentation I saw in this stuff that we actually used to use. You know, it's basically, like, okay, who gets to be Kirk, who gets to be Spock, and who gets to be Uhura? You know, it's it's kind of like putting a geeky sheen on the whole thing, but it but it actually sort of gets to the meat of something very

critical about who's doing what, who is assigned to what role. And if everyone

knows what their roles are, it just makes the process flow much more smoothly. So having something having intelligence that enabled that

baked into

some interface that's running in the chat, because that's where everyone in in a really efficient organization comes together on this stuff anyway,

that sounds like it would be tremendously helpful.

It's hugely helpful. And even then, the other thing that that StackStorm allows you to do is it it cuts the low hanging fruit. Right? So StackStorm goes, and by doing that, it it'll go collect a bunch of the diagnostic data for you. So instead of you having to say, hey, Jim or hey, Sally, can you go look at these servers and collect these logs and get this data for us? StackStorm is gonna collect that for you and and spit it out in chat or spit it out in email so the team can see it. And the team can already be working towards a resolution

or working on the more complicated problems while the system is automating the collection of basic data and basic diagnostics.

What was the inspiration for creating StackStorm in the first place, and what are some of the biggest architectural and design challenges that were encountered during the process of building it?

So I I can speak a little bit to inspiration, and then I'll kinda pass over the pass over the baton for challenges to to Maj. So,

really, the inspiration for for StackStorm was that we saw there was a need for operations processes to be improved.

Like I said, 1 of the biggest problems that we saw was that there's an outage that happens, and the ops guys get together in a war room, and they start working on an issue. And they go about their different ways, and they're doing things, and they're assuming their standard role.

And at the end of the day, they come back and they report back what they did,

so that the postmortem can happen. But that's not really transparent. Right? And so we saw that there's really a a place there where these things can be improved. And the other thing is that we realized that an engineer doing something like that is is not able to properly properly leverage himself across a number of different data centers or a number of different servers within the data center in order to be able to work as efficiently possible. As far as our architecture challenges goes, I will first, like, just give I know this is a Python podcast, but I'll, like, give some reason why I decided to go why I decided to go with

Python.

So a lot of us in the team like Python because it allows us to be productive and things like that. And in the end, I think when you're designing a product like this in the beginning phase, the most important thing is developer productiveness, and it allows you to get a prototype out fast.

Besides that, the next important thing for any kind of product is the architecture. This means some basic principles. So you usually, this means in a distributed products like this, this horizontal scalability and high availability. And for those things, the actual language doesn't matter at all. It's the architecture which matters. In our case, we also decided to, like, follow those horizontally scalable and highly available principles, which means we went with

the with the practice which people call today, kind of, microservices or services.

So we have a bunch of services with a pretty strictly defined API, which

communicate with each other, in our case, over a message bus. So you could view a message bus. In our case, the message bus is really the

potential limitation

because there's a thing which could get saturated.

But other components are horizontally scalable, which means,

we have a thing which allows you to execute the workflows and actions. And this component is horizontally scalable. So if you wanna increase the throughput throughput, you basically run more instances with the service.

And what are you using for the message bus? Is it just RabbitMQ or is there something else that you need to put in place for that? So, yeah, it's RabbitMQ.

We actually used the Oslo messaging library for OpenStack, which kinda allows you to be, like, kinda like ORM, like,

underlying software agnostic. But in real life, as soon as you get to some scale and you wanna, like, leverage some more native functionality like, queue mirroring for, like, and things like that, then you need to basically,

glue yourself to a particular software. So, yeah, in our case, it's RabbitMQ.

So what made you choose Python for StackStorm's implementation rather than another language like Go?

Yeah. I know Go is pretty popular these days. When we started, it was already kinda, I think, starting to pick up, but I think our team had more experience in Python.

And at that time, the Go ecosystem was still pretty newer. It's kinda like in the early days with Node. Js. In the past, I worked, like, on a project when the Node. Js was really young, and we had a lot of issues related to that. And we needed to, like, reinvent libraries for parsing XML and things like that. So in short, it was, like, the ecosystem in the community, which is really important because in Go until, like, not not a long time ago, it didn't really even have a good debugger. And, like, things like things like that are really important if you wanna, like especially if you if you wanna get, like, something out fast to validate the idea that's gonna help customers and things like that.

I totally agree. And I and I also think that Python just offers

a higher level of abstraction than Go does. I mean, I know Go's adhering say, this is a feature, not not a bug that Go forces you to think in the small, but sometimes you don't wanna think in the small. That's a great answer. Thank you.

And also the fact that Python has been used for system automation and systems

administration for a number of years, particularly in terms of, like, Linux installers. I know that a lot of the, you know, Anaconda for Red Hat and

the Ubuntu 1 are all based on Python. So it's something that is readily available on pretty much any, at least, Linux system that you're going to be running up against.

Against. So Yeah. Another thing I forgot to mention is also, like, the integrations. 1 of the really cool things behind StackStorm is we integrate with all these systems and tools. And integrations can also be written by users. And you have said, as you have said, a lot of system administrators know Python, so

we we can leverage that. We

another another idea is you could basically write stacks from action in any language, but we have this, like, a first class citizen support for Python actions. So our idea or 1 of the cool ideas, we allow you to reuse your existing scripts. If you have some Python script to perform some action, maybe have a use fabric to run a deployment and things like that. You basically just need to add a little metadata,

for Saxon, and then you can already

reuse those scripts.

Yeah. It's definitely an important feature to be able to

take advantage of work that's already been done rather than telling somebody, oh, if you wanna use this tool, you need to throw away everything that you've been working on for the past couple of years and orient it all around the way that we've set up. So that's good that it provides that sort of ease of, extension.

Can you dig a little bit more into the architecture of Stack Storm and what the setup looks like and the different components that are involved?

Sure. So in our case, as I previously mentioned, we have these 3 concepts, the rule,

the action, and,

another thing I didn't mention is the sensors.

So sensors basically allow you to retrieve information from a 3rd party system. Sensors basically generate events or so called triggers.

The sensors could basically be like a polling some third party system, or we have, like, a inbound kinda like push based sensors, which means,

you can send events to us via webhook or similar.

So as far as the architecture goes, we have a couple of processes,

couple of services. 1 is action runner. This is a service which is responsible for

scheduling and, running actions.

Another 1 is a rules engine.

So in the rules engine, we have this, like, I don't wanna call it DSL because,

some people might hate it, but we have this,

let's call it a language,

which allows you to say if let's this string contains or matches this regular expression to that and simple operators like that. So we have a service which is, responsible for evaluating those

rules against the incoming events. So seeing if it matches. And if it matches, either schedules an action execution.

And another component we have is a API.

So we allow you to interact, with StackStone via the API.

Another we also have actually a web UI and a chat ops interface, but we can dive into chat ops later. Besides that, we have a sensor container that's a service which is responsible for for scheduling and running sensors.

So the

let's let's call it the long running processes which potentially pull audio systems for information.

And, yeah, as a message bus, we use RabbitMQ.

And as far as the database goes for our, I would say, metadata or cache or operation data, we use MongoDB. I know, like, some people are not fans of MongoDB, but

right now it works for us because we only mostly store, like, metadata, which is

which is relatively,

it's not a write heavy. It's mostly reads and a small amount of data, so it's not a big deal yet. And then we have another component, which is called the, OpenStack Mistral. That's our workflow engine for basically executing,

all kinds of workflow.

And that's actually an open, upstream OpenStack project, and we have a couple of

upstream OpenStack,

developers working on, OpenStack Mistral, which is a pretty important component of our system.

And you mentioned that part of the setup is a web UI. So, I'm curious what sorts of information and capabilities that exposes.

So, yeah, we kinda started, I would say, with a grace word grace word degradation. So first first step was just providing all this information. So which action happened when and potentially who triggered it. So kinda like a operational dashboards.

It is kinda cool because usually you have, like, 5 different monitoring dashboards and you need to go see you need to go check through all of them. But in this case, you can kinda see all these different events from different system in a in a single dashboard.

And now we also allow you to do simple things like, execute a potential run-in action. Or let's say some action was, automatically scheduled by StackStorm, but it failed and you can kinda, like, easily rerun it, like, manually to potentially debug it or see what happened.

Other than chat driven events, what types of event sources does StackStorm

support?

And what use cases do these alternate event streams enable?

So events are actually a really interesting topic to me because

anything can be an event.

An event could be a monitoring event, which is a fairly classical use case that you think of, or or something within chat. But an event could

be

a change in state on on a record in a database. An event could be that the user actually chooses to run something. Right? So the event could also be internal events within StackStorm itself, such as

an action execution finished or

a value in our internal data store or key value stores expired, things like that. So, really,

the the sky is the limit when it comes to determining what is an event. And I think that that's really where it comes down to the user because the the extensibility of the StackStorm system with how easy it is to to write sensors and to write the integration points make it that

you can turn anything into an event.

That's very cool. Like, I can definitely see the advantage of something like that when you think about managing systems complexity with so many companies and and other groups

moving to a service oriented architecture and microservices

and that whole model, when you have all these services interacting in interesting ways, well, there's plenty of tools out there for things like message queuing and the like. But having a system like this where

an event can be something generated from any source,

I can see I can easily see use cases

in these SOA architectures

where something like this could come in incredibly handy for what a developer might think of as unitarial work, but where, you know, the folks who live in ops might might, you know, deem

very important indeed. That's very cool.

So 1 of the ideas behind your event is we basically want you to push every event to our system. You don't necessarily need to perform any any kind of action on this event. But once we have the all those events in the system, we can do all kinds of core correlation. We could potentially, like, see this application failure is maybe related to deployment or maybe this, if you could,

potentially tie it into the payroll system, you can see maybe,

you've seen less commits because developers haven't got paid yet or something like that. But, basically, that's the whole idea. We it's I would say this our, like, next goal once we get these basics out is providing kinda, like, high level correlations

based on all these kinds of event streams.

1 thing that I've seen in my own work and that can potentially be a bit of a problem when you do have a system that's as extensible and open ended as StackStorm

is it can be sometimes hard to determine where to start and what a base level implementation would look like because there are so many different directions that you can take it that it's kind of hard to know

exactly

how to get started and what areas to focus on first. So I'm wondering if anything in the documentation or anything in the

architecture or initial,

installation

of StackStorm provides any sort of guidance to people in that situation.

We provide some really good examples.

That's 1 thing. But the other thing too is that we have a number of blog articles about different use cases. So if you go read the StackStorm blog, you'll see things about tying into Nagios or Sensu and how to,

remediate a low disk space alert, we'll say, or how to remediate Cassandra cluster issues, and then all sorts of things like that. So there are a lot of good examples, but, really, the big thing that I hammer on a lot of people and I tell everyone is keep it simple.

You know? Yeah. Okay. You see this great event driven automation platform, and the first thing you wanna do is you wanna make it so that it can rebuild an entire data center on its own. But why not pick the easy things, like a low disk space alert or a process needs to be balanced because it ballooned in memory or it died or something like that. Pick those easy things. Look at the tasks that

the things that you're getting paged all the time by monitoring about and that you resolve super fast with the same steps every time. Take those things and automate those. See how much easier that makes your life, and then move on from there to the more exam the more advanced use cases.

Right. And that also gives you the space and the

mental fortitude to have the time and inclination to actually do the extra work as well. Because if you're trying to go for the big fish and you're still spending too much time getting interrupted by

the easily fixable

recurring issues, then you're not really going to make much headway. But if you go ahead and clear up those small annoyances that are eating up minutes of your time but they're happening pretty frequently then it can definitely add up to some huge savings over the long run.

Exactly. Relieve the pager fatigue so that you have room to breathe.

And as far as, like, these things go, we have, like, this platform, which is great. It can do everything and stuff, but I think focus is really important. We also struggle with that as a company. We have this great platform and stuff, but in the end, user wants to solve a problem. Let's say, like, they wanna,

like, a post grad or a MySQL slave goes down and they wanna spin a new app or something like that. So you said, basically, in most cases, wants to solve a specific problem.

So over the over the lifetime of the company in the product, we've kind of been talking to our users and trying to figure out on what to focus.

And over the time, we decided to focus on auto remediation.

At the beginning, we also kinda did, like, continuous integration, continuous deployment. You can still do that with us, and I still think it's a good fit. But in the end, we decided to focus in 1 area and, like, make it really useful and build documentation and blog posts and things like that.

And,

you mentioned using it for CI and CD. I'm curious what the implementation of that looked like and sort of what approach you were taking with that.

So, yeah, I think Patrick should go with this question because he actually built our first, like, continuous integration deployment pipeline for StackStorm, so basically dock footing data.

Yeah. So

our first

CICD pipeline was built entirely using StackStorm.

There's no Jenkins in there. There's nothing else in there. And part of it was

that it was a really great dog pooting exercise for us, but the other thing was that I just wanted to to do it to say I did it and to prove that I could do it. And so we we built out

really this this great

end to end CICD pipeline. Everything from a merge commit all the way through to packages getting promoted to production. And in between, we run the unit tests, we do full deployment tests and end to end tests.

All of that gets done in there, and it's it's all along the way. If failure at any point, we'll stop and escalate to the humans and notify us via Slack, and we can step back in, fix the problem at that stage, and and rerun from that point forward.

So the home page describes StackStorm as being an event driven framework for automating the user's infrastructure. And I'm wondering what kinds of capabilities are made possible by this and

wondering if you think that it simplifies or complicates the work of operations engineers.

I really hope that it's simplifying the work of operations engineers because that was our goal.

Really,

our our main use case, as as Tamash said earlier, was our main use case these days is auto remediation, and that's mostly what we're focused towards. But due to the the flexibility of the platform and the fact that, a single atomic action within Stackstorm

could be anything from a complex workflow with branches and joins and retry logic all the way down to, you know, a 1 line

bash that you've written, then it it could be anything in there. You could use Stackstorm for anything that requires

a workflow of sorts. So it could be CICD.

It could be security remediation and mitigation,

or it could be auto remediation like we've talked about. So there's really anything that there's a if this happens, then do this, you can use StackStorm for. And I'm really hoping that the transparency that the platform brings and the ease of writing integrations and all that

are simplifying

the job of operations engineers.

Yeah. My thinking for any ways in which it could potentially complicate the work is just because of the basically in terms of the initial implementation of trying to reason about how do you map together the different events and actions and also possibly

how do you make sure that StackStorm itself is sufficiently highly available

and resilient to any sorts of network failures or partitions or hardware issues?

Yeah. I mean, like any other tool, there's definitely some things that some thought that needs to be put in to high availability and and and what do you do if I mean, what do you do if the orchestrator of all orchestrators goes down? So, of course, you need to put some thought into that, and you need to have resiliency in there.

But, overall, as for when users are actually starting to use the system, I think that if they're

how do I put this? I guess, really, what it is is that there's there's an end user education aspect to this. Right? And if users are using this appropriately and they're they're following procedures that we've kind of pushed down towards them of of sharing their patterns and and version controlling all their workflows and doing,

peer review with those workflows and everyone sees what's going on, I think that there's a a much

shorter time to education across the entire team than it is generally when 1 engineer works on something and then has to go try and share it.

And have you seen any large adoption uptake in terms of,

developers using StackStorm and contributing

to implementations of StackStorm in an organization as opposed to it just being the realm of operations engineers?

Sure. So I can actually talk about a community side. I would say 1 of the really cool, parts about a StackStorm is not just the product and the team itself. It's also our community. We have a Slack channel with over 400 of users, and those are, like, different users. Some users are, like, from big companies. We choose it to, like, automate their different environments, and then we have other, like, open source users, which also contribute a lot. So, yeah, we definitely have seen a lot of, like, uptake from community, and then we also have seen a lot of, community contributions.

This goes from, like,

integrations with GitHub to, like, integrations with, like, Microsoft SQL and many other systems. And another big part of our of our, like, whole product is this thing we call s t 2 contrib. It's basically a GitHub repository

with all those integrations.

And right now, I think we have, like, probably more than, like, a 1, 000 different actions

and integrations.

So is there a minimum or maximum size of infrastructure for which it would make sense to use StackStorm?

We haven't hit the maximum size yet, so I can't tell you on that, to be honest. There's really not a size that I would say it's too big for StackStorm, and that is for a minimum. I mean, I'm running an instance of StackStorm here that's allowing me to chat ops the control of my Hue light bulbs in my house. So

That is awesome.

I was most curious about the small end because of the fact that there are a number of different components that go into actually deploying the StackStorm

stack, if you will. So I wasn't sure if it would be easy and,

reasonable to run that all on a single node for the purposes of a small deployment of maybe a small handful of machines. But if you're using it for something as, frivolous as your hue light bulbs, then I suppose that, that would that would answer the question of the minimum deployment. I'm envisioning your children figuring out how to get on to your local, you know, Slack or HipChat instance

to run denial of service attacks on your Hue light bulbs.

You you laugh, but there's actually no figuring out. We use Slack at my house as a communication tool for my entire family. And so my kids do flip the lights on and off via via Slack, and they they will play around with that. So it's it's kinda funny, but they already do it. They already do it. Wow.

It looks like StackStorm is made up of a number of discrete components.

What are the components used to communicate, and how did those choices influence the design of StackStorm's

overall architecture?

So in our case, the components communicate over the message bus, so we kinda follow this, message passing passing architecture. And

this influenced the architecture in a way basically, everything is kinda, like, separate and decoupled, which

allows us to scale things, horizontally.

And on top of that, because each of the service has, like, kinda, like, a well defined API, this also allows the developers to kinda, like, focus potentially on a single service. So it doesn't just impact, I would say, the

operation itself, but also, like, the the development process.

If everything is using the message bus, do you enlarge StackStorm installations?

I guess RabbitMQ,

which is what you're using, it and of itself can be a cluster. Right? I guess I'm just wondering about, like, do you run into instances

where you have a large StackStorm installations

where, there can be problems with RabbitMQ becoming overloaded? Or does that become a source of bottleneck or a single point of failure for you?

So, yeah, as you have said, RabbitMQ already has some kind of mechanism for, like, master master scanning out there. And, yeah, we haven't encountered any of those issues yet as Patrick has mentioned.

And potentially We hear that question a lot. You know, people ask about if we've hit that because other uses have hit issues at the upper end with Rabbit, but we actually haven't hit any of those problems yet. I mean, there's always talk of, oh, hey. You know, we could move to Kafka or something like that if we start to hit those, but we haven't seen those yet.

And potentially, if a single instance of RabbitMQ couldn't handle the load, we could also, like, do some kind of charting or something like that. So I would say there's always, like, a way

out. Right. I mean, I know in my case, 1 of the companies that I was at, we actually did have a problem with overloading Rabbit. But I think that the problem there was that we were using it

not just as kind of a message passing architecture, but effectively, they were using it almost as a high throughput work queue, and they were really just

pounding it with tremendous, tremendous, tremendous amounts of input. And so it's not like Rabbit had a problem per se, but it would end up melting the VM that it was running on, which, which which did create some problems.

So I suspect it's it's a matter of you folks are using RabbitMQ more intelligently, so you don't run into those issues.

Yeah. And, also, in the past, like, I would say probably 4 or 5 years now, also worked in a software as a service where we use RabbitMQ.

And we also had some of the, I would say, like, similar issues. But, yeah, in those, like, 4 or 5 years, RabbitMQ has also, like, matured and stuff. And a lot of those issues has also been fixed and improved.

So

as I briefly mentioned before, I use SaltStack in my day job, which is a tool that also focuses on event driven architecture.

I'm wondering if either of you can compare and contrast the capabilities and focus of Stackstorm with the features of SaltStack. And I'm wondering if it would make sense to use both frameworks in the same infrastructure

or

where you think the, the best overlap would be between them?

Yeah. Usually, we have a lot of users who is, like, different, I would say, similar to Salt Salesforce, different, configuration management system. Some of our users use Puppet. Some of our users use Chef. Some of them use Ansible,

and some have also used Salt. So the thing, StackStorm allows you to do, it allows you to, like, potentially connect Salt or other system with, other tools and,

and,

environments you use.

So to get a to give you a concrete example, some of our users actually use a StackStorm to orchestrate a deployment. And to actually perform the the deployment, they use Ansible. So Ansible runs like potentially

downloads a tarball, unpacks it, and things like that. But the actual system which orchestrate all of that is a StackStorm. And this could be,

a a GitHub commit or a merge event comes in, and then a StackStorm kicks off this,

Ansible action.

So StackStorm kinda allows you to tie all of this together.

There are quite a lot of similarities when you look at the the core functionality of remote execution and the reactor and things like that between SaltStack and Stackstorm.

But 1 of the things that a couple of the big aspects that we bring, and I think we were able to surpass Salt with, are, 1,

our concept of events can be really anything. The idea that all of these external events can be consumed and then turned into what we call trigger instances and triggers within StackStorm,

is is really, really powerful. And that's 1 of the things that makes us such an amazing platform. The other thing is that we've added a lot of,

very nice

graphical utilities and and very powerful command line utilities

around the platform in order to make it a lot easier for all different levels of user to interact with StackStorm.

Yeah. I will say that the

UI in particular is something that SaltStack has

not quite got a great answer for yet. I think right now they're actually working on building out

a UI for their enterprise customers, so it's not gonna be an open source 1. There are a couple of open source offerings available right now but neither of them are very feature complete at the moment.

So

definitely having a a good UI that gives you a comprehensive overview of everything across your stack would definitely be very useful.

Neither does Chef for that matter. I think, honestly, everybody's playing catch up to Ansible with Tower. So everybody kind of all of a sudden realized that there's real need here and that having a nice GUI

drives customer adoption, but,

I don't think anybody really had their head in that particular game for a long time. Well, SaltStack did have Haylight for a while, but they kinda let it, peter out because it was never a fully flushed out project.

Well, it's it's all about personas. Right? When you look, configuration management tools are amazing tools, and they really allow

your

sysadmins and your SREs and things like that to leverage themselves out to a wider environment. But they are really geared towards infrastructure hackers and things like that, people that are comfortable

running things from the command line and doing those kind of things and stuff. As you start to branch out to the the lower level engineers or the people that aren't as comfortable doing those kind of things, they're not comfortable working with the DSL and the various tools or those sort of things, the graphical utilities become really powerful. They allow

the end users that may not have those capabilities to start leveraging themselves out to that same infrastructure that your power users were doing before.

1 of the advertised features of StackStorm is a strong focus on chat ops. I'm wondering if you can explain that concept for people who might not be familiar with it and describe why it's such a useful paradigm.

So chat ops is really neat, and chat ops is kind of the new hotness these days. What chat ops is is chat ops is

being able to interact with your infrastructure and with your tools

directly from your chat room. So instead of having a you know, spinning up a special war room, a channel in IRC, or a channel in Slack, where you're starting to deal with an issue and you need to jump out of Slack or jump out of your tool and you need to go hop on the box and run some commands, you're now interacting with your infrastructure

directly from that chat room. And the way you do that is via bots. So a bot will sit in your room, just like back in the IRC days. A bot will sit in your room that is gonna report information to you about your infrastructure,

about what's going on, and you're gonna be able to tell the bot to go take action against your infrastructure. So what StackStorm really allows you to do is all of that power of all those integrations and those workflows and everything else you've loaded into StackStorm

is now exposed to you through that chat room with chat ops. So you can do something like

reboot a server or you can

push the latest version of your Puppet module up to the forge, or whatever information you need there, or whatever things you need to do, or information you may need to collect, you run that command directly from chat.

The power of this is that you're now not losing your context when you're you're working with others. That that transparency and that collaboration

stays it stays linear. You can now see directly within the flow of the conversation

when it was that the other engineer went and restarted a service, or or when was that last time that we went and and checked the load on that server, things like that. It's all there.

You have time stamps and everything for all of that data. So it's powerful in the moment when you're when you're troubleshooting an issue, but it's also incredibly helpful when you go to do your post mortem afterwards.

That's an area that VictorOps as a company tries to focus on with their tool, but they're more specifically focused on managing your on call and then trying to aggregate your different alert events into a chat stream.

But having

control over the actual infrastructure itself from that same location is useful.

So in terms of the bots, I'm curious if you just have plugins for some of the more popular chat bots that are out there, such as HueBot

or if you have your own bot that you write and maintain for purposes of plugging into the different chat rooms?

No. We just leverage the existing bots that are out there. So we have plugins for both Hubot and for Leeta.

I mean, the bots are doing their job, and they're doing it very well. So there's no reason to reinvent the wheel there.

And as far as, like, the Hu HuBOT goes, this is, like, our officially supported 1, and the Lidar is actually a community developed 1 and supported.

Yeah. Hubot is definitely 1 of the more well known ones now,

particularly because of its association with GitHub and the work that they use it for.

So extensibility is a critical capability for an operations platform due to the wide variety of environments that people are inclined to build. In StackStorm, it looks like the unit of extensibility is a pack. I'm wondering if you can describe what a pack is and how you arrive at that abstraction.

So a pack is really a shippable unit of content for StackStorm. So it can be made up of rules that are processed by the rules engine, sensors that are collecting and putting events into StackStorm,

as well as all the actions, whether those actions are single atomic actions like an RM command or they're workflows,

as I said before, with complex logic in them and things like that. All of those things will drop down to the PAC level.

Really, I don't see a problem, and I haven't run into a problem

with the concept of a pack as the shippable unit that couldn't be solved through user education.

So, really, a pack doesn't have to be,

all about a piece of technology

or a group within your organization. A pack could be designed for a use case. So you may have your Sensu pack or you may have your AWS pack, but you may also have your,

you know, level 1 NOC remediations pack,

Or you may have your Database Remediations Pack or your Database Deployments Pack. So you could actually be developing packs around use cases. And as long as you're

designing your packs in an intelligent way, then I don't think that the concept of pack as a shippable unit is a problem at all. And have you encountered any situations in which the concept of a pack has been the wrong abstraction and made something more difficult than it may have been otherwise?

No. Just as I mentioned previously, I haven't. As long as the end user is is designing their packs in an intelligent way, yeah, if you take 1 pack and you shoehorn absolutely everything you've ever done into that pack, you're gonna have a little bit of problems updating that pack in production if you're trying to to to do that. It's just not an intelligent way to do that. So as long as the user is is thinking ahead and designing their packs in the best way possible and following some guidelines that we give to them, I don't I haven't seen anything where that's a problem.

So in very large scale environments like Netflix, how would 1 build a StackStorm cluster to handle the immense load? More specifically, how does 1 determine what kinds of machine resources each component needs?

Specifically, in the Netflix use case, they actually have a really interesting design. So Netflix has a monitoring system called Atlas, and Atlas is emitting all of these monitoring events out to actually SQS. So they're throwing their monitoring events onto a message bus. And then each 1 of their StackStorm nodes,

is running a sensor that's picking up these monitoring events off of SQS.

And so they're able to just scale their StackStorm nodes horizontally,

and they're all identical clones of each other just running that that SQS sensor, picking the events up, and and processing them through the rules engine and firing off workflows.

The beauty of this for them is that all they have to do is spin up another node if they start to get bogged down.

And so it sounds like

the way that they approach that is by taking RabbitMQ out of the picture and using SQS instead?

Well, actually RabbitMQ still exists internally, and it's it's how the different components in StackStorm communicate.

They have SQS sitting out front. And so Atlas is dropping all the messages to SQS

instead of sending them directly to StackStorm, and then StackStorm is just picking them off of SQS.

That makes sense. We've I've actually worked at a company that did it as well. That way, you get kind of the infinite extensibility

and infinite scale of SQS,

and your subsystem, StackStorm, in this case, can pull the messages off the queue as it can handle them.

Exactly. If you're if you're a company like Netflix that is using AWS,

you might as well leverage the services that are available from them.

Right. The company that I was working for, we we were doing donations processing for the Barack Obama campaign,

and we had our own proprietary

queue system.

But we used SQS as a front end for that explicit purpose, both in terms of being able to keep up, but also in terms of fault tolerance. That way, if our queue went down, the messages would still get queued in SQS and we could consume them when we came back up.

Yeah. And that's also 1 of the beauties of StackStorm, the 1 I've previously mentioned. So that's, like, a flexibility.

Some of our users, like, use SQS, some of them you may use Kafka, some of them use, like, a Microsoft SQL Server as a queuing system. Or it doesn't necessarily even need to be a different company. You have different different teams, different projects in the same company, and all these teams potentially use different tooling. So StackStorm allows you to ingest events from all those different systems.

And management of credentials is always a difficult problem in operations. Does StackStorm Storm attempt to tackle that issue or does it defer that responsibility to other systems such as the user's configuration management platform?

The answer as of this moment is that we defer it to other tools like the configuration management platform. But we have some really neat things coming down the pipeline. And I hate to leave a teaser dangling, but I'm just gonna say keep an eye out for the next couple of versions, and we're gonna drop some really neat stuff around, secrets management.

And do you have any rough timeline on that?

Oh, man. I don't know. Tomas, do you know when we're talking about dropping that stuff? I think usually it's the best answer is to say when it's done because with software, you never know. But, hopefully, sometimes in the next, like, 1 to 3 months.

Okay. Fair enough.

Does StackStorm interface with Kibana, Splunk, or other log or metric aggregation packages?

Interface in what way?

I guess, I was just thinking in terms of

you and mentioned previously

both of you guys mentioned how you would kinda like to see

virtually every event in a large system come through StackStorm.

And I guess, I was wondering

if StackStorm

ends up as a result of

running, you know, like you mentioned, there could be automated actions that Stack Store may take in reaction to various events

that it consumes.

I would think that it would be handy to have the results of those events in any logs that get generated or any sort of other metrics.

Basically, I'm just wondering sort of, like, if you have a centralized system for collecting this kind of information,

like Kibana or Splunk or various other thing graphite,

does

StackStorm allow you to feed the the date the data that it generates into whatever centralized system that you're using?

Absolutely.

What StackStorm does is StackStorm has a concept of what we call audit logs. And so we output everything that the system has done, every action execution, every rule enforcement,

every trigger instance. Everything that the system has done gets outputted

in into log files that are structured data. They're all JSON. So you take that JSON and you put it into your log aggregation tool of choice, and you now can see everything that StackStorm did, all the events that StackStorm received. You get all of that data, so you can now start to crunch it in your in your tool of choice.

And do you expose an interface for being able to query that data via your bot plugins?

We don't actually have any chat ops aliases that that query Splunk that I know of right now. But once again, I know it's kinda sound like a broken record as I say this. It's so easy to write these integrations

that there is a Splunk query action, so you could write an alias very easily that that would allow you to do that.

It's almost kind of like they don't need to create the interface because the interface is already there, and they can't really, like, my guess anyway is this kind of query would be the kind of query

that by its very nature, it almost requires domain knowledge. Right? Like, you have to understand

what you're querying

to know what kinds of queries or reports

you may want to be able to invoke from your chat ops interface.

Exactly. You know, we're we're creating a very powerful generalized automation platform, and we're letting the end users determine how exactly they're gonna leverage it.

And what are some the most surprising uses that you've heard of from people using the platform?

We just recently had a really neat contribution of a Tesla pack. So if you're running StackStorm and you have a Tesla, you can now, via chat ops, honk the horn on your car.

That sounds potentially really bad. Can you imagine your kids getting access to the horn on your car? You're, like, driving down the highway, you're gonna end up inducing road rage because your kids are playing.

There there was a lot of joking around that when we saw that pack land. So

Are there any questions that we didn't ask that you think we should have or anything else that you wanna bring up?

Nothing for me. So let's see. Go check our website and especially the community part. If there's, like, integration missing, you use some kind of tool or some kind of system we don't integrate with,

go have a look if the system has an API or something like that. As Patrick have said, it's really easy to integrate

and what we're we definitely wanna we're growing our community and really open to accepting contributions.

So, if you're having any problems, join us on Slack, and we'll basically help you make your contribution

accepted upstream and help you make succeed with whatever you wanna do.

And for people who are interested in making contributions, do you have a labeling system in place on your repository for being able to demarcate the issues that are for people getting familiarized with the system? Because I know that that's a tactic that some projects have taken to sort of foster adoption and foster people getting involved in the code contribution portion?

I do think in on in our main repo, which contains the stacks from code, we do have, like, some, labels. And, yeah, you can find some, issues which are potentially,

a good, take for beginners.

But as far as, like, the rcontry prep repo goes with text, I don't think we have many issues there. But, like, if you just wanna try it out, there are, like, a lot of, like, different services which are really easy to integrate with, have really easy APIs. So you can start with something like that.

So for anybody who wants to keep in touch with you guys and follow what you're up to and see what's new with Stackstorm, what would be the best way for them to do that?

So, I guess, Twitter is probably the best way. I am at

kamisl0,

or you can go to my website, which is to

maz.me.

You'll find all the links to GitHub and, Twitter there.

You can go ahead and find me on this is Patrick. You can find me on Twitter.

My handle is Dorifto Shoes, d like dog, o r I f, t like Tom, o,

s h o e s. Okay.

So with that, I'll move us into the PIX.

For my PIX today, I'm going to choose the SAWS wrapper for the AWS

CLI tool. So

it's a Python wrapper that,

uses the prompt toolkit library to give a much more friendly interface and gives you auto completion

because the number

of sub commands and their sub commands,

at infinitum that AWS CLI exposes can be somewhat daunting. So having a good autocomplete interface and a good editing interface for that is definitely very useful. So

I've been making good use of that recently. I definitely recommend other people check that out.

And I'm also going to choose the author Bill Peat. He's a children's author who has

really great storytelling

method, and also all of his illustrations are very

entertaining and engaging. And most of his stories are pretty funny. So definitely recommend checking that out,

whether or not you have kids because he's definitely enjoyable for adults as well. And I will pass it on to you, Chris.

Thanks, Tobias. After a bit of a hiatus from beer picks,

I'm gonna come back with a beer pick this week. Grim Brewing that I have picked their beers several times in the show before because I just love their stuff.

They use wild yeast and just make some really interesting brews.

The latest 1 that I've tried is called Subliminal Message, and it is a sour red ale brewed with tart cherry.

It is really delicious.

It's

got a really sort of fresh

flavor that you get the the the tart cherry at first, and then it smooths out to kind of a nutty finish. It's really delicious stuff. They just can't go wrong as far as I'm concerned. I've loved everything I've tried. My next pick is a website.

It's called Lobsters, oddly enough. I'm really not sure why they chose the name. I should probably ask at some point. But essentially, it is a curated technical community.

Think Hacker News,

but without all the founder and startup

stuff. It's all technology focused, and it's a much smaller community. It's much more civil. There's way less raginous

going on there.

The community,

basically, when you invite someone, you're responsible for their behavior. There's an invite tree. So, you know, you can always see who invited whom.

It's it's a really great community.

If anybody if any of our listeners want an invite and are willing to play nice,

drop me a line and I'll happy happy to set you up. I've really enjoyed it. I feel like the signal to noise ratio is incredibly high. So, my last pick, I'm going to pick medium.

It's really kind of interesting. I realize a lot of people by now have probably clicked through articles on medium, but it's really worth, in my opinion, signing up. It it's basically an interesting conglomeration

between social network and blogging platform, but what's interesting about it is the pervasive adoption. Right? Like, I've I've read posts on here by every, you know, everyone from

the president to presidential candidates to Alanis Morissette to fellow geeks writing about technology things to

people writing about philosophy and meditation,

and

people writing about science

and psychology.

There's a actually several homeland security oriented topics. So I've been reading things by Homeland Security Professionals

on the nature of ISIS and what kinds of threat it actually poses to American society. It's just if you're a curious person as, as I tend to be infinitely so,

it's a very deep well. It's definitely worth looking into. And with that, I'll pass it off to Thomas. What picks do you have for us, please? Sure. So my first pick is gonna be a book. It's a book about, Air France 4 47 incident.

It was a flight from, Brazil to Paris, and it crashed in the Atlantic Ocean. I think it's really interesting because it also ties into the whole automation space, especially with the airbuses and stuff these days. There's a lot of pilots and, I guess, specialists stuck in there. The pilots basically overlay on automation. And then when something goes wrong, because it's a complex system, many different events, but actually many lights and alarms going on, like,

pilots don't react in the in the right way. So in this case, in this, I I would say in this crash, 1 of the, I guess, root causes for the 1 of the reasons for the crash was the this human mistake and interaction with the system.

And on a on a second pick is gonna be your website. It's on a related note. It's called Aviation Herald.

It's basically it includes, incidents and a root cast on an analysis reports from, different aviation

incidents. And I think it's interesting because,

software engineering is it's not really a regulated industry, and it's a pretty young industry. We have some post murders and things like that. But if you really look into any more regulated industry like aviation and stuff, It's really fascinating how deep it goes into root cause analysis and things like that. And, basically, that's how everything works. There's an incident and basically, dig into all the details, and then you'll learn a lot from that.

It's interesting,

you know, it's funny that you mentioned that Air France disaster.

On 1 sort of somewhat technical forum,

an acquaintance of mine had set up a subforum called disaster porn,

and that that particular crash was showcased rather rather heavily in quite a number of write ups. But it's it's definitely sort of a good habit to get into, I think, for people

who work in our industry

to understand

these kinds of occurrences, not just in terms of what went wrong, but what can we learn about how the industry,

their industry reacted to these disasters and made widespread changes? I think there's a lot for us to glean from there.

Yeah. And a lot of it's it's almost never basically, it's never a single person fall. It's basically a process. And, yeah, with all these learnings, you can improve the processes.

And, Patrick, what do you have for picks?

So my first pick is actually going to be a,

website called trunutrition.com.

I go there and I buy a lot of my supplements, but specifically I buy my protein powder from there.

Their shipping is really fast. The flavor choices are amazing. And honestly, the price to quality ratio, I haven't beat anywhere yet, whether it's, you know, stuff just bought from the store or any of the other websites that I can order stuff off of. And let's see. For my second pick, I think I'm gonna have to go with another website,

JP Cycles.

It's where I get all the parts for my Harley. So I their prices are great, and shipping's fast once again. So as you guys can tell, if I'm not on my motorcycle or in the weight room, I don't leave the house a lot.

Well, we definitely appreciate the both of you taking time out of your day to tell us more about StackStorm and the myriad ways that it can be used. So I appreciate it. It's definitely 1 that I'll be keeping an eye on and probably,

possibly using in my own infrastructure. So

appreciate that, and, I hope you enjoy the rest of your days.

Yeah. Me too, guys. This has been really excellent. Thank you for taking the time. Thanks, guys. Thanks.

The Python Podcast.init

Summary

Brief Introduction

Interview

Keep In Touch

Picks

The Python Podcast.__init__