Run Your Applications Reliably On Kubernetes Without Losing Sleep With Robusta

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

So now your modern data stack is set up. How is everyone going to find the data they need and understand it?

Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company and how they're using it, all the way down to the SQL queries.

Best of all, it's simple to set up and easy for both engineering and operations teams to use.

With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.

Try it out for free and double the length of your free trial today at pythonpodcast.com/selectstar.

You'll also get a swag package when you continue on a paid plan.

Your host as usual is Tobias Macy. And today, I'm interviewing Nathan Yellen about Robusta, a tool chain for being able to debug your applications on Kubernetes. So, Nathan, can you start by introducing yourself?

Yeah. Hi. I'm Natan. Today, I'm 1 of the founders of robusta.dev.

Been a developer for many years.

Got involved in open source back in high school, GNOME Linux,

and kinda been developing

ever since. I guess, very recently, about 6 months ago, I got started with Robusta.

It was a cybersecurity startup, and we were running on Kubernetes, and then we were selling various cybersecurity solutions

to customers. That was good. But the experience of running on Kubernetes was really an interesting experience for me and for my cofounder.

And then out of that and out of the difficulties we saw there, the idea for Robusta was born. And do you remember how you first got introduced to Python?

So I think that it was in gHap.

Google had this program for teenagers,

kinda like the teenage equivalent of Google Summer called GHOP,

and they've changed the name since then in the mean times. And you basically went online, and there were, like, all these open source projects, and you picked out something that interested you to do at 1 of the open source projects.

And I think 1 of the tasks was in Python, and I had never done anything in Python before. I had programmed

a bit in objective c. I had a Mac, so I had programmed an objective c, done my HTML and JavaScript.

And then I learned Python

for 1 of the open source projects that was involved there, which I wanted to get involved in. It was probably GNOME,

but I don't remember for certain.

That's brought you to building this robust platform and business, and you've shared a little bit about some of the motivation for it. But I'm wondering if you can describe a bit more about what it is that you're building there and

why it is that this is the problem space that you wanted to spend your time and energy on?

Yeah. So what we saw is we were really doing, like I said, security solutions for Kubernetes.

And there's this thing that everyone does in the security space,

whether it's Snyk, which does, like, vulnerability scanning, whether it's, companies that do container scanning, whether it's companies like Wizz that do general cloud security. They basically look at your environment, and they say, you don't have to be a security expert because we are. And we'll

bring this really opinionated

made view to your environment,

and we'll tell you what the issues are, whether it's vulnerabilities in your code or it's issues in your cloud infrastructure.

And then we'll tell you what the issues are and how to fix them, and click a button here and you'll fix that. And we were selling that to people on Kubernetes and saying, okay. We'll fix your security issues on Kubernetes. And they were like, okay. Great. Fantastic. And they left that. And then we came back to our own environments, and we don't know onboard the new customer, and then things were crashing left and right. And we have, like, issues with Cassandra and MongoDB, and then we had to profile a job application.

And then we had issue with Go application.

And we were running to live issues really with DevOps and maintaining stuff in production. We started thinking, could we apply

kind of the same

concept of looking at your environment, taking this really opinionated

view and saying, okay. Here are the issues, and here's how you fix them. And taking a more opinionated view than, like, a traditional monitoring tool, which just gives you others but doesn't necessarily say, like, this is the problem, and this is how you should fix it. And then from there, the idea for Robusta was born.

Before we get too much into Robusta,

before the recording, we were talking a little bit about some of the relative merits of Kubernetes versus some of the other offerings and some of the layers that are built on top of it. And so

for people who are writing applications and running them in production, what do you see as the heuristic or the litmus test of when it makes sense to actually explore

Kubernetes as the runtime for your applications versus

building something yourself on a VM or just using Docker

on its own? Or sort of what is the sort of tipping point of complexity where you actually need Kubernetes?

Okay. So it's a good question. So I would say the point you need Kubernetes is when you end up writing Kubernetes yourself

or writing parts of it

but doing it poorly.

If you're writing it yourself, you probably are doing it more poorly because writing distributing systems is hard. And to give a longer answer for that, if you can get away with using something like Heroku or App Engine or using a more simplified platform, you should do so. But as soon as you start getting into orchestration, as soon as you start getting then into all these different concerns like configuration management, secret management,

self healing,

like liveness probes, when you start getting into those areas yourself,

then,

really, that's the point where I think it makes sense to take something out there. And Kubernetes might have, like, 20 things that you don't need, but the 3 things that you do need even if it's just self healing and, like, auto scaling,

they're written very well and they're extremely battle tested. So by taking those, you can save yourself a lot of headache down the road. It's also, like, people tend to, like, invent stuff in house, and there's always not invented hair syndrome. But I I think if you're doing stuff

that's complicated

and you're writing an IVE in house code for that, then you should see whether an existing orchestration system like Kubernetes can help. And maybe 1 more example I'll give here is just service discovery.

And some of the people, like, say, oh, like, service discovery is really easy. You just, like, use DNS, then you know where the services are.

And it's not really true if you think about it. Like, someone has to be keeping those DNS records up to date. So you might be managing a few different machines and then you're now going and updating those DNS records. But if a machine crashes and you remove that from the DNS records, and you have 5 misprobes that are checking. It's like there's all this stuff that goes into making something as simple as service discovery work, and you can either implement that yourself or you can take it. You could take an existing solution. So you you could take an existing solution like that, like console, and you take something else for secret management, and you take something else for all those skin. You can take all these different components, or you can just take 1 that's emerging as the operating system for distributed systems, which is Kubernetes.

Yeah. I was going to play devil's advocate for a moment there and bring up console as an option for a service discovery, but you already addressed my question there. So

No. You can. You think you could use console for that. You could do something else for secret management as well. You can do something else in terms of managing, like, state and doing rollouts. Like, there's solutions for everything.

But there are solutions for everything too if you wanna write, like, code without an operating system and put it on the computer. You can do it. It won't be compatible with anyone else. Everyone who comes to your company will have to learn, like, what exactly is that you're doing.

You can get by without it. And the main reason people say I wanna get by without it is because it's complex.

So I think the the main question to ask yourself is, is the alternative that I'm developing in house or I'm putting together from other tools really more simple? And if the answer is yes, then you don't need Kubernetes.

I think that's a good way to summarize.

There are definitely lots of complex and conflicting arguments for when you would use Kubernetes. And if you do a Google search, you'll probably find a 1, 000 different people having 2, 000 opinions.

Yep. Yep. I I wrote a blog post that was a little bit controversial, and I'd tell you about it. Oh, Kubernetes is complex because you want complex things. And that was the gist of it. Like, you want all these things, all those scaling and service discovery and etcetera.

And to put it simply, you really want Google level infrastructure

with, like, something like a 3 person DevOps team. So there's complexity there

just in your requirements, and the solution is going to have to be complex too. Fair enough. And so for people who

have decided that they're going to go into the Kubernetes camp and run their infrastructure

on this platform, what are some of the challenges that they face when they start

to move into production and they actually do want to run their systems in Kubernetes and be able to maintain uptime and understand the problems that occur?

Yeah. So the number 1 problem is just a lack of knowledge and a lack of experience.

There's, like, a running joke that you see, like, resumes. So you see job positions looking for people, like, with 15 years of Kubernetes experience. And, obviously, Kubernetes hasn't even been around for that long, but you do sometimes see job posts like that. And, really, the big real issue here is that Kubernetes is a relatively new technology, and not a lot of people have experience with it, and it's nontrivial.

If Kubernetes is kinda like an operating system or if you compare it to Linux, then today you can write applications on Linux without being familiar with the kernel or without being familiar with all these implementation details. And you have higher level stuff, stuff, and you're not calling syscalls directly. You're, like, using gLLC and stuff. If you look at Kubernetes, then it's really complex.

Like I said earlier, I think that complexity is justified, but it's very complex.

And that complexity often meets you in weird places. And combine that number with a lack of experience,

things can be a pain to troubleshoot, and visibility is often very poor. May maybe I'll give an example of that, like, just to make it obvious. Like, okay. Take, for example, let's say you have a container that's getting out of memory killed

because it's reusing up the memory, and then it's getting killed by the the infamous Linux ink killer. Right? And then you go online and you, like, Google how to handle that, and you'll find different advice. And there's all this stuff using CPU requests and limits, and you'll even find a lot of stuff saying that, like, the best practice is you have to set memory requests, and you have to set memory limits, and you should calculate them this way and that way. And I've actually reached out a few times, and I've spoken to different Kubernetes maintainers.

The mainstream opinion,

people who are running Kubernetes at large scale, is that you should always just set your request to your limit. And, essentially, when you're running these applications,

then you should make sure that your application

essentially requests and is allocated the same amount of memory

that it has available. What I mean by that is you can think of it as like a bin packing problem or like a knapsack problem. Right? So you're allocating different amounts of space. Then Kubernetes will let you go, and it will let you say, like, allocate 1 gigabyte for my application,

but you can actually go and consume up to 2 gigabytes.

And when you do something like that, cause all sorts of weird behavior

because you can have stuff that then goes over the limit, and your entire node is over the limit. So it causes all sorts of nontrivial issues.

And the common advice for that is really not useful.

So for people who are using Kubernetes and they are running into these issues of visibility and experience, what are some of the ways that the

tool that you're building at Robusta is going to help address those challenges?

Yeah.

So what we're doing with Robusta is really 4 things

that all build on top of 1 another. And the general vision for Robusta is that we'll look at your production environment,

and we'll say, here are the issues. Here's what you should do about them, and click here to fix it. And when that's not sufficient, then we'll give you really good tools to troubleshoot it yourself.

And to explain how we do that, then I'll just walk you through a little bit of the architecture.

There's essentially 4 parts of Robusta.

The first is this automation engine. So defining YAML

essentially triggers actions and syncs. The trigger is a pod crashed, a Prometheus

fired, or you reach the maximum scale limit with the horizontal pod autoscaler. So those are all triggers. They're events that happen in your cluster. They can come from Prometheus. They can come directly from Kubernetes.

We can do a trigger, like, if a specific log line is written. So there are all sorts of triggers, and we're always adding adding new ones. And then an action

is essentially a Python function that runs. You just configure in YAML, which then gathers some data or take some remediation action.

So a simple workflow here is a trigger as a pod crashed, and then an action is go fetch the dogs so so I can see why it crashed. A more complicated example is, like, I reached the maximum auto scaling limit. So send me a message in Slack, which then gives me a button I can press. And if I press that button, it ups the maximum auto scale limit so I can go back to bed at 3 AM and then deal with it properly in the morning.

And then it said there are 3 parts of automation. So the first is this the triggers, and there's the actions, and then there's the syncs. So the syncs are just where you send that data and, like, do you get a Slack message

or MS Teams message or whatever?

So that's part 1 of what we're building with Robusta,

that automations engine.

And then we thought at the beginning, like, okay. We'll give you this automations engine, and then you're gonna implement all these different automations. And you kind of encode your own team's knowledge

as code. So then you'll have these automations that show you to see what the exact right issue. And it turns out that people are really busy, and they just want you to provide those for them. So the second part of what we do is we have a monitoring stack that's based on Prometheus.

And either we bundle Prometheus with us or you can send data from your existing Prometheus.

And then we essentially just have out of the box rules and these out of the box automations

that just do the right thing. So, for example, if you're out of disk space, then we can go and can actually fetch the files from the disk and say, okay. Here's the biggest file. If there's some other issue happening with CPU throttling or with Oom kills, then we can tell you, okay. This is why it's happening, and your CPU request should actually be x, not y. So that's the second part of it. And then the third part of it is, well, every automated solution

only covers, like, in the best case scenario, 70, 80% of the issues.

So why don't we give you really good strong troubleshooting tools when the automations don't cover things? And that's, like, deep non breaking debuggers and profilers and stuff to debug memory leaks and various solutions over there. And then the 4th and last part of it is the only part here that's not open source and that's not MIT licensed

It's just our cloud platform. We see all this data in 1 place, and you have 1, like, dashboard that shows you everything running in all of your clusters, where the issues for each 1, how to fix them, and just gives you those buttons there in the UI instead of in Slack.

As far as the capabilities and use cases that it supports, I'm wondering what you see as some of the

alternative players in the ecosystem.

1 of the ones that I'm aware of is a platform called Rookout that focuses on sort of those non breaking debuggers

and being able to help with the profiling. But, you know, it's definitely a very

large and complex and wide open ecosystem of tools built around Kubernetes. So I'm wondering if you can give some sort of framing as to where Robusta fits in that overall space and when you might want to choose that over some of the other tools or some of the tools that it might replace that folks are used to from running their own applications on, you know, a virtual machine, for instance?

Yeah. It's a good question.

So

we're firmly in the observability

and monitoring space.

I'll start with what we don't replace, which is Prometheus. If you have Prometheus, then you configure a webhook, and it just sends data to Robusta. And if you don't have Prometheus, we actually install Prometheus with with us. So we're not replacing just the monitoring part. Like, the monitoring part is you gather, like, all these different metrics, then you show graphs and stuff. We're not replacing that. We're just adding on a very opinionated layer on top of that, which then identifies specific issues, and we'll tell you what to do about them. And specifically regarding Workout, it's not quite the same space. Like, Workout has taken

the part care of non breaking debuggers and haven't used their product, but I think they've done a very good job of taking that really to the next level. And for us, we didn't write our own nonbreaking debugger. Like, the dirty secret is we don't have a nonbreaking debugger in robust. We're

actually just deploying the Versus Code nonbreaking debugger to production, and we're just doing the wiring there. And the problem that we're trying to address is not specifically, okay. How can I run a non breaking debugger in production? That happens to be 1 specific use case. But the more broad problem that we're trying to address with the manual troubleshooting side is, okay, there are all these great Unix tools out there

for debugging CPU issues and running a CPU profiler, for gathering data on memory leaks, for debugging applications.

There are all these great tools out there, observability tools,

and the only problem with them is that you can't really use them well in Kubernetes. Like, the traditional Unix philosophy is, like, 1 to a 1 purpose, and you put them together as, like, in a bash pipeline or a bash script. You can, like, do all these different things with those. And that doesn't work well when you have machine boundaries in the middle, when you have pod boundaries.

It's no longer trivial to, like, say, okay. Take this great CPU profiler for Python, let's say, PySpy,

which robust app is to wrap, and now run this in production on a Kubernetes cluster. Because your Docker container doesn't have that profiler inside of it. And even if it does, it doesn't necessarily have features, permissions. And even if it does,

like, there are all these other issues just in terms of getting that to run at the right time. So a big part of what Robusta is doing is we're taking these really good traditional Unix tools,

and we're wrapping them in the right way. So at the exact right moment, we can, like, run the exact Linux tool that you want to gather the exact data, and you can orchestrate that. So as an example, you can say when there's a high CPU

alert, go run PySpy,

a CPU profiler, which is open source,

on my application.

Pick 1 pod, which has high CPU. We're running on that pod for 10 seconds, and then send me the profile on Slack. And you can do that all without any prior setup, which is cool because I know I always, like, find out I need to have set something up, which I never did. So it's cool to be able to do that on something that's running without needing to set it up first. The different example of this now that's not Python specific based on the monitoring side, like, let's say you have a pod running or a node that's running out of disk space, and you wanna know, like, okay. Well, what's taking up the space? It's like in the normal world, you just without Kubernetes, you we just run, like, du or df or, like, d u c or there are all these great tools.

But it's not certainly nontrivial to do that. Like, if you wanna run now du on the node, like, you have to manually log in. Maybe you don't even have access. So So being able to define the YAML file

that says, okay. Like, essentially, when there's a little disk usage, just run this for me and send me the result in Slack or being able to go and, like, say, okay. This is happening right now. So just right now, I'll run 1 command line, and that command line will set up these tools on that node, and it'll run it for me. And then it'll stop running, and it'll just send me the data. It's very useful. As far as the specifics of a Python application, you mentioned PySpy.

What are some of the other ways that Robusta is able to provide specific insights for Python engineers to be able to debug some of the

challenges that occur from running dynamic applications on dynamic infrastructure?

It's a good question. So it always splits up into 2. There's a stuff which isn't really Python specific and is more Kubernetes specific.

Like, I have the wrong CPU request

or I have the wrong memory, limit and stuff that has to do more with orchestration itself or how I should auto scale,

then all that's just like stuff that Robusto will monitor out of the box and will give you a very strong opinionated view on. Like, you're doing this wrong. You actually should have this CPU request. You showed the CPU limit. And that's kinda the more generic Kubernetes monitoring side. And there, we work hand in hand with Prometheus, and we bundle on Prometheus, and we modify some of the alerts. So that's just, like, really the normal Kubernetes monitoring side. I guess just elaborate there. Like, that's the stuff you would care about if you're a DevOps, or that's the stuff you'd be looking at if you're a DevOps.

If you take as an example, like high CPU, and then there are 2 views on this. Like, the DevOps guy will say, like, okay. Let's put on more CPU resources,

or that's putting the developer, and the developer will say, no. It's actually not my problem. Like, it's the problem with infrastructure. You're not giving me enough CPU. So it helps being able to, on the 1 hand, look at the infrastructure side, but then also, like, look inside your code and say, okay. Who's using the CPU? Like, who's right here? Do I need to give my application

more CPU, or is there really a bug here that I just made worse than the last version and I should fix this and there's, like, none of this loop in my code? So if you now move over to the application side of things and play aside the Kubernetes side,

then I always like to say that the 3 pillars

of Python troubleshooting

are memory, CPU, and logical stuff. Turn that into tools, and it's like a memory profiler that will tell you why it's stuff leaking. It's a CPU profiler that will tell you what's using the CPU.

And then it's, a debugger,

possibly a nonbreaking debugger, but that doesn't really matter as much, which will help you debug the logical stuff. And for each of those 3 things, then with Robusta, we just took the best open source tool that was out there, and we made it possible to use it on Kubernetes.

For the memory stuff, we essentially use TraceMatic.

And I don't know if you're familiar with it, but there used to be this tool, this library called Parasite, which injected code into Python processes.

Yeah. I see you're nodding your head. I see you're familiar with that. Parasite is really cool. It just has a bug that's been there for, like, 10, 20 years since it first came out, which occasionally causes things to deadlock.

So we forked in, and we fixed them, and we ended up rewriting it. It no longer looks anything like the original. But we essentially can inject Trace Malloc into your Python application. Trace Malloc is, like, the go to library and standard library for debugging memory leaks. And essentially, that's you say, okay. Give me right now, this moment, like, a copy of everything in the heap. And then you take another snapshot later on. You say, okay. Now give me the diff between them. And then the diff is what you leaked in that time period, wherever you allocate in DIMM free. So, essentially, in Robusta, we have something that that's you kind of inject trace malloc into an application at the exact right moment, and then we'll tell you what's leaking, and then we remove it. You kind of forget it was even there, and it stops doing anything.

So we're just using, like, the go to standard in Python, but we'll send you a message in stock or another location saying, okay. Like, here are the actual Python objects that you allocated, and here's the stack trace, and here's who allocated that. So that's the memory side. If you move over to the CPU side, then the my go to tool, which I absolutely love, is PySpy. So we're just dropping PySpy and making it easy to use on Kubernetes in 1 command. And then if you move over to the debugging side and the logical side,

then I really love the Versus Code debugger. And, again, like, if you were running stuff without Kubernetes, then everything is simple because you would just do, like, attach to process or whatever. But you obviously can't do that on Kubernetes. It's a different machine, and there's no debug or listening. So we essentially just implement it attached to process

for Kubernetes clusters,

and we make it easy to get up and running.

Digging into the Robusta

tool itself, can you describe a bit about how it's implemented and some of the

specific engineering challenges you've had to overcome as you've worked through building it and making it robust and reliable for people who are relying on it for their production environments?

Yeah. So

I think the core of what we do, actually

like, the heart of everything in Robusta

is pedantic,

which is by far, hands down, my favorite Python library.

And we use pedantic everywhere. And the way I often approach software engineering is I think about it all in terms of data.

So I think of it in terms of data and what data has to get to what location. Like, the data is the truth, and then the functions that act on that data are kinda trivial sometimes.

So everything in Robust is just pedantic models.

So if, for example, a trigger is just a pedantic model with certain data that fires at a certain time, and an action

it's just a Python function that accepts a pedantic model, which is your configuration as input. And then the sync, the destination where you send these things, like Slack or whatever, is, again, just like a pedantic model defining that sync and where you should send the data, like Slack, the API key, the Slack channel, whatever.

And

what's nice about that is now you can take, like, a YAML file,

could be JSON as well, it doesn't really matter, and you can deserialize,

like, from that to your entire automation schema.

And then you can even build, like, a user interface, which we haven't done yet, but we're planning to do. You can build a user interface on top of that just automatically by turning pedantic models into JSON schema, like OpenAPI.

Pedantic is the core of everything that we do, and there were some challenges at the beginning, especially for people on the team who weren't familiar with pedantic.

But once we chose pedantic as, like, the cornerstone,

the foundation on which we built robusta,

it was a great decision that we've never looked back on. I'll definitely second the choice of pedantic. It's a library that I've been incorporating into my own infrastructure automation to be able to schematize the

data models and the data structures and

being able to

add some validation to the configuration

values that are accepted by different systems that we run. So wholeheartedly agree on that front.

In terms of the

sort of capabilities and tools and utilities

that you have been building in Robusta, I'm wondering if you can talk through some of the

evolution and maybe some of the recent additions to the tool chain and the ecosystem that you're building around it.

Yeah. So for us, the big surprise is we started off with the automations platform.

When we started off with the concept of, okay, triggers, actions, and sayings, and we'll do the orchestration.

And you just define, like, what extra data you wanna see when different events happen, and we're, like, orchestrate all that and gather the right data. And big surprise number 1, like, mini pivot was people don't actually go and define these very often. They will tell you they want to, and they won't actually do it because they're just too busy and it's never a priority.

So from there, we then shifted to, like, part 2. Like, when I say part 1, 2, 3, and 4 about what we do, it actually did develop that way historically.

So part 2 then was, okay. We'll write the automations for you, and we'll give you that out of the box insights, and we'll get a little bit more into the monitoring space, but play well with Prometheus and other stuff that's out there. And then the second surprise

was we had a paid customer who was using this, and they kept saying, I wanna debug stuff on Kubernetes. I wanna debug stuff on Kubernetes. Like, let me connect the Versus Code. Let me connect PyCharm.

And for us, it was, like, kinda not such an interesting case because, like, there's no big insight there. It's relatively easy to do. It's just annoying to set up. It's, like, really annoying to set up, and you have to understand, like, what are PID name spaces and or you have to use ephemeral containers which aren't in GA. So, like, it's very annoying to set up. It's not just attached to process,

but it's trivial to do on the other end. Like, it's really

trivial to, like, write a script that automates that. So, finally, we gave it in, and we're like, okay. Fine. Like, we'll give you, like, your Versus code. Like, okay. Here. We didn't say it that way. It wasn't more like, okay. You're a paying customer, and you want this. This is important to you, and, like, of course, we do this for you. So I just sat down 1 afternoon. I, like, spent 2 hours on it. I wrote something that wrapped the Versus Code debugger. It just set it up, like, on the pod that you chose at the right time, and we built it on top of the automations engine. So it's also just like a robust automation action. Just the trigger isn't like another fire or something bad happened. The trigger is, like, you push the button or you're on the command line to trigger it. And then we were really surprised that people actually loved it. And then they start requesting all this other stuff. And they go, he gave me stuff like that for memory leaks too. And then can you do that for Java now? Can you do that for, like, Go? Can you do that for other things? So

that whole area with the manual troubleshooting tools

really surprised us.

So that was another big surprise. And then, actually, we started using that stuff internally

for debugging Robusta itself on Kubernetes clusters after we had released for a customer. And we're like, okay. Yeah. I can see why people like this. Yeah. It's definitely always interesting seeing the ways that people who are building tools end up applying them to the tools that they're building.

Yeah. Yeah.

Like, with the monitoring stuff, we saw those issues, like, at our previous company, me and my cofounder. So, like, we understood that really well. But then with the manual troubleshooting side, we were a little bit surprised. Like, we ourselves have never run debuggers and stuff in production, not even not breaking debuggers. Like, we had never done that. So we were a little bit surprised.

And at first, we emphasized to people, like, okay. You should just use this in test environments. Right? Like, you're just gonna use this in a test environment.

And they were like, oh, no. I used it on my production, like, with a 1, 000, 000 servers yesterday. Like, it was fine. Like,

I shouldn't do that? And at that point, we were like,

Interesting.

And you write down, you take no, and then, like, start to learn. And at some point, maybe we won't recommend to people that you're using on production environments. Well, stop telling them it's a bad idea if they wanna do it because it does seem to work. But I thought production was the test environment.

Yeah. Yeah.

It's a whole another story. Absolutely.

And so for people who are adopting Robusta and they want to start integrating it into their Kubernetes environment and their work flows and their maintenance processes. I'm wondering if you can talk through some of the steps that are involved in the workflow that teams might adopt to actually start using Robusta in their day to day efforts?

So if you asked me 2 weeks ago, I would have told you it's a 60 second install,

and then someone on YouTube put that to the test with a stopwatch.

And it's a 97 point, like, 68 second install, so it's really fast. But it's not 60 seconds.

And the gist of it is they're just 2 steps. We install with Helm. So Helm is a package manager for Kubernetes for people who aren't familiar with it. So, essentially, you just do Helm install, and then you give it a file with, like, a bunch of configuration settings. And that's actually step 2. Step 1 is you have to generate those configuration settings. So, typically, people handwrite this stuff. What we did with Robusta is there's just a Python CLI. You run Robusta generate config, and then generate asks you a few questions. Like, do you wanna connect Slack? Yes. No. Do you wanna install Prometheus as well? Yes. No. And you answer a few questions, and it generates that config file. And then, like I said, you take that config file, you run Helm install, and it's up and running pretty fast. It's really easy to do. Again, I should say the only part robust to that isn't open source is the UI, which is disabled by default, actually, at least as of now when I'm saying this. But so everything that you install by default is open source, all MIT licensed,

doesn't communicate the stuff outside of the cluster, like, unless you tell it to, like, communicate with Slack or whatever, or if you choose to route, like, certain stuff through our cloud. But it's completely up to you. It's very easy to install on environments for the live security concerns as well because we make it easy to put everything under your control. As far as the

types of questions that people are asking of their infrastructure when they are using some of these profiling and debugging tools, What are some of the

common issues that they might encounter and some of the ways that they will actually use

to be able to discover the answers and then feed that back into some of the definitions of the types of automations that they want

to register in the system?

Yeah. So it comes down to, like, 3 different areas.

Area number 1,

which is,

I'd say, the second most common. But area number 1 is know you have an issue. You know what you wanna do about it. You just want it to be easy. So an example of that is, like, you have a memory leak. Just tell me what Python objects are actually responsible for that memory leak. Tell me what line of code is allocating them. Like, I know I wanna do it, and I I just wanna go and run that right now, or I wanna run it next time my application has memory above a certain value. The second part of it that we do best is you get a notification in Slack saying, like, right now, there's this issue in your cluster, and this is what you should do about it. We do our job well, then every other that arrives will tell you what the other means,

what you should do about it, and, like, in your specific case, because we feel like random decision tree on the other, and we know what the solution is. Like, click this button to fix it. And we don't have enough of those click this button to fix it yet, but we're adding on more of those. And,

essentially, the rationale behind that is, like, if you take Prometheus and you just set it up, there's a diver, there's a required fine tuning. And, like, it's gotten to the point where I can tell you, like, if you're on EKS, then you're going to have kubeScheduler down. If you're on GCP, then you're going to have, like, CPU throttling high, and you're going to have all these different threads which require fine tuning. So we're trying to fine tune them in advance, but also when stuff happens,

then we try and do the whole fine tuning and investigation for you, but then just give you the bottom line to, like, change this value to this, and that will fix it. So I hope people don't have to extend it is my answer to this part because I hope the default thing just does a good job. And if it doesn't, then open up an issue on GitHub, and let's fix it. And then the third part of it is people are taking this and doing totally unexpected things. Like, yesterday, someone reached down on our Slack channel, and they said, I have a system which has, like, object storage. As soon as I wonder on garbage collector on, like, my objects that I'm storing, and can I actually, like, use Robusta

on certain Prometheus others to trigger a Kubernetes job that then cleans stuff up? Kinda like the garbage collector equivalent for a file system or for object storage. And, yeah, you can do it, but it's, like, not something we ever would have thought you would use Robusta for. It's not monitoring. It's not observability. It's not troubleshooting.

It's really just automation.

To the point of extensibility,

what are the interfaces that are available for being able to customize and extend the platform and the functionality of Robusta?

So 2 extension levels. Extension level number 1 is you just write stuff in YAML. For example, we have that action

which profiles, Python applications.

So it takes as input a pedantic model, essentially just taking as input a bunch of fields like what you should profile,

how long you should profile it for, how many seconds, and so on. So the first step like, the first extensibility there is you're just going to configure stuff in YAML, or you're going to, like, write a command line, which triggers that Python profiler, and you put in certain parameters.

So you're taking the stuff as is, and you're just tweaking the knobs that we expose to you in these existing actions.

And then the second layer of extensibility

is, okay, I'm going to write my own actions. And I said earlier that robust has 3 parts, triggers, action, and syncs. So each 1 of those is extensible. You can add on new types of triggers for stuff that happens. You can add on new actions.

Like, when this other fires, I want some sort of weird data that I'm going to fetch from, like, a different API in my company, and they wanna draw a pie graph of that. And I wanna send it to your mom. Like, I don't know why you wanna do that, but, like, that's what you would write a custom action for.

And then the vast part is syncs, where we typically just do that for you. Like, all of this, we're happy to do for you in Tableau with open source community if you open an issue on GitHub. But the area that we add on the most by user request is syncs. Just recently, we added on Telegram. We added support for a service called, like, Notify or something, which sends notifications to your cell phone. Someone just requested support for, I think, Mattermost,

ClusterStack, MS Teams, even Kafka, like, all sorts of weird exotic locations, Opsgenie. These things are the last area where you can extend it. Typically, people just open an issue and get how to rewrite it for them now. In terms of the design of the tool and the

way that you interact with it, I'm curious how you have

approached the

design

and implementation

and feedback cycles to make it ergonomic

and intuitive to work with given the fact that you are

running in people's production environments. You are providing a service that people are going to be relying on to be able to remediate critical alerts in their infrastructure. And so a lot of the times when they are going to be interacting with your tool,

it might be in a state of panic because everything is on fire. Their whole is down, and they need to just get it back up and running right now, and just some of the ways that you think about how to

design those interaction paths for people to sort of help them, you know, work with it in those times of stress?

Yeah. So

there are, like, 3 pillars in terms of how we think about it. The first pillar

is be concise

and show you what the data you actually care about.

So

let's say you're running out of disk space.

Like, I can show you for every single thing running in your cluster, all the data, and I can, like or tell you go and graphite and, like, all these different graphs.

And I've just, like, shoved a whole bunch of information at you, and that doesn't solve your problem when you're in a rush and when you're busy when something's on fire. So be concise and show people the data

that they most likely wanna see and that will help them. Part number 2,

let people dig deeper. So if that isn't good enough for you, make it easy for people to dig deeper, and don't do, like, weird magic things that no 1 understands. So we'll give you a really good strong opinionated

view. Like, okay. Here's what you need to see. Here's why. Here's the bottom line.

But there will always be an option there, like, no. I wanted the more into detail about this or, like, how did you reach that conclusion? And then there's a link you can click, which will, like, take you to a Wiki page, which explains the logic and the other and why it's firing and how we analyze it and so on. And then the third part of that is

hire developers

who think like developers. Like, I don't know. Like, this isn't always good advice. And what I mean by that, I'll give you an example. Like, if you set up the Robusta

and and use the SaaS platform as well, you actually sign up for the SaaS platform using the CLI.

So that's probably a case of, like, me coming from a developer background. Maybe taking that a little bit too far probably hurts the onboarding process.

But hire people and especially for product, like, anyone who does a product job by us is someone who is once a developer, and we won't hire nontechnical people for that position. That's advice we got from, friends at at SNCC.

Saf Gefetz, who was CTO of Snyk, invested in Robusta. I'm close to a bunch of other people there. So we're very influenced by Snyk's thinking in terms of building tooling for developers,

And a big part of that is to hire product people and to hire key functions and, of course, developer advocates and stuff and marketing people who think like developers.

And not just bringing the sales team, which, like, really doesn't speak the same language, or a technical support team, which isn't technical enough themselves.

The other piece of the fact that you are running in these production contexts and people are relying on the utilities that Robusta

provides,

how do you approach the

design and implementation

and validation

of

your system to make sure that it is reliable and resilient enough for people to be able to depend upon it in those times where everything else might be breaking and falling

over? Lots of stress testing and lots of reliant where possible

on type stuff so that you don't discover stuff at runtime.

I've seen very big Python systems in the past which without type annotations.

Either you have to have unit tests, like, that covered every single line of your code base, or you were constantly discovering stuff when an exception was thrown at runtime.

So

we make heavy use of type annotations in Python

to try and really have more type safety

and more knowledge, like, not at compile time, but so we speak at compile time that you didn't do something really dumb, which is the diagnosis we have found. In your work of building Robusta and working with your customers and the community, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

So we've seen it used or we've seen interest in using it, and I touched on this earlier, but we've seen interest in using it

for general purpose orchestration

or for building different, like, automations,

not necessarily for monitoring, but using Robusta

to

orchestrate, like, different

company specific

automations that have to work at specific times. An example of that is, like, what I mentioned earlier with the garbage collector for an object storage system.

So that's really been a surprise.

Other than that, like I said earlier, we were very surprised by how much people wanted the manual troubleshooting tools here, which we didn't expect

when we got started.

In your experience of building this tool and the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

I've been surprised by how many VCs get open source. I thought when we got started, I have to explain to VCs like, okay. Yeah. You can actually build a profitable business around open source. And I opened my mouth, and I'd say we're open source. Maybe like, okay. Are you more like Red Hat or like HashiCorp?

And the answer is Hashicore.

The average very pleasantly surprised by how many VCs really understood it, and they understood, like, I guess, from a VC's perspective, you look at it. And, like, Spacer is huge. It's the future of the cloud, and Kubernetes and huge giant opportunity.

And then the question is, can you execute on that? Like, do you have independent existence here really as, like, a unique tool, or is it just something that's part of an APM, or is it something that's part of something else,

and why? And then there's the whole question about the business model, which I thought would be really the hard part for an open source tool. And in almost every case where I've asked VCs,

either current investors or people who are interested in the future, about it. And I've said, should we prioritize growth, or we should we prioritize revenue? Everyone says growth all the time, and that always surprises me. And I'm happy with that. Like, we don't charge money for most things today. Like, the open source obviously is open source and will always be open source and free. But even for the cloud platform, you can sign up and use it for free today. There will always be a free tier anyway, but we're under 0 pressure to monetize that anytime in the future from any of our VCs, which has been a bit of a surprise.

And to that note of the open source

system and the

commercial

entity and the fact that you are taking venture funding,

what is your overall approach to the governance and long term sustainability of the project

and how you think about the boundaries between the open source and the commercial aspects?

You should try really hard up front to make sure that there's no conflict there between the 2. And what we've been able to do, which is nice, is we've been able to really create an open source that's

completely standalone,

completely useful on its own. You can use it. You don't need the commercial side. We see people taking it and using it. But the nice thing is when the open source gets better, suddenly the commercial platform gets better too. So the commercial platform here, what will be the commercial platform 1 day, is just giving you that single pane of glass into your cluster, and you can look at everything. You can see what's going on. And all the data from that, like, the whole brain is coming from the open source, and the open source is also completely usable without that. So you can use it. You can just send data to Stack instead of it to other locations.

For people who are using the open source community, if you have a healthy and active open source community, then you're always growing kind of your coverage of different things and things you can troubleshoot and all this. And it's never in competition with with the commercial sides, and you never have to try and put on the brakes there. Like, it's just 2 completely different products, but they go really well together. So when 1 gets better, then suddenly the other gets better as well. But the guidelines there are be open, be honest, involve the community, and make something also that is genuinely useful in and of itself and which is not just something to get people into the commercial platform. I asked Shay Banan, CEO of Elasticsearch, about this once about monetizing open source. And he said, the trick with open source is to make the denominator

so big

that it doesn't even matter if it's 10% or 1% or what the top part is.

The bottom part of that fraction is so huge that, like, the bottom of the funnel is just gigantic. So it's okay if a lot of people take your open source tool and they don't use that too. And, sorry, they don't use the commercial platform. That's fine. Like, our goal is to run-in every Kubernetes cluster in the world,

and that the commercial side should never be a barrier because you can just use it without that. But then if there are enough people, we decide, okay. Yeah. I want the commercial side as well, and I want to all of that. And, like, I want the user interface where I can see, like, a diff between what Kubernetes resources changed recently. I can see that on the timeline, and I can see why they're fired, like, right after I did an upgrade and what 9 did I change in the YAML during that upgrade, which is like a feature in the SaaS platform. But it doesn't

compete at all with open source sites. For people who are interested in the capabilities

of understanding

what is happening in their clusters,

being able to automate remediation

because of some of these

sort of cluster level constraints, what are the cases where Robusta is the wrong choice and maybe they're better suited building some of their own in house tooling or relying on some other open source or commercial offering?

If you're not using Kubernetes, we're definitely not the right choice for you. Like, we can do it. We're not fundamentally tied to Kubernetes, but it's not a focus.

If you're using Prometheus, then, like, we are the right choice. But if you're not using Prometheus and you're using some other monitoring system, then you'd have to check whether we have compatibility with that today. Like, we can send data to Datadog and stuff, and we can integrate with other monitoring systems. But if you don't use Prometheus, then you should check first

whether we support your system. And if you don't use Prometheus yet, but you want to, then we are the right choice. Just type y into the command line when you set up Robusta, and we'll install a bundled Prometheus stack as well.

As you continue

to iterate on the open source and the commercial aspects of Robusta and work with the community

and keep up to date with the evolution of Kubernetes and the ways that it's being applied? What are some of the things you have planned for the near to medium term or any areas that you're particularly excited to dig into or

aspects of contribution that you're looking for help with?

So we're looking for help with everything from, like, improvements to the docs

to adding different automation actions to new triggers to new syncs. Really, just open up the GitHub and the GitHub Issues page. And I got started, like I said at the beginning of the podcast, started programming in Python

when I was, like, 14 years old with Google highly open participation contest. I wouldn't be where I am at today in my career. I wouldn't be CEO of a start up had I not been so heavily involved in open source and really learned from the community and got feedback from people and got CRs and really learned that way. So it's something we also try and give back. And when people commonly wanna contribute, then we try and find something that's appropriate for each person. We try and really

help you and mentor you and, like, let you advance your career goals by getting involved in open source

and get you involved in stuff that advances what you wanna work on and what will help you. So if you're interested in getting involved in open source project, I think we're 1 of the friendlier projects for that, especially because of our own backgrounds.

And in terms of the future and what the future holds, and it's really just coverage, coverage, coverage

to cover more types of errors that can occur, to tell you for more types of errors what the issue is and how to fix them, to have more troubleshooting tools for more languages, not just Python and Java.

Essentially, just to really accomplish that vision of anything that goes wrong in your cluster, we'll tell you why it happened and what you can do about it. And, eventually, you wanna go into specific, like, application specific stuff as well. Like, if you're running MongoDB on Kubernetes

and there's, like, a still a still query, then we wanna be able to tell you, okay. This query is still because x, y, and z. Or if you have Postgres, we wanna be able to tell you, like, this is the index that you're missing. So, really, to give you full coverage

and to accomplish that vision where every DevOps problem is always

accompanied by something that tells you why it happened and what's the solution, and, preferably, that gives you a button for an automated fix.

Are there any other aspects of the Robusta

platform and technology

and the surrounding ecosystem and utilities that we didn't discuss yet that you'd like to cover before we close out the show?

So it's technically not part of Robusta itself, and, actually, it was very little to do with everything else we do. But I would love for a moment just to mention new open source tool we released called Yprofiler,

and that's y spelled y w h y. And Yprofiler

is a profiler for Jupyter Notebook,

and it does 2 things that are cool. 1st, it colors the lines in Jupyter Notebook so you can see which lines

use the most CPU. And it took me some time to actually realize that's the feature that probably most people care about and not the second feature, which is unique.

And the second feature

is that it uses SEMGraph,

a static analysis tool. So we can not only tell you, like, why your code is slow, like, what 9 is, but, also, we can recommend fixes. So if you're using, for example, like, the built in JSON library and the 9 for that is taking up 50% of your run time, we'll show you, like, an icon saying, okay. There's a recommendation.

Would you like to replace using JSON with the OR JSON library, which is much faster? And if you hit yes, then it installs OR JSON. It reruns your code, and then it'll tell you, okay. Your code just got 50% faster.

2 interesting facts about this. First of all, original name was Juprofiler, j uProfiler,

as in Juprofiler.

And I was told I had to change that name because it sounds like you're profiling Jews. And I, myself, am Jewish, so it was not anti semitic,

but we had to change that name day before the release.

And the other part of this is this was 1 of many startup ideas that we played around with before Robusta,

kinda looking, okay. Like, can we make it easier for you to figure out why your code is slow and then tell you how to speed it up? And the answer was, yeah. We can do that. And then when we pitched it to people, they were like, oh, I don't care about that as much x. They have all these Kubernetes in there, so I'm never going to get around to fixing my code and making it faster. It just isn't a priority. Like, it's okay. I'll add on another 2 CPUs.

And then we thought, okay. Well, maybe we should work on that problem instead, and Robusta was born. That's definitely a very interesting insight into the ways that companies are formed and the ways that their focus are determined.

Yeah. Yeah. It was a surprise. You think, like, okay. I'll tell you you can make your code, like,

90%

faster, and then you tell it to the developer

and says, like, okay. But I'm working on this other feature right now, and, like, I don't necessarily wanna revisit that call I wrote, like, I wrote 2 years ago to make it faster. But in most cases, that someone else wrote, I'm not even familiar with that code. Now you want me to change it, and you're claiming that this change, like, will maintain correctness. And it's a 100% equivalent, but it'll be faster. Like, I don't necessarily care. And then you say to them, like, oh, reduce your cloud cost. Well, he the developer typically doesn't care. They're not their budget. And then you tell it to DevOps guy, and the DevOps guy says, like, it's not his area either because you're changing the codes.

And you tell it then to, like, the FinOps guy or the head of finance or whatever, and he says, okay. Yeah. That sounds like a great idea. But everyone else in the company is against it. So it was kind of a surprising thing. It's like 1 of those cases

where

you're doing something that's good for the company as a whole. It basically makes every single individual's life worse,

and you're just creating work for everyone.

Yeah. It's definitely an interesting

Venn diagram of concerns and sort of figuring out how to make somebody care enough to do something about it.

Yeah. Yeah. So in either, it's a case of, like, a solution looking for a problem, which there is a problem. It's just not a problem anyone has, so it's not a problem.

Or if you prefer, it's a case of there really being no persona. Like, in startup terms, there's no persona that answers.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

And this week, I'm going to choose a book that I started reading recently called Kubernetes

up and running from the folks at O'Reilly.

So I started reading that because I'm starting my own journey onto Kubernetes. So I just wanted to get up to speed with the different

considerations

and concepts and figure out, you know, what I need to dig deeper into. So recommend that for folks who are in the same situation. And so with that, I'll pass it to you, Nathan. Do you have any picks this week? Yeah.

So number 1 pick is

Kubernetes versus admins. It's a YouTube video by Kelsey Hightower,

who is 1 of the pioneers

of Kubernetes at Google,

or at least the public figure most associated with Kubernetes from Google.

And it's a great video where he plays Tetris, and he speaks about Kubernetes and how they're related.

I'm not gonna give any more spoilers, but, honestly, the best introduction to Kubernetes, in my opinion,

hands down. And that's, like I said, Kubernetes Persist Admins by Kelsey Hightower.

And

other than that, I don't know. Like, I'm getting married next week. So it's like before a wedding, like, you have to make a dive decisions. And, like, you're always asked to pick 1 thing versus another and it gets to the point, like, where you have this decision fatigue. It's like I no longer care what my favorite food is. Like

and you're hearing inputs from everyone. Probably a bad week for you to pick anything. Like, I'm picking a 1, 000, 000 things, but I'm not gonna go and gonna pitch any single thing. Like, anything that would just get it done, like, that there's consensus around among all the parents and everyone, like, who has a say on the wedding. That's good enough for me. My pick obviously is, of course, my, wonderful,

fiancee who's extremely supportive and who I can never do robusta without.

But that's up for someone else to pick.

Well, sounds like you're picking delegation.

So

congratulations,

and definitely best wishes for the ceremony next week. Thank you very much. It's been a pleasure. Thank you for having me. Yeah. Thank you again for taking the time. Definitely

great to be able to explore the work that you're doing. I appreciate all the effort you're putting into helping make running Kubernetes

more palatable

and sustainable for everybody. So thank you again for the time and effort you're putting in there, and I hope you enjoy the rest of your day. Thank you.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__