Healthchecks.io: Open Source Alerting For Your Cron Jobs with Pēteris Caune

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports the show on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@podcastinit.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app. And now you can deliver your work to your users even faster with the newly upgraded 200 gigabit network in all of their data centers. If you're tired of cobbling together your deployment pipeline, then it's time to try out GoCD, the open source continuous delivery platform built by the people at Thoughtworks who wrote the book about it. With GoCD, you get complete visibility into the life cycle of your software from 1 location. To download it now, go to podcastinit.com/gocd.

Professional support and enterprise plug ins are available for added

peace of mind. You can visit the site at podcastinnit.com

to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions, I would love to hear them. You can reach me on Twitter at podcastinit

or email me at host@podcastinit.com.

To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Peteris Sauna about health checks, a Django app which serves as a watchdog for your Chrome tasks. So, Peteris, could you please introduce yourself? Thank you, Tobias.

So, yep, I'm,

author of Health Checks. Io. I'm a guy living in Riga, Latvia, and,

doing a lot of cross country cycling, programming,

and, yeah, that's about it. And do you remember how you first got introduced to Python? Yep. It was in,

2003 or 4, my first workplace.

I was working on internal informational system. Later on, I worked on

semantic web technologies,

started working, writing

tests,

doing test driven development. And then in the later years, I brought more

conventional web applications in Django and Flask.

And can you start by explaining

what the health checks application is and what motivated you to build it in the first place?

So HealthCheck is,

monitoring service that monitors your Cron Jobs

scheduled tasks, and

different things like that that ring to run semi regularly. And you want to be notified if they stop working, stop running. And,

so the canonical example would be backup jobs that run nightly. And what can easily happen is that you set up backups once and then forget about them. Then a year later, there's an emergency, and you go looking for your backup and only discover that your script stopped working 6 months ago, and there are no backups.

And it was a silent failure. No 1 was notified, and you're out of luck. So this service helps with that.

It will,

well, stay silent as long as things work, and it will send you notifications as soon as

a given night

the backup in script doesn't check-in and doesn't say I'm fine. I just completed the complete backup. My motivation for making it was,

basically, I needed a service like that. And I looked at the existing solutions and, decided that I'd like to make a stab at it myself

and started this as a side project.

And you mentioned that there are other services that provide a similar kind of service. So I'm wondering

if you can do a bit of compare and contrast with some of the other offerings that are available

and what you found lacking in those ones that made you want to build your own?

Yeah. Sure. So 2 very well

known ones were Kronitor and Denman Smich

that existed at the time when I was looking for solutions.

At at this point, there's a bunch more. They've been cropping up, and, they do work. I've tried them. And I guess the reasons why I started my own was I guess the main reason was,

I thought the existing solutions were overpriced for what I needed to do. Like, the pricing is probably fair for depending on how you look at it. But, at the point in time when I was just needed this 1 task being monitored or these 3 tasks, and I didn't want to invest much in them. I thought that, oh, come on. This can be done

cheaper. Well, so I thought.

And,

yeah. So that's 1 difference between them is that the pricing structure. The other difference is that, HealthChex is an open source

project. The code is on GitHub, so there's an option of hosting it yourself. And I know that people actually do that. Functional device, I think they're quite similar. Of course, there are UI differences. There are differences under the hood, how things are implemented with their, basically, direct competitors.

Yeah. I, currently use health checks for some of my scheduled tasks, and that was 1 of the things that motivated choosing it over some of the other options is the fact that, 1, it is very fairly priced as compared to some of the other options. And 2, the fact that I do have the option of self hosting it if I get to a point where I decide that that's more worth my time. And, the pricing, as you said, is quite generous for what you get on your hosted version of the service. So I'm wondering if you can discuss how you arrived at the cost structure that you're using right now and whether it has proven to be profitable for you. Yeah. So the whole, I guess, the whole reason why I started the project was the premise of of why do anything was that I thought that I can do this

cheaper or for free for most people. Like, if you are not using it heavily, then this should be a free service. Like, there are many free services out there. And then if you really are using it heavily and using a lot of team members and setting up a lot of checks, then maybe it makes sense for you to give back some money.

And, so that was the idea going in that can I make a product that's self sustaining at least that works like that? And the other thing was that, the main reason wasn't I wasn't profit driven, but it's more it was more a hobby project. It was

to learn about different aspects of

building your own service, like, not just full stack development, but also all the paperwork that goes around it and marketing and deployment

and monitoring

service itself and so on. So it's a learning experience for me. It's also something you can put on your CV, on your portfolio.

So I wasn't least in the beginning, I wasn't too concerned about whether it will be highly profitable. As long as it pays its own bills, that's fine.

And that's basically the state it's at in now. Like, the bills have been going up. The, well, money I get from owed from it also has been going up, kind of, and they are in balance. But

it's not like, I'm not leaving my day job just now.

And what does the deployment architecture look like, and how is HealthJax itself implemented? And I'm wondering too how both of those designs have evolved since you began working on and using it.

So

health checks, the Django app, is pretty standard. It's

intentionally not doing anything too clever. It's not using many dependencies. It's a basic it uses basic,

Django

features. It uses either Postgres or MySQL database for storing data, and and that's about it. It's not a big elaborate,

setup using some fancy new data storage layer or anything like that. And the idea was that let's not overcomplicate it from the beginning, and let's see how far we can get scaling wise just by doing this kind of naive

approach. And so far, it's been fine. I've needed to move it to more powerful servers from time to time. But at the moment, it's doing fine, and it can handle the load pretty well, I think, as far as I know.

The deployment part,

how

how it actual how it's actually set up on the servers and which servers and what services it uses. That's an area that has changed a lot, and that's actually

something I didn't expect. Initially, it's I've spent a lot of time when well, working on this. So I started off with DigitalOcean

single droplet, $5 droplet, and that was it. That was when it first launched, and nobody was using it. And then since then, I've moved to try different services,

testing my hypothesis whether this will work better or that will work better. So I moved

to Linode. I moved to Google Cloud.

I considered using Heroku, considered using Amazon Web Services. And by consider, I mean, actually write deployment scripts and test it out

and be almost ready to deploy and switch over and then decide against it. So the

kind of a lot of there's a long shop shopping list of features that it needs to for this hosting environment to really work. So, for example, the hosting environment needs some kind of load balancing or avail ability to,

have a flexible IP, some kind of switch over so it can do seamless deployments.

It needs to have IPv6

support. It needs to ideally, it would have SSL termination.

So so there are various things like that that that you kind of need to look out for and, as you move on. And, also, there have been, well, interesting discoveries. And, like, let's say you I moved to Google Cloud and started use using Google's load balancer and after a while,

I started hitting

errors,

that come from load balancer, not from my code. And it took quite a while to track them down and follow-up with Google support. And in the end, I wasn't able to sufficiently resolve them. And so my option was, well, migrate over to a different service or live with them, and then I,

ultimately, I migrated over. And and that's something you cannot tell that this will be the case when you're just looking at their spec sheet and and googling around from other people's experience. You have to try the service to see if it will work or not.

Yep. So the deployment aspect has been

yeah.

I've learned

I've learned a ton of new technologies,

with this just different cloud cloud providers and also deployment technologies technologies. Like, let's say I started with Fabric,

and then I moved to Ansible. A while later, I moved moved back to Fabric. And at some point,

I I had a setup that used Docu,

and then sometime later, I used the Kubernetes.

Oh, anyway, 1 last note on that aspect is the current deployment. So currently, the service is running in Hetzner bare metal servers in so Hetzner is a hoster in Germany. So it's a pretty simple setup. You have your bare metal servers that you can SSH into.

It's con reasonably quick to provision a new bare metal server if you need to, let's say, if you're existing once something happens to them. I use Cloudflare as a load balancer in front of them. So it's, I can take 1 web server out of rotation, update it, make sure it works, put it back into load balancer,

and, that seems to work fine so far.

And the database is also on a bare metal sore server. That's actually currently 1 of my concerns and thing I am looking into, how you do failover on Postgres.

There are many solutions for that, and each 1 is has its own,

pros and cons.

And I want to make system that is actually is actually usable,

and is worth it. Like, if you make a failover system that's very complicated, then when it's finally time to failover, what can happen is, well, it breaks, and

it's no better than a single point failure that I have currently on the database side.

Yeah. Running Postgres in a highly available configuration is definitely

a nontrivial task and 1 that I've dealt with in the past.

When I was working on it, I ended up going with a,

streaming replication setup

where the secondary was a hot spare for being able to do read queries against, and then I used pg pool 2 as the query balancer

and the,

failover trigger as well. But it,

that that ended up having a few

weird edge cases as well, so it's definitely nontrivial.

Actually, on my other podcast, I recently interviewed the folks behind Citus,

the post grads extension for being able to run it in a distributed context, which was an interesting discussion as well. So I'll add the link to the show notes in here as well. And 1 of the other things that I was curious about is given the fact

that health checks

use case is designed for a lot of fairly regularly

scheduled tasks, but that are most of the time probably gonna be fairly spread out. I'm curious

if you ended up finding any benefit from or need for caching

in in the web architecture or if it's sufficient to just go directly against the database given the very regular and known,

periodicity of the interactions?

Yeah. The traffic that the site gets is interesting. It's it's very spiky. There's huge spikes of

services checking in around, let's say, round hours or every 15 minutes every quarter. And so you'd see a base load of, let's say, few tens or 50 or so pings per minute sorry, per second. And then at

certain times when it's a midnight summer, there'd be a huge spike where, for a short amount of time, you'd get thousands of requests per second. And you need to keep up be able to keep up with them.

And so 1 of the hardest things there is also that they're all using HTTPS,

each being a separate is a separate new request. And so there's an TLS handshake, which is computationally expensive. So if you do that on a weak system, that alone can easily overwhelm that system.

Can

well, then it can cannot keep up with them. And so this is where Cloudflare helps in that, the TLS is terminated there. And, from Cloudflare to

my servers,

the connections are use keep alive, so I don't need to do as many TLS handshakes. Caching wise, there are a few clients that do a lot and a lot of pings, like multiple pings per second. And there's no huge benefit for for us or for the users themselves of doing them so often. But I guess it's just easier for them to be set up that way and

in order for me to be able to better handle them so so the problem there is actually writing to the database, not reading from the database. There's only so many writes you can do to the database.

And, so writes are, at some point, get expensive, and you need to start to save up on them. And so what I'm doing is I'm having a little cache in, in the code that receives pings. And if there's a ping if there's have been a ping from a given service in last, let's say, 10 seconds, then we won't do another write when the next ping comes. So there's a kind of a 10 second time out. And that helps with the cases when somebody,

sends pings every second or multiple pings every second. Then it's harder for that single client to overwhelm the whole server with lots of, IO operations on the database. On the read side, I haven't really needed cache or any kind of caching so far. But, yeah, we'll see how it how it goes on.

Yeah. It's always interesting how different

services and different use cases can have such different needs in terms of whether or not caching ends up being beneficial or just an extra burden that could potentially actually have negative impacts on the overall throughput of the system. And I'm wondering too what have what you have found to be some of the most challenging aspects of building and running the health checks application

and the

isn't isn't the hardest part and isn't the part that, you spend the most time on. And that's not really a surprise. If you read, let's say, Hacker News, from time to time, you'll have solo entrepreneurs saying that the bulk of the work will go into marketing, into doing paperwork, into doing customer support, into monitoring your servers and not actually working on your product. And that's how it is. And that's what I've found out in practice,

working on this alone. This is something I've been shielded from before

in in the day job. And now on on this project, I I get to see that, yep, that's the case, that I'd be responding to queries from customers,

fixing tickets,

and doing stuff like that. Also,

on the billing side, accepting payments is trickier than I initially thought. Getting different payment processors to work, filling in the paperwork, and

handling invoices. That's 1 area that I've been working just now on. And, yeah, and the other tricky part, of course, is the fact that,

ideally, it would be fully highly available. And, the closer I can get to that state, the better. And so this there's continuous,

improvement in that area. It's pretty easy to make the web services

pretty much bulletproof. You you have you can have many web servers load balanced, and they they can hit the database. And then it's trickier to decide what to do with the database. We already talked about, whether you have what kind of database you use, how you do the failover,

how do you do scaling, whether you maybe don't use MySQL and Postgres at all, but use

a different type of some kind of distributed database? Maybe

So I guess Citus would work help with scaling up. Not sure,

how much help it would be in having more availability, but I haven't really looked into it in detail.

And for somebody who wants to run health checks on their own infrastructure,

what are some of the things that they should be aware of as they're

deploying and setting up the service, particularly if they want to make sure that it is going to be fairly reliable?

I think, the setup I would recommend for hosting it yourself

would be to

use Heroku. There is a fork of health checks on GitHub that's

adapted

so that it's very easy to deploy to to Heroku. In fact, you have you can click the ping deploy to Heroku button, answer a few questions, and you'd be up and running. Of course, you need your own domain name. You need to well, probably, you'll want to customize your mail settings,

what your email address will look like, and so on. But then it's there. It's working. And if you

need the service to be kind of reliable, then you it's easy to just,

move the sliders in in Heroku's dashboard and go from

free, the free dyno to professional dyno and go from hobby Postgres to professional Postgres that has highly available feature as well. That should be pretty easy to set up and to maintain going forward. So that's the 1 I'd suggest. It's also possible to do it all yourself. I know there are people running it in Docker

containers or people just running it similarly how I'm doing that. But that's something that you'd be spending more time on, time upgrading and keeping eye on it and just

general maintenance. And it that make might make sense to do as well if you are using it seriously. But,

for small to mediums,

setups, I think Heroku would work great.

1 of the critical aspects of the service is the fact that if the

different tasks don't check-in within the scheduled time frame,

that they'll actually send out a notification. And I know that for that, you have a number of different integrations available. So I'm wondering if you can just talk through what a typical workflow for a single task would be in terms of how the

application

identifies that it is failing and then what some of the options are for being able to send out that notification to somebody to let them know that they need to check-in on that particular job?

Sure. So let's say you have,

application that needs to send out weekly emails every every week. So it's a bunch of code in a in a programming language that that goes through a loop and sends an email for each 1, and then it's done. And let's say you want to make sure that this loop always completes every week, and it never gets interrupted in the beginning or somewhere in the middle. So what you can do in the app is you can after that loop, you can instrument your code with a call to health checks ping endpoint.

So depending on your language, it'd be a different looking snippet of code, and we have examples of those for various languages. And you'd just stick it in your code. And what would happen from then on is, when the app runs, it sends its weekly emails, and then it runs the snippet, which pings health checks, and it's done. On the health check side, what you'd see is that every week at the around the same time, it would receive a ping, and the the check would have a green icon next to it, and everything would be fine. So let's say 1 week something goes wrong. For example, you hit the sending quota, and you cannot send all your emails. And so there's,

your code throws an exception.

The the ending part of the code doesn't complete, so the ping isn't sent. And what would happen next on health checks IO, the check would go from status green to status red. And at that point, health checks would send email

address,

the same email address, the same email address you used for signing up. So you just receive email, hey. This check has stopped reporting in. What you can also do is you can set up multiple different

channels. The most popular is notifications to Slack, but there's a bunch more. There's notifications to Telegram,

notifications to PagerDuty

to

and so on, SMS messages. So there's a list of those. And different people, depending on what types of services they already use in their workflow, they would just use whatever messaging service or or channel they they prefer and use that. And yeah. So what you do next is you receive that notification. You check your app. You check your logs. Figure out why it didn't work, what what has gone wrong. So you fix your app, run it,

run that task to send weekly emails again. This time, it completes correctly. Another ping is sent on HealthCheck. Io side, the check goes from red to green. You'll get another notification saying that, hey, it's all good again, and life goes on.

And are there any particularly

interesting or

unusual uses of health checks that you've seen? Yeah. So

1 interesting use that I've come across is on GitHub. It looks like,

there's a company in Africa that's using

health checks as a for educational purposes, as a playground for teaching test driven development. From what I can tell, what they do is they would create a fork of the code base. They would remove some of the tests, and then they would give it as an assignment to somebody to write the missing tests. And I guess the fact that the code base is pretty straightforward, it's not too convoluted. It's, for most time, it's a readable standard Django code. It makes for a good

sandbox environment where you can easily get running locally,

make sure that, yep, there are some tests that are, failing on write those tests or fix those tests, write some more, and so on. So that's been an use case I didn't anticipate a lot at all that somebody would use it like that, but that's great. So what are some of the improvements

or features that you have planned for future releases of health checks?

So for the product for the app itself, I at at the moment, I don't

have great

big break groundbreaking

features in mind. It's more about small refinements

and fixing issues and adding smaller

missing features. Basically,

it is already useful as it is. Actually, well, I think it was useful already when I was 6 months in when I started to work on it. But, yeah, of course,

when you get feedback for users from users, you'll find out, oh, yeah. I could add this as well. That would make it better. So that's

there's probably it's going to be evolution, not not something

big that's I'm going to add on the other areas. So there are improvements I can do on the high availability

front, most notably the database. So I'll be working on that. Yeah.

But main development is I'm happy in in the in the shape it's at currently, development wise.

And are there any other topics that we didn't discuss that you think we should have before we start to close out the show? I think we we covered it pretty well.

Well, for anybody who wants to get in touch with you or follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And for my pick this week, I'm going to choose,

TV I just got over the holidays.

So since 4 k is starting to become more affordable, I ended up getting

1 that's,

fairly reasonably priced. So I got the LG

55uj63100,

which for anybody who doesn't wanna try and remember all those random numbers and letters, I've got it in the show notes. But it's a decent sized TV. It's got 4 k display capabilities. It's got smart functionality built in.

And for the price, it's,

really nice TV, you know, good picture, good feature set, and, been enjoying using that. So, yeah, for anybody who's in the market, it's a decent choice. And so with that, I'll pass it on to you. Do you have any picks for us this week, Peteris?

Yes. So

2 training applications,

training apps for cycling. 1 is called Zwift, and the other is called TrainerRoad. So for these, you need,

a bike and a smart trainer. You put it on, and then you have your screen in front of you, and you can use them to work out. And they're great for

this, season when when it's cold and wet outside. So Zwift is kind of a virtual environment when you cycle in a virtual world with,

other cyclists like you. So you can interact with different people, and it's kind of social experience. Whereas, trainer road is more,

structured workouts

and more focused on getting fit and attaining your goals. And I've been using both for more than a few years now. Well, trainer, at least, Zwift is newer, and they've been working great. So if you're a cyclist and if you're

if you are not completely

opposed to training inside, then these are this would be great to have a look at.

Well, I appreciate you taking the time out of your day to join me and talk about the work you're doing with health checks. It's a service that I take advantage of and have gotten benefit from, and it's a interesting problem space. So I am sure other people will be able to gain some benefit from it as well. So thank you again for your time, and I hope you enjoy the rest of your day. Thanks. Thank you for interviewing me. This has been great.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__