Summary
Your backups are running every day, right? Are you sure? What about that daily report job? We all have scripts that need to be run on a periodic basis and it is easy to forget about them, assuming that they are working properly. Sometimes they fail and in order to know when that happens you need a tool that will let you know so that you can find and fix the problem. Pēteris Caune wrote Healthchecks to be that tool and made it available both as an open source project and a hosted version. In this episode he discusses his motivation for starting the project, the lessons he has learned while managing the hosting for it, and how you can start using it today.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.
- If you’re tired of cobbling together your deployment pipeline then it’s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to podcatinit.com/gocd. Professional support and enterprise plugins are available for added piece of mind.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Pēteris Caune about Healthchecks, a Django app which serves as a watchdog for your cron tasks
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what Healthchecks is and what motivated you to build it?
- How does Healthchecks compare with other cron monitoring projects such as Cronitor or Dead Man’s Snitch?
- Your pricing on the hosted service for Healthchecks.io is quite generous so I’m curious how you arrived at that cost structure and whether it has proven to be profitable for you?
- How is Healthchecks functionality implemented and how has the design evolved since you began working on and using it?
- What have been some of the most challenging aspects of working on Healthchecks and managing the hosted version?
- For someone who wants to run their own instance of the service what are the steps and services involved?
- What are some of the most interesting or unusual uses of Healtchecks that you are aware of?
- Given that Healthchecks is intended to be used as part of an operations management and alerting system, what are the considerations that users should be aware of when deploying it in a highly available configuration?
- What improvements or features do you have planned for the future of Healthchecks?
Keep In Touch
Picks
- Tobias
- Pēteris
Links
- Healthchecks.io
- GitHub
- Riga
- Latvia
- Cross Country Cycling
- Semantic Web
- Django
- Flask
- Cron
- Cronitor.io
- Dead Man’s Snitch
- IPv6
- Load Balancing
- PostGreSQL
- MySQL
- Fabric
- Ansible
- Dokku
- Kubernetes
- Hetzner
- CloudFlare
- PGPool II
- Streaming Replication
- Citus Data
- Heroku Fork
- the Evolution of healthchecks.io Hosting Setup
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports the show on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@podcastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app. And now you can deliver your work to your users even faster with the newly upgraded 200 gigabit network in all of their data centers. If you're tired of cobbling together your deployment pipeline, then it's time to try out GoCD, the open source continuous delivery platform built by the people at Thoughtworks who wrote the book about it. With GoCD, you get complete visibility into the life cycle of your software from 1 location. To download it now, go to podcastinit.com/gocd. Professional support and enterprise plug ins are available for added peace of mind. You can visit the site at podcastinnit.com to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions, I would love to hear them. You can reach me on Twitter at podcastinit or email me at host@podcastinit.com.
To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Peteris Sauna about health checks, a Django app which serves as a watchdog for your Chrome tasks. So, Peteris, could you please introduce yourself? Thank you, Tobias.
[00:01:33] Unknown:
So, yep, I'm, author of Health Checks. Io. I'm a guy living in Riga, Latvia, and, doing a lot of cross country cycling, programming,
[00:01:43] Unknown:
and, yeah, that's about it. And do you remember how you first got introduced to Python? Yep. It was in,
[00:01:49] Unknown:
2003 or 4, my first workplace. I was working on internal informational system. Later on, I worked on semantic web technologies, started working, writing tests, doing test driven development. And then in the later years, I brought more conventional web applications in Django and Flask.
[00:02:09] Unknown:
And can you start by explaining what the health checks application is and what motivated you to build it in the first place?
[00:02:17] Unknown:
So HealthCheck is, monitoring service that monitors your Cron Jobs scheduled tasks, and different things like that that ring to run semi regularly. And you want to be notified if they stop working, stop running. And, so the canonical example would be backup jobs that run nightly. And what can easily happen is that you set up backups once and then forget about them. Then a year later, there's an emergency, and you go looking for your backup and only discover that your script stopped working 6 months ago, and there are no backups. And it was a silent failure. No 1 was notified, and you're out of luck. So this service helps with that.
It will, well, stay silent as long as things work, and it will send you notifications as soon as a given night the backup in script doesn't check-in and doesn't say I'm fine. I just completed the complete backup. My motivation for making it was, basically, I needed a service like that. And I looked at the existing solutions and, decided that I'd like to make a stab at it myself and started this as a side project.
[00:03:24] Unknown:
And you mentioned that there are other services that provide a similar kind of service. So I'm wondering if you can do a bit of compare and contrast with some of the other offerings that are available and what you found lacking in those ones that made you want to build your own?
[00:03:41] Unknown:
Yeah. Sure. So 2 very well known ones were Kronitor and Denman Smich that existed at the time when I was looking for solutions. At at this point, there's a bunch more. They've been cropping up, and, they do work. I've tried them. And I guess the reasons why I started my own was I guess the main reason was, I thought the existing solutions were overpriced for what I needed to do. Like, the pricing is probably fair for depending on how you look at it. But, at the point in time when I was just needed this 1 task being monitored or these 3 tasks, and I didn't want to invest much in them. I thought that, oh, come on. This can be done cheaper. Well, so I thought.
And, yeah. So that's 1 difference between them is that the pricing structure. The other difference is that, HealthChex is an open source project. The code is on GitHub, so there's an option of hosting it yourself. And I know that people actually do that. Functional device, I think they're quite similar. Of course, there are UI differences. There are differences under the hood, how things are implemented with their, basically, direct competitors.
[00:04:46] Unknown:
Yeah. I, currently use health checks for some of my scheduled tasks, and that was 1 of the things that motivated choosing it over some of the other options is the fact that, 1, it is very fairly priced as compared to some of the other options. And 2, the fact that I do have the option of self hosting it if I get to a point where I decide that that's more worth my time. And, the pricing, as you said, is quite generous for what you get on your hosted version of the service. So I'm wondering if you can discuss how you arrived at the cost structure that you're using right now and whether it has proven to be profitable for you. Yeah. So the whole, I guess, the whole reason why I started the project was the premise of of why do anything was that I thought that I can do this
[00:05:31] Unknown:
cheaper or for free for most people. Like, if you are not using it heavily, then this should be a free service. Like, there are many free services out there. And then if you really are using it heavily and using a lot of team members and setting up a lot of checks, then maybe it makes sense for you to give back some money. And, so that was the idea going in that can I make a product that's self sustaining at least that works like that? And the other thing was that, the main reason wasn't I wasn't profit driven, but it's more it was more a hobby project. It was to learn about different aspects of building your own service, like, not just full stack development, but also all the paperwork that goes around it and marketing and deployment and monitoring service itself and so on. So it's a learning experience for me. It's also something you can put on your CV, on your portfolio.
So I wasn't least in the beginning, I wasn't too concerned about whether it will be highly profitable. As long as it pays its own bills, that's fine. And that's basically the state it's at in now. Like, the bills have been going up. The, well, money I get from owed from it also has been going up, kind of, and they are in balance. But it's not like, I'm not leaving my day job just now.
[00:06:48] Unknown:
And what does the deployment architecture look like, and how is HealthJax itself implemented? And I'm wondering too how both of those designs have evolved since you began working on and using it.
[00:07:00] Unknown:
So health checks, the Django app, is pretty standard. It's intentionally not doing anything too clever. It's not using many dependencies. It's a basic it uses basic, Django features. It uses either Postgres or MySQL database for storing data, and and that's about it. It's not a big elaborate, setup using some fancy new data storage layer or anything like that. And the idea was that let's not overcomplicate it from the beginning, and let's see how far we can get scaling wise just by doing this kind of naive approach. And so far, it's been fine. I've needed to move it to more powerful servers from time to time. But at the moment, it's doing fine, and it can handle the load pretty well, I think, as far as I know.
The deployment part, how how it actual how it's actually set up on the servers and which servers and what services it uses. That's an area that has changed a lot, and that's actually something I didn't expect. Initially, it's I've spent a lot of time when well, working on this. So I started off with DigitalOcean single droplet, $5 droplet, and that was it. That was when it first launched, and nobody was using it. And then since then, I've moved to try different services, testing my hypothesis whether this will work better or that will work better. So I moved to Linode. I moved to Google Cloud.
I considered using Heroku, considered using Amazon Web Services. And by consider, I mean, actually write deployment scripts and test it out and be almost ready to deploy and switch over and then decide against it. So the kind of a lot of there's a long shop shopping list of features that it needs to for this hosting environment to really work. So, for example, the hosting environment needs some kind of load balancing or avail ability to, have a flexible IP, some kind of switch over so it can do seamless deployments. It needs to have IPv6 support. It needs to ideally, it would have SSL termination.
So so there are various things like that that that you kind of need to look out for and, as you move on. And, also, there have been, well, interesting discoveries. And, like, let's say you I moved to Google Cloud and started use using Google's load balancer and after a while, I started hitting errors, that come from load balancer, not from my code. And it took quite a while to track them down and follow-up with Google support. And in the end, I wasn't able to sufficiently resolve them. And so my option was, well, migrate over to a different service or live with them, and then I, ultimately, I migrated over. And and that's something you cannot tell that this will be the case when you're just looking at their spec sheet and and googling around from other people's experience. You have to try the service to see if it will work or not.
Yep. So the deployment aspect has been yeah. I've learned I've learned a ton of new technologies, with this just different cloud cloud providers and also deployment technologies technologies. Like, let's say I started with Fabric, and then I moved to Ansible. A while later, I moved moved back to Fabric. And at some point, I I had a setup that used Docu, and then sometime later, I used the Kubernetes. Oh, anyway, 1 last note on that aspect is the current deployment. So currently, the service is running in Hetzner bare metal servers in so Hetzner is a hoster in Germany. So it's a pretty simple setup. You have your bare metal servers that you can SSH into. It's con reasonably quick to provision a new bare metal server if you need to, let's say, if you're existing once something happens to them. I use Cloudflare as a load balancer in front of them. So it's, I can take 1 web server out of rotation, update it, make sure it works, put it back into load balancer, and, that seems to work fine so far.
And the database is also on a bare metal sore server. That's actually currently 1 of my concerns and thing I am looking into, how you do failover on Postgres. There are many solutions for that, and each 1 is has its own, pros and cons. And I want to make system that is actually is actually usable, and is worth it. Like, if you make a failover system that's very complicated, then when it's finally time to failover, what can happen is, well, it breaks, and it's no better than a single point failure that I have currently on the database side.
[00:11:41] Unknown:
Yeah. Running Postgres in a highly available configuration is definitely a nontrivial task and 1 that I've dealt with in the past. When I was working on it, I ended up going with a, streaming replication setup where the secondary was a hot spare for being able to do read queries against, and then I used pg pool 2 as the query balancer and the, failover trigger as well. But it, that that ended up having a few weird edge cases as well, so it's definitely nontrivial. Actually, on my other podcast, I recently interviewed the folks behind Citus, the post grads extension for being able to run it in a distributed context, which was an interesting discussion as well. So I'll add the link to the show notes in here as well. And 1 of the other things that I was curious about is given the fact that health checks use case is designed for a lot of fairly regularly scheduled tasks, but that are most of the time probably gonna be fairly spread out. I'm curious if you ended up finding any benefit from or need for caching in in the web architecture or if it's sufficient to just go directly against the database given the very regular and known, periodicity of the interactions?
[00:13:00] Unknown:
Yeah. The traffic that the site gets is interesting. It's it's very spiky. There's huge spikes of services checking in around, let's say, round hours or every 15 minutes every quarter. And so you'd see a base load of, let's say, few tens or 50 or so pings per minute sorry, per second. And then at certain times when it's a midnight summer, there'd be a huge spike where, for a short amount of time, you'd get thousands of requests per second. And you need to keep up be able to keep up with them. And so 1 of the hardest things there is also that they're all using HTTPS, each being a separate is a separate new request. And so there's an TLS handshake, which is computationally expensive. So if you do that on a weak system, that alone can easily overwhelm that system.
Can well, then it can cannot keep up with them. And so this is where Cloudflare helps in that, the TLS is terminated there. And, from Cloudflare to my servers, the connections are use keep alive, so I don't need to do as many TLS handshakes. Caching wise, there are a few clients that do a lot and a lot of pings, like multiple pings per second. And there's no huge benefit for for us or for the users themselves of doing them so often. But I guess it's just easier for them to be set up that way and in order for me to be able to better handle them so so the problem there is actually writing to the database, not reading from the database. There's only so many writes you can do to the database. And, so writes are, at some point, get expensive, and you need to start to save up on them. And so what I'm doing is I'm having a little cache in, in the code that receives pings. And if there's a ping if there's have been a ping from a given service in last, let's say, 10 seconds, then we won't do another write when the next ping comes. So there's a kind of a 10 second time out. And that helps with the cases when somebody, sends pings every second or multiple pings every second. Then it's harder for that single client to overwhelm the whole server with lots of, IO operations on the database. On the read side, I haven't really needed cache or any kind of caching so far. But, yeah, we'll see how it how it goes on.
[00:15:23] Unknown:
Yeah. It's always interesting how different services and different use cases can have such different needs in terms of whether or not caching ends up being beneficial or just an extra burden that could potentially actually have negative impacts on the overall throughput of the system. And I'm wondering too what have what you have found to be some of the most challenging aspects of building and running the health checks application and the
[00:15:59] Unknown:
isn't isn't the hardest part and isn't the part that, you spend the most time on. And that's not really a surprise. If you read, let's say, Hacker News, from time to time, you'll have solo entrepreneurs saying that the bulk of the work will go into marketing, into doing paperwork, into doing customer support, into monitoring your servers and not actually working on your product. And that's how it is. And that's what I've found out in practice, working on this alone. This is something I've been shielded from before in in the day job. And now on on this project, I I get to see that, yep, that's the case, that I'd be responding to queries from customers, fixing tickets, and doing stuff like that. Also, on the billing side, accepting payments is trickier than I initially thought. Getting different payment processors to work, filling in the paperwork, and handling invoices. That's 1 area that I've been working just now on. And, yeah, and the other tricky part, of course, is the fact that, ideally, it would be fully highly available. And, the closer I can get to that state, the better. And so this there's continuous, improvement in that area. It's pretty easy to make the web services pretty much bulletproof. You you have you can have many web servers load balanced, and they they can hit the database. And then it's trickier to decide what to do with the database. We already talked about, whether you have what kind of database you use, how you do the failover, how do you do scaling, whether you maybe don't use MySQL and Postgres at all, but use a different type of some kind of distributed database? Maybe So I guess Citus would work help with scaling up. Not sure, how much help it would be in having more availability, but I haven't really looked into it in detail.
[00:17:45] Unknown:
And for somebody who wants to run health checks on their own infrastructure, what are some of the things that they should be aware of as they're deploying and setting up the service, particularly if they want to make sure that it is going to be fairly reliable?
[00:18:00] Unknown:
I think, the setup I would recommend for hosting it yourself would be to use Heroku. There is a fork of health checks on GitHub that's adapted so that it's very easy to deploy to to Heroku. In fact, you have you can click the ping deploy to Heroku button, answer a few questions, and you'd be up and running. Of course, you need your own domain name. You need to well, probably, you'll want to customize your mail settings, what your email address will look like, and so on. But then it's there. It's working. And if you need the service to be kind of reliable, then you it's easy to just, move the sliders in in Heroku's dashboard and go from free, the free dyno to professional dyno and go from hobby Postgres to professional Postgres that has highly available feature as well. That should be pretty easy to set up and to maintain going forward. So that's the 1 I'd suggest. It's also possible to do it all yourself. I know there are people running it in Docker containers or people just running it similarly how I'm doing that. But that's something that you'd be spending more time on, time upgrading and keeping eye on it and just general maintenance. And it that make might make sense to do as well if you are using it seriously. But, for small to mediums, setups, I think Heroku would work great.
[00:19:27] Unknown:
1 of the critical aspects of the service is the fact that if the different tasks don't check-in within the scheduled time frame, that they'll actually send out a notification. And I know that for that, you have a number of different integrations available. So I'm wondering if you can just talk through what a typical workflow for a single task would be in terms of how the application identifies that it is failing and then what some of the options are for being able to send out that notification to somebody to let them know that they need to check-in on that particular job?
[00:20:02] Unknown:
Sure. So let's say you have, application that needs to send out weekly emails every every week. So it's a bunch of code in a in a programming language that that goes through a loop and sends an email for each 1, and then it's done. And let's say you want to make sure that this loop always completes every week, and it never gets interrupted in the beginning or somewhere in the middle. So what you can do in the app is you can after that loop, you can instrument your code with a call to health checks ping endpoint. So depending on your language, it'd be a different looking snippet of code, and we have examples of those for various languages. And you'd just stick it in your code. And what would happen from then on is, when the app runs, it sends its weekly emails, and then it runs the snippet, which pings health checks, and it's done. On the health check side, what you'd see is that every week at the around the same time, it would receive a ping, and the the check would have a green icon next to it, and everything would be fine. So let's say 1 week something goes wrong. For example, you hit the sending quota, and you cannot send all your emails. And so there's, your code throws an exception.
The the ending part of the code doesn't complete, so the ping isn't sent. And what would happen next on health checks IO, the check would go from status green to status red. And at that point, health checks would send email address, the same email address, the same email address you used for signing up. So you just receive email, hey. This check has stopped reporting in. What you can also do is you can set up multiple different channels. The most popular is notifications to Slack, but there's a bunch more. There's notifications to Telegram, notifications to PagerDuty to and so on, SMS messages. So there's a list of those. And different people, depending on what types of services they already use in their workflow, they would just use whatever messaging service or or channel they they prefer and use that. And yeah. So what you do next is you receive that notification. You check your app. You check your logs. Figure out why it didn't work, what what has gone wrong. So you fix your app, run it, run that task to send weekly emails again. This time, it completes correctly. Another ping is sent on HealthCheck. Io side, the check goes from red to green. You'll get another notification saying that, hey, it's all good again, and life goes on.
[00:22:32] Unknown:
And are there any particularly interesting or unusual uses of health checks that you've seen? Yeah. So
[00:22:40] Unknown:
1 interesting use that I've come across is on GitHub. It looks like, there's a company in Africa that's using health checks as a for educational purposes, as a playground for teaching test driven development. From what I can tell, what they do is they would create a fork of the code base. They would remove some of the tests, and then they would give it as an assignment to somebody to write the missing tests. And I guess the fact that the code base is pretty straightforward, it's not too convoluted. It's, for most time, it's a readable standard Django code. It makes for a good sandbox environment where you can easily get running locally, make sure that, yep, there are some tests that are, failing on write those tests or fix those tests, write some more, and so on. So that's been an use case I didn't anticipate a lot at all that somebody would use it like that, but that's great. So what are some of the improvements
[00:23:36] Unknown:
or features that you have planned for future releases of health checks?
[00:23:40] Unknown:
So for the product for the app itself, I at at the moment, I don't have great big break groundbreaking features in mind. It's more about small refinements and fixing issues and adding smaller missing features. Basically, it is already useful as it is. Actually, well, I think it was useful already when I was 6 months in when I started to work on it. But, yeah, of course, when you get feedback for users from users, you'll find out, oh, yeah. I could add this as well. That would make it better. So that's there's probably it's going to be evolution, not not something big that's I'm going to add on the other areas. So there are improvements I can do on the high availability front, most notably the database. So I'll be working on that. Yeah.
But main development is I'm happy in in the in the shape it's at currently, development wise.
[00:24:35] Unknown:
And are there any other topics that we didn't discuss that you think we should have before we start to close out the show? I think we we covered it pretty well. Well, for anybody who wants to get in touch with you or follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And for my pick this week, I'm going to choose, TV I just got over the holidays. So since 4 k is starting to become more affordable, I ended up getting 1 that's, fairly reasonably priced. So I got the LG 55uj63100, which for anybody who doesn't wanna try and remember all those random numbers and letters, I've got it in the show notes. But it's a decent sized TV. It's got 4 k display capabilities. It's got smart functionality built in.
And for the price, it's, really nice TV, you know, good picture, good feature set, and, been enjoying using that. So, yeah, for anybody who's in the market, it's a decent choice. And so with that, I'll pass it on to you. Do you have any picks for us this week, Peteris?
[00:25:36] Unknown:
Yes. So 2 training applications, training apps for cycling. 1 is called Zwift, and the other is called TrainerRoad. So for these, you need, a bike and a smart trainer. You put it on, and then you have your screen in front of you, and you can use them to work out. And they're great for this, season when when it's cold and wet outside. So Zwift is kind of a virtual environment when you cycle in a virtual world with, other cyclists like you. So you can interact with different people, and it's kind of social experience. Whereas, trainer road is more, structured workouts and more focused on getting fit and attaining your goals. And I've been using both for more than a few years now. Well, trainer, at least, Zwift is newer, and they've been working great. So if you're a cyclist and if you're if you are not completely opposed to training inside, then these are this would be great to have a look at.
[00:26:32] Unknown:
Well, I appreciate you taking the time out of your day to join me and talk about the work you're doing with health checks. It's a service that I take advantage of and have gotten benefit from, and it's a interesting problem space. So I am sure other people will be able to gain some benefit from it as well. So thank you again for your time, and I hope you enjoy the rest of your day. Thanks. Thank you for interviewing me. This has been great.
Introduction to Peteris Sauna and Health Checks
What is Health Checks and Motivation Behind It
Comparing Health Checks with Other Services
Cost Structure and Profitability
Deployment Architecture and Evolution
Handling Traffic Spikes and Caching
Challenges in Building and Running Health Checks
Self-Hosting Health Checks
Workflow and Notification Options
Interesting Use Cases
Future Improvements and Features
Closing Remarks and Picks