Summary
Docker is a useful technology for packaging and deploying software to production environments, but it also introduces a different set of complexities that need to be understood. In this episode Itamar Turner-Trauring shares best practices for running Python workloads in production using Docker. He also explains some of the security implications to be aware of and digs into ways that you can optimize your build process to cut down on wasted developer time. If you are using Docker, thinking about using it, or just heard of it recently then it is worth your time to listen and learn about some of the cases you might not have considered.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at pythonpodcast.com/angel and help support this show.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Itamar Turner-Trauring about what you need to know about running Python workloads in Docker
Interview
- Introductions
- How did you get introduced to Python?
- For anyone who is unfamiliar with it, can you describe what Docker is and the benefits that it can provide?
- What was your motivation for dedicating so much time and energy to the specific area of using Docker for Python production usage?
- What are some of the common issues that developers and operations engineers run into when dealing with Docker and its build system?
- What are some of the issues that are specific to Python that you have run into when using Docker?
- How does the ecosystem for Python in containers compare to other languages that you are familiar with?
- What are some of the security issues that engineers are likely to run into when using some of the advice and pre-existing containers that are publicly available?
- One of the issues that you call out is the speed of container builds. What are some of the contributing factors that lead to such slow packaging times?
- Can you talk through some of the aspects of multi-layer packages and useful ways to take proper advantage of them?
- There have been some recent projects that attempt to work around the shortcomings of the Dockerfile itself. What are your thoughts on that overall effort and any specific tools that you have experimented with?
- When is Docker the wrong choice for a production environment?
- What are some useful alternatives to Docker, for Python specifically and for software distribution in general that you have had good luck with?
Keep In Touch
Picks
- Tobias
- Itamar
Links
- Itamar’s Best Practices Guide
- Docker
- Zope
- GitLab CI
- Heresy In The Church Of Docker
- Poetry
- Pipenv
- Dockerfile
- 40 Years of DSL Disasters (Slides)
- Ubuntu
- Debian
- Docker Layers
- Bitnami
- Alpine Linuxhttps://alpinelinux.org?utm_source=rss&utm_medium=rss
- PodMan
- Nix
- Heroku Buildpacks
- Itamar’s Docker Template
- Hashicorp Packer
- Rkt
- Solaris Zones
- BSD Jails
- PyInstaller
- Snap
- FlatPak
- Conda
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computations, such as training machine learning models and running your CICD pipelines, they just launched dedicated CPU instances. They've also got worldwide data centers, including a new 1 in Toronto and 1 opening in Mumbai at the end of the year. So go to python podcast.com/linode, that's linode, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show And to grow your professional network and find opportunities with the start ups that are changing the world, then AngelList is the place to go. Go to python podcast.com /angel today to sign up. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.
For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference with upcoming events including the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit in Graphorum. Go to python podcast.com/conferences to learn more and to take advantage of our partner discounts when you register. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and don't forget to visit the site at pythonpodcastdot
[00:01:51] Unknown:
com to subscribe to the show. Sign up for the mailing list and read the show notes. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers. Your host as usual is Tobias Macy. And today, I'm interviewing Itamar Turner Trauering about what you need to know about running Python workloads in Docker. So, Itamar, could you start by introducing yourself? Hi. My name is Itamar Turner Trauring.
[00:02:12] Unknown:
I help, teams using Python to ship features faster, and using Python for about 20 years. And most recently, I've been spending a lot of time focusing on figuring out how to really run, Python applications on top of Docker in the best way possible. And do you remember how you first got introduced to Python? Yeah. It was 1999, and I was a web developer doing PHP. And I heard about this web framework called Zope, which had a lot of really neat features. It had a built in management UI. It had a built in transactional database and object database. So I ended up learning how to use Open, built some websites with it, worked on a dotcom where where we ran Zope. And then at some point, I decided I didn't wanna be a web developer anymore. But since Python was a general purpose language, I ended up using Python for a whole bunch of other things like distributed systems and networking and so on. And so
[00:03:11] Unknown:
now as you said, you've been working with Docker for being able to package up and deploy Python applications. And for anybody who isn't familiar with it, can you just start by describing what Docker is and some of the benefits that it can provide to this overall problem space?
[00:03:27] Unknown:
Yeah. So Docker is sort of a generic name, and the earliest implementation of, sort of a set of technologies for running, applications in an isolated and reproducible manner. And basically, it lets you take a full file system for for your application. So all the files it needs to run, basically a whole Linux distribution, plus all your code and package it into an image. And then you can run that image, and when you run it, it will typically run-in isolation from in terms of the networking stack, in terms of the other processes you can see. And so, basically, it's a little bit like having a virtual machine except unlike virtual machines, you can run a whole bunch of containers within a single Linux kernel. And so you have less overhead in terms of running lots of Linux kernels, And it's easier to connect the if you do need to access the shared file system, it's easier to get to it. But you get the benefits of the virtual machine, which is isolation
[00:04:28] Unknown:
And the images let you sort of ship everything you need to run your applications code. And what was your motivation for dedicating so much of your time and energy into the specific area of using Docker for Python usage in production?
[00:04:42] Unknown:
So I started using Docker a a number of years ago. 5 years maybe? I worked in 1 of the early storage back end systems for Docker. So I had a bunch of experience with packaging applications for Docker. And so at a previous job, I was packaging our our code up, for Docker, and I was using some of the latest features to try to get smaller images and faster builds, like, multistage builds. And I noticed that even though, we were in theory, we were using some of the Docker's features to make faster builds, in practice, the builds are actually taking a long time. We weren't using any of the caching. Every time we built an image, it would basically build everything from scratch and ignore caching.
And so I spent a bunch of time trying to figure out why, and I spent a bunch of time then trying to get our tests to run-in GitLab CI, which is what we were running using Docker. And, eventually, I realized that, basically, there's a whole bunch of details you need to get right, and many of these details are not written down anywhere. And it can just take people like, I then spent too much time just doing research to see what else I'd missed. And it's just a huge amount of effort to get all this information. It's you have to wait through, like, blog posts that are 4 years old and they're for out of date. You have to log through a whole bunch of tutorials that are using things that are very much not best practices, things that are insecure, and that's fine for a first tutorial.
But they don't tell you that they're doing the wrong thing. So lots of people just copy them and, like, end up shooting insecure images. It's just it's 1 of those areas of ops work where you just have to get a lot of details right and it's not well documented. And so I figured there was value in spending the time to do the research and then write it up and so people can learn all the details they actually need to know to make things fast and secure and, maintainable.
[00:06:37] Unknown:
For the overall use cases of Docker, there's a bit of a divide between people who are using it for their development environments just to isolate different projects and not pollute their laptop or desktop with all of the different dependencies that go into being able to get an application running, and people who are using it for production environments where they need to package up all the code that developers are creating and then be able to put it on their servers for end users to be able to take advantage of. And I'm wondering if you can just talk through some of the differences in requirement as far as the level of effort and detail necessary for both of those different use cases.
And since production is your primary focus, some of the common edge cases and issues that engineers run into when trying to take advantage of the production capabilities for Docker. Yeah. And so
[00:07:28] Unknown:
if you're doing using Docker for development environment, and it's kinda nice for that because, for example, whether you're in Windows or Mac or Linux, you can run the exact same versions of the code. On Mac and Windows, it'll, run a Linux VM behind the scenes and so transparently proxy Docker commands to it. So it it's good for that, but you often tend to emphasize things like fast feedback cycle for a developer. So you want, for example, code reload every time like, if you're doing web web application, it's easy to just every time you save a file to restart the web application or reload the new code. And so you need you your Docker image will basically be hooked up to the code check out the developer's working on and reloading from that.
Whereas in production, you would actually have a particular version of the code. Another thing which I actually wrote an article about today, database scheme upgrades. If you're working as a developer or developing your code, the sort of natural way to do a scheme upgrade is as part of your code startup. So when you start your container, when you start your application, first thing you do is a schema upgrade, and the next thing you do is start an application. That way you're always running on matched version of the latest code and the latest schema. And for development, that's fine. For production, that can cause problems because if you upgrade migrate your schema database schema. If you migrate your data schema on application startup, unlike a developer environment where there's only 1 process running, typically in production, you might have 2 or 3 or 4 or 200 processes running your application just to scale across multiple machines.
And And so now you have 200, processes trying to migrate this database schema at the same time. And unless you're using some sort of data migration tool that actually can database migration tool that can deal with this, this can corrupt your database because, like, the tool most unless you have some sort of lock, like, you can't really migrate your database schema at the same time. And so they they do have different requirements. And there's a whole bunch of other details involved in production like, security, matters, but when you're doing development locally, you don't really care about security. You care about ease of use.
Having a really large image is not really a big deal necessarily because, usually, you the feedback loop is tied to you're just mounting your code in. You're not repackaging each time, and you're not copying the image around. It's just local in your hard drive. So a few 100 megabytes more don't matter. So the the ways in which you use Docker are tend to if you want to do it right, are probably gonna be
[00:10:10] Unknown:
somewhat different between packaging for quick developer feedback and packaging for what you're gonna run-in production. Yeah. There's actually a great presentation I saw a while ago called Heresy in the Church of Docker that talks about some of those differences in use case where a lot of people will see Docker and say, oh, it's just this panacea where it solves all of my problems. I just put it into Docker container and then ship it off to production, and I'm done. And then he talks through of the different layers that need to go into it once you're actually in a production environment as far as the security, ensuring proper networking, ensuring, you know, distribution across multiple hosts. And so I'll add a link to the show notes for anybody who wants to watch that. But, yeah, the the the overall production, the the overall set of issues that come in when you're actually running it in production are a lot more than people might initially suspect. And I'm also curious if there's anything specific to Python itself as far as being able to build containers that are production ready that would differ from just the generic case of Docker on its own. A lot of it comes down to, for Python,
[00:11:14] Unknown:
just the specific tools that are different. So if you're if you're shipping a Go binary, in a container, like, Go basically compiles everything down to 1 binary. So packaging up Go is very easy. Just copy 1 file, and then you're done mostly. Maybe you need to install a few libraries. For Python, you need the Python interpreter. You need a bunch of c libraries you depend on. You need all the Python dependencies you libraries you depend on. You need your code. You probably you have different choices for a base image. And, again, these issues apply to other languages, but you you wanna make the pythons there's different options for Python. There's different ways that, people specify their dependencies.
So there's poetry and there's pipenv and there's just plain old requirements of text. And so for those, you want to do things in different ways. And then often you want some level of you you want to install all your Python code into 1 thing often that's easy to copy over. For if you're doing a multistage Docker build. You basically compile, things like c extensions in 1 image, that means you have to install your compiler and so on in that image and you'd see it's pretty big. And then you add a second image, which is your production image and you just copy over the compiled code. So your final image that you run-in production doesn't have a compiler in it. That means you're copying a bunch of files over and Python will by default install files all over the place and now you need to either use a virtualenv or do a user install. And so basically there it's not it's not that it's fundamentally different than many other languages. It's just they're just detail and just another small detail and another small detail. And over time, it just it's a it's a lot to keep track of and it adds up. And
[00:13:00] Unknown:
as far as the biggest pain points that you see for people who are getting started with Docker and trying to shift it into a production environment, what are some of the ones that you have found to be the at least the ones that are initially most painful that people are trying to deal with and some of the advice that you have for overcoming those issues? Well, I guess there's,
[00:13:22] Unknown:
the invisible but potentially extremely problematic pain points of security. And then there's the more, obvious ones people encounter. So security is a sort of a whole issue on its own, so maybe you can talk a bit a bit about it afterwards. But putting security aside, some of the other pain points then feed into security problems. So I'll start with the the ones that sort of hit people first. Basically, when you're building a Docker image, you are installing, in some sense, an operating system from scratch and then compiling all your code and then installing your libraries. You have to install a whole bunch of stuff and build a whole bunch of stuff. And depending on how much compilation you have, how many libraries you depend on, this can just end up being a fairly slow process. And then if building a Docker image becomes a bottleneck in your development process, so you say you have some integration tests that use your Docker image and now before you can merge a pull request you need to build a Docker image. If it takes 10 minutes 10 minutes to build your Docker image then now you've added 10 minutes to your feedback loop, to catching bugs and that can be end up being very expensive, especially as your team as the size of your team grows. And then maybe less of a immediate concern, but it just sort of it feels wrong as you can it's very easy to end up with really giant images. And if you have, like, a gigabyte image, it just takes a while to to download, so upgrades are slower. Use a bandwidth. It's not ideal. And so there's ways you can deal with the with the the slow builds and basically Docker's build system has caching, but that then starts introducing security bombs.
So if you're caching basically, the cache the way the caching works is you have a series of steps you're doing to build your Docker image. So you might say, I'm starting with the Ubuntu 18 0 4 image, and then I want that's step 1. And then step 2, I'm gonna install these packages. In step 3, I'm going to copy my code in. Step 4, I'm gonna build and install it. And since you're, typically, only your code is changing, you cache the earlier steps. So you don't reinstall the Ubuntu packages you need every single time. And the problem there is, like, you can be 2 months in, and now you haven't updated any of the Ubuntu packages that you're running in production, and so you haven't gotten any security updates. So you've gotten fast builds but at the cost of not getting security updates. And so you need a process in place to rebuild your image from scratch without caching, let's say, once a week or whenever there's a security update. Update. And often you need build secrets. Another problem people often have is, sort of a combination of security, nonsecurity issue is often you need access to some, private, secret information to build your code. So let's say the private git repository can only access it if you have the right SSH key. And the Docker build is sort of an isolated file system on its own, so you have to get your SSH key into the Docker build.
But if you're not careful, it's very easy to end up, leaking these, secrets into the final image. So there's a pretty decent chance that if you go and look at a lot of Docker images that are publicly available that were built from private code that if you look in the image, you'll find an SSH key or some other secret embedded in the image, and people built it didn't even know about it. And there's other annoyances like, if you're not careful, you can it can be hard to shut down your docker image. Like, go hit control c, and then it'll just sit there. And then for 10 seconds, it'll time out, and then it'll kill it because signal handling can be a little bit tricky to get right. And if you don't do it right, then you never get clean shutdown. You always get, like, kill minus 9 after a 10 second time out. And there's also things like if you're running in a Docker environment, it might be slightly different than running on the host machine. And so for example, for gunicorn, that can result in some cases in g unicorn web server just freezing for a few seconds because the it just expects certain locations in the file system to be actually RAM file system. So but within the Docker container, they're always on disk.
And so, I mean, time or disk is slow, suddenly Unicorn's heartbeat system is blocked, and suddenly, Unicorn just freezes. So it there's just all these details that you have to get right, and some of them are somewhat unexpected. And the, I guess, last security issue that people don't often seem aware of is that you really don't want to run your Docker container as a root. The default on for unfortunately, many of the official Docker images is to run as root, and they don't discourage it enough. And the problem with running in root is it makes it much easier for attackers to escape the confines of the Docker container and take over the machine that's hosting them. So for example, in February 2019, there is a attack that allowed someone to basically escape Docker and get root on the host machine.
And running your container as a non root user, as an unprivileged user, would have prevented that attack from happening. So for security, the sort of absolute minimum you should be doing when you're packaging stuff for Docker is making sure you don't run as root. 1 of the issues that you pointed out in that list is the problem of the overall speed of the Docker builds.
[00:18:58] Unknown:
And you've pointed out the fact that there are multilayer aspects of the container runtime, and some of that can factor in as far as the way that the caching takes place. And I also know that some of the ways that the Dockerfile syntax works, it can contribute to the overall size of different layers. So I'm wondering if you have any particular advice or general resources or guidelines as to how to make best use of that layering system and some of the potential of people who are writing their own Docker files?
[00:19:31] Unknown:
Right. So I guess we can start with a quick review of the Docker image format and then how that reflects in the Docker file instructions for building Docker images. So Docker image is basically a set of layers. For For our purposes, you can think of each layer as a tarball, and you untar the first layer, then you untar the second layer on top of it and then a third layer on top of it. And you can't ever really delete files from earlier layers. We can only hide them. So when you delete a fire file in the 3rd layer, it hides it, but it's still there like the 2nd layer if that's where it was. So we have basically these stacked layers, And these layers allow you to have caching because you you have a Dockerfile, which is the instructions for creating a Docker image. Dockerfile is 1 word. And each, line of instructions to first approximation matches up to a line and to a layer. So first, you create the 1st layer, then the 2nd layer, and the 3rd layer.
And so your instructions for building might be start with Ubuntu 1804 base image, which basically says use these previous layers, which are marked as being Ubuntu 18 0 4 image. Use those layers as a base layers. And then run apt to get install these packages. So you've added new second layer and then copy in some code from the the host file system. That's another layer. And then install that Python code. That's a 4th layer. And then as part of the build process, there's caching where it basically does a hash of if it's just a command like apt get installed, it hashes the text of the command. So if you haven't changed that line, then it'll consider not to have changed. And if you're copying files in, then, it'll hash the files you're copying in. So if the files have changed, it'll invalidate the cache. The files haven't changed, it'll have it'll it'll be able to look it up in the cache.
And so as long as you have a layer in the cache that has the same hash as whatever you told it to do, it'll say, oh, I already have this layer pre created. I'm not gonna rerun that step. And this is useful because to build if you think about what it takes to install some code, so you have to install a bunch of packages like a compiler and libraries and dev headers, and that takes a while. You may have to download it. You have to untark. Like, package installing packages takes a while. And then you copy in your code, and and then you install the Python libraries. That might be more compilation and install your code.
And so all the basically just all this copying and compiling and downloading and whatnot just takes time. And the caching lets you say, oh, I don't have to redo this step. Like, I already have the last time you built this image you already installed the same Ubuntu packages you don't need to reinstall them I already have a layer that has all those files in them already. Ideally you only rebuild the layers at where things actually change so if you're only your code change you shouldn't have to reinstall any of the dependencies because they're the same. But in order to ensure that happens, you need to make sure you copy in and run the commands in the right order because once you invalidate and then once you can't use a layer from the cache, all subsequent layers in your build are also invalidated, so you have to rebuild them. So the very first thing you do is copy in your source code, and then you install the Ubuntu packages.
If the files you're copying in have changed, that invalidates the Ubuntu package install even though it's completely unrelated to your files. And so when you're writing your Docker build instructions in your Dockerfile, you want to do things so that, the minimal amount of the validation happens. So first, you want to install the system level packages because they don't depend on anything else, and there's no reason to redo that every time your code changes. And then you copy in just enough of your files to install your the libraries you depend on. And this is where some of the Python specific stuff comes in. You might be copying in requirements dot text. You might be copying in a pip file and file that lock depending how you are managing your dependencies.
And then so you copy in your requirements at text, then you install those dependencies. And now if if your requirements of text doesn't change, you won't have to reinstall those dependencies because you'll be able to use the layer of the cache. And only then do you copy in your code, and by having that almost the last thing you copy in, if your code changes, it only invalidates the last couple layers.
[00:24:03] Unknown:
And you can reuse all the cache layers that have all their dependencies and all the system packages and so on. Another element that plays into this whole layering piece is the idea of the base image that you're working from. And I know that there can be issues as far as consistency in terms of the way people tag those base images where the latest might actually not point to the same container or even the specific versions might not point to the same container depending on how the original creator decides to construct their workflow. And there are also issues with security problems contained in those base images that don't necessarily get addressed or that aren't necessarily obvious. And I'm wondering what your opinions and advice are as far as how to address that overall problem of where to start in terms of building your containers of is there a particular distribution that works well or, just general best practices around what to do for your base image for what you're then constructing on top of? As a starting point,
[00:25:10] Unknown:
and there's always specialized cases where the situation is different. But for most people, you want as a starting point to have an image that's based on an operating system that has, some level of stability. So it's not so the the fundament the libraries of chips aren't going to change radically from 1 week to the next because it just increases your maintenance burden. So Ubuntu long term support, releases are good are a good base image in that sense, whereas Ubuntu's non long term support are less so because those get updated every 6 months and suddenly all your libraries change whereas the long term support ones, you'll have stability for a couple years. So you wanna start with something like that. So Debian releases also get supported for stable for stable Debian releases get supported for 5 years, I think the latest 1. So you wanna start with a operating system that's fairly an image that's based on operating systems very stable.
And then you also want to have a, usually a particular version of Python. And if you say you're using the current long term support Ubuntu which is Ubuntu 18 0 4 that ships with Python 3.6. If you want Python 3.7, it's not installed and you have to get it from somewhere else. And so to solve that problem, there's an official quote unquote Docker image, for Python which is basically Debian Stable and specifically they now have Debian Stable Buster which was released, middle of July 2019. So it's actually very up to date packages and it's gonna be have security updates for the next 5 years.
So it's sort of right now it's quite up to date but you also can trust it to be stable. And then they install on top of that particular Python versions. So you can get like Python 3.5 or 3.6 or 3.7 and the way you select that is there's a little in addition each Docker image has sort of the name of the image and then tags And tags can point at different underlying images. And so 1 way you can, when you build your image, you can refer to it in a bunch of different ways. And 1 way is you can say I want Python 3.7 off of Buster, which is the latest Debian release. So 3.7 dash Buster. Or if you want a slightly smaller image, 3.7 dash slim dash Buster.
We can link to the, this particular image in the show notes so you can see the different options. And if you link to 3.7 that means at some point there will be a new point release of 3.7 so it will go from where it is now 3.7.4 to 3.7.5 and you'll automatically switch from 3.7.4 to 3.7.5. Which if you're not too worried about breakage from that minor version release then, you can do that. Or you can say I'm gonna specifically build on 3.7.4, then I know exactly what, Python version I'm getting. And, with the cost that you have to, notice when 3.7.5 comes out and manually change your Docker image to refer to that.
And this gives you decent amount of stability. If you really, really wanna be sure that, the particular image you're building off of doesn't change, Bitnami also packages, Python images, docker images in this style. But 1 of the things they have that the official images don't is that they will actually have permanent tags where they'll say, we guarantee that if you refer to this tag, we will never ever change what it points to. Whereas the 3.7.4 bust their tag for the official docker images could in theory, in fact, it does get rebuilt every once in a while when ever is in the release of PIP PIP, for example, they change that. So if you're worried about new version of PIP somehow breaking your build, you can use a bit Nami images and then link to 1 of their permanent tags and as your base image, and then you'll know that it'll never change.
If you're the more stability you have, the the downside of that is you're not getting security updates. And so it's always best practice as when you're building the image yourself is that 1 of the commands you run is updating all the system packages. So you say, you know, in addition doing that's for using a Debian or Window Assistant. So in addition to AppScan install, you also up AppScan update to latest packages, just to make sure that, you have the latest system packages. And, again, if you're using caching, that will only happen the first time. So it's also best practice to once a week or if you're paying attention every time there's a security list or if you're really paranoid once a day, rebuild your images from scratch with no caching. And that way, make sure you always have the latest system packages.
And just 1 final note about choosing a base image is a lot of people recommend Alpine Linux because Alpine Linux gives you slightly smaller images. However, I actually personally think it's a bad idea to use Alpine Linux because they use a different most Linux distributions use gnulypse, as a standard library for c programs, including Python. And lpine Linux uses, musl, which is a different ellipsi. And it's just different. And so there's all kinds of edge cases in there that haven't been fixed or are there different and cause different behavior. And so there's you'll see reports of, like, Python crashes because it has a different smaller stack size or string formatting for, time stamps is different or weird edge case DNS differences. And Mussel has been fixed over the years, and all the known bugs that I've heard people talk about have been fixed. But someone had to notice this problem and report it, and they had to fix it. And so if you're using Alpine Linux, we're more likely to have weird, obscure, production problems.
And the benefit is you save, say, 80 megabytes, a 100 megabytes on your image size. And, personally, I feel that, production impacting problems, especially obscure hard to debug ones, are a thing I don't want. I'm willing to have another 100 megabytes on my image not to worry about that. 1 of the other things that has been introduced to the overall Docker ecosystem
[00:31:56] Unknown:
in the past few years are alternative build tools that issue the Dockerfile syntax in favor of some other way of actually constructing the containers. So I'm wondering what your opinion is, that overall trend, and if there are any particular tools that you have experience with and how they compare to Dockerfile itself.
[00:32:16] Unknown:
It's on my to do list to do, deeper research into these, and so I would, consider this more of a preliminary. But the problem with the way Docker works is that it's a daemon that's running on your computer, and then the command line tool you use sends commands that daemon. And that daemon is running as a root. And so that results in a bunch of operational issues. There's in in CI systems, they're all they often sort of restrict the way it can run or a CI system's already running Docker, has its runtime environment. So they have to run Docker in Docker, which adds some to some weirdness, like, because you have this daemon, which is in a container off the side and doesn't can be pretty confusing. In particular, if you're using GitLab CI, you should read the Docker and Docker instructions they have very carefully. And CircleCI won't let you run Docker containers the same way. And so there's some so 1 alternative route people have taken is, I believe, the Podman project. The Podman project basically emulates the Docker command line, fairly thoroughly and completely. So it can it can package Docker files. You can do Docker run. You can do all the different commands that you can do in Docker except it's implemented not as a daemon, but as a command line tool that just runs without a daemon. And so that makes it simpler and easier to run, and you don't also need to run as a root often. And on my to do list, it's saying if that solves makes some of the CI build issues easier.
I believe also that the downside is that it doesn't whereas Docker, there are solutions running on Windows and Mac. I don't think you can run Podman on Mac. Again, I have to do the research. Don't take my word for it. And then there are sort of different approaches which sort of abandon the concept of a Dockerfile altogether. So 1 option is Nix. Nix is an attempt to make, a packaging system where all builds are completely reproducible. So instead of saying I'm gonna use this Debian package, and hopefully it's exactly the same as the Debian package I referred to last time, This Python package and hopefully, it's a Python package I ran out last time. With nix, you are guaranteed that you're it ident that you're getting identical inputs because of the way it works. And 1 of the things Nix can generate is docker images. And so if you really care about reproducible builds, Nix seems like the sort of a tool that is designed for that from the start and does it all the way from the level of system packages. Basically, everything is you're not using debbie or Ubuntu packages. You're using Nix packages. And so it gives you reproducible builds end to end, and it can generate a Docker image. And I know there's some efforts also again, getting rid of Dockerfile altogether, which sort of emulate the buildpack concept that Heroku popularized. And so they're trying to build similar buildpack systems for Docker where you don't have to go and do everything yourself.
It sort of knows how to package things for you. And that means you take away some control, but it also mostly just works. And then there's a 4th approach to something that I've actually tried to build. This is a product I've created, which is a template for pack Docker packaging for Python applications in production. And the template approach is sort of a halfway between the what Dockerfile gives you, which is a huge amount of flexibility, but also a huge amount of places where things can go wrong and details you have to get right. And the tool where a higher level tool sort of restricts what you can do at some sense because it's giving you a higher level of traction, and there's a reason you have to deal with all these details when you're building things. And so the idea with the template is that it gets you for the majority of people, the majority of time. You don't have to edit a Docker file. You just edit a simple simplified config file and that does everything and it, you know, does all the necessary security checks and security settings and it makes sure your images are small and so on.
But when you hit the edge case it doesn't support, you can fall back to the lower level, more flexible thing and just edit the Dockerfile, edit the code to customize it. And there's a few other tools I'm forgetting about that people are likely to encounter. Basically, what started out as Docker is now a bunch of related semi competing technologies that approach things in slightly different ways. What are the cases where
[00:36:54] Unknown:
Docker itself is the wrong choice for a production environment? And are there any useful alternatives that you have seen that address some of the same problems that Docker is intended to cover? So for
[00:37:09] Unknown:
some applications, you don't you're only running 1 application, so you don't care about the isolation aspects of Docker, and you already have a system with VMs. So you can just build a VM and, image, the tool packer and then run it on machine in the cloud like Google's cloud or Amazon's cloud and get the same sort of benefits of isolation and, reproducible builds. And potentially, then you you can, use it to log vagrant to run the image locally. So there's you you can get some of you can get some of the same, benefits without using Docker, and then Docker adds some level of performance overhead. And so not that big, kind of because it's a few percent, but it basically you don't have that, and so it solves that problem. If you consider Docker as a specific instantiation of a container technology, like, all the things we've been talking about are basically generically Docker is a specific implementation, the idea of containers. And so there are other implementations of container runtimes, for example. There's Rocket. There's 1 other. I don't remember its name.
And, typically, they can run Docker images. And so, there might be operational reasons or security reasons you don't wanna run Docker daemon, but you can run a effectively equivalent technology. And similarly, there's tool there's runtime systems that ingest Docker images like Heroku or, I believe, Elastic Beanstalk. And their runtime, it's proprietary soft. We don't know when it's running. The runtime is plausibly not Docker. It's just redeeming Docker image format. But you can just build a single binary executable with a 2.5 installer. And if you're distributing command line tools or software people who run the desktop, like, technically, you can use Docker for that, but it's really not the and and container The idea of containers makes a lot of sense for that, but the the user experience in Docker is not really designed well for command line tools.
So you probably don't want to use Docker for that. And using, like, py installer might be a better choice if you just wanna install, like, a isolated thing people can run or things like snap or flat pack if you're, distributing stuff for Linux distributions. And, yeah, it's sort of for development environments versus, like I I guess another approach is conda, where conda comes with a package that is its compiler, and it comes with pre compiled c libraries. And so you can basically have a runtime environment that's, in terms of, available libraries and binaries, that's different than your host operating system, but it's still just there's no, it's not a container. It's not isolated from the host operating system. It's just kinda provides a a a different set of compilers and libraries for you. And so you get some of the benefits of having more control about what specific libraries and packages and so on you're running without, having this isolation that might make things more difficult if you need easier access to your host file
[00:40:38] Unknown:
system. The overall space of Docker and running Python in production is pretty vast, and we've only touched some of the preliminary aspects of it here. But are there any other aspects of this topic
[00:40:52] Unknown:
that we didn't discuss yet that you think we should cover before we close out the show? Yeah. I guess the thing that that I want to people to pay attention to is that much of the documentation and examples you'll get for this are often not best practices. And to some extent, that's less of an issue if it means, like, your build's a little slower or your image is a little larger. But for security, that's actually can be an extremely expensive problem. And so at least for the security aspects of packaging stuff for Docker, you you should just make really sure that you're doing it right. And even the official Docker recommendations, like, they they were sort of half heartedly say you should not run as root, then they make other recommendations that would force you to run as root.
And you really, really don't want to run as root. It's just it makes your the the ability of attackers to escalate the their access much higher. And in general, just don't assume that if you found a random Docker portal and went through it, you know how to build images that will that were good enough for production, like, the security issues at least. They don't it doesn't take a lot of work to get it right, but you do have to do that work. And the vast majority of tutorials will just
[00:42:16] Unknown:
completely gloss over all the problems you're going to have, like making sure your your caching doesn't lead to insecure packages and not running as root and so on. Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the pics. And this week, I'm going to choose the movie Shazam. I just watched that, and it was very entertaining and amusing and, had a good time watching that with my kids. So with that, I'll pass it to you, Itamar. Do you have any picks this week? I've been watching,
[00:42:49] Unknown:
Veronica Mars, the original 3 seasons, and it's in in sort of interesting seeing a TV show talking about, sort of class conflict and income inequality, especially a TV show that at this point I think is 15 years old. And it's also just a fun mystery slash teenage drama combination.
[00:43:12] Unknown:
Well, thank you very much for taking the time today to join me and share your experiences and knowledge of using Docker for running production workloads with Python. It's definitely a very relevant subject and 1 that I think a lot of people will be able to benefit from. So thank you for your efforts and your time, and I hope you enjoy the rest of your day. Thank you. Thanks for having me.
Introduction to Itamar Turner Trauring and Python in Docker
What is Docker and Its Benefits
Docker for Development vs. Production
Common Pain Points and Solutions in Docker
Optimizing Docker Builds and Layers
Choosing the Right Base Image
Alternative Build Tools for Docker
When Docker is Not the Right Choice
Final Thoughts and Security Best Practices