Pip and the Python Package Authority with Donald Stufft

Hello, and welcome to podcast. Init, the podcast about Python and the people who make it great.

You can join our community at discourse.pythonpodcast.com

for your opportunity to find out about upcoming guests, suggest questions, propose show ideas, and follow-up with past guests.

I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.

For details on how to support the show, you can visit our site at pythonpodcast.com.

Linode is sponsoring us this week. Check them out at linode.com/podcastinit

and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project.

We also have a new sponsor this week. Rollbar is a service for tracking and aggregating your application errors so that you can find and fix the bugs in your application before your users notice that they even exist.

Use the link rolebar.com/podcastinit

to get 90 days and 300, 000 errors for free on their bootstrap plan.

You can subscribe to our show on Itunes, Stitcher,

TuneIn Radio. And now Google Play Music just launched support for podcasts, so you can check us out there and subscribe to the show.

Please give us feedback. You can leave a review on iTunes or Google Play Music to help other people find the show.

And don't forget that we have 2 tickets to the Open Data Science Conference in Boston, which is happening on May 21 22nd, to give away.

In order to enter, just sign up for our newsletter at pythonpodcast.com.

Also, if you use the code ep, you will save 20% off of the ticket price. If you decide to attend, then let us know. We'll see you there. Your hosts as usual are Tobias Macy and Chris Patti. And today, we're interviewing Donald Stuffed about PIP into the Python Packaging Authority.

So, Donald, could you please introduce yourself?

Yeah. So I'm a software engineer for Hewlett Packard Enterprise,

where I work full time on the, Python package,

ecosystem.

It was spanned from PyPI to PIP and other related libraries.

That's really cool that, that HP is is funding your work with this. It's it's always really nice to see when companies

realize that they're getting enough out of the open source

ecosystem, as you said, to actually put their money where their mouth is. So thank you, HP Enterprise.

And how did you get introduced to Python?

Yeah. So back in, like,

2007

or so, I was using Drupal to try to make,

web applications.

And, you know, it's more of a CMS web rather than a web web framework. Someone,

had mentioned to me that there was this new framework on the scene called Django, and Django as, you you know, is written in Python. So I went and bought a book, and I read through it and taught myself Python so that I could use Django. That's how I got introduced to Python was through Django.

And how did you get involved with the PIP project and the Python Packaging Authority?

Sort of a roundabout way. I, I care a lot about security. And back then, PyPI was not served with a trusted HTTPS

certificate. It was using CA cert, which isn't trusted in hardly any browsers.

And, I sorta got angry about that and yelled a lot on its mailing lists.

And, I decided I was gonna start up at my own pipe I called crate dot io,

which just sort of sucked the data down from pipe I and had h t t t p s and a bunch of other things. And then to,

support features I'm going to put on that, I started sending full requests and stuff to PIP. And, eventually, they gave me a commitment. And,

since then, create that IO has been shut down.

You started working on pipe directly, but,

you know, also when it got started there from me getting angry. I had to see a security there.

And can you explain a bit about what the Python Packaging Authority is and what it does?

Yeah. So it's essentially an informal organization

of just people working on,

packaging in Python. You know, it's more or less just to make it easier to say things like the people who work on Python packaging,

which is sort of a mouthful.

You know,

you know, we don't have any sort of, like, officers or, you know,

a whole lot of policy around it. Just sort of if you're working on something that's related to Python packaging and you wanna be part of your organization, we add your project, and then you get part of the PIPA.

Other than PIP, what are some of the other projects that fall under that umbrella?

So there's PIP like you said. There's PIPI itself. There's set up tools is in there. Bandersnatch,

which is a mirroring client for PyPI.

There's a packaging, which is a library called packaging,

which is sort of a low level library that PIP and set up tools uses.

So there's wheels in there,

warehouse, the the new implementation of PyPI is in there, and just sort of a a catchall.

I'm not missing some, but sort of just a catchall of of things.

How is PyPy and the Python Packaging Authority funded? We're we're only asking because, you know, funding

sort of open source infrastructure has been such a hot topic lately, and so many other projects are struggling to figure it out.

Pypa itself,

the Python packaging authority, that is we are terrible at names. We have all these names that, sort of collide in how you pronounce them. But the PIPA

basically has no funding

directly whatsoever. We're not set up for it. We've talked about trying to do it before, but as of yet, no one's really

stepped forward to actually,

manage funds coming in. PyPI,

on the other hand, is technically

the service itself that runs on pypi.python.org

is owned by the Python Software Foundation,

and it gets whatever money it needs. You know, we have to ask permission and such, but from

the PSF

board who then gets funding from donations.

But primarily, the things

that, funded are either volunteers given their time, Companies like mine, which sponsor development, which Gil Packard Enterprise also sponsors a couple other people who

work not full time like I do, but they come over and they work part time on some of the stuff as it relates to OpenStack more. And then for PyPI itself, a lot of our things are,

donated by just companies giving us services like Fastly CDN, Rackspace VMs,

Roku's in there. You know, there's a whole slew of services that are donated, you know, well over 20 or $30 a month worth of services.

Wow. That's really awesome. Is there a page somewhere that lists who all the sponsors are that we can put in the show notes? I ask only because, you know, like I say, I think sort of recognizing corporate sponsorship of the infrastructure that we all use and depend on is super important so that they recognize the value in in continuing that.

Yeah. So there's not currently a

real page for doing that.

We've sort of half heartedly tried to do it here or there before.

However, the new PyPI,

which is live ish

on warehouse.python.org,

you know, the footer has every single company that donate services to PyPI

itself, not all Python, just PyPI.

And there's going to be a page there

the way you can, like, click and get a be a full page list, you know, with a paragraph describing them and stuff like that, so that those companies are recognized

on every page of PyPI.

Super cool.

So I know that there are some languages that don't really have a central packaging and dependency management framework. The first 1 that comes to mind is Go, where it prefers to go straight out to GitHub repositories. And I'm wondering what your view is on the

level of importance for a well managed and centrally managed dependency framework and packaging architecture for a language.

Yes. So I think,

it's pretty important. I think that it's the sort of thing that people expect nowadays.

I worry about what's gonna happen in something like Go where they don't have a central repository so much because what's gonna happen in 10 years when GitHub is no longer the the cool place to be in some new site,

and then have all the that code up there that's modified to work directly on, you know, pulling code from GitHub that's, you know, no longer going to be there. Nicely with PyPI is we've been here for over a decade now, and,

I'm not gonna claim all the code still works. But,

all the code that has been uploaded is still there available.

That might not run on the latest version of Python, but it's still there. If you just download it, you know, you'll still get that code regardless. But more than that, I think that 1 of the things that a central repository

does is it allows

you to abstract

your dependencies

if you depend on requests.

Your request can come from PIPI, but if you're running a local mirror, it can come from there instead.

And the information includes

concrete locations

makes it much harder to mirror and,

you know, run sort of dependencies from different locations.

Yeah. And, 1 topic that

occasionally comes up when talking about central package archives

is the thought of curation or at least

recognizing sort of officially sponsored packages,

which I know that

in the Python package index, when you're searching for a package, there's sort of

a ranking. But I'm curious if you have any thoughts on the idea of curation or sort of reviewing of packages to determine when you're searching for a particular use case, which 1 you should go with.

Curation

is hard to do.

You know, it requires a a lot of man hours, and it also makes it hard for a beginner to come in and create a new package that then becomes popular.

It heavily,

incentivizes

to use packages done by people who you already know. Pipedrive is not curated, and we probably would never become curated. Now,

there is some talk about how we could more able

people to develop

sort of a curation around PyPI

so that, you know, people can maybe make package sense or,

you know, review packages

and whatnot, which,

I'm not sure if we're going to go there or not. Some of the things I think would probably be better off as an external service that then just pulls the data from PyPI.

But I think for a community sort of central repository,

you really don't wanna get too heavily into curation because you start then stifling new people coming in and trying to release their packages.

I agree. I think it's very important for something like Py Py to not slow down the development process. I would much rather have I mean, we're so lucky with Python. Right? Because there's this

amazing wellspring of goodwill that we have and that, on the whole, people who submit to PyPI are doing so for the common good. I mean, you might disagree

about whether 1 package is written a particular way or not, but it's not like the iOS or the Android world where we have to necessarily worry too much about malicious

Python being submitted to the package system, I wouldn't think. Although perhaps that should be an issue. I'm not sure. But I agree. I think not curating is definitely the default we should stick to. Yeah. And then so you start worried about malicious packages, you know, you start having to talk about threat models and stuff. And primarily, the thing with PyPI as far as security goes is we want to make sure that if you request,

a particular package, you get the bits that the author uploaded, and those are the ones you meant to get. We are not going to probably ever tell you whether the bits are good bits or not, just whether or not they're the ones the author upload and they're the ones you requested.

Absolutely.

And does the Python package index have any support for GPG

signing of packages? So that if you have the public key of somebody who you trust to give those packages, that you can actually validate that the downloaded package came from them.

So the answer is yes. PyPI does allow you to upload a signature, but it's not generally useful. Nothing

outside of Deviant's u scan utility

actually validates it to my knowledge,

other than people manually downloading and running GPG. The problem is just signing packages doesn't buy you anything without a trust model behind it. And GPG's

trust model was not sufficient because,

it's focused around verifying identity. So now you can verify that on Dom's stuff, but you have no idea what whether Dom's stuff is allowed to sign for PIP or request or Django

or Twisted. And so you need something to say there, okay. This key is trusted for this particular package. GPG is just not there yet. We're likely to introduce package timing at some point. It's unlikely to

be based on GPG. Certainly. So what was involved in getting PIP into the standard Python distribution

distribution? Was there any controversy around this? I asked because I came from the Ruby community where

the act of introducing gem into the standard Ruby distribution almost caused a civil war.

Yeah. So the pep for introducing that was accepted something like 2 and a half years ago,

and I'm still dealing with fallout from that.

You know, it and a lot of the Python people, people who work directly on Python,

saw the need for it and were totally okay with it. There was arguments about the specific details about how we we were gonna do it. In the end, we ended up not included directly

in Python, but instead, we included a little,

installer file that bundled PIP, and that was just mostly around today. You would upgrade PIP after you install Python.

The real fallout that kept coming from our redistributors

of Python, you know, like Debian Linux, Fedora Linux, FreeBSD,

things like that where they have policies against bundling code, like we bundled pip inside of Python. And there's also then a circular dependency,

you know, where Python then depends on pip, but pip depends on Python. And there was a lot of back and forth and arguing about how best to handle that, you know, like, Ubuntu 1404

shipped with a broken

VM's module

because they deleted,

this code that that we added for that. But I think we're getting to a better state now where a lot of these things have worked out over time. And a lot of that controversy also came from just, like, a decade of sort of pent up hurt feelings about the 2 sides fighting

with each other.

It's kind of interesting because it always seems like there's this

tension between

language specific packaging systems and vendor specific packaging systems. And in a sense, I get it right Because for end users,

it's always

kind of complicated to know, do I use PIP or do I install the vendor supplied

RPM based or dev based Python package

universe?

I think maybe some of that just needs to be handled by better marketing. Right? Like, if you're a developer, chances are you're gonna wanna use PIP so you have max flexibility.

If you're a system integrator or an infrastructure engineer or someone who is looking to build turnkey systems or systems

with,

configuration management tools like Shaf or Ansible or Puppet or whatever the case may be, then you're probably gonna wanna use the the vendor's

packaging paradigm because that just makes your job a whole lot easier. So I kinda don't see a problem here, but I recognize

why people who are just coming to the table could be very confused by it all.

Yeah. Well, and some of the problem, you know, comes from history and just there'd be bad feelings pinned up. Some of it's things like,

you know, right right now, if you go to your w Linux and you type PIP install something in your user account, it's gonna try to install up to the root location. And if we're not root or pseudo, it's gonna fail. And most people, when they see that that permission failure, are gonna pseudo up and then install into the global, site packages directory. So there's some work we need need to do around making a better default for people and to try and direct them away from modifying their system

with PIP instead of using their system packages.

And then the flip side of that is we've done that, you know, for so long, we have done that and it's broken,

people's systems. And then then they go to Debbie and say, hey. My system's broke. What's up? And then Debbie gets kind of mad. It's not just Devine. I think Devine could because I deal with them a lot. But, you know, and and then they get upset with us. I wanna try to,

limit people's exposure to things like PIP because they see it as a attractive nuisance towards people. So it's a lot of,

political

fallout we you you know, trying to explain.

Yeah. We realize this might lead to more people

using PIP, but we're also gonna try to introduce,

new concepts here so that people are less likely to mess up their system using PIP.

Can you describe some of the mechanics of how PIP works and how it differs from some of the other packaging systems that Python has used in the past, such as easy install or,

setup tools?

Yes. Actually, easy install is part of setup tools. But as far as how they fetch and discover packages, they're pretty similar. You know, PyPI has

a it's actually an HTML page for historical reasons

that list all the files, and they pick which 1 they like best. They download it, and they, you know, run-in a few pipeline commands and install it. The primary thing that PIP had over easy install

was a easy install attempted to do a lot of things at runtime.

So,

it had a sort of a built in concept of something that's sort of like a virtual environment, but it happened completely at runtime.

So and through a sys.path munging.

So you wouldn't have a simple, you know,

source or assist.path

that Python in itself just kinda handled. You'd have this little bit of extra code that would just sort of dynamically

alter your sys.path to ensure that the versions of whatever packages you wanted were

available at run time.

And this sort of didn't work very well and broke down,

quite often. And, depending a lot on how you started Python stuff, and that really didn't work very well.

DIP sort of split that out and started the virtual m project,

said we're just going to install it into flatly

without having any of this runtime stuff in in into your Python. And if you wanna install different versions, you're gonna need to use something like virtual m to manage

different environments.

Then the other thing Pip did was an easy install sort of try to install things as quickly as it could. So if it needed to download and install 4 things,

would download 1 thing,

install them, go try to download the next thing, install them,

and so on. And what would happen oftentimes was you you get some sort of failure in the middle of

installing. You'd leave with a half uninstalled, half installed system, whereas PIP tries to offload as much of that work

upfront as possible and install at the very end after it's already done as much as the other work as it could to reduce the the chances of getting into this sort of

partially failed mode.

So speaking of virtualenv

and the like, does PIP interact at all with virtualenv, pyenv, etcetera?

So the answer is yes ish. Generally, virtualenv itself, it installs some patches and does some,

hacky black magic

that is really gross. Then py then

is a sort of standardized

version of that that doesn't need all that black magic stuff. But they sort of modify

the Python run time,

directly without Pip's involvement. Then when Pip installs a package, it just kinda goes to the Python environment. It says, hey. Where do these kind of files go?

And in that case, you know, if we're not running inside of a virtual environment,

you know, you'll get your global site packages. If you are, then virtual end will have modified that so that it gives you the new location.

So in that sense, PIP doesn't need to know about virtual environments because it just asks Python and Python tells it.

Now that being said, there are a few corner cases where we do have to know if we're running under a virtual environment or or not. So we do know about them, and we do sort of tweak a few things based on that. But, generally,

PIP is kind of blind to the fact that it's running inside of a virtual environment. It just does the right thing because of the way virtual environments are designed.

That's awesome.

You know, it's funny. You're now the second, third person that I've heard who's really in the know in the sort of, you know, Python internals, Python packaging, etcetera world

who has talked about some of Virtualenv's

warts.

Virtualenviz

is clearly

a really kind of a much needed

innovation

in the Python programming environment, but everybody seems to cite all the problems with it. Like, you mentioned some of the gross stuff it does behind the scenes.

I've heard people say, you know, the it's it's really broken on OSX, etcetera, etcetera. Is there any work or any thought going into

sort of packaging and virtual environments, the next generation, to make it less gross and make everything, you know, be peace, love, and gerry and working with each other?

I'm also curious if there has been any thought of

moving to a maybe more of a functional style packaging,

in the vein of the Nix package management system, where in order to handle multiple different versions of the same packages, you essentially just use hashes of the packages to determine the path that you resolve based on the version that you're trying to retrieve?

Yes. So, actually, since Python 3 dot 3, there's something very much like virtual ed included in the Python standard library. In In Python 3 dot 3, it did not install PIP or anything. It was just virtualization,

and you then had to go out and get PIP and install it yourself.

As a Python 3 dot 4,

it will install a version of PIP that is bundled with Python,

into your virtual environment

by default.

And, so and that that particular

isolation solution eliminates the gross hacks because a lot of the gross hacks are things like,

yeah. Yeah. Yeah. You have to start getting into how the Python interpreter starts up. It looks for a OS dot pie file relative to the its,

binary directory.

And if it finds that, then it

assumes that's where site packages is. So what virtual m is, it actually loads that puts the OS dot pi relative, then puts enough of the site of the the standard library in that location

that it can bootstrap itself, monkey patch the standard library,

then load the standard library from the real

location, and then set up,

your shift up path the rest of the way.

The version that's in Python itself

just sort of eliminates that is in as part of the runtime startup, it just looks for a pyban dot cfg file.

Says, okay. If this is here, we're just going to assume site dot the site packages

is relative to this file as opposed to

what virtual m has to do, which is sort of shuffle around

a half half of the standard library

using that OS dot pi sigil.

So

the, so the answer to the question is, yes, this all gets a whole lot better in Python 3. Unfortunately,

our, Python 3 is not what everyone's using. So virtualenv is still,

you know, very popular because it works across all the versions of Python, particularly Python 2. And long term, what I would like to do is refactor virtual

so that it takes

advantage of that building on installation where it's available, and then it just uses the, the old Blackmagic hacks for these older versions of Python. I just don't have that.

As far as the functional packaging goes, I've looked at that a little bit.

I I haven't spent a lot of time on it primarily because 1 of the things that it can that,

we

currently have working

and and whatnot is that if you type the the just the command Python,

then,

you know, it's sort of just working, and what you have installed there is sort of already there.

You you know, it's available to import and whatnot. And, a lot of the functional package managers, you sort of it's not the greatest for in development work, where you're sort of installing things in and maybe you're modifying a little bit and tweaking it and then throwing it away.

I think we could possibly get there at some point, but it's hard to

figure it's hard to figure out how to sort of work that in with

the

2 decades of legacy

understanding of how things work now.

The newest package format for Python is the wheel system. Can you describe what that is and what its benefits are?

Yeah. So the wheel is very similar to an egg,

just with some of the sort of missed features removed from it. It's I mean, under the coverage, it's just a zip file that, you know, a when you're building or installing a packaging, 1 of the things you do is you create the s test. And once you have the s the the source distribution, the s test. Once you have the s test, you download it. You run setup dot py install or KIP does that for you. And it does a number of things like building dotcfilesandthen.s0files,

you know, compiling the the dotpyfilesandthen.pycfiles.

All all that can take a lot of time.

In some cases, like, NumPy, I think it can take up to 30 minutes. And what wheel sort of does is it steps in between

the,

the step where you take those .SO files and you copy them into the final location.

Instead of copying them in the final location, you just stick them in a zip file,

and then, you know, that zip file gets distributed. And then when PIP downloads it, all it really has to do is unzip it and copy those files into location. There's a few other other little fitting bits in there, but the, sort of 10, 000 foot view is it steps in right after you've compiled those files and then just ships those artifacts. So net you know, it's very similar to

in that dev packages or or such where they ship compiled bits,

so that you don't have to do that on the end user machine.

You know, it gets a little blurry when you're talking about pure Python packages,

but, you know, you can think of the step of, you know, moving them into moving them from your drive your

development layout into the install

layout as the compile step. The other advantage it has is it has completely static metadata,

so there's no need to invoke a setup dot pyscript to find out something like the version number

or, you know, the name of the package or, you know,

whatever other metadata that, you know, someone might want. It's all in there statically. So it installs a lot faster even for pure Python

because it's all just static metadata and unzipping and file copies instead of several sub processes, like, executing set up dot pie.

And what are the biggest challenges that you've encountered while working on PIP? Yeah. So not not particularly specific to PIP, but the all the Python packaging stuff. The biggest challenge is really the weight of the legacy

code base there. You know, not just in the code base that you're working on,

but we have half a 1000000 packages,

pack I mean, actual individual

package files

on PyPI

spanning over a decade.

And people still use some of those, even those old ones

where, you know, we have to be careful how we progress things. Because if we just try to throw everything away,

say here's the you know, here's your new system. It's you know, even if it's the most amazing system in the world, people are gonna go, why can't install, you know, all my 100 dependencies

that I already have.

It's not good to me.

So you got to be very

careful how you progress things so you still maintain compatibility.

What what was installing the vast bulk of this? You don't have to keep compatibility for everything,

you know. But as long as most of them still install,

you know, you're you're pretty good. But, you know, it it's a constant struggle of figuring out how we're gonna maintain enough compatibility,

then also being careful what we add now. Because if someone comes and starts using something we add now, we're gonna have support that for the next decade.

So I'm curious if there are any features that you see in other package managers,

that you would like to incorporate into PIP or if people have tried to

add features into PIP that you think don't fit well with it?

Yeah. So

1 thing that's not necessarily PIP specific, but for packaging specific,

1 of the things that I think a lot of other package managers

or package ecosystems have got going for them that I'd really like to get into Python

is the concept of a sort of static binary like Go has, like, Rust has, and whatnot,

where you can just sort of take all all all your code, all the runtime stuff, just put it into a single file and ship that off to customers. That is an amazing user experience

for

when you're trying to give something to a customer who, you know, just might wanna download this thing and run this little simple binary. We all have to say, well, first, you install Python, then you, you know, make sure you have PIP installed, then you PIP install this.

You know, and maybe that doesn't work. Maybe they have to figure out, you know, why PIP wasn't working on their machine or you didn't have to compile or etcetera, etcetera.

Some of those systems

or all those systems that I'm aware of, they have static binaries

sort of got,

the benefit of being new. So they they didn't have,

a lot of the, you know, the legacy stuff that I was talking about. But I think it's something that the other systems,

the the other languages have that we should really should have.

We have some things that sort of try to hack it in, like, virtual impacts and,

you know, isolation. But I think we would need a real first party supported solution here that will,

you know, give people the ability to ship a single file without having too much requirements on what the system all of everybody has in place.

What about from maybe, like, a developer workflow perspective? I know some people, you know, with NPM, the dash dash save argument. I know people have occasionally

asked about that for PIP or,

the idea of, like, a gem file like Ruby has with Bundler. Curious if there's ever been any consideration of adding things like that to the PIP library.

Yes.

Absolutely.

That's just save. It's a little, you know, it's a little bit hard to to translate some things over. But,

you think of, like, a gem file, I do wanna get something like that in the PIP. We sort of have it with requirements dot text.

The requirements dot text still requires you to do a bunch of, manual things with it. Now you you

either don't pin and then, you know, dependencies just sort of automatically get upgraded, which isn't great. Or or you pin and then you're left stuck

managing your entire dependency chain by hand, which is no fun.

I really do like the the,

gem file and then gem file lock

concept from Bummer, which is just sort of automatically handled, which I'd love to get into

PIP itself.

The MPM dash s save thing, I wouldn't be opposed to it, but you start getting in the tricky questions

like, where does it get saved? Is this just get saved to a jet to the,

you know, requirements of taxes. Is it set up dot pie? Well, then you start start talking about

sort of dynamically generating,

generating, modifying this a Python file, which starts to get very tricky very quickly.

So there's sort of some fiddly bits there to figure it out, which I'm not too sure about that we're gonna be able to do

very well. But, certainly, like, the gem file and stuff, I certainly would

like to get that into Python.

And what does the infrastructure for the Python package index look like, you know, in terms of what is its operating environment? What are some of the server infrastructure

and traditional services that support it? And also,

I guess, what sort of tooling do you use to manage all of that?

Yes. So that's a bit complicated

because we're we actually have 2 sort of environments run right at once. Now we have warehouse and we have then we have the what I call the legacy PyPI, which is pipe data everyone interacts with now, which they're sort of running side by side, but they share services too.

So sort of to come in as as a request comes in. We have Fastly on the outside, which is excellent. Fastly has, you know, been a huge help to us. Then Fastly caches,

you know you know, almost all of what PIP uses

most of the web UI, etcetera,

and we rely on them pretty heavily to,

scale out to our roughly 8 terabyte today of bandwidth we use.

And then Fastly comes back in, and it connects to,

an origin server, which in for legacy PIPI is just a, it's a Rackspace VM running

the sent to us Linux on that 1, you know, which is just running the, PIPI

software. For warehouse, it goes to Heroku Dyno running warehouse,

except if they're requesting a package file, which then it hits s 3 directly instead of going through our servers.

Then we have a, post SQL databases in there, which is hosted by Heroku.

You know, there's elastic search clusters to handle search for both of them. You know? And then for the statistics, which is fastly streaming logs

into it, our syslog log server, which then buckets that out and, throws it into Redis.

Then there's an hourly cron job that roll that up, which is how you get the law the download statistics on Type API

goes through that.

So what have been some of the challenges around scaling

some of PyPI's,

infrastructure to meet demand? I know you talked about Fastly just now, but you also mentioned the PostgreSQL

DB in the mix. And I know from my own past experience that caching the static stuff goes a long way,

but there's also some fancy dancing required to not massively overload your database servers and the other sort of non static parts of the system.

Yeah.

So I I have to say we're incredibly

fortunate with the way that things sort of fell into place.

PIPI is vastly read only.

Not read only, but vastly read heavy.

You know, we use something like 700

new packages a day, and we download several million a day. So, you know, there there's a huge ratio imbalance there.

And we sort of lucked into a system where the vast of the smarts of what people are actually doing, which is using PIP, is the client side, not the server side. So almost every

something like, 85%

of our traffic or more

is requesting fully static pages.

That's in terms of request numbers. Something like 90 some percent of our bandwidth

is,

is requesting fully static things, which are served directly from Fastlane.

Unfortunately, we have some things with, like, the XML RPC API,

which we can't cache in Fastly. So for there, we have,

you know you know, we have some memcache there to try and cache it locally

on on the servers.

And then we have a,

you know, we just have the 1 actually, yeah, we just we have just the 1

database server. We're actually not too hard on our database server just because we

have the Fastly caching, and we have the Memcached caching

to try and

eliminate where we actually have to hit the database other than rights.

What sort of monitoring do you have in place? Because I'm sure that you need to have something fairly robust given that if PIPI goes down, then it affects a large number of developers and, systems administrators' days?

Yes. So we do have pingdom

set up to to monitor things.

But,

to be honest,

our best monitoring is whenever things go down, people start pinging me on. I I have seen Twitter telling y'all, I mean, telling me it's broken.

That's certainly a good way to make it highly available anyway.

So as you mentioned, you're currently working on a replacement for the PyPI site with the warehouse project. Can you explain your motivation for that and how it improves on the current system and also some of the technology that you're using for it?

Yes. So legacy pipe a, which we're running today, was written something like 15 years ago. And just to sort of give a

a an idea of what things were like back then, that's prior to Django.

I think with WSGI

came out right around the same time.

You know, I think ZOAP was the primary thing that people were using back then. You know, it was sort of a dark time in in web applications,

and it came from that time period, and it's never had any significant

involve,

investment in sort of updating it since then. So you see a lot of really old, really junky code there. And then over time, you know, we've we we had people kinda come in, sort of try to hack things in,

you know, as quickly as they can without sort of looking at the code as a whole. How do we really wanna make this code easy to work with? So the sort of side effect of that is, you know, it it it invents a lot of its own sort of custom 1 off things that, you know, were written how they were code 15 years ago, and they're just horrible to work with. On top of all of that, there are basically 0 tests. There's, like, 1 or 2. And the state of it is it's really hard to run locally. So testing is basically we change something, we push it to production, and see if it breaks. You know? So that's obviously horrible. I tried to refactor it and

sort of in place try to make it better, but the code is sort of very spaghetti. Like, in any sort of

significant refactoring

ended up touching the whole thing anyways.

So I sort of just kinda gave up and said, okay. I'm just gonna start from scratch, rewrite this in a modern web framework

using modern tools,

require test coverage, you know, all these sort of good things to have and say, you know, once we get feature parity, we'll just switch over to this, and then we should be in a better place. You know? So it's written warehouse itself is written using, like, the pyramid web framework. You know? You know, it's Python 3,

you know, use g u g Unicorn as the web server card. You know?

All these sorts of,

you know, these things tooling, they just didn't exist back when PIPI was originally written.

And, Richard Jones wrote it back then, and I believe he's commented more than once saying that he had fully expected

someone to come along and replace it, you know, within a couple months. This was sort of a proof of concept that never,

was supposed to be production that evolved and depending on for 15 years. Yeah. That's, any insufficiently ugly hacks will remain permanent.

And PIPI is, you know,

living testament to that. You know, the fact that it's ugly and and hackish will not stop the world from depending on it. Yep.

So where do you see the future of dependency management in Python headed?

Yeah. So I see us trying to head more towards more static metadata.

It's not completely static metadata. Some things just just can't make static, but trying to move static where we can and trying to get away from a lot of these

implementation defined standards, which has been a problem for us. You know, like, the standard for what an s distance is, the thing that set up tools as dishutils,

produces.

You know, and this has made it hard to sort of advance things along.

So a lot of what we've been trying to do is taking these ideas that people have had over time. Say, okay. These are decently good ideas.

Let's standardize them, remove the parts we don't like,

and, you know, focus on standardizing them without saying just whatever this tool produces.

And 1 of that comes with a lot more focus on formats rather than tools,

and which will hopefully

make it so that setup tool itself is no longer a blessed tool, and it's just another implementation of what a Python build tool could be. But, you know, overall, it's a lot lot of work on standardizing

things

and, you know, trying to make sure that all these pieces can interrupt each other without having to know what these other pieces are.

So a few days ago, there was a big story about how an NPM library was removed from the index, and that broke a large number of other projects and applications.

And for anybody who's familiar with the story or isn't familiar with the story, the library that was removed was an 11 line piece of code that

left pads a string with white space.

And I'm wondering if you think that anything like that could happen in the Python ecosystem

or,

if there's anything in the sort of general best practices of Python or the ecosystem itself or even in the package management that would prevent something like that from happening. Did you did you hear about that story, Chris?

Oh, the left pad fiasco?

Absolutely. You'd have to be in a cave or dead to not hear about it. And I feel like it's been

I I mean, I don't wanna be knocking people's work because Node is awesome and there's been a lot of really great innovation going on in that space, And I really admire it for its kind of youthful exuberance,

but I think that that has a a dark side too. Right? They're running into all the same, like, missteps

that every kind of, you know, newly hatched

ecosystem

has. This 1 in particular was particularly

hairy because

of the incredibly

sort of interdependent

nature of Node. I think there's so little.

Like, Node has chosen to keep its standard library so light that people really depend on this ecosystem of NPM modules

really extensively,

and people had no concept that they were gonna be breaking the universe like they did when they unpublished that 1. So it's been very interesting to watch, and I hope that people are sort of taking note in that future, you know, when the next big thing comes down the pike, we can all sort of avoid making the same mistake again.

Yeah. So to answer your question, this could absolutely happen,

to PyPI.

We do allow people to delete things off of, PyPI if they wish. We do not allow people to reupload if they've deleted it. So that's sort of like what NPM was like. However, I think it's less likely that something like what happened to NPM happened was just simply because of our culture and the way we, run IPI. We are very,

conservative

about how we give up 1 name to someone else. You know, like,

we pretty much treat it as first come, first serve unless the person who has it now, is willing to give it up or unless there's specific

legal requirement to give it up. You know, so for instance, if someone had, you know, the name Google on PI, we would require, you know, some sort of legal

process from Google to give it to them. Otherwise, you know, unless the person who owned it now says, sure, give it to them. You know, they're out of luck. That being said, we are looking into other things that we can do to try and reduce the impact or the,

damage that someone can do by deleting their projects from PyPI. You know, I saw just recently,

MPM

published some of the things they're planning on doing, like requiring,

YouTube contact support to unpublish something that's more than 24 hours old. Something like that. I could see us getting involved in to do something similar to that.

But as it stands right now, we are in a very similar circumstance to what MPM was when this happened.

What's on the road map for PIP? Are there any planned features that you have in the pipeline that people should be keeping an eye out for? And, also, maybe list some of the most recent major updates to the PIP library itself.

Yeah. So, you know, some of the biggest things if you haven't upgraded to PIP 7 dot 1, at least, I believe it it was,

you should. You you get, automatic wheel caching. So, like, the first time you type PIP install l m

l l x m l, it downloads it, builds it, installs it, caches a wheel, and reuses that, you know, the next time you install it, which saves tremendous time. As far as what's coming down the pipeline,

you know, I talked about a few things like finally breaking the hard dependency on set of tools so it becomes just

another build tool that Pip can use, which also would mean you won't have to have that preinstalled.

So you only need just PIP installed.

You know, trying to

change what our default location is for

when when you type PIP install so that we don't incentivize people to install into the global site packages, but instead to a user packages or a virtual environment.

Maybe something akin to the node modules directory

if we can swing it. And then, you know, there's a lot lot of things that are going to

to sort of be internal refactorings,

which is gonna be a while, but also things like, better support for multiple repositories that should be coming down the pipeline.

And, you know, hopefully, just sort of, general improvements to make things work better.

1 question that a coworker of mine had when I mentioned that I was going to be speaking with you is he was curious about what would be involved in

building sort of a mirror for PyPI, but 1 that was just a subset. So if there are just a specific set of packages that they wanted to mirror locally, but then any other packages that weren't on that mirror go out to the the canonical PyPI to fetch. So he was saying that he had run into some issues when trying to set up similar to that. So I was curious if you have any input on that.

Yes. So there's a program

or a package called devpi,

which essentially does that among other things. You do, you don't tell it upfront, say so say you don't give a list of these 15 packages upfront, but as you,

use it, you know, you,

you instruct your PIP to use that instead of PIPI. And as you use it, it will it's sort of a lazy on demand cache in the PIPI where it will go to PIPI the first time, download it, serve it to you, then save that and cache that locally.

Outside of that, there is a tool called Bandersnatch, although that's more for creating the full

mirror of PIPI, although,

it could possibly be modified to take a list that only

synchronize,

you know, that list of packages.

I'm not aware of of any tool that out of the box takes a preformatted

list and, you know, sort of makes that available to you. Although, you know, it shouldn't be terribly hard to, like, modify Bandersnatch

to do that.

So are there any questions that we didn't ask that you think we should have or anything else that you'd like to talk about before we move on?

No. I think it's I think it's pretty good.

1 thing I have, I guess, is is there any area or particular

aspect of PyPI

or warehouse or the Pypa

or whatever

that our listeners that you would like to solicit help with? Like, where could where could you use a hand?

Yeah. So if,

anyone's familiar with web development or even if they're not and they wanna help out working on warehouse, that would be great. You know, the faster way they get that done, the faster we can get rid of the legacy PIPI.

And that means, you know, the faster we get the nice new shiny PIPI, which has a great new design done by Nicole Harris. You know? So if anyone wants to help out with that, that would be great.

You know, we try to set that up so that it's very easy to get started. And, you know, we're willing to help anyone get set up to to work on that.

Maybe we could list the GitHub for that, in the show notes.

Yep. It's, you know, pipea slash warehouse.

Cool. So for anybody who wants to

keep up to date with what you're doing and, the work that you're up to, what would be the best way for them to do that?

You know, looking at the the PyPA GitHub is a good thing, but I think most of the big things get announced on dishutil sig and things get discussed there too if they wanna get involved, you know, and, you know, participate in discussing,

you know, these sort of higher level, you know, ecosystem

affecting changes.

Then just, you know, the various projects which are almost all on GitHub is where the other work, you know, tends to happen.

And, what about if they wanted to get in touch with you directly?

They can email me. It's donaldwettstuff.i0,

or I'm on Twitter as d stuff or IRC on freemium is d stuff.

Any of those work. Great. If you find the d stuff somewhere, it's me. Okay.

So with that, I will move us on to the picks.

For my pick this week, I'm going to choose a project called Xiki, spelled x I k I. And it's actually a really cool,

interactive shell

that allows you to

interact with your shell environment

in the same manner as you would with an emacs buffer.

So you can actually

so if you type ls, it will then print out an editable list of the files

in your local directory. And then you can expand the directory tree. And then once you find the file that you want,

you can then take that and put it back into your shell.

So it's just a really interesting

reimagining

of interactivity within a shell environment. So I definitely recommend that people check that out. It's, pretty wild the first time you start working with it. And with that, I will pass it to you, Chris.

Excellent. So my first pick this week is

a really unusual, well, from my perspective anyway, game. And it's so stupid simple, but it's a really great game. It's a free to play,

and it really is actually free to play. You can get quite a lot of enjoyment out of it without dumping any money into it. It's called

agar. Io, and, basically, you're a cell

running around in this cellular world, and you're trying to eat

cells and agar that are left lying around, cells that are smaller than you, and you're trying not to get eaten by cells that are larger from you. And it's kind of massively multiplayer. It is 1 of those things that, okay, you describe it and you say, okay, fine, but you actually play it and it is amazingly compelling.

It does require a good net connection, however, but it's a great game. I heard about that on this season season 2 of House of Cards, which I'm really, really also very much enjoying. My next pick is going to be, Culprit. It appears to be a UK based electronica band, and they are really fantastic, really great sort of textural,

melodic,

really, you know, a great deal of complexity. It's it's just great stuff. You should definitely give it a listen. And my last pick is,

a book. TCPIP illustrated volume 1, the protocols. I've been sort of getting back to first principles lately and sort of really studying up on some of my infrastructure

lore. And,

you know, I've been sort of revolving around the Internet and its related protocols for

many, many years, decades at this point, and it's been really fascinating, actually. This book is really accessible. And if you're interested in sort of, you know, understanding

how

IP networking works, how subnetting actually works, how the protocols layer onto each other, And there's just a wealth of really interesting detail in this book, and it's it's really accessible. It's it is a great place to geek out and learn all about the nuts and bolts of the Internet. And that's it for me. Donald, what picks do you have for us?

Yes.

1 thing I'm really excited about just was announced today was that you can now run byte for byte Linux binaries

on Windows 10. And, it's really gotten me considering you going back to Windows for the first time in 6 years.

You you know, so that's I'm really excited about that. It was announced today at

their build,

conference.

Okay.

And do you have any other picks? Just that 1. Okay. Great. I'd like to thank you very much for taking time out of your evening to join us tonight and talk to us about PIP and the Python Packaging Authority and the Python Package Index. It's been very interesting, and I appreciate your time. Yep. Thank you. Thank you, and good night. See you.

The Python Podcast.init

Summary

Brief Introduction

Interview with Donald Stufft

Keep In Touch

Picks

Links

The Python Podcast.__init__