Pandas Extension Arrays with Tom Augspurger

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200 gigabit network, all controlled by a brand new API, you've got

everything you need to scale. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.

To get worry free releases, download Go CD, the open source continuous delivery server built by Thoughtworks. You can use their pipeline modeling and value stream app to build, control, and monitor every step from commit to deployment in 1 place.

And with their new Kubernetes integration, it's even easier to deploy and scale your build agents.

Go to podcastinit.com/gocd

to learn more about their professional support services and enterprise add ons.

And visit the site at podcastinit.com

to subscribe to the show, sign up for the newsletter, and read the show notes.

Your host as usual Tobias Macy, and today I'm interviewing Tom Augsberger about the extension interface for Pandas data frames and the use cases that it enables. So, Tom, could you start by introducing yourself?

Yeah. So I have a somewhat common, I think, background for at least for the scientific

people. I was in economics grad school, needed to pick up something for doing data analysis, some kind of programming language.

They started us with MATLAB, which was not my favorite.

And eventually, I made my way over to Python, learned Pandas, started contributing to Pandas.

And that's essentially

where I got to where I am today, which is data scientist slash software

engineer for Anaconda,

working on open source tools for data analysis. And do you remember how you first got introduced to Python? I think so. I had a friend who was into programming languages just in general. You know, like wrote his own iOS apps.

He was in the accounting program. I was in economics. He guided me towards either r or Python.

And in the end, he recommended

Python because r's assignment character is 2 characters

That's an

That's an interesting,

criteria to base your language choice off of. It's it's an extra character. It's gonna take a lot of time over the years. Yeah. Yeah. Yeah. I I think, like, our studio has a short keyboard shortcut to make it a single keystroke anyway. But yeah. Yep.

And so most people are fairly familiar with the pandas library and some of the capabilities that it brings along. But can you give a high level overview of the new extension interface that you've added?

Sure.

So for some background, pandas has a data frame. It's a container for tabular data like you might find in a Excel spreadsheet or a database table.

And internally, Pandas uses NumPy to actually store the data. So there's like a whole bunch of NumPy arrays behind a Pandas data frame.

So the key difference between NumPy and Pandas is that NumPy provides

n dimensional,

homogeneous

arrays. Whereas Pandas is really focused on the 1 and 2 dimensional case. But the trade off there is that Pandas is able to support heterogeneous data. So a a NumPy array has a single data type for everything,

whereas a pandas data frame has a single data type per column. So with the pandas data frame, you can have a column of ints, a column of floats, a column of strings, date times, you know, whatever. Right? So internally,

you can have, NumPy arrays, which provides the basic types like floats,

ints, booleans, date times, those sorts of things.

But pandas users have a few different use cases than NumPy.

NumPy came, you know, more like the physical science sciences.

It's very popular in astronomy.

So you're dealing with, like, a lot of floating point data, these sorts of things. Pandas

users want to deal with a richer set of data types often.

So we've made some extensions to the NumPy

type system internally.

1 of the most notable ones is the categorical d type where you have your raised values can only come from a fixed set of categories.

So if you think about survey data, this is, you know, do you strongly agree, all the way down to the scale of strongly disagree. There might be

5 categories that your values can come from. So we wanted to support these kinds of,

uses

within pandas,

but it was too difficult or NumPy didn't want these types of extension d types that we've been making. So internally to Pandas, we hacked in a few places these extension d types like categorical

or date times with time zones or

intervals for, you know, is this data in this interval like you might find in Postgres.

So we had these internal extension types, and we wanted to

essentially make that easier for people to do outside of pandas

to basically say, hey, I have this array type. It's not a simple NumPy array, but it's similar enough that it can be stored in a data frame. And so my understanding is that for the

impetus for creating the new extension interface,

there was a customer who wanted you to add it directly into pandas. And when you brought that proposal to the community, it was rejected in favor of this new API. So wondering if you can just give a bit of the background behind,

how this new interface

was started and maybe some of the challenges associated with building it.

Yeah. So

Anaconda

does some consulting

and, a lot of open source development too. So they're kinda in this nice position to tie together businesses and research institutions that have these use cases that maybe they can't share or aren't they don't have the they don't have the connections within the open source community to

propose directly. So they came to Anaconda with this need where they had IP address data in a tabular data set. So 1 of their columns in the in their dataset contained IP addresses.

I wrote up a proposal

to Pandas,

tried to do some research on how prevalent IP addresses

and this kind of networking data is within the community, whether people were asking around for it, and there was some interest for it on Stack Overflow and various other issues. But

eventually, we decided that it was too

niche, too type specific for the small community. So we didn't wanna include it in pandas proper. But like I said, we had these

extension

d types internal to pandas that we had kind of hacked together in a few places. So we decided we wanted to support

these use cases outside of pandas by providing the necessary hooks.

Functionality of Pandas or add new data types? Yeah. So you're extend the functionality of pandas or add new data types? Yeah. So you always have the fallback of storing your

objects,

your arrays as Python objects. So NumPy can store anything.

It just won't be as fast when it's storing Python objects. So if you were dealing, in this case, with IP addresses, you might store them as either an array of strings representing the IP addresses or perhaps an array of IP address objects. So Python 3 has an IP address module that provides ipv4 and ipv6

objects for scalar types. And if you wanted to store an array of them, well maybe you would just store an array of those objects. The downside to that is that, first of all, it's gonna be a lot slower. So once you have Python objects, you're not really

able to lay them out in a nice memory efficient layout like NumPy is able to do with the basic types like ints and floats and so on. So it's gonna be slower and it's also gonna be error prone because you're gonna do something like give me all of the

columns that contain IP addresses.

Pandas has a select d types function where you say, I want all the floating point columns or all the integer columns.

You can't do that with object columns because there's no way to know that all of the object columns

are specifically they're actually containing IP addresses. You'd be mixing in potentially other columns there. And then it's also you have to do all your own logic

on

processing those. So those would be

having to reinvent a lot of the stuff that Pandas is already doing, already knows how to do on array like objects. And there's another project that's based off of Pandas. I believe it's called Xarray that allows for multidimensional

data frames essentially.

And with this new extension array interface, will that also propagate to that other library for being able to do these n dimensional

tables and being able to

use these new data types within them? I don't think so. I need to I haven't used X-ray greatly. But as far as I know, X-ray is the actual data themselves need to be limited to

homogeneous and dimensional

values. So the data inside an X-ray dataset is an actual NumPy array, which is still gonna be limited to those basic types, and it does have to be homogeneous, a single data type, even if you have multiple dimensions.

What they will what XRA will benefit from is, in the future, we'll be able to

have these kinds of extension

index

objects. So that hasn't been implemented yet, but Pandas has the idea of the columns of a data frame where you can store a NumPy array or now this extension array type or an index, which is used for, like, looking up values. So X-ray uses pandas indexes directly. And then in the future version of pandas, once we're able to support

these kinds of extension

array types in the indexes,

xarray will gain that as well. I know that when you were first building this, there was a particular data type that you had in mind for being able to support IP addresses natively, but what are some of the other new data types that are available as external packages?

Yeah. So the IP address use case is now being served by a library called Cyber Pandas, which I think is a pretty great name. So that's gonna be providing the IP array, and then they also wanted to work with MAC addresses.

And I should say, you know, upfront, just to give more concrete use case here, the reason we can't store IP addresses just as integers is because

our client was interested in working with both IPv4

and IPv6

addresses.

And IPv6

addresses require 128 bits to store, which NumPy

does not provide a 128 bit integer type. So the way we're doing this internally

in Cyber Pandas is storing a NumPy

structured array, which is gonna have 2 fields, 1 for the lower 64 bits and then a second field for the upper 64 bits. So we're essentially tricking Pandas into thinking that this 2 d

object with these 2 fields is actually a 1 dimensional thing. So that's that's CyberPandas. It's doing the IP address and Mac array specific

things. So there's also

GeoPandas, which is a, again, a pandas like library. It

it actually subclasses a pandas data frame right now. But 1 of the 1 or more of the columns contain

geographic objects. So these might be things like lines or segments or

boxes, things you might find in a

GIS library. And on the cyber, CyberPanda's

master

has a refactor to do a whole bunch of these geographic operations

in Cython, which will speed things up quite a bit.

The way it's doing that is by storing

instead of storing Python objects with the geometries, it's gonna be storing pointers to C structures that have all the geometry information. So it's just an integer to some that represents some pointer. In CyberPanda's master,

they had to do some pretty bad hacks into Pandas' internals to try and get their,

what is actually an array of integers, to be treated as a special column. But with this extension array interface, it's gonna be quite a bit cleaner for for them. Those are the only 2, public ones I know of. There's a contributor to Pandas who's doing something internally for their company, but it is quite new. So I'm sure I'm hoping that we'll have quite a few of these packages popping up for domain specific uses. And 1 of the other nice things about having this unified interface is that rather than having to create these various subclasses that won't necessarily play nicely together,

you just have this 1 interface so that you can, for instance, maybe have an IP field along with the geographic

information so that you can actually join the, IP address with a maybe the the return value of a geo IP lookup, within a single data frame rather than having to have these conflicting implementations and try and choose 1 versus the other or, you know, maybe instantiate them separately and then have to do a join across them or something like that. So GeoPandas currently uses a subclass of the data frame, which is, you know, fragile in our documentation. We warn against against it because it's hard to do properly. And like you said, being able to combine these different kinds of extensions

just doesn't work when you're subclassing. But now that we have the interface within pandas, we also have an

extension accessor API where people can say, hey, pandas users know about this already with, like, categorical. So when you have a categorical column, you can type .cat and then under that namespace, there's a whole bunch of categorical specific properties. So we'll be able to do the same thing with Cyber Pandas,

registers the dot IP namespace on the data frame. Geopandas does dot geo. So these various packages will be able to work together much better.

And are there any other unique use cases that this extension interface will enable or any of the current or upcoming implementations

will allow for beyond just what the core Pandas project provides?

Yeah. So we kind of always had the fallback

of storing things as Python objects.

So you could usually,

achieve what you want to do

by just storing

your your values, your scalar values as Python objects. But like I said, that's gonna be quite slow compared to a more optimized implementation.

So you'll be able to analyze larger datasets more quickly by having an optimized

extension array implementation.

And I think combined with the new extension

access or API, where you can say, you know, dotip or dotgeo,

it's gonna be quite natural

to Pandas users to be able to, you know, install CyberPandas and then just start analyzing

IP address data like they would any other data type. Does this implementation

work nicely with the

Arrow library for being able to translate pandas data frames back and forth between R using the Arrow serialization format? Yeah. So that's actually 1 of the use cases that we're gonna be prototyping next. There's

an Arrow and Pandas contributor who's interested in prototyping

storing strings as Apache

Arrow

arrays instead of currently pandas uses a NumPy array of objects for storing strings. So this is going to be really helpful for prototyping these sorts of things

within Pandas. And then there are, you know, broader visions of

an Apache Arrow backed Pandas data frame. I'm not sure exactly what that's going to end up looking like, but being able to drop these pretty simple

implementations in as extension arrays will be certainly helpful for testing those out. And you've discussed a bit about the fact that the new extension array allows for being able to store these values

as alternative

implementations,

beyond just NumPy. So I'm wondering if you can dig a bit more into the implementation details of how you built the interface

and what have been some of the most difficult aspects of that, particularly with needing to be able to maintain some of the same,

API interface for the pandas data frames?

Sure. So for an extension array author, there's 2

base classes that we provide, an extension d type and an extension array.

That's

their abstract base classes, we don't actually inherit from abcmeta

because that makes is instant checks really slow. So we had to remove that. But there are these base classes with some abstract methods that you provide as an extension array author, you provide information on what your scalar type is. So for Cyber Pandas, my scalar type is an IP address

object, either ipv4 or ipv6 object.

You say, what is your missing value sentinel if you have 1? So, like, the IP address 0 is gonna be treated as missing in Pandas operations, like an is in a or fill in a.

And then, you know, various other array like things like how do I get compute your length? How do I slice you? How do I copy you? These various array like things are required by the implementation

author. We also provide some defaults, things like fill in a and

finding the unique values.

It's assumed or required that your

extension array can be convertible to a NumPy array even if it isn't actually a NumPy array natively.

But

by converting from your specific

storage type to a NumPy array, we can provide default implementations like fill in a and unique and factorize.

It's just probably gonna be slower

than you could do on your own. So a good example of this is, like, if you have a GPU backed array. I think when we met at PyCon, we were, I was prototyping with Sue, who also works at Anaconda,

on getting a CuPy backed

array inside of Pandas data frame. So as I understand it, GPUs are able to sort data extremely quickly. So you might not want to rely on the default sorting implementation by, you know, converting your cuPy array to a NumPy array, then sorting, and then converting it back since you'll be there's a bottleneck transferring from the GPU to the CPU and back again. But if you're able to override the method for sorting, then you can use the fast CUDA sorting kernels and things will be super quick. As far as some of the challenges,

you you really hit up on it. It was so Pandas is a large library. It has some crufty corners, I guess.

Part of the motivation here for implementing this extension array interface was to clean up our own internals.

I think we had essentially reinvented

this type of

stick this not quite a NumPy thing inside of a data frame or a series.

We had reinvented that, like, 3 or 4 times slightly differently each time. So we were able to clean up quite a bit by having this single clean implementation

about

what it means to have a a NumPy

thing or a non NumPy array like thing inside of a data frame. So that was the hardest 1, was definitely preserving

backwards compatibility

and avoiding edge cases around that. Fortunately, we have a lot of tests in the Pandas codebase. Other than that, it was tricky from an API design standpoint.

We're

being quite deliberate about not inventing a new array type. We wanna, you know, we're not replacing NumPy here. The idea is to be able to take array like things

without imposing any restriction on the actual storage of the data.

So trying to design an API

around how to do array operations

on

this kind of abstract,

you know, notion of data of just an array decoupled from any storage,

was an interesting challenge.

And then another kind of interesting fun 1 was being able to design tests that could be reused in other libraries. So we made these essentially base classes for tests.

And you provide pytest fixtures

for inserting your own data into that. So CyberPandas is providing a fixture to generate some specific IP address data

that is then run against the test that we are using to verify correctness and are you conforming to the expectations of the interface.

And what are some of the limitations

imposed by the new interface for any libraries that want to add new functionality?

And have you had anyone who's come up against those limitations and wanted to be able to add new capabilities

and wondering how that conversation may have gone? Yeah. There are definitely some,

you know, this is the minimum viable implementation I think.

We had to get we're overdue for release so we had get things together and cut things off at some point. So some things

like operations

just don't work. So if you compare 2 extension arrays do like equal equal, I think it throw I hope it throws an error. I'm actually not sure what happens right now because it's just not implemented.

So operations

are definitely not done, and that's being worked on right now, actually. And then there's a whole bunch of other, you know, pandas data frames have a ton of methods.

We did the common ones. So,

again, thanks to Anaconda, we had a concrete use case in mind. Our customer had use cases that they needed served. So we had a good example,

and Geopanda's provided a secondary

thing to check out, okay, what all needs to be implemented for

a a reasonable

set of analyses to be possible. But other ones just haven't been dispatched to yet. So things like is in currently,

will not do the correct thing, you know, checking whether the array of values is in this other thing, is in this other array that's currently not doing the correct thing. So there will be a whole long tail of methods that need to be added to the interface. But, you know, the most basic ones are served right now, I think. And you've touched briefly on some of the next changes that you want to be able to bring in for the extension array. But are there any other major changes or improvements that you'd like to see added in Pandas to make it easier to use and extend?

Yeah. So the biggest 1 that Jeffrey Back who's, 1 of the main tanners of Pandas has pull requests open right now adding integer

missing value support to pandas.

So a bit of background on missing values. Currently, pandas uses

NumPy dot nan, so n a n is a floating point value, and we use that to represent missing data

even for integers.

So the unfortunate

consequence of this is if you have an array of integers,

but even 1 of the values are missing, then your integer

column suddenly becomes a float,

and this causes all sorts of headaches. So we have a PR open right now adding integer

missing value supports

by default. So in the next version of Pandas it'll just be a thing that you can opt into and we'll have to figure out a path forward for

getting that to be the default behavior when missing values show up in integer columns.

But that's that's gonna be solving, like, 1 of the longest standing pain points within pandas. We also so I think I mentioned earlier, we pandas uses

object dtype for storing strings. So strings are currently stored as a NumPy array of

objects. That's painful for several reasons, but the biggest 1 is speed. So being able to have a dedicated string type will be really nice for the speed improvements and all the nice ergonomics that comes with having a specific type for for some type of data. Another 1 is this might be too in the weeds, but so we have a categorical d type. That currently serves kind of 2 purposes. We have the

original intent is for this type of data that come can come from a fixed set of categories. So you have these from strongly disagree to strongly agree. Where are you on that scale? It's also serving a secondary purpose as a

memory efficient container for

data that has low cardinality.

So

this is, you know, if you have, like, string data where

there's only a relatively few number of unique observations. So if you have the population of the United States and you're storing the state as, you know, the 2 letter code for each state,

then that's gonna take a ton of memory. But if you store it as a categorical, it's gonna be quite a bit more memory efficient. We're going to be adding a dedicated type for this type of interned string that will have the memory benefits without the kind of the categorical fixed set of categories,

semantics attached to categorical. So that's within pandas.

External to pandas, like I said, there's cyber pandas

fighting IP types. GeoPandas is gonna be a really nice speed of improvement there. I spoke to a person who's doing medical research. They have MRI data.

And so they they're dealing with these tabular datasets, but 1 of the fields are these actual images.

And they wanna be able to tie their analysis to that image. So, essentially, treating an image as a single column in the data frame, but be able to do some image specific operations

on that

image data, without necessarily loading it into memory ahead of time. You know, it'd be loaded in on demand. That'd be a a pretty cool use case, I think. 1 really cool 1 to see would be better nested data support.

So where a column is either

an array or some nested thing like a a dictionary or dictionaries of dictionaries,

being able to better support that, both, you know, storing it in a more memory efficient format than what pandas would currently do just as objects, and also an API for working with that data that, you know, might look like, JQ, which is a command line tool for parsing JSON objects. Being able to better support nested data would be really nice. I'm not sure if that belongs in Pandas or outside Pandas, but we'll at least be able to see it. And so

Pandas

was originally inspired by some of the data frame interface available in r. And since then, there's also been data frames added to Spark. So I don't know what the current landscape looks like as far as having a unified definition of what a data frame is and does between these different languages and run times. Yeah. So 2 things there, I think. So first of all, Apache Arrow is kind of trying to be the in memory format for

how to store

data,

tabular data anyway, across these various systems. So being able to have a column that is backed by an Apache Arrow array and then serializing that and getting it to the JVM for use in Spark is, like, the,

a major goal for Apache Arrow. I should say Apache Arrow is being worked on by Wes McKinney who is the original author of Pandas. So he cares a lot about this. The extension array interface is gonna allow us to

really quickly

prototype with what that would look like from the Pandas side of things. So so it'll definitely help out with being able to have these alternative

memory storage formats

within Pandas.

So that kind of leads to the second idea, which is these kind of foundational libraries like NumPy and Pandas being

about the nice APIs that they provide and less about the actual memory implementation.

So this is an ongoing discussion within

NumPy is how do you take these things. You have various array implementations,

NumPy's being the first 1, but also things like CuPy providing a NumPy like

implementation

on the GPU.

So how do you write code if I want to take like the logarithm of every element in array? I would do that personally by writing np.log

and then passing it the array. But if I want to write a library that supports both NumPy arrays and CuPy arrays, I suddenly have to add in this compatibility layer to be able to support both. It'd be nice if your code like np.sum

or np.

Log is able

to dispatch to the correct implementation for that actual physical array. So this has been going on. This is like the array. UFunc protocol helps out with a lot of it, but we're kind of pushing that forward both on the NumPy side and then I'd like to see it happening on the pandas side as well. And do you see any movement in the other language communities

to try and work with Pandas or work with each other to create a sort of common denominator

implementation or API so that it's easier to translate

analytics code between various runtimes languages?

That 1 I'm not sure about.

I know Apache Arrow wants to provide the memory format, and I think that Wes is interested in writing

kernels.

So, you know, like these kind of math operations, how you take the

sum over an axis of an array. He's also interested in doing that and then

being able to share that amongst various run times as a major

goal of Wes' to avoid all this kind of wasted effort of, you know, each language community having to do their own implementations of everything

is kind of it's unfortunate, I guess.

So I'm not sure where things are on are at on that.

And are there any other areas of discussion

related to this topic that you think we should cover before we start to close out the show? I think,

kind of in the theme of making these

libraries more about their APIs and less about their actual implementation. I would like to, when we find the time, make pandas plotting essentially configurable to use various back ends. So Python is fortunate to have tons of libraries, like matplotlib,

being the most notable for visualization.

And Pandas plotting is currently

built into matplotlib.

It constructs these matplotlib objects.

But we also have libraries like Bokeh and now Altair and I'm sure others that I'm forgetting about that provide other plotting back ends. So it'd be nice if we could take the same pandas plotting API, which is great for quick, you know, throwaway plots to quickly visualize your data, but have it show up in your back end of choice so that you can do further adjustments afterwards. We've got a small little prototype of that, but I'd like to flush that out and make it an actual thing.

So for anybody who wants to follow the work that you're up to and keep in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

And for my pick this week, I'm going to choose

the Black Panther movie. I watched that recently

and, you know, I've been hearing a lot of good things about it, and I was happy to see that it lived up to most of them. So it's definitely a fun movie, worth it if you're into any of the Marvel superhero films. And so with that, I'll pass it to you, Tom. Do you have any picks this week? Yeah. Kind of a selfish 1, but I'm gonna be picking a

library called daskeml. So this is a Python library for doing scalable machine learning. And it's a bit selfish since, I'm working on it right now along with some others just trying to

explore what it means to do,

large scale

machine learning within Python. So things like fitting a scikit learn model on a cluster. If that sounds interesting to you, check out Dask, which is the kind of project parent project that this falls under, and then DaskML for machine learning specific things. Dask is definitely an interesting project and that reminds me that 1 of the other things I wanted to follow-up on with this discussion is whether the extension interface for pandas is going to be reflected within Dask and being able to parallelize operations on these data frames? Yeah. So we're discussing that right now.

Basic things do work, without any changes to DaaS. There are a few other things

that we think,

you know, we could guess on the DAS side of things, but we think it would be better for the extension array authors to

explicitly opt into it. So things like das sometimes needs a small sample data set of your array. We could guess at that

by looking at the first few values within dasq, but we think it'd be better for

the

library authors to opt into that, essentially.

So things work basically well, at least for Cyber Pandas. And for GeoPandas, we've played around with a scalable version of Geo Pandas built on top of Dask as well. Those both work since I think they're using NumPy arrays under the hood, but being able to properly support them will take a bit more work. Alright. Well, thank you for taking the time today to join me and discuss the work that you've been doing and for,

adding in this new functionality to Pandas. I'm definitely interested to see where it takes us. And so I wanna thank you for that, and I hope you enjoy the rest of your day. Thanks. Nice talking to you.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__