Summary
Pandas is one of the most versatile and widely used tools for data manipulation and analysis in the Python ecosystem. This week Jeff Reback explains why that is, how you can use it to make your life easier, and what you can look forward to in the months to come.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.
- When you’re writing Python you need a powerful editor to automate routine tasks, maintain effective development practices, and simplify challenging things like refactoring. Our sponsor JetBrains delivers the perfect solution for you in the form of PyCharm, providing a complete set of tools for productive Python, Web, Data Analysis and Scientific development, available in 2 editions. The free and open-source PyCharm Community Edition is perfect for pure Python coding. PyCharm Professional Edition is a full-fledged tool, designed for professional Python, Web and Data Analysis developers. Today JetBrains is offering a 3-month free PyCharm Professional Edition individual subscription. Don’t miss this chance to use the best-in-class tool with intelligent code completion, automated testing, and integration with modern tools like Docker – go to <www.pythonpodcast.com/pycharm?utm_source=rss&utm_medium=rss> and use the promo code podcastinit during checkout.
- Visit the site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Your host as usual is Tobias Macey and today I’m interviewing Jeff Reback about Pandas, the swiss army knife of data analysis in Python.
Interview
- Introductions
- How did you get introduced to Python?
- To start off, what is Pandas and what is its origin story?
- How did you get involved in the project’s development?
- For someone who is just getting started with Pandas what are the fundamental ideas and abstractions in the library that are necessary to understand how to use it for working with data?
- Pandas has quite an extensive API and I noticed that the most recent release includes a nice cheat sheet. How do you balance the power and flexibility of such an expressive API with the usability issues that can be introduced by having so many options of how to manipulate the data?
- There is a strong focus for use in science and data analytics, but there are a number of other areas where Pandas is useful as well. What are some of the most interesting or unexpected uses that you have seen or heard of?
- What are some of the biggest challenges that you have encountered while working on Pandas?
- Do you find the constraint of only supporting two dimensional arrays to be limiting, or has it proven to be beneficial for the success of pandas?
- What’s coming for pandas? Pandas 2.0!
Keep In Touch
Picks
- Tobias
- Jeff
Links
- Continuum Analytics
- Myths Programmers Believe About Time
- Jupyter Notebook
- XArray
- Dask
- NumFocus
- PyLint Interview
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init. The podcast about Python and the people who make it great. I'd like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy. So you should check out linode@linode.com/podcastin it, and get a $20 credit to try out their fast and reliable Linux virtual servers for running your next app or experimenting with something you hear about on the show. When you're writing Python, you need a powerful editor to automate routine tasks, maintain effective development practices, and simplify challenging things like refactoring. Our sponsor this week, JetBrains, delivers the perfect solution for you in the form of PyCharm, providing a complete set of tools for productive Python, web, data analysis, and scientific development available in 2 editions. The free and open source PyCharm community edition is perfect for pure Python coding.
The PyCharm professional edition is a full fledged tool designed for professional Python web and data analysis developers. Today, JetBrains is offering a 3 month free PyCharm Professional Edition subscription for individuals. Don't miss this chance to use the best in class tool with intelligent code completion, automated testing, and integration with modern tools like Docker. Go to www.podcast init.com/pycharm and use the promo code podcast init during checkout. You can also visit the site at www.podcast init.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. And to help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Jeff Reback about Pandas, the Swiss army knife of data analysis in Python. And, Jeff, could you please introduce yourself?
[00:01:42] Unknown:
Sure. So my name is Jeff Reback. Let's see. I started off I graduated from MIT with a degree in computer science, and but for about 20 years, I actually worked at Deutsche Bank as a quant trader, person. And so I it's it's interesting. I started off using, Pearl, of all things, which is kind of like the Swiss army knife of, you know, data parsing languages. And I did this for quite a long time up through about 2009 when I switched to using, Python. And for the last couple of years, actually, I've worked at Continuum Analytics, basically doing some consulting for people in Python and using Pandas and big data type of things.
[00:02:22] Unknown:
And how did you first get introduced to Python?
[00:02:25] Unknown:
So I basically decided in 2009 that Perl was a dead language. I really liked it actually, but the world was moving on. So I decided to, just switch over from Perl to Python. Ironically, I actually hated Python. I was like, oh, white space. That's just terrible. But then, actually, after just a little bit of usage, I I was really, really happy with it. And that lasted about a year or so before I started using, pandas. This was way back, gosh, it must have been pandas, like, you know, 0.1 or 0.2 or something like that. I I was working at, a hedge fund at the time, and I started exploring with pandas. I needed something to to manipulate my my data with. It just became pretty easy, and so I continued on.
[00:03:09] Unknown:
Yeah. It's interesting how a number of people coming from Perl either go to Python because it's pretty significantly different from it, or they go to Ruby because of how much of Ruby was influenced by Perl to begin with. So interesting to see where people lie on that continuum.
[00:03:25] Unknown:
Absolutely. Ironically, I didn't discover Ruby till much later, and I was already hooked on Python, so and then just continuing on my little story, so I started using, Pandas, and I was, like, 1 of the really, I'll say, first, like, users out in the real world, and I just liked it so much. I started, you know, doing all my all my stuff in it, and was like, okay. Hey. This is kinda interesting. I I had never contributed to any open source project before. You know, keep in mind, this was back in 2010. And over time, I said, oh, hey. I wrote this nice little well, my first contribution to Pandas was actually some integration with the Pytables library, actually. I needed a way to store lots of, you know, data. I had my data frame in memory, I needed to store it on a disk. So I said, oh, hey, I'll write this little interface, and it was, you know, looking back at that code was not the best in the world, but it got the job done. And I eventually took this code and I pushed it up. And Wes, who I was, who's the original author of Pandas, and we'll talk about that in a second, we actually, you know, talked a little bit. And I was like, oh, this is kinda interesting contributing to open source. I said, oh, let me push some code up and and and it works and people can use it. Great. That went on for a couple of years, and eventually I started contributing more and more in my spare time. I had, you know, a little bit of spare cycles. You know, I was running some model or, you know, doing some analysis or whatever. And I said, oh, well, I'll just go, you know, look at some, you know, bug and and go fix that bug. And it became more and more interesting over time to do that.
I'm just gonna digress a second and give you a little bit of the background of Pandas. So it was originally started in 2008 by Wes McKinney, who was at AQR at the time. This is a a big hedge fund. Yeah. And so he needed a a way to marry the disparate systems that, existed. And, you know, as anybody who, you know, has to deal with messy data, you know, some some data you have in CSV files or some data you have in SQL, some, you know, some data is in a binary format sitting out there. And what you want to do is read it all in and marry this data together, and then the key thing is make this data line up with each other. And this, of course, you know, it's not a finance specific problem, but a lot of the things that you actually want to do with these do with the data, namely, like, time series type operations, are finance specific. But what pandas came out of this whole scenario where you could take, you know, some of your existing code because, you know, python connects pretty readily to your existing c or c plus code, and you could read in your data, use your own libraries to do analysis for it, and, you know, read from SQL and read from your CSV files, put it all in memory, align it, and then just easily do your manipulations with it. So this was the the rationale for the creation of pandas, and over time, I'd say the next 3 or 4 years, as the library was maturing, it became easier and easier to use, became more flexible, and became just more friendly. That's what I view as Pandas, why Pandas became successful, actually. A lot of people, you know, were essentially forced to create their own version of Pandas, you know, to do this type of manipulation. And over time, more and more people started to see, oh, hey, there's this nice open source library, that's out there that allows me to, in a very easy way, using the easy language, Python, in a very readable way to be able to read your data in and with just a small number of commands manipulate it. And so this became very popular with lots of people, you know, over the years. So that's what I'll say how pandas sort of grew up. And and now the next stage of Pandas was, I'll say, from maybe, I don't know, 2012 through now, where it was being picked up by not only the finance companies and finance, people who work to finance companies, but also many, many other disciplines. You know, we have had communication, you know, through through bug fixes and bug reports from so many different disciplines, you know, everywhere from, you know, reports from so many different disciplines, you know, everywhere from, you know, physical science type people to, neuro people to people who do consumer products, you know, advertising people who are taking all this all this massive advertising data and and want to parse it. So a lot of different pandas is a very large library, and people take from it and use certain parts of it. So this is why I think pandas has risen in the last few years to become, I'll say almost dominating the basic data wrangling in Python. And that is because people can find what they need in it's just a giant toolbox. I mean, you mentioned at the very beginning of this podcast, it's the Swiss army knife. Absolutely. I think it's more than the Swiss army knife. It's the hammer and it's the screwdriver. It's all of the tools rolled into 1, and people can take from the toolbox the tools that they need and and and want. 1 other thing I'll mention is that the very fact that it is open source means it's very, very high quality. You know, pandas is very well tested by lots of different people. We have something like 15, 000 tests now that run on every single commit, and all of these tests came from bug reports from, you know, people who were running running the code and doing something, and they found some edge case or whatever, and they contributed back. You know, at the very least, they contributed back a an issue or may maybe they went 1 step further and contributed back some actual code to test that issue. So I think pandas actually is very fortunate to be used actually by lots and lots of different disciplines. It gives it a lot of flexibility, and it makes it very robust. And so this is a a great example of software that has grown in a almost a grassroots fashion. There was no central vision. I mean, there was a central vision, but that can only take you so far. There's been a lot of, we'll say, even backfilling of features and, you know, from other languages. You know, pandas had a lot of influence originally from R and even from things like Excel. You know, Excel is actually an interest a very interesting use case because tons and tons of people obviously use Excel. And at some point, they're gonna hit the limits of Excel. They're gonna be like, oh, let me look at this pandas thing. It can do a lot of what Excel does in a very natural way. It's it's it's different, of course, but it's also very complimentary. And this is just a a good thing. Pandas has definitely become quite ubiquitous in a lot of areas and it's become almost the defacto introduction to doing data analysis in Python. It's the first step is to import your data into a pandas data frame. And I've actually spoken with a few people who have had the experience where they've,
[00:09:40] Unknown:
come into Python specifically to learn Pandas. And in some ways, they don't even realize that the that Python itself is its own independent language because they just say, oh, I write Pandas. Similar to a lot of similar to the way that I've spoken with a lot of people who come into the Ruby language by way of Ruby on Rails and say, oh, I'm a Ruby on Rails developer. Not even really realizing that Ruby is a language separate onto the, overarching library. And then once they do realize that Python is a language apart from the Pandas library, they then can go and explore all the rest of the Python ecosystem and the power and the and the language and the libraries that are available to them to then tie into the work that they've been doing with Pandas, for their data analysis.
[00:10:20] Unknown:
Absolutely. Interesting enough, this brings up, so so Pandas has sort of grown up now, and in fact, we're approaching our, 1 release. It's not I'm not gonna give a time frame because we don't actually have 1 yet. And this is the interesting point. Pandas now is, as you say, is a 1 of the first experiences of a lot of people who are now coming into the Python ecosystem where you know, whereby they're teaching in in college level and high school level, and any data scientist now will want to learn pandas. And that that's actually a responsibility now of pandas to make sure that the API that it's presenting is very Pythonic. You know, in other words, it's very readable, it's very concise, and is, you know, at the same time, does not completely obliterate, you know, any Python rules that exist. There are some, of course, because pandas by definition is a vectorized way of talking about things. It's an array oriented language, you know, built upon NumPy and doing things in an array oriented way is just simply different than Python. So, you know, pandas really does take this responsibility, to heart in making sure that you know, number 1, making sure that versions are as backward compatible as possible and, you know, of course, provide flexibility and and things like that, but it really is important, I think, to show the best parts of Python in the API.
[00:11:42] Unknown:
As you mentioned, 1 of the original inspirations for pandas the has actually come to surpass what is available in R. And a lot of people sort of hold that up as being the example of what a proper data frame implementation looks like. And I think that that unifying abstraction is probably a big part of what has led to the of pandas because, well, there are a number of different inputs and outputs and analytical capabilities built into the library. They all tie into that 1 unifying core abstraction of the data frame and the series. And so I'm wondering if you can talk a bit more about the principle of what a data frame is. And for somebody who is just starting to come into the Pandas library, what are some of the most fundamental ideas and abstractions of that that they should understand when they first start using it to work with data?
[00:12:40] Unknown:
The core abstraction that is Pandas is the data frame. And a data frame is a think of it as a collection of 1 dimensional objects. Okay. We'll call these series. They are ordered and they are possibly heterogeneous in in data type. In other words, you could have, you know, a couple of integer columns, a float column, a string column, a datetime column, whatever, and this is very important. So this is a it looks naively like a columnar database is what it actually looks like and in fact, it has it wears the hat of a database, to some extent. You can do things like joins, you can do projections, in other words select out certain columns, but it provides even more on top of this. 1 of the key ideas behind pandas is the idea of alignment of data. In other words, if I do an operation between my data frame, which remember, is a, basically, a table like structure, and I add, say, you know, a series to it or something, The point is it's going to align both of these to, say, similar what's called indexes. In other words, the, left hand side, kind of like the dates, will be aligned. And when I say aligned, what I mean is you don't have to worry about, you know, whether I'm off by 1 and the dates or not because pandas will naturally account for missing values in both datasets when I'm doing the operations. Okay? So this is another big idea. Pandas supports 1st class missing value support, not a 100%, and we'll talk about that at the, little bit later because of some of the implementation details, But the idea be is that you can have missing values, and missing values will propagate naturally. So adding to miss adding a value plus a missing value clearly gets you a missing value, and this is very, very powerful. It allows you to work with data that is not completely aligned. K? Furthermore, pandas provides a very, interactive experience. And you know what? Nowadays, this has become so important. The ability to just type a few keystrokes and introspect your data or change it or really just look at parts of it. So this is a very Pandas provides really first class support in indexing and the ability to view your data in a in a very nice way. Let's see. Another key feature of pandas is the ability to work with various input output types. In other words, everybody has, you know, like CSV type data. Great. But maybe I want to export to, say, Excel or read from Excel or from SQL or from a binary format, and Pandas supports natively all of these formats. So in fact, pandas wouldn't some people use pandas simply to move data from, say, Excel to SQL, and it works works great for that.
So just as in, like, I was mentioning database functionality before. Well, a very important part of databases and data analysis is the ability to group by your data and perform aggregations on it. So a traditional where I form various groups and then say do a sum on those groups. K? So a reduction in my dataset size to some sort of aggregates is a very important capability when working with databases or with pandas data frames. And the last feature I'll say is I mentioned, input output data, but the ability to easily visualize your data is also very important. Now pandas does not have, you know, all these libraries baked in, but what it does have is the ability to very simply data frame dot plot, and you will simply get a plot of your data right away. It'll just work. If I had to say the number 1 feature of Pandas, it's intuitive data analysis.
The function names are very intuitive. The actions are intuitive, so it's natural. It's readable.
[00:16:24] Unknown:
And as you mentioned, there is quite an extensive API that does do quite a good job of being, easy to understand and easy to guess at. I also noticed that 1 of the most recent releases includes a nice cheat sheet for being able to get a high level view of some of the different capabilities and just shortcuts straight to what you need to do with it. So I'm wondering, what are some of the ways that as a library developer and maintainer, you balance the power and flexibility of such an expressive API with any usability issues that can be introduced by having so many options available of how to manipulate the data and be able to expose in an incremental fashion the different layers that are available to users of pandas?
[00:17:01] Unknown:
It's funny. When you're managing an API, you always want a new function name to allow people to say, oh, you know, I wanna do xyz. Okay. So we'll give you xyz. But at the same time, if you look at the number of functions, the number of methods that are available in the data frame, it's like over a100. So already, this is too many. So I think in the last few years, we've been very, very careful about adding other top level methods. And at the same time, we've been very we've been really, really working hard to make the existing methods really consistent in terms of the naming of the arguments, the certainly the ordering of the arguments, and the basic functionality.
What it says it does is what it should do, and for the most part, that was the case. However, there's, you know, a lot of you know, as library, you know, grows and and it has code contributed through various different sources, sometimes consistency can slip. And so, in fact, over the last, I'd say, at least a year, this has been 1 of the big deals is to really, really lock down the API, make things that are, say, available on group by, you know, available on the main, you know, main API for data frames. As example, in the next release, I've been working on adding, dot aggregate functionality. In other words, you can actually, right now, say when you have data frame dot group by dot aggregate and give it a, you know, list functions or a function, and you can get a summarized view. So this up up until now has not existed on data frames. Could you do it? Sure. You could do it, but it was not in the same way that it was natural to do. And and that's the key point here is that unlike Perl, where there's more than 1 way to do it, in Python, there really should be just 1 way to do it. And for the most part, Pandas follows that idiom. You know, 1 area which Pandas does have, I'll say, somewhat of a confusion is indexing, actually. Over the years, there's been lots of, we'll say, accommodation of users in order because you wanna provide users with all kinds of capabilities, but sometimes these can actually lead to some confusion. So another thing we've done recently is to deprecate the dotixindexer, and, you know, here's an example for people who don't really know about this for a long time. So you have 2 axes to a data frame.
You have we'll call it the rows, and then you call it the columns. And so sometimes people want to do what's called label based indexing. In fact, you wanna do this most of the time. You wanna look up which label you have on the rows and the columns and then select those elements out. But there are times when you actually want to look at so called positional indexing, and this is where you simply count from, you know, 0 to wherever on either axis. And people have tended to do a shortcuts, both types of indexing. So this leads to lots of confusion, we'll say, in what's going on. And several years ago, we introduced separate so called indexes for this, namely dot lock, which is for label based indexing and dot iloc, which is for positional indexing. And so part of this whole cleanup over the years has been to really make sure there's just 1 way to do things. There's not 3 ways to do it. There's just 1 way. So we'll get there and I think once we get there, we're gonna have a 1 release. So it's gonna take a little bit more time, but I think that the API is really very important. And I think if we have to take a little bit more time to get to a really solid API that we're not gonna then change, That's, you know, worth waiting for. As you mentioned before, Pandas is
[00:20:28] Unknown:
strongly focused on being used in data science and data analytics and physical sciences. But but there are also a number of areas where Pandas is useful as well. So I'm wondering what are some of the most interesting or unexpected uses that you've seen or heard of?
[00:20:41] Unknown:
So it's funny. I you know, 1 of the biggest uses I suspect Pandas is for out there are things like log parsing. And here's the funny thing about this. I always say this when I'm doing a talk in pandas. I say, you know, we have lots of discussions on issues, you know, on our tracker, and there's always people who wanna comment on certain sort of you know, certain things. And, you know, repeatedly, they come up and and talk about those specific issues. The most the funniest issue I've ever heard about is so pandas has really first class support for not only time stamps that are naive, in other words, they don't have a time zone, but time stamps that have a time zone. Okay? And time zone arithmetic itself is is actually fairly complicated. There's a lot of edge cases and things like that. So over the years, we have, I'll say, 3 or 4 contributors who their sole purpose is to comment on issues related to Daylight Saving Time Transitions, which is a very funny thing because, you know, we have 2 of them per year, you know, in the in the US, of course, but there's there's transitions in other, other time zones and other countries and so forth. But the very funny thing is this is a nontrivial thing because, you know, of course, you know, in a transition, you have, you know, either times sort of repeated or it's skipped. And so how to handle and especially when you have log data, like example, advertising log data, it, of course, goes through these times. You know, financial data rarely does, but other types of data does. So it's very interesting where you have applications into very, very specific type of areas that are, you know, not physical science and not not financial applications.
[00:22:09] Unknown:
So that was just 1, 1 point. I'm not sure if you've come across this, but there's actually quite a long and humorous list of contradictory beliefs that programmers, have about time. And, having had to deal with time zones and time stamps in quite a number of applications, I can sympathize with the people who have to keep, educating others on the the challenges that are posed by that.
[00:22:30] Unknown:
Yes. Absolutely. The, I mean, funny thing is this is the 1 great thing not 1 great thing. There's a lot of great things about the Python ecosystem. The first thing that I always, always point to is the fact that the libraries and the great breadth and depth of the Python ecosystem is just amazing. There's a package for everything you can think of, and that that is so great, actually. You can just you know, whatever problem you have, the very first thing you do is look for for a package that solves exactly that problem. And, you know, 9 times out of 10, you'll you'll you'll be able to find 1, which is great. So, in fact, this is another, I'll say, hard problem, actually. Finding you know, you have you have packages sort of which form the core of the Python ecosystem. You know, and pandas is 1 of those in the in the Py data stack. But at the same time, it is useful from a user perspective to have lots and lots of functionality. So as an example, pandas could, for example, you know, split off, you know, its time series functionality or something, and and that would be fine. I mean, you could make it so that pandas users can still use that, of course. And nowadays, installation of packages is not so hard, but I really think that having a common and curated API to a large part of the data analysis stack is really, really key. Conformity and uniformity is not something that you would think is very important, but in some sense, you know, I'll just give my 1 dig about r here, is that in the r ecosystem, there are packages for everything and you don't know the quality of them and you don't know how how well they're maintained and and how will they work together. So it actually behooves Python to really spend a lot of effort. I'm talking collectively here in having high quality core packages, and and and it does.
[00:24:12] Unknown:
Yeah. It's definitely a challenge, particularly coming in as a new user to a particular ecosystem or set of tools to have that uncertainty as to the level of quality of, what's available or not really understanding what are the canonical packages for any particular capability. And, yeah, if you do have a well curated list or at least reference material that somebody can look to to see, as a sort of launching point, you know, what are the packages that I would use to at least get started with this particular set of capabilities, it's immensely valuable. And the fact that pandas and NumPy are such strong cores of a huge portion of the Python ecosystem even beyond just data analysis. I've seen it reach into areas including web development and, you know, just back end systems engineering because of the fact that numerical capability is such a far reaching use and need within any program that the fact that we do have such powerful packages that are well respected even beyond the Python ecosystem itself, speaks volumes about what people can do with Python and what people have done with Python.
[00:25:18] Unknown:
Absolutely. And actually, that brings up a very good point. It's interesting. Python is actually really well known, not necessarily for data analysis. Obviously, that's, you know, a key point nowadays, but web development is huge. And the interesting point about that is that now I I see more and more web developers actually starting to reach for pandas to do their, you know, hey, they want to throw up a table, or they want to do some simple stuff in in their UIs. They reach for Pandas to do this. And in fact, we are also having a lot of interest from, actually, the front end developers. In other words, things like, like the Jupyter developers to integrate more with Pandas because everybody wants to have their data frame and to have it and of course what they really want is to have an interactive data frame, you know, that that scales and they can flip the columns around. And so this is all coming. This is the next thing that's gonna happen is in the notebook, starting in the notebook, you will have really first class data frame support for manipulation of your data frames, and it's going to go both ways too. So, you know, making changes in the front end will reflect back in in the back end. So this is really coming, and it's exciting because, you know, this is I I think, you know, taking your data and putting up a graph that, you know, that's that's been going on a while, and you can do, you know, a lot of there's a lot of great packages for this, but really first class table support is, you know, that's really, really well integrated in the Python ecosystem. And I'm not saying go and writing all this custom, JavaScript, but I'm saying really first class in the Python ecosystem
[00:26:47] Unknown:
is coming. Yeah. I think another aspect of Python, and particularly in the data analysis space, is the fact that we, as a community, are very outward looking, where we don't try to lock people into doing everything within Python. We actually try to provide a lot of outlets for people to use it as a waypoint or even to migrate from Python to other languages because we know that by locking people in, people are less likely to want to even start using Python. And 1 of the more notable recent developments in that front is the feather library that wraps Apache Arrow and allows easy interoperability of data frames between Python and R. And I think that that's 1 of the really powerful lessons that Pandas as a community in particular really carries
[00:27:34] Unknown:
out. Absolutely. And and just commenting on that, this is going to go even further. So, actually, Wes is currently hard at work, we'll say, on reengineering some of the, we'll say, base levels of not even pandas at this point, just some base it's it's called Aero that is is his new project or Py Aero, which is essentially the goal is to provide a very convenient format, we'll say, that you can exchange data in memory between various things, namely things like spark and pandas. And so this will facilitate, you know, movement and memory. And so feather is just an offshoot of this actually, and it's gonna be soon rolled back into arrow so that there will be effectively a standard for exchanging data in memory and on disk between various different not not necessarily formats, but between various different memory processes. So an example, it'll be very convenient for you to use PySpark and then put push your data to pandas in a very performant way. And that and that's the key point here is that pandas serves is great. It's very performant itself, but when you put it in a so called big data framework, there's some translation barriers here that exist right now. And so Wes is hard at work actually on reengineering some of this low level basic infrastructure in order to facilitate a much more even a much more performant data frame. And so actually, this leads me to the next generation of pandas. We're gonna call it pandas 2.0. Even though we've not, officially blessed pandas 1 0.
There is a notion of pandas 2 0 where we're going to reengineer the we'll call it the back end of pandas. So the idea being that 95% of the API, the user facing API, is gonna be unchanged. Okay? Only specific edge cases which right now are even in fact not possible. So here's an example. Right now, it is not possible to represent n a for integers. Pandas simply converts them to floats right now, okay, because the implementation of pandas is in NumPy, and and it's just not supported there. So pandas 2.0 will have first class support for all data types, including integers. So example, this is gonna fall in the, you know, small percentage of API, which is going to break to support this type of functionality. So but, however, the vet we're gonna be able to do almost this entire, we'll call it, back end implementation change where things will be faster, be flexible, and be more supported than even they are now without changing the user experience.
The hope is to have real very close to drop in replacement. And so give us some time, but I think this is, an exciting development for pandas because pandas will really be able to take its place as a first class. It it already is a first class, but even more first class, we'll call it, user experience and the ability to then interact with the rest of the world at large in a better way. And what are some of the biggest challenges that you've encountered while working on and maintaining Pandas? Well, the biggest challenge really is, so so we have a fair number of contributors. You know, we have, you know, of course, the drive by contributors, but we also have some people who've done some patches that are, you know, well thought out and and and fairly invasive. But the biggest problem is that Pandas is a fairly large and sprawling, codebase. You know, for people to, you know, that's sort of a double edged sword. They can work on a portion of the code. In other words, something that interests them. Maybe they're interested in in, you know, a bug in Excel. Well, our Excel interface is not that complicated, and they can dive in and and figure it out. So that's great for new contributors, actually. What the the biggest challenge I would say, is having a group of people who can make, I'll say, somewhat far reaching changes.
You know, in order to really, make significant impacts nowadays, you really have to, understand a fairly large part of the code base. Yes. There's a lot of peripheral parts which are sort of self contained, but in order to make some far reaching changes, in other words, to make anything related to, say, indexing, you have to really dive in, and that's been the challenge. It's having people who can do that and, correspondingly, other people who can review that code. And so up till now, it's been a fairly
[00:31:41] Unknown:
small group of, folks there. So so I would say that's that's probably the biggest challenge. And what is the internal software architecture and design look like that it enables such broad functionality and allows for multiple people to be working on the code base without stepping on each other?
[00:31:58] Unknown:
So pandas actually has been really quite accessible, I I think, to people. It's right now, say, about, I don't know, 75% Python code and maybe, you know, 20% say Cython code. Okay? And, you know, small percentage of just pure c code. Interestingly, in our new version, this will flip flop, and you're gonna have a large percentage of the code base in c plus plus, and Python will become much more of a thin wrapper. So, ironically, this will actually make it a little bit harder actually to contribute to the code base because it's very easy to tinker in Python or even Cython code.
[00:32:30] Unknown:
And is it a fairly modular code base where it uses sort of a plug in architecture for being able to load in the different import and export interfaces?
[00:32:39] Unknown:
So the architecture is, we'll say, somewhat modular. As far as the the input output layer, yes, that is very, very modular. Actually, adding or chopping off a module is fairly easy. In fact, over the years so we had some folks who added, example, an interface to Google's BigQuery that was directly written, in the pandas code base. And, actually, that was 1 thing. We're just gonna split that off into its own separate package so that those folks can then work on it independently, yet it will still work with pandas directly. So things like input output adapters are very modular. Fortunately or unfortunately, other parts of the code are they're fairly well designed and they're very well tested and they're organized, but they're not super modular, I'll say. Although, over the years, we have become, a lot more friendly to things like subclassability.
So there are folks out there who have, wanted to the ability to subclass, you know, series and data frame, and and it is certainly possible nowadays. I normally don't recommend it, but there are a a few cases where it's useful. So pandas is becoming, you know, along with becoming more consistent in the naming of parameters and method names and things like that, we have become, I'll say, more covered in terms of, you know, inheriting from common base classes and things like that, where it is at least, you know, approaching
[00:34:02] Unknown:
well pretty well designed code. And jumping back out to the higher layer, do you find the constraint of only supporting 2 dimensional arrays to be in any way limiting? Or do you think that that has proven to be beneficial for the success of Pandas because of the fact that there is that constraint placed on the way in which the data is interacted with?
[00:34:21] Unknown:
So, Pandas is named from, panel data arrays. Okay? And a panel is is in pandas land is a three-dimensional array, but we've been actually moving away from this. And, in fact, we're going to deprecate panel pretty soon. I guess people will be sad, but three-dimensional data is far less common than, say, 2 dimensional or 1 dimensional, data. In fact, you can certainly represent 3-dimensional data as a multi index with 2 dimensional data. So this is a, up till now, has been a very, very common way of doing things. Also, there's our we'll call it a sister package called Xarray, which is built entirely on top of Pandas, which is designed for handling n dimensional, we'll call it, data frames. In fact, these are the folks who do, weather data, which is naturally, like, 4 or 5 or 6 dimensions or whatever, designed this for this reason. So we are very happy to sort of spin off, like you call it, panels because, you know, the implementation just becomes that much more complex. Indexing becomes that much more complex and the vast majority of things people are doing are really in 2 dimensions or 1 dimension. Going to 3 is not. Yes, it's it certainly happens, but there are a lot of ways to represent data. So to answer your question, I don't think it's really a problem. So are there any other topics or questions that we, haven't covered that you think we should touch upon? Sure. I just wanna briefly talk about Dask just very briefly. So pandas is primarily effectively an in memory vehicle for doing data manipulations just as NumPy is a in memory system for array manipulation. So there's another package called Dask, which is, in effect, a distributed data frame and also distributed NumPy array. So this is the you know, when people come to me and say, oh, you know, I have, you know, lots of data or I wanna work on multiple cores, you know, what do I do? Well, the very, very first thing I will say to them is to use Dask. If you really do have, you know, multiple cores and you want to process your data, you know, simultaneously across multiple cores on your machine or even across a cluster. And the idea here is that Dask is built upon pandas and NumPy, and effectively it is a scheduler or coordinator of data frames that exist on each core. And so this is a very, very powerful way of thinking about what you're doing. It allows you to use the natural pandas syntax.
In fact, Dask mimics and mirrors, a lot not all, but a lot of the Pandas syntax for this exact reason. So it's an API in itself now that allows you to perform operations not just so with a single data frame, of course, we have vectorization. We can do df dot sum, but if I have a DAST data frame, I can do the same thing. But now I can do that operation across many cores. So this is a very powerful way of thinking about and and working on my data. It gives a a feel of, like, a spark or a Hadoop type feel where I can actually then control where my the operations on my data frame, and the data doesn't move around is the whole idea here. So this is a very powerful abstraction, and
[00:37:21] Unknown:
this is the way I'll tell you, this is the single way that people can extend the processing power of pandas. You know, there's not really yes. Python has, you know, threading and has multiprocessing, but this is a very powerful and native way. It's all pure Python code to be able to work on distributed data frames. And I think this is a, you know, really powerful way to move forward. Yeah. And for people who are interested in learning more about that, I actually did a really interesting interview with Matthew Rocklin, the maintainer and 1 of the primary contributors to the project on the data engineering podcast. So I'll add a link to that in the show notes as well. But, it was definitely interesting hearing about how Dask does a good job of sitting in between projects such as Spark and Airflow in terms of its capabilities and use cases of being able to do directed task graphs and distributed in memory computations of data and do being able to do it purely in Python and tie into the broader Python ecosystem as well. As we've touched on a few times, Pandas is a fairly integral part of the overall Python data ecosystem.
And as such, it's very important to a number of people for being able to do their day to day work. I know that NumFocus is the fiscal sponsor for PI Data. So wondering if you can talk a bit about the long term prospects of pandas as a component of the day to day work of that large number of people and how they can be sure that it will continue to be supported and maintained and developed for them to be able to continue using it? Yes. So Pandas became NumFOCUS fiscally sponsored,
[00:38:55] Unknown:
about a year and a half ago, and this allowed us to, accept donations. And we've had a number of donations to date. And in fact, we're actually embarking on a round of fundraising now to fund some efforts to support Pandas 2.0, and, of course, maintenance costs. I mean, this is the big thing about the open source. You know, generally, a lot of the big packages have had, say, a single developer or a couple developers driving the initial creation and expansion of the packages. And and and it's not any different with with pandas. And that's great until, you know, they change focus and and move on or or something else happens. And so the longevity and the long term, health of a project is actually hard. And, you know, there's different stages of these projects. I mean, pandas has been in the grassroots massive expansion phase for quite some time. I think we're getting to a phase where, you know, there's not gonna be major new features added. There just simply aren't because we have a lot of the features that are required. And it's now a time to, you know, fix bugs and and and, you know, even improve the actual excellent documentation now and improve testing. These are almost boring things, but they're very, very necessary things for a widely used project. And so, it actually is very important that not only pandas, but things like NumPy and and a lot of a lot of other, PyData products are part of NumFOCUS so that we can receive funding for things like maintenance and any and even earmarked toward new features and things like things like that. So Well, it's definitely a very useful package. 1 that I have used a number of times even for just small little ad hoc queries because it's easier than trying to dump it into even a SQLite database. So I'm definitely glad that I can expect for it to be here along into the future. Pandas is my first tool for everything. It is just really convenient to be able to work with data in the very very natural way. I mean it's just built for doing, you know, slicing and dicing your data, and you know 1 1 thing I compare it to is, you know, I've seen lots of people do I want I have my data sitting in some SQL tables as an example, and, you know, it's in a nice form. So, oh, let me just do, you know, a giant query in there. And, you know, you send the query off and some point later you get some results back, and it's just not interactive necessarily, and so I like to use it in concert. You know, generally I like to simply pull the data into memory. No, generally we're not talking about, you know, ginormous amounts of data here. So it's it's pretty reasonable to do this, and then I have the data at my fingertips.
I'm sitting, you know, in my notebook, and I can slice and dice it how I want, play with it, and then, you know, if I really want to, you know, do some sort of aggregation on in the SQL server, well, I can do that, but doing anything more than very very simple, you know, reductions is it's just much easier to do in memory, and so I find that it just flows. It just makes it really easy the ability to, you know, chain expressions. In other words, I can, you know, take my data, and then I want to slice it, and then I wanna do a group by, then I wanna do, you know, maybe, you know, a couple of aggregations. Oh, and by the way, then I want to plot the data, okay, or maybe I want to push it out to another package. It's all very easy. It's it's nice to read. It really is, and that's 1 of the key things, you know, I I see, you know, so I sort of see kind of 2 uses of pandas. 1 is actually taking quote unquote clean data and then performing various transformations and aggregations on it. But a very very important use of pandas is cleaning up your data in the first place. So, you know, what ends up happening is, you know, you have, you know, data source XYZ and oh it's a CSV file. Oh, 0, but the headers are, you know, I have to do some transformations on them and oh I then I my dates are all screwed up. Okay, let's go fix that. But that's the point. I can write a little funk make a little wrapper function and read in this god awful messy CSV, clean it up, drop the bad rows, and just give it into some nice format, and then poof, out of that function, I have a nice clean data frame that I can then do my aggregations and transformations on. It provides a nice, you know, barrier between various operations here. So I think it really behooves people to understand the different parts of data manipulation. I mean, 1 of pandas hallmarks is the so called ETL or extract, transform, load. This comes from not even the big data world. I'll say it's the big iron world where, you know, you have your data sitting in some, you know, big data lake and and you need to extract it and then process it and then and then push it somewhere else actually. So pandas can serve you in various levels in various capacities as as part of this. What I say to people is your goal is to set up a pipeline, and the pipeline is take your data in its raw form and transform it to another form, whether it's on disk or memory, it's a graph, whatever it is, and you will have a pipeline of steps to do this. And I'm convinced that, and I hope you are too, that pandas can help at virtually every 1 of these steps. It's just a way of taking data and transforming it in in various ways without writing, you know, loops and without, doing all kinds of crazy manipulations, and and it just allows you to think about your data or about it allows you to box your problems. So say 1 box is taking this raw messy data and transforming it and you're reading it and transforming it. Box 2 is, you know taking it and aggregating reducing it whatever, And then, you know, box 3 is is a graph, and it allows you to attack these problems in various stages without having to worry about all the other problems at the same time. It allows you to really, focus on what's important.
[00:44:31] Unknown:
For anybody who wants to follow you and the Pandas project and keep in touch with what's going on, I'll ask you to just send me the preferred methods of contact and the preferred information sources. I'll put those in the show notes. And with that, I'll move us to the picks. And my pick this week is a bit of documentation from a game studio called Mousepaw Games that I found to have a lot of good ideas in terms of code style. But the, I think most interesting and useful aspect of their standards is something that they called comments showing intent, which they abbreviated as CSI. And it's just a way of, as you're working on your projects, making sure that the comments in the source code are informative enough that you could actually take all of the code out, just have the comments and then be able to reimplement the original program without a whole lot of difficulty. And the purpose behind that being that it makes it easier for somebody new to the code base to come in and actually understand what's going on without having to necessarily look through reams of documentation if they're only focused on 1 particular portion of the code. So definitely recommend people take a look at that and see if there are any ideas that they can pull out of it for their own day to day work. And with that, I'll pass it to you, Jeff. Do you have any picks for us? I'll I'll just say this was something I was working on today. This is not even a package. This is a so when you're developing your own packages, it's, of course, very important to test things. And,
[00:45:52] Unknown:
you know, again, this comes down to almost boring things, but boring things are really important when you're building an actual real project. So I will just quickly mention the 3 testing sort of things that oh, there's actually more than 3 things that Pandas uses. So we use these so called continuous integration services, 1 of which is Travis CI, which basically all of these work sort of similarly where we basically have a, for lack of a better way of telling this, would they have a control file and it takes your every and every single commit, all of your code gets pushed to these services and they run with a certain configuration and then run all your tests. So, we use Travis CI for testing on Linux and testing on Mac OS. We use something called appfavor.com for testing Windows.
And today, actually, I wrote an interface to another Linux testing service called CircleCI. In in fact, the reason I was doing it was because we actually test so many different configurations. I think we have, like, 15 different configurations that we test. In other words, we have things like, you know, we test on, you know, Python 2.73.53.6, and and we, of course, have various different combinations of libraries that or dependencies that we use. So we have so many different combinations, we try to hit everything, and it's just taking so long to run these builds. It takes like 4 hours to run them, you know, for every single combination. So we're trying to offload some of the capacity to some, other services so that we can then ensure testing compatibility. The 1 other thing I'll mention here is that and this device this goes along with your library that you just pointed to. So pandas about, I'd say about a year ago, we implemented this this linting service. So we actually do 1 of Python's big selling points is the fact that you can look at code and instantly understand, you know, the line breaks and and where functions start and where functions end and all these kind of things because it's a formatted language. But it's not exactly the same formatted language between 2 different people even. So, about a year ago, we implemented a linter and a linter is just simply a style checker. It's something, you know, that makes sure your lines have, you know, 80 characters or less and, you know, things like that. And it's it's sort of a silly boring thing, and people hate when they they break it. But you know what? It really makes sure you're project compatible. It really makes it easily understandable and digestible, you know, from file to file because they're all the same exact format. And, you know, in a little product, maybe this doesn't matter. But as your products grow, this is extremely important, and I think that you should just do it. Yeah. I could definitely agree with that. Having consistent
[00:48:15] Unknown:
conventions within a project and having them enforced definitely makes it easier for people to come in and be able to contribute. Because if you have 5 different styles within a single file, then it can be really jarring and difficult to actually gain a cohesive picture of what's going on. And, for people who are interested in learning more, I'll point them to the, interview that I did with the folks behind the pilot library, and I'll add a link to that to the show notes. Well, I really appreciate you taking the time to join me today and share your work with pandas and, all the wonderful things that it's can be used for. It's definitely a useful package, and it was very informative conversation. So thank you for taking the time. Thank you very much. I had a nice time doing this.
Introduction and Sponsors
Interview with Jeff Reback
Jeff Reback's Background
Transition from Perl to Python
First Contributions to Pandas
Origins and Evolution of Pandas
Core Abstractions of Pandas
Balancing API Power and Usability
Challenges in Maintaining Pandas
Pandas in Web Development
Future of Pandas: Pandas 2.0
Dask and Distributed Data Frames
Long-term Prospects and NumFOCUS
Pandas for Data Cleaning and ETL
Contact Information and Picks