Summary
Pandas has grown to be a ubiquitous tool for working with data at every stage. It has become so well known that many people learn Python solely for the purpose of using Pandas. With all of this activity and the long history of the project it can be easy to find misleading or outdated information about how to use it. In this episode Matt Harrison shares his work on the book "Effective Pandas" and some of the best practices and potential pitfalls that you should know for applying Pandas in your own work.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Matt Harrison about best practices for using Pandas for data exploration, manipulation, and analysis
Interview
- Introductions
- How did you get introduced to Python?
- What motivated you to write a book about Pandas?
- There are a number of books available that cover some aspect of the Pandas framework or its application. What was missing from the available literature?
- Who is your target audience for this book?
- What are some of the most surprising things that you have learned about Pandas while working on this book?
- What are the sharp edges that you see newcomers to pandas run into most frequently?
- It is easy to use Pandas in a naive manner and get things done. What are some of the bad habits that you have seen people form in their work with Pandas?
- How and when do those habits become harmful?
- What are the most interesting, innovative, or unexpected ways that you have seen Pandas used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on this book?
- What are some of the projects that you are planning to work on in the near/medium term?
Keep In Touch
- Website
- @__mharrison__ on Twitter
- Blog
- mattharrison on GitHub
Picks
- Tobias
- Matt
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Effective Pandas Book (affiliate link with 20% discount code applied)
- Discount code INIT
- TCL
- Perl
- Pandas
- Pandas Extension Arrays
- Koalas
- Dask
- Modin
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Matt Harrison about best practices for using pandas for data exploration, manipulation, and analysis. So, Matt, can you start by introducing yourself?
[00:01:08] Unknown:
Yeah. Sure. Thanks for having me on. I'm a corporate trainer these days, and I do some consulting. My background is that I have a computer science degree. Worked in the Bay Area for a while at various start ups, then relocated. I live in Utah right now and been using Python most of my career. But after, I would say, writing a book and doing a bunch of conferences, I got pulled into the training gig world and pretty happy here. And so I spend a lot of my time selling snake oil and telling people how to tell lies with data. I mean, I teach Python for a living and also teach data science machine learning. You mentioned that you've
[00:01:55] Unknown:
got your degree in computer science. And I'm wondering what was your kind of entry point into the data science side of things, and what about it keeps your interest and keeps you motivated to stay up to date with that space?
[00:02:08] Unknown:
We'll definitely say that staying up to date is relatively difficult these days just with a number of, I guess, niches and papers and whatnot that are coming out. But when I graduated from school, I went to work for a search company. And so, actually, how I learned Python was working with a linguistics PhD writing what we called a term suggestor. So pulling out the important terms of that. We did that with Python. He was a Tcl person. I was a Perl person at that time. When I was in school, I was told if I learned Perl, I could get a job, and it proved to be true.
However, at that point, neither 1 of us wanted to sort of cross the bridge to I wasn't particularly interested in Perl or Tickle. He wasn't interested in Perl. And so Python was a compromise. And within 3 days, we had a working prototype, and I think both of us neither looked back. And so that was my background into sort of the data world. In a later life, I did a lot of BI reporting. So I had a company that was manipulating data and creating reports using Python. And then later, I worked for a storage company, and was doing statistical models of failure rates and that sort of thing. So that's sort of my background into the data science side. So I come at it from a software engineering side, which might be a little different than a lot of, I guess, what you see data scientists these days. From my experience doing training, a lot of them are coming from PhD programs, physics, mechanical engineering, whatnot, to realize that there are a lot more job opportunities doing data science than there are in academia. And so they sort of cross over that way.
[00:04:00] Unknown:
And so you mentioned that your entry point into Python was as a compromise so that you didn't have to learn Tcl. And I'm wondering sort of as you said, you never really looked back. And I'm curious, what about the Python language and community has kind of held your interest? And what motivated you to turn your career into teaching other people to kind of make that same jump?
[00:04:22] Unknown:
That's a great question. So as I said, I was taught to learn Pearl or told to just because it was like, Hey, this is pragmatic. You can get a job. And that proved to be the case. However, I thought, you know, Pearl was great coming from, like, school where we're writing a lot of c and Java. Pearl was just a little less ceremony than that. However, as I started using Pearl more, their mantra of there's more than 1 way to do it proved to be problematic and that I would look at different resources, and there would be 3 different ways to do something with different characters. And it got really confusing. And even so much that, like, I would come back to my own code, and it'd be hard to understand what's going on. So the thing that I appreciated about Python was the focus on readability and understanding. And even today, when I'm, like, teaching a course to students, I'll emphasize, like, for me, the most important thing is readability.
And do what you can to make it readable because you're gonna come back to your code, your colleagues are gonna come back to your code, and you wanna be able to understand it in a day, in a week, in a month. And I found that Pearl really didn't fulfill that promise for me. It made it really easy to kick out stuff, but coming back and changing and updating proved to be a little bit more problematic.
[00:05:46] Unknown:
Yeah. 1 of the jokes that I've heard is that most code is write once, read many, and Perl code is write once, read never.
[00:05:54] Unknown:
Yeah. And I don't mean to, like, pick on Perl. I knew a lot of people who did Perl and did great stuff with that. And it, for me, it was my entry point into Python. So I'm grateful that I had the chance to sort of see those differences, and it really helped me appreciate Python more by learning Perl first.
[00:06:12] Unknown:
Now you've been working in Python for a long time. You teach people how to use it, and you've written a few books. And the most recent 1 is about effective pandas, so digging into the pandas library and framework and how to apply it to different data manipulation and analysis tasks. And it's definitely the Swiss army knife of data and has come to be 1 of the predominant tools where a number of people come into Python entirely because of pandas, where, you know, maybe 10 years ago, it would be people came into Python because of Django or something like that. I'm wondering, what was your motivation for deciding to write a book about the Pandas framework and sort of the main
[00:06:54] Unknown:
focus and goals of the book as you've written it? Sadly or interestingly, this is not my first Pandas book. So I did write a Pandas book 5 years ago as I was using pandas quite a bit. And then I also was the author. I did the 2nd edition of the pandas cookbook. However, in the meantime, after doing both of those, I have taught a lot of pandas to thousands of students, big companies and and small companies. And I do a course at Stanford couple times a year where it's a data analysis course aimed at people who want to start doing data analysis without using things like Excel. And so I've seen a lot of people using Pandas. I read a lot of Pandas content.
Over that time, I've come to have some stronger opinions about, I guess, the proper care and maintaining of your Pandas code base. And I wasn't seeing a lot of material that aligned with my opinions, or maybe I should say like a lot of medium blog posts that frankly were pushing out bad advice. And so I took that as a chance to, okay, let's maybe revisit this. And my book, Effective Pandas, started out as sort of just updating my old Pandas book that was 5 years old, but it essentially turned into a rewrite. I guess another thing that I wanted to do that I didn't see was just the ability to use color in the book.
Not that you couldn't use color before, but I've written a couple of books before, and most every 1 of the physical versions has been in black and white. And so this 1, I was like, okay. Well, I'm gonna make it so the digital and the physical are gonna be in color because I think there are a lot of things that you can convey in pandas visually. My book has over a 100 images talking about some of these manipulation constructs, and I think that having compelling images that really explain what's going on can be useful for that. So like all things, I think, Tobias, you know, when you asked what was pushing you to do this, I think a lot of developers and software people, they like to bike shed or or scratch their own itch. And so this was, I think, an itch that was very itchy for me after, sort of seeing what was out there, and I was compelled to scratch it. And as you mentioned, there are definitely
[00:09:25] Unknown:
numerous blog posts out there that are talking about different applications of pandas or plugins for pandas or using pandas as 1 element of a broader workflow. And there are also a number of book length treatises on pandas in different forms, including 1 by Wes McKinney that is talking about sort of data analysis with Python, where Pandas plays a big part. Wondering what was missing from the kind of available literature or maybe just not properly up to date in the available literature and some of the gaps that you are aiming to fill in your work with this effective pandas book? I don't want this to be like a huge, I guess, rant on Pandas.
[00:10:07] Unknown:
I guess, full disclosure, like I said previously, like, I I make my money from Pandas and Python. However, I don't think it does me any service as someone who's gonna help people learn it to be sort of like in my class. Oh, this is awesome. It's all peaches and cream. And then, you know, everything in the labs and the course work out great, and then they get on the real world, and it's like, oh, they run into these issues, and they don't know what to do. Or they're And it's sort of like, oh, here's a wart, and why wasn't I exposed to this? So I sort of feel it's my duty to show, like, good and bad and talk about both of those things. I mean, if you want to, I have sort of a list of, like, rough things in pandas that pandas might bite you with.
I've read a lot of pandas books, and I think a lot of them are are, like, great for, like, here's an analysis I did or wanted to do, and here's how I did it. But they're a little bit lighter on, like, here's the things to watch out for. And so I wanted to make sure that I sort of covered both of those, and that was 1 of the goals of the book. You want some examples or
[00:11:21] Unknown:
you wanna dive into some of that? I think we could dive into that in a minute. But before we go that direction, another thing that's worth exploring is that anytime you're writing a book, there's the challenge of figuring out, particularly a programming book, figuring out what is the entry point. Like, where are the people who are reading this coming from? Where a number of books that might be covering pandas are starting with, you have no idea what programming is. You just wanna be able to do something with data. So I'm going to give you the, you know, 30 second overview of Python, and then we're gonna dig into these calls with pandas. And we're just gonna cover this all at a surface level so that you can go from 0 to having something working in as short of time as possible.
Or you might be assuming, you know, this is somebody who's coming from a data science perspective, and so we're going to teach them some of the software engineering practices around how to use Pandas. And I'm just curious, what was the kind of target persona that you had in mind as you were writing this book and the main goals that you wanted to have them walk away with as far as, you know, I've read this book and so now I can feel confident doing x.
[00:12:25] Unknown:
The target persona is anyone who wants to improve their Pandas code. And I know that's a little bit vague, but I do believe that there are some gaps and some issues that aren't covered by a lot of the material out there. And so that's sort of the vague target persona. To me, be a little bit more specific, this is not like an intro to Python tome. So I'm assuming that there is some basic Python knowledge. I think we've all seen that someone who has a background in programming can go out and, in 90 minutes, sort of look at some Python code and sort of teach themselves Python or, like, go out to Stack Overflow and start copying pasting and be relatively successful with it.
There are some, I would say, like, minimal Python skill sets that would be useful for someone who wants to dive into pandas. A few of those that are not, I guess, maybe intuitive to the 90 minute Python learner might be Lambda functions, list comprehensions, and slicing. So I do sort of teach those in the book, but they are, I would say, used all over the place in pandas and might be something that someone who's fresh to Python might want to review. And not just for my book, but in general, if they if they're thinking about learning pandas and using it, they might wanna make sure that they're pretty caught up on slicing lambdas and comprehension constructs.
[00:13:58] Unknown:
So now digging into some of the kind of Pandas specific aspects of it, there are definitely a number of sharp edges in the design of the framework because of just the early days of getting something working, and they are kind of core to how it's all pieced together. And so there have been some, you know, improvements in the API, but the old original implementations are still there. And it's like, oh, you know, there's loc and there's iloc. Which 1 do I use? And I'm wondering as far as some of those sharp edges or some of the potential bad habits that people can develop when they are just trying to hack something together quickly. What are some of the main things that you see newcomers run into when they're first getting started with pandas and some of the ways that you're trying to correct those habits in the book?
[00:14:49] Unknown:
Yeah. Again, I don't want this to be like, this is all bad about pandas. There's actually a lot of great things about pandas. I think 1 of the things that a lot of people don't realize is at its core, pandas is a wraparound NumPy. And basically, NumPy is a wrap around giving you the ability to do vectorization on numbers as, what I would like to say, C numbers, not Python objects. And so by using NumPy and, in essence, pandas, because on top of that, you basically get very close to c level speeds for many operations. And if you understand that, that can sort of push you into not using certain operations or trying to do things in, what I would say, a vectorized way.
That's, I think, a basic sort of principle that especially for people coming from a Python background but without a NumPy background can be a little bit different. And if you look at it just from, like, a pure Python side, that can be confusing. And and you might wonder, like, what's the benefit here if I'm just writing for loops with pandas? So, generally, if you're writing a for loop with pandas, it's sort of a code smell or indication that you could be doing something a better way. There are other things that can be overwhelming for people coming to Pandas. And 1 of those is if you, like, inspect a number of attributes on the data frame or series, which are the core data types in Pandas, there's over 400 different attributes on those.
Contrast that with, like, a list in Python. I think a list has, like, 40 attributes or something like that. So there's a large amount of API, which I think when you present someone with that, they're like, you just have to learn 2 things, data frame and series. But both of those have 400 different attributes. That can be a little bit overwhelming, and people put up their guards and, like, I don't think that I can do that. So my take on that is, I guess, going with the Pareto principle. Yeah. There are a lot of things. It's a rich API. However, you don't need to learn and memorize everything. In fact, I think it's probably humanly impossible to memorize the whole API pandas just because it is so large. I mean, you also have things like the read CSV function, which is a function in Pandas for ingesting CSV comma separated value files. And you might think, oh, that's a simple thing. Right? You pass in a file. Yeah. Yeah. Yeah. You can pass in a file, but this function has, like, 50 different parameters in it.
And, you know, from a software engineering point of view, if you look at, like, API design, API design is like, hey. People have limited mental capacity, and most people's brains can only hold 7 plus or minus 2 things. And so when you overload them with 50 things, that can be overwhelming, let alone if you look at a lot of the options for those 50 different parameters in there, like par states. Par states is an option for telling Pandas which columns you want to treat as dates. There's, like, 5 different ways of specifying dates in there as well. So if you were to look at, like, someone applying modern Python, like type signatures on top of pandas, The signature for that function would be multiple pages long. Just to get that right, it is even possible to get that right. So that can be a little bit overwhelming for people who are, you know, coming from Python.
However, I do think, again, you know, if you're of the mentality that I need to, like, read through all the documentation and understand everything, that's probably gonna overwhelm you. But if you sort of take it step by step, you know, there are 50 parameters to the read CSV function. However, like, in practice over the years that I've used Pandas, I've only used, like, 5 or 6 of those. So a lot of these things, for better or for worse, you can sort of ignore and not worry about them unless you really need to. Some other things that are a little bit tricky or maybe confusing is if you have a data frame, which is a 2 dimensional structure similar to, like, a database table or a spreadsheet, and you want to access a column in that, there are kind of 2 what main ways to do that, but there are other ways as well. So there are multiple ways to do things, which a lot of people think, Python should only have 1 way to do things, which in practice really isn't true, but it's a nice goal, I think, to strive for. But the 2 main ways to do things are attribute access, which is with the period, and index access, which is with the square brackets.
And there are sort of pros and cons to both of those ways. And so, you get people professing to do things 1 way or people professing to do things another way. And for someone who's new, it can be confusing. In fact, my sort of take on this after teaching things to a lot of people is that, you know, I would probably prefer not even to have either of attribute access or index access. If I were designing it, I would probably just make it a method to pull off columns and rows just because, like you said, Tobias, a lot of people who are coming to use pandas are using as a tool, and they're not necessarily Python programmers. So having to understand those differences can be a little bit confusing or overwhelming. Why is there different syntax to do the same thing?
That sort of thing?
[00:20:16] Unknown:
As to the Pandas API, as you mentioned, it's vast and every you know, just the core data types have a huge number of operations that you can do on them. And I'm curious as you have been working through this book and trying to enumerate your own knowledge and learn more about the specifics to make sure that you're getting things right, what are some of the most surprising things that you've learned about the Pandas API or its capabilities in the process of actually trying to cement these best practices and understand, you know, these are the the 5 different ways I can do this thing, and this is the way that I want to recommend.
[00:20:55] Unknown:
A lot of this, I I was, you know, pretty familiar with having taught this and having written books previously. I guess 2 of the things that caused me the most delay in this was diving into time zones a little bit more. So I wanted to have pretty good coverage of date manipulation and time zones. And I looked back at my analyses that I've done over the years, and most of them really didn't use time zones. So I I started to mess around with time zones a little bit more in preparation for the book. And something that surprised me that, again, I hadn't really come across was in Pandas, if you convert a column to a date column, you get this nice little accessor, which is the dtaccessor, which allows you to easily pull off, like, the year or the month or the day of the week. And then it has some nice things where you can, like, convert the month name or the day of the week name to a different locale or language if you want to. So that's really handy. And, certainly, you could imagine doing these sorts of things, you know, just with a string and doing, like, reg x or something and pulling off parts and converting them. Once you convert it to a date, you get this ability to manipulate that sort of for free. And so I was playing with a dataset that I had and trying to convert these times that I had in here to a time zone, and it was just really slow and taking, like, 15 minutes to convert this dataset to a time zone.
And I was trying to, like, look on Stack Overflow in all these places and not really getting too far, and coverage of that was a a little bit hard. And then once I finally converted this column to a date that had multiple time zones, I tried to do, like, date manipulation as I'd been doing it with DT Accessor, and it didn't work. And that's because in pandas, if you have a column that has 2 different time zones, it doesn't treat it as a a date, time type. It converts it to this time stamp type, which doesn't have the DT accessor. That was something that I just wasn't aware of before. Like I said, I really hadn't done a lot with converting stuff to time zones. I just sort of had wall time and said, okay. It's it's wall time. It's good. That was something that was surprising to me.
Another thing that was surprising, this might be a little bit esoteric as well, is I sort of dove into grouping and all the grouping operations you can do. And so it might seem like, oh, those are just group by, but there's not just group by. There's group by and then there's pivot table and then there's crosstab. And those are all sort of built on top of each other. But in addition to that, these are ways of pivoting your data, if you're familiar with Excel, terms. In addition with that, there's a really cool method in pandas, which is resample. And so if you have a date and you stick the date in the index and you can do resampling, you say, I want to aggregate these by month or by day. Or Or you can say, I wanna aggregate them by every 2 months 3 days 2 hours. Cool stuff like that. Sort of diving in and sort of deep diving into that, and I actually didn't include this in the book, but I was like, okay. I'm gonna look at, like, every combination of doing these group buys with applies and transforms and filters, and then also doing those with resamples and whatnot.
It turns out that there are some sort of corner cases where doing a group by is different than doing a resample, which I wouldn't have expected, but I always thought they were somewhat equivalent. But there are some corner cases where that behavior is different. So those are some examples, I think, Tobias, of, you know, if you really want to learn something well, you would teach it or alternatively write a book about it. And, you know, even after having used Pandas for many years, yeah, there's insights and stuff to learn by digging into them.
[00:24:50] Unknown:
As far as the work of just using Pandas to complete a task, you know, it's definitely fairly straightforward to take it off the shelf and say, okay. These are the 5 methods I need. I'm going to read CSV. I'm going to drop n a. I'm going to aggregate across this column, and then I'm going to write that out to a table in my database. And so it's, you know, pretty straightforward to say, okay. These are the things I need to do. And as you get confident using some of these high level aspects of the framework. You might get comfortable with certain practices that are useful to get up and running, but might be considered bad practice if you're using it in a professional setting or trying to use it in a team environment. And I'm curious what you've seen as far as some of the practices that are useful at the start but end up becoming either limiting because you don't realize the actual full power of what you can be doing with it or are potentially antipatterns as you start to scale up your usage of pandas and just some of the ways that you've been able to work with some of your clients to help to kind of correct those crutches in terms of how they're applying Pandas in their work?
[00:26:06] Unknown:
1 of those would be people who are coming to Pandas from a Python background, and they're thinking about pandas as a sequence of data. So they wanna use for loops everywhere. Once you sort of drop into a for loop, you you're losing 1 of the main benefits of Pandas, which again is that notion that it's storing data in blocks as c would and and not really storing objects, Python objects under the covers. Once you use a for loop, you're sort of telling Pandas that you want to pull the data from c back into the Python level and manipulate it at the Python level, which can work. And, obviously, a lot of people do that. And if you look at a lot of code, you'll see a lot of code that has 4 loops. But generally, a 4 loop is what I would call a code smell with pandas and and indicating that there's probably a faster way to do things. A related version of that would be using the apply method.
A lot of people use the apply method, which is basically doing a for loop, but wrapping it in a method that's called apply. So either of those is maybe naive and works, but maybe not the most effective. Another thing that I see people doing all over the place is using the in place argument. So there is an in place argument on a lot of operations on Pandas, and the compelling reason supposedly for using in place is that if you do an operation in place, it doesn't copy the data, but rather it just updates the data in place so you have this memory savings. Well, it turns out that that's sort of a lie, and there's even a bug in Pandas to deprecate and remove in place completely because most of the calls that use in place actually make a copy under the cover. So a lot of people, like, use in place. You're gonna save all this time and memory, and it's actually not the case. You don't save time and memory. In fact, from my point of view, using Inplace is an anti pattern because if you use in place, you're not gonna be able to chain. And I think chaining those operations is going to lead you to write better code. And so that would probably be my next thing that people do is they don't chain.
And what is chaining? Well, most operations in Pandas will return a new data frame or series. And so what I tend to tell people is parentheses around your operations and then take your first operation, and it's going to return something. Instead of storing that as a variable, just do your next operation on the next line right after that. And if you follow that pattern of what I would call chaining, sometimes people call this flow oriented programming, it's gonna force you to sort of think about the logical steps in your process, but it's also gonna make your code look like a recipe, and you can read through it line by line. This line is going to filter out the columns. The next line is gonna filter out the rows. This line is going to add some new columns. This line is gonna do an aggregation.
And you can read that and it's very clear. So a lot of people don't know about chaining or don't use that. Another issue that is a common wart or gotcha is when people treat pandas as a dictionary, or a data frame, or a series as a dictionary, and they start doing operations on it and doing, like, index assignment. And so, invariably, if you do that, you're going to get this setting with copy warning that most people who use Pandas will see. And you'll start searching all over the place to find a good explanation with it. It's a little bit hard, and I don't know that I've seen a really good explanation of it, like on Stack Overflow or other places. But invariably, what most people do is they just to get around that warning after they do some operation, they actually copy their data frame, which is kind of funny because these are the same people who say like, use in place, don't copy. And then they have these warnings. They put copy to get around the warnings. Well, I found that, like my whole book, all of the code in my book and in my previous book, The Python Cookbook, don't have any of those errors at all because if you learn to use the assign method, which very few people know about it seems, you won't have those issues. So the assign method is a method in Pandas that rather than treating Pandas as a dictionary and saying for this column, overwrite this column with this, you say I want to call assign and you use a key with the name of the parameter being the column and the value being the value of that new column, and that will return a new data frame with that updated column in there. And if you use that, you you sort of sidestep that issue completely.
And the nice thing about assign contrast with, like, treating pandas objects as a dictionary is that assign works very well with this chaining that I just talked about. So you can write code that, again, when you have an assign method in there, it's just gonna be, here are the methods and here are the columns I'm gonna add to my data frame. And you can read through that and you can see line by line what are the columns you're going to add. I think another maybe anti pattern or things that people do that they shouldn't is they load up their data, and then they look at it, and they see that the combination of Jupyter, generally, they're using Jupyter and Pandas limits the amount of data they can see. So I think currently on most modern versions of Jupyter and Pandas, it will show you 20 columns and 10 rows of data.
You know, a lot of people have more data than that. Their immediate thought as hackers is like, how do I break the system or how do I view more data than that? And I like to tell students to actually sort of if you have that urge or thought, maybe reconsider it, and maybe it should be your spidey sense telling you that maybe there's a better way to do things. And so humans aren't particularly adept at looking at large tables of data. I sort of realized this when I was, again, in my prior life doing business intelligence reporting. You know, we didn't evolve over the years to have this huge table with a 1,000 rows and or a 1,000,000 rows and a 1,000 columns in it and sort of scroll down through that and find the data that looks interesting. Right? So that's a really ineffective way of dealing with your data. However, computers are really good at filtering out your data and finding things that might be important. So if you feel that urge to, like, view more of your data, it might be a hint. Hey. Use the computer. Do this. Alternatively, humans are also good at visualizing things.
And so that might also be a hint to you to visualize your data rather than to start scrolling through large amounts of data. Those are a few examples of places where people maybe get confused or could do things a little bit better, Tobias?
[00:32:37] Unknown:
1 of the interesting things about pandas is that it has grown to be popular in a number of different kind of styles of application where many people will use it for data engineering workflows, where they're just doing some manipulations to clean up the data, like drop nulls or, you know, impute values that are missing or maybe, you know, adding derived columns or removing certain columns and then writing that back out to some other source. Some people are using it just for data exploration purposes to see here's a big CSV file or here's a table from the database. What do I even have in here? Maybe do some sort of, like, regression analysis on it. Other people are using it as a core piece of their machine learning workflows. And I'm wondering if there have been any different kind of stylistic aspects of how you apply pandas in those different scenarios and any recommendations that you have for folks who might be using it for 1 versus the other as to lessons that might be learned from different use cases for the Pandas framework?
[00:33:42] Unknown:
I think you sort of nailed it on the head. The Pandas is so popular these days because it can be used all over the place. Right? So if you're doing, like, structured machine learning on tabular data, Pandas is sort of the de facto go to tweak your data before that. And then you have a lot of analysts who want to come over to the Python world. They're used to maybe SQL or Excel and want to start doing things with a programming language. Sometimes the goal is just to, like, learn a programming language, but sometimes the goal is to start automating these things. Right? I mean, you certainly can automate Excel, but that's somewhere probably you don't want to go.
I think 1 of the things that might be problematic, it's sort of a pro and a con, is that Jupyter. Jupyter is a great environment for exploratory type stuff and trying things out very quickly. However, Jupyter doesn't really lend itself well to writing production code. And so you can get quick feedback, but, you're probably violating a lot of software engineering best practices, like making global variables and not using encapsulation, like functions or methods, that sort of thing. I guess I'm pretty aware of this because I teach programming, but also because I have a software engineering background. So I worked in the software for a long time, and so I've been through looking at code reviews and clean things up. You know, all of what you just talked about, Tobias, like machine learning or analysis or just pulling something out, you know, a lot of the code that you see in Jupyter is really not good production quality code.
And so, again, my advice, what I tell my students is to sort of embrace that chaining philosophy of writing your steps 1 by 1. Because once you do that and you have sort of like, here's my raw data, and I also tell people work with your raw data because invariably your boss is gonna come back to you and say, I want you to explain this prediction or I want you to explain this number. If you're not working with the raw data, that makes it a lot more difficult. Also, you know, if you are working with the raw data, you have all of the transforms there. If you're working with someone's Excel file or, you know, data that they've hand tweaked, you don't know what they've changed potentially. So I tell people, work with the raw data and then make this chain of operations that you're gonna do to sort of clean up the data. You know, doing your data janitorial work or getting the data ready for exploratory data analysis or machine learning. And then once you have this chain, it's pretty trivial to just take that chain and indent it and make it into a function.
And so if you follow that style of programming, going from sort of naive bad practices to making something that's ready to, you know, put be put in a library or put into production is really trivial and easy. So that would be something that I would suggest. Just start learning how to do chaining, And I think that will make a huge change for a lot of people's code and make it a lot more readable and easy to understand.
[00:36:43] Unknown:
And another interesting development in the core of pandas that's happened in the past, at this point is maybe a couple of years old, is the extension arrays capability where you can actually create additional data types that can be understood by the pandas data frame. So I think the initial work was done by the folks at Anaconda to add support for geographical objects, so being able to do geo representations of objects in pandas or being able to treat IP addresses as a native object without it just being of type object and having to do generic operations and then casting it back and forth. And I'm wondering if you factored any of that capabilities or if there are any new evolutions in the Pandas ecosystem that you brought into the book or if you decided to just keep it core to what's out of the box when you just pip install Pandas and nothing else.
[00:37:36] Unknown:
To be honest, Tobias, I have never used those extension capabilities and nor have my clients asked for them. So not to say that they can't be useful. So I didn't put that in my book. I will say this as well. I mean, I'm actually teaching a pandas course this week. And in general, Pandas, you know, for a while there is, like, a point 18, and then it was, like, 0.24, and it sort of sat there at this weird version level. And a lot of people are like, well, we can't use that because it's not 1.0 or whatnot. And then it came to 1.0, and I think we're now at, like, 1.34.
You know, things have changed in pandas, but also a lot of things haven't changed. And a lot of what the book is, again, is coming back to this Pareto principle. If there are these core fundamentals and you understand these core fundamentals, you can do sort of 80, 90, 95% of what you need to do with using very limited pandas. And so it might be interesting, but I haven't done it. But I would imagine that, like, 95 plus percent of my book would probably run with, like, Pandas 0.24. Right? It is, like you said, focused on the core. And what I've seen most of my clients and students, the the functionalities that they need to get their job done.
So, yeah, will there be, like, a second edition, and maybe will I put, like, extension types in there? I mean yeah. Hopefully, there will be a second edition. Right? And and with feedback from users and, you know, as I continue working with clients and find out more best practices or features or things that are super compelling, would certainly like to include those in a future edition.
[00:39:16] Unknown:
In terms of your work with Pandas and working with your clients, what are some of the most interesting or innovative or ways that you've seen it applied?
[00:39:25] Unknown:
I think as someone who spends a lot of time with their clients, the most interesting is probably when a client has like a major insight. And I can't really speak to the specifics of a lot of clients, but I do remember 1 who I was teaching them pandas and visualization as well. I often teach those sort of hand in hand. And instead of using sort of cannedata, the class we actually used their data to teach pandas. So as we're going through the class, we're just like, okay. How would you do this operation? And you're doing it on your data. Right? So it's like, super relevant to them. And in the middle of the class, 1 of the attendees was like, oh, look at this visualization I did. And it was a super cool visualization. It basically was a little chart that showed CPU usage through the week. It was a sort of circular path of, like, you know, during the weekend, for this particular client, CPU usage went up for certain reasons, and then it came back down and then sort of built back up. And and after, you know, learning some Pandas and learning some visualization tools, they were able to make, super compelling and relevant insights with very little code that you know, just by understanding these basics of data manipulation.
I mean, another case that was just rewarding, I guess, as an instructor, but maybe sad for the client, was I I was teaching Pandas, and I had a client, like, in the middle that, like, hit themselves on the head. I'm like, what's wrong? Did I, like, say something wrong? And they're like, no. This code that you just showed us, which was basically 1 line of code, we just spent, like, the last 3 weeks implementing this for our business intelligence reporting insight. So there is a lot of power in Pandas. And, you know, if you can master that, it can open doors to insight into your data. So those would be a couple of examples.
[00:41:17] Unknown:
In the work of writing this book and exploring the kind of detailed elements of pandas that you might not otherwise pay much attention to in your day to day, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:41:32] Unknown:
I mean, go back to your classic, you know, to learn something, you need to teach it or alternatively write a book. And I think that's kind of true, especially in the days of Stack Overflow. You can just sort of go out Stack Overflow and start copying and pasting things and jamming them in where you want to, and a lot of it will just work without even understanding what's going on. So I would also say that, you know, if someone's considering writing a book, books are probably gonna take longer than you expect, even as good as you are estimating your skills of writing a book.
At this point, I've written enough books that, like, book writing and editing and those sorts of things isn't particularly difficult. It's just a matter of putting in the time and getting that work done. For this book, probably what spent the most time for me was, I think, focusing on examples and imagery that would convey the message and the ideas of what I'm trying to get at. So it can be a little bit of a challenge. I would say, like, maybe this is sort of a meta commentary and a lot of material out there, but, you know, when you're doing an example, you want your example to show how to do something. Right? And so there's a fine line between, like, here are some of the warts that you might come across, and, like, here's just the example as is. So navigating that path can be a little bit of a challenge.
I'm happy about it. Hopefully, it's useful for others for not just telling them how to do things, but also giving them some warnings and some ideas of when things go south, how to recover from that.
[00:43:10] Unknown:
As you continue to work with clients and as you are working on releasing the book, now that you've got that out the door, what are some of the projects that you're planning to work on in the near to medium term or any kind of new areas of Python or Pandas or data science that you're looking to explore?
[00:43:29] Unknown:
I guess I'm in a weird situation where my focus is like on providing value to my clients. And so in the last year, some of the places where they've asked for additional training is testing their code, and some of them have asked for more modern Python. Modern Python is starting to use things like type annotations, some of the Python 3 specific features. And so I did do a lot of training, testing, basic testing, but also more advanced testing, leveraging advanced features of Pytest or using tools like Hypothesis. Hypothesis. My experience is that a lot of people who are using Python, as we've mentioned, you know, are coming to it not necessarily with a programming background. So being able to do testing is important and that they just don't have that background.
I'm also working on a new initiative to sort of, I guess, help spread Python knowledge, and I don't really have more details about that right now. But I would say, like, follow me on Twitter, and hopefully in the coming weeks, I should be able to shed a little bit more light on that.
[00:44:34] Unknown:
Are there any other aspects of the work that you did on this book or your experience with pandas or lessons that people might take away from reading the book that we didn't discuss yet that you'd like to cover before we close out the show? I I just wanna sort of make it clear that, like, hopefully, this wasn't super negative, and my goal here is not to rail on pandas. Like like I said, I do make a good deal of my income from that.
[00:45:00] Unknown:
I think pandas is a super powerful tool, you know, but you've got to take it in with warts and all. I'd say 1 more thing, Tobias, that might be sort of a meta commentary that's important for people to understand or they might want to consider if they're thinking about learning pandas. It is what we're seeing now, in addition to, like, pandas, there's a proliferation of libraries that are basically using pandas not as an engine, but more as an API. And so you'll see things like Spark has Qualas. There's DaaS. There's a lot of other ones that are maybe fringe right now, but I could see a world where a couple years down, my Pandas code really doesn't change other than an import. But now instead of running on the CPU, it's running on the GPU or maybe some tensor processing unit or something.
But it's basically using an API very similar to what Pandas provides us, but the underlying implementation is completely different. Tanner Iskra Yeah. It's definitely an
[00:46:06] Unknown:
interesting aspect of some of these early tools have become so ingrained in terms of how people are working with them, that people are just adopting the API for these different applications. And I know that there's also some work being done to build a set of API standards for working with data in Python so that you can abstract over the specific library that's being used, whether it's NumPy or Pandas or if you're working with Dask or Modin or what have you so that you can just write your code, and then as you said, just change the import at the top. Yeah. Yeah. I think it's probably gonna be interesting because
[00:46:42] Unknown:
up till now, I think we're sort of at the 1.0 days for data analysis here. But if you look at, like, the progression of databases where databases start out simple, but now they have these really complicated query planners and those sorts of things, I think we'll probably see, you know, things go down that route. You know? And and especially if you leverage, like, a chain style like I'm talking about, that would be something that would enable
[00:47:05] Unknown:
a query planner on analysis engine to do things like predicate push down in some of these more advanced operations that the database can do because it has the whole query there, and it knows what to do rather than just a single step from it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or pick up a copy of the book, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose snowshoes as a way to get out and be active in the winter and enjoy the scenery. I recently picked up a pair of snowshoes from MSR. I had a link to them in the show notes. They're a good manufacturer. They've got a decent lineup of snowshoes,
[00:47:47] Unknown:
across a range of budgets, so definitely worth checking out. And so with that, I'll pass it to you, Matt. What do you have for picks this week? Awesome. Yeah. My kids have some MSR snowshoes, and they're definitely cool. So I guess on the snow theme so I live in Salt Lake, and I like to ski. So this might be a little bit geeky, but I like to do what's called telemark skiing, which is this weird style where you, like, drop your knees when you're going. But, for the past couple years, I've been using a binding from a manufacturer called 22 Designs, which is the links binding. So pretty esoteric, but if you're into skiing, telemark skiing, I definitely recommend that. 1 more thing I'll mention, Tobias, is let's put a promo code out there for your listeners. So with the discount code init, capital I n I t, all caps, I'll put a discount code on a Pandas book course bundle. So go to store.metasnake.com and, promo code in it for a discount code on that.
[00:48:44] Unknown:
Alright. Well, thank you very much for that offer for the listeners, and thank you for taking the time today to join me and share the work that you've been doing on the effective pandas book and helping to teach pandas and effective practices to people in the community. It's definitely a very necessary skill to have, particularly as data pervades everything that we're doing these days. So appreciate all of the time and effort you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It's been a pleasure. Thanks for having me on. Thank you for listening. Don't forget to check out our other show, the data engineering podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Matt Harrison's Background and Entry into Data Science
Python's Appeal and Transition to Teaching
Writing Effective Pandas and Motivation
Challenges and Best Practices in Pandas
Surprising Insights and Advanced Pandas Features
Common Mistakes and Anti-Patterns in Pandas
Different Use Cases for Pandas
Extension Arrays and Future of Pandas
Final Thoughts and Future Projects