An Exploration Of Effective Pandas Practices With Matt Harrison

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Matt Harrison about best practices for using pandas for data exploration, manipulation, and analysis. So, Matt, can you start by introducing yourself?

Yeah. Sure. Thanks for having me on. I'm a

corporate trainer these days, and I do some consulting. My background is that I have a computer science degree.

Worked in the Bay Area for a while at various

start ups,

then relocated.

I live in Utah right now and

been using Python most of my career.

But after,

I would say, writing a book and doing a bunch of conferences,

I got pulled into the training gig world and pretty happy here. And so I spend

a lot of my time selling snake oil and telling people how to

tell lies with data. I mean, I teach Python for a living and also teach data science machine learning. You mentioned that you've

got your degree in computer science. And I'm wondering what was your kind of entry point into the data science side of things, and what about it keeps your interest and keeps you motivated to stay up to date with that space?

We'll definitely say that staying up to date is relatively difficult these days just with a number of,

I guess, niches

and

papers and whatnot that are coming out. But when I graduated from school,

I went to work for a search company. And so,

actually, how I learned Python

was working with a linguistics

PhD writing what we called a term suggestor. So

pulling out the important terms of that.

We did that with Python. He was a Tcl person. I was a Perl person at that time. When I was in school, I was told if I learned Perl, I could get a job, and

it proved to be true.

However, at that point,

neither 1 of us wanted to sort of cross the bridge to I wasn't particularly interested in Perl or Tickle. He wasn't interested in Perl. And so Python was a compromise.

And within 3 days, we had a working prototype, and I think both of us neither looked back. And so that was my background into sort of the data world.

In a later life, I did

a lot of BI reporting. So

I had a company that was

manipulating data and creating reports

using Python.

And then later,

I worked for a storage company,

and was doing statistical models of failure rates and that sort of thing. So that's

sort of my background into the data science side. So I come at it from a software engineering side, which might be a little different than a lot of, I guess, what you see data scientists these days. From my experience doing training, a lot of them are coming from PhD programs, physics,

mechanical engineering, whatnot, to

realize that there are a lot more job opportunities

doing data science than there are in academia. And so they sort of cross over that way.

And so you mentioned that your entry point into Python was as a compromise so that you didn't have to learn Tcl. And I'm wondering sort of as you said, you never really looked back. And I'm curious, what about the Python language and community has kind of held your interest? And what motivated you

to turn your career into teaching other people to kind of make that same jump?

That's a great question. So

as I said, I was taught to learn Pearl or told to just because it was like, Hey, this is pragmatic. You can get a job. And that proved to be the case.

However,

I thought, you know, Pearl was great coming from, like, school where we're writing a lot of c

and Java. Pearl was just a little less ceremony than that. However,

as I started using Pearl more, their mantra of there's more than 1 way to do it

proved to be problematic

and that

I would look at different resources, and there would be 3 different ways to do something with different characters. And it got really confusing.

And even so much that, like, I would come back to my own code, and it'd be hard to understand what's going on. So the thing that I appreciated about Python was the focus on readability

and understanding. And even today, when I'm, like, teaching a course to students,

I'll emphasize, like, for me, the most important thing is readability.

And do what you can to make it readable because you're gonna come back to your code, your colleagues are gonna come back to your code,

and you wanna be able to understand it in a day, in a week, in a month.

And I found that

Pearl really didn't fulfill that promise for me. It made it really easy to kick out stuff, but coming back and changing and updating proved to be a little bit more problematic.

Yeah. 1 of the jokes that I've heard is that most code is write once, read many, and Perl code is write once, read never.

Yeah. And I don't mean to, like, pick on Perl.

I knew a lot of people who did Perl and did great stuff with that. And it, for me, it was my entry point into Python. So I'm grateful that I had the chance to sort of see those differences,

and it really helped me appreciate Python more by learning Perl first.

Now you've been working in Python for a long time. You teach people how to use it, and you've written a few books. And the most recent 1 is about effective pandas, so digging into the pandas library and framework and how to apply it to different

data manipulation and analysis tasks. And it's definitely

the Swiss army knife of data and has come to be 1 of the predominant tools where a number of people come into Python entirely because of pandas,

where, you know, maybe 10 years ago, it would be people came into Python because of Django or something like that. I'm wondering,

what was your motivation for deciding to write a book about the Pandas framework and sort of the main

focus and goals of the book as you've written it? Sadly or interestingly, this is not my first Pandas book. So I did write a Pandas book 5 years ago

as I was using pandas quite a bit. And then I also was

the author. I did the 2nd edition of the pandas cookbook.

However, in the meantime, after doing

both of those,

I have taught a lot of pandas to thousands of students,

big companies and and small companies.

And

I do a course at Stanford

couple times a year where it's a data analysis course aimed at

people who want to start doing data analysis without using things like Excel.

And

so I've seen a lot

of people using Pandas. I read a lot of Pandas content.

Over that time, I've come to have some stronger opinions about,

I guess, the proper care and maintaining of your Pandas code base.

And I wasn't seeing a lot of material that aligned with my opinions,

or maybe I should say like a lot of medium blog posts that frankly were pushing out

bad advice. And so

I took that as a chance to, okay, let's maybe revisit this. And my book,

Effective Pandas, started out as sort of just updating my old Pandas book that was 5 years old, but it

essentially turned into a rewrite.

I guess another thing that I wanted to do that I didn't see was just the ability

to use color in the book.

Not that you couldn't use color before, but

I've written a couple of books before, and most every 1 of the physical versions has been in black and white. And so this 1, I was like, okay. Well, I'm gonna make it so the digital

and the physical are gonna be in color because I think there are a lot of things that you can convey

in pandas visually.

My book has over a 100 images talking about some of these manipulation

constructs, and I think

that having compelling images that really explain what's going on can be useful for that. So like all things, I think, Tobias, you know, when you asked what was pushing you to do this, I think a lot of developers and software people, they like to bike shed or or scratch their own itch. And so this was, I think, an itch that was very itchy for me after,

sort of seeing what was out there, and I was compelled to scratch it. And as you mentioned, there are definitely

numerous blog posts out there that are talking about different applications of pandas or plugins for pandas

or using pandas as 1 element of a broader workflow.

And there are also a number of book length treatises on pandas in different forms, including

1 by Wes McKinney that is talking about sort of data analysis with Python, where Pandas plays a big part. Wondering what was missing from the kind of available literature

or maybe just not properly up to date in the available literature

and some of the gaps that you are aiming to fill in your work with this effective pandas book? I don't want this to be like a huge, I guess, rant on Pandas.

I guess, full disclosure, like I said previously, like, I I make my money

from Pandas and Python.

However,

I don't think it does me any service as someone who's gonna help people learn it to be sort of like in my class. Oh, this is awesome.

It's all peaches and cream.

And then, you know, everything in the labs and

the course work out great, and then they get on the real world, and it's like, oh, they run into these issues, and they don't know what to do. Or they're And it's sort of like, oh, here's a wart, and why wasn't I exposed to this? So I sort of feel it's my duty to show, like, good and bad and talk about both of those things.

I mean, if you want to, I have sort of a list of, like,

rough things in pandas that pandas

might bite you with.

I've read a lot of pandas books, and

I think a lot of them are are, like,

great for, like, here's an analysis I did or wanted to do, and here's how I did it.

But they're a little bit lighter on, like, here's the

things to watch out for.

And so

I wanted to make sure that I sort of covered both of those, and that was 1 of the goals of the book. You want some examples or

you wanna dive into some of that? I think we could dive into that in a minute. But before we go that direction, another thing that's worth exploring is that

anytime you're writing a book, there's the challenge of figuring out, particularly a programming book, figuring out what is the entry point. Like, where are the people who are reading this coming from? Where a number of books that might be covering pandas are starting with,

you have no idea what programming is. You just wanna be able to do something with data. So I'm going to give you the, you know, 30 second overview of Python, and then we're gonna dig into these calls with pandas. And we're just gonna cover this all at a surface level so that you can go from 0

to having something working in as short of time as possible.

Or you might be assuming, you know, this is somebody who's coming from a data science perspective, and so we're going to teach them some of the software engineering practices around how to use Pandas. And I'm just curious,

what was the kind of target persona that you had in mind as you were writing this book and the

main goals that you wanted to have them walk away with as far as,

you know, I've read this book and so now I can feel confident doing x.

The target persona

is

anyone who wants to improve their Pandas code. And I know that's a little bit vague,

but I do believe that

there are some gaps and some issues

that aren't covered by a lot of the material out there.

And so

that's sort of the vague target persona.

To me, be a little bit more specific, this is not like an intro to Python tome.

So I'm assuming that there is some basic Python knowledge. I think we've all seen that someone who has a background in programming can go out and, in 90 minutes, sort of look at some Python

code and sort of teach themselves Python or, like, go out to Stack Overflow and start copying pasting and be relatively successful with it.

There are some,

I would say, like, minimal Python skill sets that would be useful

for someone who wants to dive into pandas. A few of those that are not, I guess, maybe intuitive to the 90 minute Python learner might be Lambda functions,

list comprehensions, and slicing. So

I do sort of teach those in the book,

but they

are, I would say, used all over the place in pandas

and

might be something that

someone who's fresh to Python

might want to review. And not just for my book, but in general, if they if they're thinking about learning pandas and using it, they might wanna make sure that they're pretty caught up on slicing lambdas and comprehension constructs.

So now digging into some of

the kind of Pandas specific aspects of it, there are definitely a number of

sharp edges in the design of the framework because of just the early days of getting something working, and they are kind of core to

how it's all pieced together. And so there have been some, you know, improvements in the API, but the

old original implementations are still there. And it's like, oh, you know, there's loc and there's iloc. Which 1 do I use? And I'm wondering

as far as some of those sharp edges or some of the

potential

bad habits that people can develop when they are just trying to hack something together quickly. What are some of the main things that you see newcomers run into when they're first getting started with pandas and some of the ways that you're trying to

correct those habits in the book?

Yeah. Again, I don't want this to be like, this is all bad about pandas. There's actually a lot of great things about pandas.

I think 1 of the things that a lot of people don't realize is

at its core,

pandas is a wraparound NumPy.

And basically, NumPy is a wrap around giving you the ability to do vectorization

on numbers as,

what I would like to say, C numbers, not Python objects.

And so

by using NumPy

and, in essence, pandas, because on top of that, you basically get very close to c level speeds for many operations.

And if you understand that,

that can sort of push you into

not using certain operations or trying to do things in, what I would say, a vectorized way.

That's, I think, a basic

sort of principle that

especially for people coming from a Python background but without a NumPy background can be a little bit different.

And if you look at it just from, like, a pure Python side, that can be confusing. And and you might wonder, like, what's the benefit here if I'm

just writing for loops with pandas? So, generally, if you're writing a for loop with pandas, it's sort of a code smell or indication that you could be doing something a better way. There are

other things that can be overwhelming for people coming to Pandas.

And 1 of those is if you, like, inspect a number of attributes on the data frame or series, which are the core data types in Pandas, there's over 400 different attributes on those.

Contrast that with, like, a list in Python. I think a list has, like, 40 attributes or something like that. So there's a

large amount of API,

which I think when you present someone with that, they're like, you just have to learn 2 things, data frame and series. But both of those have 400 different attributes. That can be a little bit overwhelming, and people put up their guards and, like, I don't think that I can do that.

So

my take on that is, I guess, going with the Pareto principle.

Yeah. There are a lot of things. It's a rich API.

However,

you don't need to learn and memorize everything. In fact, I think it's probably humanly

impossible to

memorize the whole API pandas just because it is so large. I mean, you also have things like the read CSV

function,

which is a function in Pandas for ingesting CSV comma separated value files. And

you might think, oh, that's a simple thing. Right? You pass in a file. Yeah. Yeah. Yeah. You can pass in a file, but this function has, like, 50 different parameters

in it.

And, you know, from a software engineering point of view,

if you look at, like, API design, API design is like, hey. People have limited mental capacity,

and most people's brains can only hold 7 plus or minus 2 things.

And so when you overload them with 50 things, that can be overwhelming,

let alone

if you look at a lot of the options for those 50 different parameters in there, like par states. Par states is an option for telling Pandas which columns you want to treat as dates. There's, like, 5 different ways of specifying

dates in there as well. So if you were to look at, like, someone applying modern Python, like type signatures on top of pandas,

The signature for that function would be

multiple pages long. Just to get that right, it is even possible to get that right. So that can be a little bit overwhelming for people who are,

you know, coming from Python.

However, I do think, again, you know, if you're of the mentality that I need to, like, read through all the documentation and understand everything,

that's probably gonna overwhelm you. But if you sort of take it step by step,

you know, there are 50 parameters to the read CSV function. However, like, in practice over the years that I've used Pandas, I've only used, like, 5 or 6 of those. So a lot of these things, for better or for worse, you can sort of ignore and not worry about them unless you really need to. Some other things that are a little

bit tricky or maybe confusing

is if you have a data frame, which is a 2 dimensional structure similar to, like, a database

table or a spreadsheet,

and you want to access a column in that, there are kind of 2 what main ways to do that, but there are other ways as well. So there are multiple ways to do things, which a lot of people think, Python should only have 1 way to do things, which in practice really isn't true, but it's a nice goal, I think, to strive for. But

the 2 main ways to do things are attribute access, which is with the period, and index

access, which is with the square brackets.

And there are sort of pros and cons to both of those ways.

And so,

you get

people professing to do things 1 way or people professing to do things another way. And for someone who's new,

it can be confusing.

In fact, my sort of take on this after teaching things to a lot of people is that,

you know, I would probably prefer not even to have either of attribute access or index access. If I were designing it, I would probably just make it a method to pull off columns and rows just because, like you said, Tobias, a lot of people who are coming to use pandas are using as a tool,

and they're not necessarily Python programmers. So having to understand those differences can be a little bit confusing or overwhelming. Why is there different syntax to do the same thing?

That sort of thing?

As to

the Pandas API, as you mentioned, it's vast and every

you know, just the core data types have a huge number of operations that you can do on them. And I'm curious as you have been working through this book and trying

to enumerate your own knowledge and learn more about the specifics to make sure that you're getting things right, what are some of the most surprising things that you've learned about the Pandas API or its capabilities

in the process of actually trying to

cement these best practices

and understand,

you know, these are the the 5 different ways I can do this thing, and this is the way that I want to recommend.

A lot of this, I I was, you know, pretty familiar with having taught this

and having written books previously.

I guess 2 of the things that caused me the most delay in this was

diving into time zones a little bit more. So

I wanted to have pretty good coverage

of date manipulation

and

time zones. And I looked back at my

analyses that I've done over the years, and most of them really didn't use time zones. So I I started to mess around with time zones a little bit more

in preparation for the book.

And something that surprised me that, again, I hadn't really come across was in Pandas, if you convert a column to a date column, you get this nice little accessor, which is the dtaccessor,

which allows you to easily pull off, like, the year or the month

or the day of the week. And then it has some nice things where you can, like, convert the month name or the day of the week name to a different locale or language if you want to. So that's really handy. And, certainly, you could imagine doing these sorts of things, you know, just with a string and doing, like, reg x or something and pulling off parts and converting them. Once you convert it to a date, you get this ability to manipulate that sort of for free. And so I was playing with a dataset that I had and trying to convert these times that I had in here to a time zone,

and

it was just

really slow and taking, like, 15 minutes to convert this dataset to a time zone.

And I was trying to, like, look on Stack Overflow in all these places and not really

getting too far, and coverage of that was a a little bit hard. And then once I finally converted this column to a date that had multiple time zones, I tried to do, like, date manipulation

as I'd been doing it with DT Accessor,

and it didn't work. And that's because

in pandas, if you have a column that has 2 different time zones, it doesn't treat it as

a a date, time type. It converts it to this time stamp

type, which doesn't have the DT accessor. That was something that I just wasn't aware of before. Like I said, I really hadn't done a lot with converting stuff to time zones. I just sort of had wall time and said, okay. It's it's wall time. It's good.

That was something that was surprising to me.

Another thing that was surprising, this might be a little bit esoteric as well,

is

I sort

of dove into

grouping and all the grouping operations you can do. And so it might seem like, oh, those are just group by, but there's not just group by. There's group by and then there's pivot table and then there's crosstab.

And those are all sort of built on top of each other. But in addition to that, these are ways of pivoting your data, if you're familiar with Excel,

terms. In addition with that, there's a really cool method in pandas, which is resample.

And so

if you have a date and you stick the date in the index and you can do resampling, you say, I want to aggregate these by month or by day. Or Or you can say, I wanna aggregate them by every 2 months 3

days 2 hours. Cool stuff like that. Sort of diving in and sort of deep diving into that, and I actually didn't include this in the book, but I was like, okay. I'm gonna look at, like, every combination

of doing these group buys with applies

and transforms and filters,

and then also doing those with resamples and whatnot.

It turns out that there are some sort of corner cases where

doing a group by is different than doing a resample, which I wouldn't have expected, but I always thought they were somewhat equivalent. But there are some corner cases where that behavior is different.

So those are some examples, I think, Tobias, of, you know, if you really want to learn something well, you would teach it or alternatively write a book about it. And, you know, even after having used Pandas for many years, yeah, there's insights and stuff to learn by digging into them.

As far as the

work of just using Pandas to complete a task, you know, it's definitely

fairly straightforward

to take it off the shelf and say, okay. These are the 5 methods I need. I'm going to read CSV. I'm going to drop n a. I'm going to

aggregate across this column, and then I'm going to write that out to a table in my database.

And so it's, you know, pretty straightforward to say, okay. These are the things I need to do. And as you get confident using some of

these high level aspects of the framework. You might get comfortable with certain practices that

are useful

to get up and running, but might be considered bad practice if you're using it in a professional setting or trying to use it in a team environment. And I'm curious

what you've seen

as far as some of the

practices that are useful at the start but end up becoming

either limiting because you don't realize the actual full power of what you can be doing with it or are potentially

antipatterns as you start to scale up your usage of pandas and just some of the ways that you've been able to work with some of your clients to help to kind of correct those crutches in terms of how they're applying Pandas in their work?

1 of those would be

people who are coming to Pandas from a Python background,

and they're thinking about pandas as a sequence of data. So they wanna use for loops everywhere.

Once you sort of drop into a for loop, you you're losing

1 of the main benefits of Pandas, which again is that notion that it's storing data in blocks

as c would and and not really storing

objects, Python objects under the covers. Once you use a for loop, you're sort of telling Pandas that you want to pull the data from c back into the Python level and manipulate it at the Python level, which can work. And, obviously, a lot of people do that. And if you look at a lot of code, you'll see a lot of code that has 4 loops. But generally, a 4 loop is what I would call a code smell with pandas and and indicating that there's probably a faster way to do things.

A related

version of that would be using the apply method.

A lot of people use the apply method, which is basically

doing a for loop, but wrapping it in a method that's called apply. So

either of those is maybe naive and works, but maybe not the most effective.

Another thing that I see people doing all over the place is using the in place

argument.

So

there is an in place argument on a lot of operations on Pandas, and

the compelling reason supposedly for using in place is that if you do an operation in place, it doesn't copy the data, but rather it just updates the data in place so you have this memory savings.

Well, it turns out that that's sort of a lie, and there's even a bug in Pandas to deprecate and remove in place completely

because

most of the calls that use in place actually make a copy under the cover. So a lot of people, like, use in place. You're gonna save all this time and memory, and it's actually not the case. You don't save time and memory. In fact, from my point of view, using Inplace

is an anti pattern because

if you use in place, you're not gonna be able to chain. And I think chaining those operations is going to lead you to write better code. And so that would probably be my next

thing that people do is they don't chain.

And what is chaining?

Well, most operations in Pandas will return a new data frame or series.

And so what I tend to tell people is parentheses

around your operations

and then take your first operation,

and it's going to return something. Instead of storing that as a variable, just do your next operation on the next line right after that. And if you follow that pattern of what I would call chaining, sometimes people call this flow oriented programming,

it's gonna force you to sort of think about the logical steps in your process, but it's also gonna make your code look like a recipe, and you can read through it line by line. This line is going to

filter out the columns. The next line is gonna filter out the rows. This line is going to add some new columns. This line is gonna do an aggregation.

And you can read that and it's very clear. So a lot of people don't know about chaining or don't use that.

Another issue that is a common wart or gotcha is when people treat pandas as a dictionary, or a data frame, or a series as a dictionary, and they start

doing

operations on it and doing, like, index assignment.

And so, invariably, if you do that, you're going to get this setting with copy warning that most people who use Pandas will see. And you'll start searching all over the place to find a good explanation with it. It's a little bit hard, and I don't know that I've seen a really good explanation of it, like on Stack Overflow or other places.

But invariably, what most people do is they

just to get around that warning after they do some operation, they actually copy their data frame, which is kind of funny because these are the same people who say like, use in place, don't copy. And then they have these warnings. They put copy to get around the warnings. Well, I found that, like my whole book, all of the code in my book and in my previous book, The Python Cookbook,

don't have any of those errors at all because if you learn to use the assign method, which very few people know about it seems,

you won't have those issues. So the assign method is a method in Pandas that rather than treating Pandas as a dictionary and saying for this column, overwrite this column with this, you say I want to call assign and you use a key

with the name of the parameter being the column and the value being the value of that new column, and that will return a new data frame with that updated

column in there. And if you use that, you you sort of sidestep that issue completely.

And the nice thing about assign

contrast with, like, treating pandas objects as a dictionary is that assign works very well with this chaining that I just talked about.

So you can write code that, again,

when you have an assign method in there, it's just gonna be, here are the methods and here are the columns I'm gonna add to my data frame. And you can read through that and you can see line by line what are the columns you're going to add. I think another maybe anti pattern or things that people do that they shouldn't is they load up their data, and then they look at it, and they see that the combination of Jupyter, generally, they're using Jupyter and Pandas limits the amount of data they can see. So I think currently on most modern versions of Jupyter and Pandas, it will show you 20 columns

and 10 rows of data.

You know, a lot of people have more data than that. Their immediate thought as hackers is like, how do I break the system or how do I view more data than that? And I like to tell students to actually

sort of if you have that urge or thought, maybe reconsider it, and maybe it should be your spidey sense telling you that maybe there's a better way to do things.

And so

humans aren't particularly adept at looking at large tables of data. I sort of realized this when I was, again, in my prior life doing business intelligence reporting.

You know, we didn't evolve over the years to have this huge table with a 1,000 rows and or a 1,000,000 rows and a 1,000 columns in it and sort of scroll down through that and find the data that looks interesting. Right? So that's a really ineffective way of dealing with your data. However, computers are really good at filtering out your data and finding things that might be important. So if you feel that urge to, like, view more of your data, it might be a hint. Hey. Use the computer. Do this. Alternatively,

humans are also good at visualizing things.

And so that might also be a hint to you to visualize your data rather than to start scrolling through large amounts of data. Those are a few examples of

places where people maybe get confused or could do things a little bit better, Tobias?

1 of the interesting things about pandas is that it has

grown to be

popular in a number of different

kind of styles of application where many people will use it for data engineering workflows, where they're just doing some manipulations to clean up the data, like drop nulls or,

you know, impute values that are missing or

maybe, you know, adding derived columns or removing certain columns and then writing that back out to some other source. Some people are using it just for data exploration purposes to see here's a big CSV file or here's a table from the database. What do I even have in here? Maybe do some sort of, like, regression analysis on it. Other people are using it as a

core piece of their machine learning workflows. And I'm wondering if there have been any different kind of stylistic aspects of how you

apply pandas in those different scenarios

and any recommendations that you have for folks who might be using it for 1 versus the other as to lessons that might be learned

from different use cases for the Pandas framework?

I think you sort of nailed it on the head. The Pandas is so popular these days because it can be used all over the place. Right? So if you're doing, like, structured machine learning on tabular data,

Pandas is sort of the

de facto go to

tweak your data before that. And then you have a lot of analysts who want to come over to the Python world. They're used to maybe SQL or Excel and want to start doing things

with a programming language.

Sometimes the goal is just to, like, learn a programming language, but sometimes the goal is to start automating these things. Right? I mean, you certainly can automate Excel, but that's

somewhere probably you don't want to go.

I think

1 of the things that might

be problematic, it's

sort of a pro and a con, is that Jupyter. Jupyter is a great environment for exploratory type stuff and trying things out very quickly.

However, Jupyter doesn't really lend itself well to writing production code.

And so

you can get quick feedback, but, you're probably violating a lot of software engineering best practices, like making global variables and not using encapsulation, like functions or methods, that sort of thing.

I guess

I'm pretty

aware of this because I teach programming, but also because I have a software engineering background. So I worked in

the software for a long time, and so I've been through looking at code reviews and clean things up. You know, all of what you just talked about, Tobias, like machine learning or analysis

or just pulling something out, you know, a lot of the code that you see in Jupyter is really

not good production quality code.

And so,

again, my advice, what I tell my students is to sort of embrace that chaining philosophy

of

writing your steps 1 by 1. Because once you do that and you have sort of like, here's my raw data, and I also tell people work with your raw data because invariably your boss is gonna come back to you and say, I want you to explain this prediction or I want you to explain this number. If you're not working with the raw data, that makes it a lot more difficult.

Also, you know, if you are working with the raw data, you have all of the transforms there. If you're working with someone's

Excel file or, you know, data that they've hand tweaked, you don't know what they've changed potentially. So I tell people, work with the raw data and then make this chain of operations that you're gonna do to sort of clean up the data. You know, doing your data janitorial work or getting the data ready for exploratory data analysis or machine learning. And then once you have this chain, it's pretty trivial to just take that chain

and indent it and make it into a function.

And so if you follow that style of programming,

going from sort of

naive bad practices

to making something that's ready to, you know, put be put in a library or put into production is really trivial and easy. So that would be something that I would suggest.

Just

start learning how to do chaining,

And I think that will make a huge change for a lot of people's code and make it a lot more readable and easy to understand.

And another interesting development

in the core of pandas that's happened in the past, at this point is maybe a couple of years old, is the extension arrays capability where you can actually create

additional

data types that can be understood

by the pandas data frame. So I think the initial work was done by the folks at Anaconda to add support for

geographical objects, so being able to do geo representations of objects in pandas or being able

to treat IP addresses as a native object without it just being of type object and having to do generic operations and then casting it back and forth. And I'm wondering

if you factored any of that capabilities or if there are any new evolutions in the Pandas ecosystem

that you brought into the book or if you decided to just keep it core to what's out of the box when you just pip install Pandas and nothing else.

To be honest, Tobias,

I have never used those extension capabilities and nor have my clients asked for them. So not to say that they can't be useful. So

I didn't put that in my book. I will say this as well. I mean, I'm actually teaching a pandas course this week. And

in general,

Pandas, you know, for a while there is, like, a point 18, and then it was, like, 0.24, and it sort of sat there at this weird version level. And a lot of people are like, well, we can't use that because it's not 1.0 or whatnot. And then it came to 1.0,

and I think we're now at, like, 1.34.

You know, things have changed in pandas, but also a lot of things haven't changed. And a lot of what the book is, again, is coming back to this Pareto principle.

If there are these core fundamentals and you understand these core fundamentals, you can do sort of 80, 90, 95%

of what you need to do with using very limited pandas.

And so it might be interesting, but I haven't done it. But I would imagine that, like,

95

plus percent of my book would probably run with, like, Pandas

0.24.

Right?

It is, like you said, focused on the core.

And

what I've seen

most of my clients and students, the the functionalities that they need to get their job done.

So, yeah, will there be, like, a second edition, and maybe will I put, like, extension

types in there? I mean yeah. Hopefully, there will be a second edition. Right? And and with feedback from users and, you know, as I continue

working with clients

and find out more best practices or features or things that are super compelling,

would certainly like to include those in a future edition.

In terms of your work with Pandas and working with your clients, what are some of the most interesting or innovative or ways that you've seen it applied?

I think

as someone who spends a lot of time with their clients,

the most interesting

is

probably when a client has like a major insight.

And I can't really speak to the specifics of a lot of clients, but I do remember 1

who I was teaching them pandas and visualization as well. I often teach those sort of hand in hand.

And instead of using sort of cannedata,

the class we actually used their data to teach pandas.

So as we're going through the class, we're just like, okay. How would you do this operation? And you're doing it on your data. Right? So it's like, super relevant to them. And

in the middle of the class, 1 of the attendees was like, oh, look at this visualization I did. And it was a super cool visualization. It basically

was a little chart that showed

CPU usage through the week. It was a sort of circular path of, like, you know, during the weekend, for this particular client, CPU usage went up for certain reasons, and then it came back down and then sort of built back up. And and

after, you know, learning some Pandas and learning some visualization tools, they were able to make,

super compelling and relevant

insights

with very little code that you know, just by understanding these basics of data manipulation.

I mean, another

case that was just rewarding, I guess, as an instructor, but maybe sad for the client, was I I was teaching Pandas, and I had a client, like, in the middle that, like, hit themselves on the head. I'm like, what's wrong? Did I, like, say something wrong? And they're like, no. This code that you just showed us, which was basically 1 line of code, we just spent, like, the last 3 weeks implementing this for our business intelligence

reporting insight. So there is a lot of power in Pandas. And, you know, if you can master that, it can open doors to insight into your data. So those would be a couple of examples.

In the work of writing this book and exploring

the kind of detailed elements of pandas that you might not otherwise pay much attention to in your day to day, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?

I mean, go back to your classic, you know, to learn something, you need to teach it or alternatively write a book. And I think that's kind of true, especially in the days of Stack Overflow. You can

just

sort of go out Stack Overflow and start copying and pasting things and jamming them in where you want to, and a lot of it will just work without even understanding what's going on.

So I would also say that, you know, if someone's considering writing a book, books are probably gonna take longer than you expect, even as good as you are estimating your

skills of writing a book.

At this point, I've written enough books that, like, book writing and editing and those sorts of things isn't

particularly difficult. It's just a matter of putting in the time and getting that work done.

For this book,

probably

what spent the most time for me

was, I think, focusing on examples

and imagery

that would convey

the message

and the ideas of what I'm trying to get at. So

it can be a little bit of a challenge. I would say, like, maybe this is sort of a meta commentary and a lot of material out there, but, you know, when you're doing an example, you want your example to show how to do something. Right? And so there's a fine line between, like, here are some of the warts that you might come across, and, like, here's just the example

as is. So

navigating that path can be a little bit of a challenge.

I'm happy about it. Hopefully, it's useful for others for not just telling them how to do things, but also

giving them some warnings and some ideas of when things go south, how to recover from that.

As you continue to work with clients

and as you are working on releasing the book, now that you've got that out the door, what are some of the projects that you're planning to work on in the near to medium term

or any

kind of new areas of Python or Pandas or data science that you're looking to explore?

I guess I'm in a weird situation where my focus is like on providing value to my clients.

And so

in the last year, some of the places where they've asked for additional training

is testing their code, and some of them have asked for more modern Python.

Modern Python is starting to use things like type annotations, some of the Python 3 specific features.

And so I did do a lot of training,

testing, basic testing, but also more advanced testing, leveraging advanced features of Pytest or using tools like Hypothesis.

Hypothesis. My experience is that a lot of people who are using Python,

as we've mentioned, you know, are coming to it not necessarily with a programming background. So being able to do testing

is important and that they just don't have that background.

I'm also working on a new initiative to sort of, I guess,

help spread Python knowledge, and I don't really have more details about that right now. But I would say, like,

follow me on Twitter, and hopefully

in the coming weeks, I should be able to shed a little bit more light on that.

Are there any other aspects of the work that you did on this book or your experience with pandas or lessons that people might take away from reading the book that we didn't discuss yet that you'd like to cover before we close out the show? I I just wanna sort of make it clear that, like, hopefully, this wasn't super negative, and my goal here is not to rail on pandas. Like like I said, I do make a good deal of my income from that.

I think pandas is a super powerful tool,

you know, but you've got to take it in with warts and all. I'd say 1 more thing, Tobias, that might be sort of a meta commentary that's important for people to understand or they might want to consider

if they're thinking about learning pandas.

It is

what we're seeing now, in addition to, like, pandas, there's a proliferation

of libraries

that

are basically

using pandas not as an engine, but more as an API.

And so you'll see things like Spark has Qualas. There's DaaS. There's

a lot of other ones that are maybe fringe right now, but

I could see a world where a couple years down,

my Pandas code really doesn't change other than an import. But now instead of running on the CPU, it's running on the GPU or maybe some tensor processing unit or something.

But it's basically

using an API

very similar to what Pandas provides us, but the underlying implementation

is completely different. Tanner Iskra Yeah. It's definitely an

interesting aspect of some of these early tools have become

so ingrained in terms of how people are working with them, that people are just adopting the API for these different applications. And I know that there's also some work being done to

build a

set of API standards for working with data in Python so that you can abstract over the specific library that's being used, whether it's NumPy or Pandas or if you're working with Dask or Modin or what have you so that you can just write your code, and then as you said, just change the import at the top. Yeah. Yeah. I think it's probably gonna be interesting because

up till now, I think we're sort of at the 1.0 days for data analysis here. But if you look at, like, the progression of databases where databases start out simple, but now they have these really complicated query planners and those sorts of things, I think we'll probably see, you know, things go down that route. You know? And and especially if you leverage, like, a chain style like I'm talking about, that would be something that would enable

a query planner on analysis engine to do things like predicate push down in some of these more advanced operations that the database can do because it has the whole query there, and it knows what to do rather than just a single step from it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or pick up a copy of the book, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

This week, I'm going to

choose snowshoes

as a way to get out and be active in the winter and enjoy the scenery. I recently picked up a pair of snowshoes from MSR. I had a link to them in the show notes. They're a good manufacturer. They've got a decent lineup of snowshoes,

across a range of budgets, so definitely worth checking out. And so with that, I'll pass it to you, Matt. What do you have for picks this week? Awesome. Yeah. My kids have some MSR snowshoes, and they're definitely cool. So I guess on the snow theme so I live in Salt Lake, and I like to ski. So this might be a little bit geeky, but I like to do what's called telemark skiing, which is this

weird style where you, like, drop your knees when you're going. But,

for the past couple years, I've been using a binding from a manufacturer called 22 Designs, which is the links binding. So pretty esoteric, but if you're into skiing,

telemark skiing, I definitely recommend that. 1 more thing I'll mention, Tobias, is let's put a promo code out there for your listeners. So with the discount code init, capital I n I t, all caps, I'll put a discount code on a Pandas book course bundle. So go to store.metasnake.com

and,

promo code in it for a discount code on that.

Alright. Well, thank you very much for that offer for the listeners, and thank you for taking the time today to join me and share the work that you've been doing on the effective pandas book and

helping to teach pandas and effective practices

to people in the community. It's definitely a very

necessary skill to have, particularly as data pervades everything that we're doing these days. So appreciate all of the time and effort you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It's been a pleasure. Thanks for having me on.

Thank you for listening. Don't forget to check out our other show, the data engineering podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe

to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.init