Wes McKinney's Career In Python For Data Analysis

Today, I'm interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone.

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project to hear about on the show, you'll need somewhere to deploy it, so take a look at our friends over at Linode.

With 200 gigabit private networking,

scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale.

And for those tasks that need fast computation, such as training machine learning models or building your deployment pipeline, they just launched dedicated CPU instances.

Go to python podcast.com/linode,

that's l I n o d e, to get a $20 credit today and launch a new server in under a minute. And don't forget to say thanks for their continued support of the show. And don't forget to visit the site at python podcast.com

to subscribe to the show, sign up for the newsletter, and read the show notes.

And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

And to keep the conversation going, go to python podcast.com/chat.

To learn and stay up to date with what's happening in artificial intelligence, check out this podcast from our friends over at the changelog.

Practical AI is a show hosted by Daniel Whitenack and Chris Benson about making artificial intelligence practical, productive, and accessible to everyone. You'll hear from AI influencers and practitioners, and they'll keep you up to date with the latest news and resources

so you can cut through all the hype. As you were at the, Thanksgiving table with your your friends and family, were you talking about the fear of AI? Well, I I wasn't at the Thanksgiving table because my wife has forbidden me from doing so. Oh, it's It's off limits for for me, lest I drive

podcasts.

You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with O'Reilly Media for the Strata Conference in San Francisco on March 25th

and the Artificial Intelligence Conference in New York City on April 15th.

In Boston, starting on March 17th, you still have time to grab a ticket to the enterprise data world.

And from April 30th to May 3rd is the Open Data Science Conference.

Go to python podcast.com/conferences

to learn more and to take advantage of our partner discounts when you register. Your host as usual is Tobias Macy. And today I'm interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone. So, Wes, could you start by introducing yourself? Sure. I'm, Wes McKinney. Most people know me as the creator of the Python Pandas project. So I've been doing,

development work for data analysis tools for a little over

a little over 10 years, and,

very interested in

open source software and,

and models for doing funding models for doing more, more open source software and making,

data processing systems

more powerful and more accessible to normal people. And do you remember how you first got introduced to Python?

I do. The very first time I heard of Python when I was an undergrad at at MIT. I was taking a class on algorithms. It was a computer science class. I wasn't a computer science major,

but we were studying

dynamic programming.

It was the only part of the course where we actually

needed to write code.

And I had done a bit of Java, and I wrote my solution to the dynamic programming

problem in Java, and it was maybe, you know, a 150 lines of code or something.

And a friend of mine,

someone named, Christine Corbett, told me that she was going to write her solution in Python. You You know, she expected it would only be 20 or 30

lines,

and I was like, how could that be? How could it be so

how could it be so short?

And,

and she was right. And so that was my first exposure to

you know, the first time I became aware aware of Python as a programming language.

That was in, like, 2005.

But I didn't get really

start programming in Python until

a couple of years later in 2007,

when I was working at,

AQR. And so a colleague of mine, named Michael Wong had had written some

rudimentary distributed computing tools in Python.

And I was intrigued by the language,

from my prior experience with it, and and I sort of went down the rabbit hole and started rewriting some legacy, Pearl code into into Python. That was November or December 2007, something like that. And at this point, you have spent a large portion of your career on building tools and platforms for data science and data analytics, largely in the Python ecosystem.

So I'm wondering what your overall motivation

is for focusing on that particular problem domain.

Yeah. Well, I,

you know, I have a I have a mathematics

background, so I

I didn't do a lot of programming

in the past. So, you know, people talk about how they, learn to program when they're, like, 10 or 11 years old or something like that. But, you know, I I really didn't start programming until so much later in life, because,

you know, I I realized at some point that I needed

to be able to program in order to be useful.

And so I was struck in in my first job working at AQR Capital Management

up in Greenwich, Connecticut,

I,

I was struck

by how difficult it was to go about

very basic data analysis problems. So I saw how much time people were spending

using Microsoft Excel. I was learning SQL, and I felt that for a lot of basic data wrangling, like SQL felt very clumsy. We were working with a lot of time series data.

Also, that felt extremely

clumsy. And just in general, it felt like there was a disconnect between

ability of your your mind and your brain to think about what you wanna do with the data and the the actual tools to do what you're, you know, what you want, essentially, to to carry out, you know, your desire to analyze the data. And so I got really interested in building tools

for myself

to be more productive, to make the whole process of translating, like, I have an idea about what I wanna do with the data to actually I have working code to do that.

And as I've kind of expanded,

you know, I I've been more focused on the needs of other people

and looking at how people are working with data and how to make them

make them more productive. And so I find it to be a kind of virtuous cycle where you build

you build tools, you get feedback from people,

you see if it makes their work or their lives better,

and then you incorporate that, you know, that feedback into your process.

And as time has gone on, I've, you know, I've gone deeper and deeper into,

underlying systems problems that

are the underpinnings

of tools like pandas, in general, data analysis tools. Because after spending several years building very user centric

data analysis tools, things that normal people can pick up and use, like somebody that might be a Microsoft Excel user, and they wanna start writing some Python code instead, I found myself

pretty limited in terms of what could be built

in those user facing tools, and the constraints

lie in the in the systems domain. And so, you know, some people have asked me, like, why are you, you know, working on all these systems problems? It's because they they directly impact the kinds of tools that can be built for end users.

And a large variety of the tools and projects that you've worked on are open source, And at various points, you've worked for different businesses or run your own. And I'm wondering what your experience has been overall in terms of the

differences as far as sustainability

and levels of sophistication that are possible

based on different environments

and funding models and sustainability

models for the different projects that you've worked on? Yeah. So, I mean, I've been through pretty much all of the the different types of standard funding models for for building

open source software. So, I mean, nowadays, if you consider, like, where is most of the funding for open source

coming from, it's largely coming from corporations

that are allowing their employees

to work on open source projects either as their full time job or maybe they spend anywhere from, you know, 20% of their time to 80% of their time in working on 1 or more open source projects. And so

in a sense, my that was my first exposure. I worked for an investment manager, a financial firm AQR.

That was

2, 007 through 2, 010. So they effectively

were the initial funding

for Pandas because they paid my they paid my salary.

After that, I'm I'm moved into the next kind of environment where funding comes from, academia.

I spent a year as a PhD student at Duke University

and continued to do some open source development, but I found myself

basically split between doing research and doing software development, and I also

found that a bit problematic.

I've done consulting,

and so sometimes open source gets funded through consulting work. And so I after I I dropped out of grad school and, moved back to New York and and did some consulting work with with AppNexus and, which is an ad tech company and and some other organizations.

And, basically, I was looking for people that were trying to do do more data analysis work in Python,

and I used kind of that working with them to kind of influence

the development road map for pandas. I also started a company, and, I with 1 of the early pandas developers, Changsha. We he and I started Datapad,

which is a venture backed start up, and we were building a visual analytics tool that was developed. The all of the back end technology was the Python

data stack, including pandas and a number of other things. And so we ended up,

shutting down the project

in 2014, but part of our objective in starting the company was

to direct some of our r and d budget and engineering time back into

supporting the underlying open source projects. So there's challenges with with all of those,

you know, funding models, basically, consulting,

academia,

start up,

single company funding. And, you know, just just taking you know, working for a company as an example,

1 of the challenges that open source developers have is that they may run into a conflict of priorities with their parent company

where they may feel at liberty to work on the problems that are directly relevant

to their company's

business, but work on maintenance

and building features and fixing bugs that are that do not directly impact

the business or don't have, you know, quote, unquote, you know, ROI, you know, return on investment, they may find it more difficult to prioritize

those things.

And a lot of the work in making open source software projects successful is very unglamorous

and falls into this category of of things that can be let's let's call it hard to explain to your boss, like, how you're spending your time.

You know, doing code reviews and fixing, you know, esoteric bugs that people report may not, on paper, seem like

a high priority, you know, when written down, but, you know, really is is that is the core stuff that that makes open source software projects successful is that grind of, you know, taking care of all of the little things and making sure that,

you know,

that the the project as a whole is is healthy and not just, your little corner of the project

that's immediately relevant to the applications that you're working on. And another issue that has been coming up more recently as a larger number of businesses

are starting to become comfortable with open source is the idea

of corporate driven open source with projects such as TensorFlow being top of mind where the needs of the organization

potentially outweigh the needs of the community. And so there can be some conflict of interest in terms of how the project is progressing

or who the primary developers and maintainers are on the project that might not necessarily be conducive to a long term health and sustainability of the project.

Right. Yeah. So that's so you're you're talking about governance,

and governance is certainly

affected by

like, a project's governance is affected a lot by where by where the money is is coming from.

And this is part of why I've, you know, become such a big fan of of projects in in the Apache Software Foundation because

it forces

a community centric governance model on projects that may be largely corporate funded.

And so some of the bad patterns that you see in corporate

driven open source projects, like,

let's just say, private discussions,

throwing code over the wall.

So sometimes you'll see companies that a project will be, quote, unquote, open source. But, you know, maybe there's, like, a monthly or quarterly code dump, and all of the code reviews are private.

All of the developer discussions or many of the developer discussions are are private, and there may not be that much of an opportunity for people in the community to give their

to give their feedback and to be involved in the process. So, really, I, you know, I think that the process that produces open source software

is just as important as as the code as the code itself, and so you do see those struggles

in,

in corporate, you know, driven projects. I think in the case of you mentioned TensorFlow,

and, you know, I think that that was a criticism early on. And since since,

in recent times, TensorFlow has has, instituted a a kind of a formalized

process for collecting community feedback

and writing essentially to to, you know, to help get design discussions

and new initiatives

out in the open

so that there's an opportunity for people to feel like they're involved in the process. But yet, you know, if you go on if you go on the TensorFlow,

you know, GitHub,

and you look at the contributors,

you can see that, you know, there's a huge number

of contributions

coming from the,

TensorFlow or Gardener

account, which effectively means contributions from internal Googlers.

And so, you know, whether or not those code reviews

or the process that produced

those I'm just looking at it now. There's, like, 13, 700

commits from the TensorFlow Gardener. It may or may not be easy for

for individuals, you know, somebody who does not work at Google to be a full fledged participant in that process.

And going back to the overall topic of data analysis

and some of the challenges that are present in that overall problem domain, what have you found to be some of the common issues that practitioners in that area are dealing with

in terms of the overall work environments and software projects that you've been involved with? Yeah. So the the challenges that that I've been most interested in or that I've gravitated towards

are usability and and accessibility.

So API design and the user experience of working with

of working with data.

And so 1 of the reasons why Pandas is so popular

is not only that it has a ton of features and it has

the individual, you know, all of the kinds of data manipulations that you need to work with real world datasets, but the ergonomics or the usability

of the library is good. So

with relatively,

concise and easy to read

code, you can perform quite complex

data analysis tasks. Related to these things, I've also been very interested in performance

and scalability.

So, you know, performance, a lot of the you know, because, you know, pandas is a

a real a tool for relatively small data, you know, single digit gigabytes and down is kind of the recommended size for for pandas. A lot of the performance work there has been around, you know, making the software more interactive.

So the difference between something taking

5 seconds and 500 milliseconds, so a 10 x improvement can be pretty huge. And so

we've done a lot of work to improve, like, the, you know, expand the sweet spot for what interactivity means. I've also been interested a lot in in interoperability.

So you you may not be doing a 100% of your work in Python and have all your your data may not all fit into memory

in Python. And so, you know, we need to be able to take advantage of SQL based systems,

distributed processing frameworks like Apache Spark, Apache Hadoop, and, so but there's a lot challenges associated with having a work having a workflow that involves multiple

storage systems or processing environments. So all of this work is, like, supporting different file formats,

interacting with different processing frameworks,

and just going about your day to day work. Yeah. Just, if you sit down and try to use these systems to get something done, you run into a lot of rough edges and, you know, my process is basically,

you know, I try to do things. If something seems hard or something seems

like a rough edge,

I I'll take note of that. And so if you're very diligent about, you know, keeping notes and and tracking, like, what seems like, why is this hard, you know, could can we make this easier? Can we make this faster?

Can we make this use less memory? Can we make this more interoperable?

You accumulate, you know, a pretty long list of, things that are things that are imperfect or things that could be made better. The solution to fixing those things may not always be obvious, but it's at least good to if you perceive something to be to at least have the feeling of being suboptimal

or imperfect,

that you at least, like, make yourself aware of that and, like, try to figure out, like, what's a way to make things better? Like, how can we if if the status quo has something we don't like, like, how can we try to try to change that? And maybe, you know, maybe you don't like something and you find that it's just, you know, it's just as good as it can get. And I find that sometimes when I ask people about, you know, hey. This, you know,

this tool works like this, and maybe it doesn't work that well or you know, I think that I I feel like a lot of people

become somewhat,

kind of

dull to some of the the difficulties that they experience. So Paul Graham,

not that I'm a big Paul Graham supporter, but he has this essay called schlep blindness, and it's this idea that the the kind of the tedium and, like, the difficulties that people experience in their work, they after a while, they often will stop seeing them. And so they and they just just accept, like, a certain level of unpleasantness is, like, just endemic

to the process. And so classic example of this is, like, Microsoft Excel,

and maybe, you know, it's sort of,

you know, it's sort of a Stockholm Syndrome kind of thing where after a while, you you may stop asking yourself, like, what can we do better than this?

And are there any other areas of data science and data analytics that you think are not receiving the attention that they deserve or the support or funding that we should be providing

to be able

to bring all of our capabilities

forward and improve the

types of systems that we're able to build, whether in terms of tooling or

just general research or awareness?

Yeah. Well, I guess my answer is a little bit is a little bit biased. But if you if you look objectively

at where a lot of the the money is going and where a lot of the hype and and marketing is going, it's really,

it's it's a bit

skewed towards,

towards machine learning and, quote, unquote,

AI.

So that's basically, like, machine learning frameworks, deep learning.

Like, there's there's a huge amount of amount of money that's being invested in that. And comparatively,

a lot less money and effort and attention is being given to some of the more fundamental

problems in data access

and kind of interoperability.

So just really basic things like reading data files.

You know, I think in, if if you consider,

like, public cloud providers, so Google,

Amazon, and Microsoft, so they support

5,

primary,

open file formats for data warehousing,

CSV, JSON,

Avro,

Parquet,

and ORC or ORC.

And if you look at, like, the quality of libraries

for dealing with those file formats

and dealing with those file formats in the context

of using a cloud platform,

like, the software is really not very good.

And it it sort of leads you to scratch your heads. Like, well, you wouldn't, you know isn't

the problem of reading and writing datasets,

like, pretty fundamental, and why don't more people

work on that? It it really perplexes me a lot.

That happens to be

what I'm working on in large part, but it's,

you know, it's a function of I part of the reason why I'm working on it is because

I I feel that it is,

it is under attended,

and

it's, it is something that deeply impacts

the productivity

of of the users. And

it's, I think when there's a lot of friction

and really basic,

data access and data manipulation,

that it tends to close out off, kind of

people will choose not to pursue different development approaches to problems because they run into these really basic road like, roadblocks

just dealing with with the data in a very basic

really basic way.

So, you know, I'm really interested in in, having robust and reusable

high quality solutions to data access and, you know, just dealing with datasets

in a multi language

setting. So not just for Python, but for, you know, for all programming languages.

And on that front, you've been dedicating a lot of your time and attention to the Arrow project, which I know initially started off as just a way of being able to share data frames in memory between Python and R and has now expanded into the realms that you were discussing of data access

and interoperability with different data formats.

So I'm wondering if you can talk a bit about some of the cases where where a Python developer would be interested in leveraging Aero

and also being able to use its capability for incorporating

capabilities from other runtimes, such

as a particular

analysis suite in R or something in Julia

versus reimplementing it in Python or finding a different Python library that does some measure of the same things?

Sure. So,

so so the Apache Arrow project,

just to give a little bit of history about things,

very brief history about how things got started.

I,

I had been building,

the Datapad company with, with with Chong and, our and our team in 2013 2014.

And we felt like we were boiling the ocean in a number of ways and working on a lot of systems problems around,

you know, low latency

and high performance analytics in the cloud.

And we developed a a bunch of column column based or columnar analytics tools,

to be able to power the the Datapad application.

We decided to join Cloudera

at the end of 2014

to spend more of our time working on systems problems for data science. And my initial appointment when I landed there was to come up with a a plan to make Python more of a first class citizen in the big data world. So that's in the Hadoop and Spark ecosystem. And 1 of the things that struck me pretty much right out the gate was how fragmented

the technology

was. And this is, you know, just a function of things being open source and there being lots of different

corporate players. And so even though, you know, 100 of 1, 000, 000 of dollars have been invested in big in open source big data projects, there's still you know, the the level of interoperability

was not very

good in terms of sharing data

and using multiple computing frameworks

in a single application.

It was also very Java centric, and so a lot of these systems were written, you know, written in Java, or they were more, like, black box y in a sense. So I was working with the, the Apache Impala team. It was still the Cloudera Impala back then, but I was interested in in plugging Python into Impala

and found that really basic issues, like, how do we move data between Impala and Python,

that there was no off the shelf technology

to do that in a standardized way. And so I spent

a large part of 2015

gathering a group of, open source developers

to see if there was interest in defining a standardized

data representation, basically, a standard data frame that was language independent. So it could be used in Java, it could be used in Python, c plus plus, R,

really used in any language

that, would be a technology that we could use to,

essentially tie the room together. So that was the the the rationale for creating the creating the project. And the reason that having this common

data format is so important is that it gives you something to collaborate on. So, traditionally,

if you consider 2 systems,

they write libraries to read and write datasets.

They write algorithms to perform analytics

on the datasets.

They write messaging layers to move datasets around in a distributed system, around on a network. All of the code and the libraries that people write in general

are specialized

to the way that the data is represented in memory in in the runtime environment.

So by defining this standardized

representation,

it allows us to create reusable libraries

and, also, you know, to be able to use libraries written in different programming languages

in process without any overhead. So

now, you know, a few years later,

we can use c plus plus and LLVM

to process

data that originates in the JVM without any serialization.

And so it's the kind of thing that we always dreamed of, but it's just extremely difficult because you have to define all of these standards and these and these ways of communicating

large datasets in a language

in a language agnostic way. And so as we've as we've build out the project,

you know, our goal has expanded beyond just defining an open standard

for tabular,

datasets,

AKA data frames,

to building, essentially, a

polyglot

development platform

for building data processing applications.

So

if you're working with tabular datasets and you're working in Python or c plus plus there's building blocks.

All of the building blocks that you need

to to do analytics, to read and write datasets,

to send datasets in a distributed system. These are kind of the the basic pieces that you could use to build a data frame library like Pandas or that you could use to build something more sophisticated like an analytic database.

And so

for me, the the interest is in partially improving interoperability,

like, broadly across the big data world, but I'm also interested

in consolidating efforts within the the broader data science community. So that's the Python world,

the R world. We have a pretty significant contingent of Ruby developers that are really interested in having data science tools for Ruby,

and we are, you know, building almost all of the core

computational system software in in c plus plus, and then we build relatively thin bindings

to those libraries,

that we can use in Python and R and Ruby.

And so in other languages, we also have MATLAB kind of bindings to the c plus plus libraries.

So it's really cool that we can implement

a feature once and keep improving that implementation,

and that code is immediately reusable

in in all of these places.

And so I believe that as time goes on, that we'll start seeing more and more data processing systems that are formed from

heterogeneous

computational components. So you might see a system that includes some Rust and some c plus plus and maybe even some Java, and that, you know, is possible because we have this

unifying,

technology at the data level. And so I'm really, so I'm really very excited about that. But as you can imagine, it's 1 of these, like, frighteningly

ambitious projects that, you know, hasn't really happened in the past, and and part of the reason it hasn't happened is because it's so difficult. And I'm sure that 1 of the biggest challenges that you're facing in the process

of working on and helping to

shepherd the Arrow project is identifying

which concerns

belong to Arrow specifically

and which should be relegated

to those other language

ecosystems.

So whether you should

just rely on Arrow being a method of data interchange

or include some even lightweight analytical capabilities

there to try and allow for push downs from things like R or Python

and any challenges as far as

replicating effort or wasted effort between those different language communities and raising awareness of which capabilities exist where and how to incorporate them into a larger system.

Right. We definitely have to strike a a balance in terms of what problems we're taking

on in in the Apache Arrow project, and we've been pretty deliberate about, like, where we draw the line. And so the where we've mostly been drawing the line in terms of, like, what is an arrow problem and what is, like, a downstream

consumer problem

is largely in the user interface side of things. So we don't want to be prescriptive

about

how the libraries are consumed by end users. So some people have expressed, like, oh, you know, are we gonna expect, like, a pandas like library to exist

inside the Apache Arrow project?

And the answer is probably not, but that is the type of, you know, like, a next generation pandas, aka what we'll be calling pandas 2,

would exist as a separate open source project that utilizes

the Arrow runtime libraries to to create its implementation.

And so

we want people to build many different kinds of front end interfaces

to

the technology that we're building in the Arrow project.

But we we don't wanna say, you know, you have to write SQL or

you have to use this 1 particular

data frame like API, that there's flexibility

as far as user interface. But the key thing is that the the components in the project

are reusable and have clean

public APIs, and so if you just need

to read and write datasets

as arrow format, you can do that. If you're mostly concerned with in memory query processing, in memory analytics that you on in memory data, that you can just use those libraries, and maybe you have your own serialization or data access layer,

that is maybe proprietary to your application.

You can still do that. You don't have to accept

a particular storage

system. Like, you don't have to store things as parquet files

in order to make use of the query engine components that we're building. But 1 of the big pushes over the next couple of years is creating

a full fledged

comp like a query engine,

basically parallel

execution of analytical queries against in memory and on disk datasets.

So

that's could be used to execute SQL, but also evaluate,

Pandas type data frame operations. So group by aggregate,

filtering,

column expressions.

So we we're right now, we are, you know, laying down the building blocks of kind of having an end to end in memory and out of core query query engine. So that's kind of the, you know, the major growth area for the of the project for the next 2 to 3 years. But you did you did mention,

probably the multilanguage

aspect, and so it's notable that there is a parallel effort to build a native

query engine for Arrow that is in Rust,

and I'm I'm happy to encourage that. And if we end up with native language query query engines written in multiple languages in the project, I actually think it's a really great thing because we can learn from each other, not only about, you know, implementation strategies,

but also I think the design of how like, what is what is idiomatic

arrow code in c plus plus may translate well to Rust

or maybe not, and and maybe vice versa. There's some things that,

are learned about how to make the problem more tractable or build better software

in Rust or in Go or in Java, and that can inform development in the other in the other subcomponents of the project.

And you were mentioning

some plans for the,

yet to be determined Pandas 2.0 release. And I know that you have spent a lot of your career working with Pandas and working on Pandas being

the initial author,

and that now it has grown to be a larger community beyond yourself.

And I'm wondering

now that you've had some time with it and space from it, what your current thoughts are on its overall success.

And with that perspective that you've gained,

any thoughts on what you would have done differently if you were to start it over today? Yeah. So, so so 1, I guess, you know, 1,

elephant in the room that people always ask me about is the

is the quote, unquote pandas 2 effort. So at the end of I think it was the end of 2015,

I started a discussion with the pandas core team.

It's you can find the discussion on the pandas dev mailing list

about, you know, what kinds of changes or improvements, like, would we like to make to the core library.

And later, I wrote, like, a long kind of doc set of documents about,

like, a hypothetical

pandas 2 project. Like, what do we want? Like, what do we what problems do we wanna solve? What problems do we don't wanna solve? And as we've discussed as a community, the the work the working plan there is that the existing Pandas project will live on effectively in perpetuity

given the, you know, millions of users and the, you know, years years of code that depends on pandas. And so the development there is focusing on stability and and having a mature

and consistent and reliable

reliable code base. And the work in

so if you look in the pandas 2 design documents that I wrote along blog post called

called Apache Arrow and the 10 things I hate about pandas, which is based on a talk that I gave

about 5 or 6 years ago about internal design issues in pandas. So we've been laboring kind of diligently to address

those systems problems that have affected

the performance and scalability of Pandas in in the Arrow project.

And our plan there, effectively, is to,

is to create a a sibling project,

that is intended to be used as a complementary tool

to Pandas that is using all of the Arrow technology

under the hood and will be geared toward,

doing analytics on massive datasets. And so a little bit less flexible in its API, but designed to work, in a scalable way with much larger datasets. But going back to your original question about kind of, you know, my reflection on

pandas', you know, will be pandas will be 11 years old this April, so it has been around for

a long time. I can't say that I I would have done

very much differently

kind of going back and and thinking on it. You know, I think,

if anything,

I I there are probably some things that

I would have said no to but, you know, when you're building a brand new open source project and you're excited to have people

join the community and become users, you have the tendency to say yes to everything. And so now, you know, there's some things in Pandas that were, you know, are being deprecated,

maybe removed, like things that seemed exciting once but maybe,

haven't gotten as much maintenance

love or use over the years. But, you know, when you go back a decade,

things were a lot different back then. So the center of gravity and ecosystem was in

scientific computing and HPC,

and people used NumPy for a lot of things. And so

it was in Pandas' interest to have really tight,

interoperability

with NumPy that if you didn't have that, then basically your project would be more or less dead on arrival because there's just too much friction

for people to pick up and use the tool if they were already using NumPy.

And so now, you know, fast forward a decade later, some of that close

relationship

with NumPy

has has created problems in in the sense that, you know, a lot of the internal the implementation details of pandas are exposed publicly.

And so that makes things like NumPy, and so that makes changing things really difficult. But the community, you know, Jeff Reback and Boris Standen Bosch and Tom Alksberger,

they've done a really great job, you know, growing the community and

also judiciously

growing new features to make the project more extensible. So things like extension arrays

just dropped in the project relatively recently, and that's opened up a lot

of interesting opportunities to

expand beyond the horizons of what's possible,

with vanilla NumPy.

So having no values and integer

columns is something that's been enabled by extension arrays. There's also, like, adding there's GeoPandas which provides some, you know, geographic

data structures. And so so there's still a lot of exciting work happening in Pandas, and I think the project has a lot has a has a lot of

growth and, and, and work ahead of it. I don't see the the arrow work as being in conflict with pandas at all. And, you know, I I see kind of a, let's call it, a symbiotic relationship between the work that we're doing in in in the Arrow project and Pandas in the sense that we wanna make sure that if somebody has, you know, a 100 gigabytes or 100 of gigabytes

of parquet files, that they're able to perform

standard

analytical queries on those datasets. And if at some point, you need to cross over

into pandas and build pivot tables and do classic Pandas stuff that that it's straightforward for

for you to do that. But the presumption is that at that point, you're gonna be working with smaller amounts of data. Pandas is more of a Swiss army knife than a chainsaw, if that makes sense. And

a lot of your overall sort of recognition in the Python community in particular, but also data science at large is related to Pandas.

And most recently, you've been very involved with the Arrow project, but I'm wondering if there are any other achievements that you're particularly proud of that you'd like to call out that people might not necessarily be as familiar with?

Yeah. So 1, I mean, 1 project that we haven't,

we haven't talked at all about on this on the podcast yet is the is the IBIS project,

I v I s. So it's a project that I started when I was at Cloudera,

and the idea

was to develop a fairly rich

DSL, so domain specific language for interacting with,

initially SQL based systems. And so I wanted to build something that was very similar in its API to pandas and could be adopted by

a pandas user, but was

lazy and created

you would create strongly typed,

well typed expressions,

and you could

essentially take any SQL query and rewrite it as an IBIS pie expression and write it with Python code.

And, you know, I contend that that that, you know, the IBIS is probably 1 of the most underappreciated

things that I've worked on because,

you know, are are big fans of it because it makes, writing really complex SQL a lot a lot more tractable,

and it brings a level of code reuse to SQL

analytics that is you know, you can't really do in SQL. So the way you reuse code in SQL is by copy and pasting. It's something I'm quite proud of, and I think also

the work in designing the,

expression language of of IBIS is also influencing

the work that we're doing in in the Arrow project, because we also need to have a DSL for writing down deferred,

analytical expressions in c plus plus. And so the fact that, you know, I have a fully formed

DSL

that is a superset of SQL and can express even very complex

SQL concepts like correlated subqueries

and,

you know, things that, you know, traditionally are very hard to even think about in pandas. I think it gives us a head start on kind of thinking about how to map declarative

SQL analytics onto

the imperative

kind of functional approach of Python

and essentially

composability and function calls.

So I think it's an interesting research project and, I you know, something I hope that more people, you know, take a look at. It's been 1 of these, like, sleeper projects that, you know, has been

growing and developing

over time,

and has expanded to support a lot of different SQL systems and now has an in memory pandas back end. And, Philip Cloud from from the pandas project, has has done a lot of, amazing work on that. Christian Suetsch, who,

is also,

in getting involved in, in Apache Arrow,

It's as a PMC member in Apache Arrow has been working

with me on that project as well. So he's he's done a lot of work on,

on Ibis. So it's a so it's a cool project. But, I've tended to concentrate my my development work in a in a in a few

in a few areas.

So so, you know, Pandas and IBIS and the and the stats models project or outside of outside of the arrow project are where most of my most of my development work has gone. And it's also worth at least an honorable mention that you wrote the book Python for data analysis. So if anybody hasn't read that, it's probably worth taking a look at that as well as a means of getting introduced to the ecosystem.

That's true. Yeah. So I I I wrote, so I wrote my book, Python for data analysis.

Essentially concurrent

with the development of pandas, which was a bit risky thinking back on it. This was,

you know, this was

2012

that I was mostly writing the book, and, I'm I did a second edition of of the book a couple years ago to update it for Python 3

and, for for the latest version of Pandas. So,

you know, if you're learning Pandas or you you wanna learn more about data analysis and Python, the book definitely is a good, is a good resource for that.

And looking forward,

what do you have in terms of grand ambitions for the future of the data science community, both inside and outside of the Python ecosystem?

And within that, any projects that you are particularly excited to be working on in the near to medium future?

Yeah. So,

so I I, I helped start the Arrow project while while I was at at Cloudera in in 20 2015.

In 2016,

I I moved across the country,

and and spent a couple of years working with

2 Sigma on systems for for data science, and they're really gracious,

supporters of the of the Apache Arrow work. And we, we did, you know, integration between

we developed integration between,

Spark and Arrow to make Python on Spark faster, and we collaborated with with IBM on that work. That was really an exciting collaboration to see,

come together and be successful.

And last year,

I,

in order to scale up the the my, you know, the work in in Apache Arrow,

I partnered with RStudio

and 2 Sigma

to create a new organization called Ursa Labs, which is

a,

a nonprofit group, which

enables me to put put people to work full time working on the Arrow project, and the mission for the mission for Ursa Labs is to,

to develop the Arrow platform

as a shared computational infrastructure for for data science. And so my grand ambition for all of this is to have a really powerful

computational runtime

for data science, for analytics,

data access,

future engineering for machine learning and statistical applications

that is uniformly accessible from R and Python

and Ruby and the different languages that people want to use for data science. And so it's the kind of it's the kind of thing that hasn't really been possible

in in the past because of the points of friction that had made it difficult to share code between these programming languages.

And, also,

part of that ambition is to foster collaboration between the data science world

and the database systems community because part of what's missing from data science is the level of computer science and computational systems work that has been done in the analytic database world for the last, you know, 20, 25 years.

But almost very little of that,

systems work has made its way into the hands of everyday data scientists. And so part of the goal of of the Arrow project is to create

reusable libraries

that provide the level of performance and scalability that you have in modern analytic databases, but put those at the fingertips

of everyday

data scientists

and to essentially liberate individuals from from being,

tied to a particular programming language or being able to use multiple programming languages, but not having to be too concerned about, you know, whether things are gonna run 10 times slower in 1 programming language

versus the other. So I think things are on their way toward that goal, and and, you know, I think setting up the the Ursa Labs organization

helps provide some scalability in in terms of building relationships with more corporations

that wish to fund the the Arrow work so that we can build a bigger and bigger team as time goes on. So, so if anyone listening has, you know, has interest in in funding, you know, is funding this work or helping us move faster,

you you can definitely reach out to me on a Twitter or, you know, any of the standard communication channels. And for anybody who does want to get in touch either for offering funding or support or who are interested

in working with you on the Ursa Labs mission, I'll have you add your preferred contact information to the show notes.

And

to close out the show, is there anything else that we didn't discuss yet that you think we should cover before we close out or any parting advice that you have for active or aspiring data scientists or any resources that you'd like to recommend?

I don't think so.

But, yeah, it's a it's a pretty it's still a pretty,

in terms of where we are and, like, the you know, on our on our sort of historical timeline, if you think about, you know, maybe what life might be like in

in 2050

and where are we right now, you know, I think it's it still feels like we're we're a bit in the wild west in terms of the development of of systems and tools for doing,

for doing data science. And so I think we have a long way

a long way to go. And,

you know, I think the more open dialogue we have around these problems

and,

and get it kinda collect ideas and and combine and kinda coalesce

efforts in building

the open source software and light and tools and systems, I I think the more, you know, the more progress, you know, we'll make. So, you know, I I like to tell people that, you know, I want the future

to get here faster. I don't want time to pass more quickly, but, you know, if we could advance

the kind of human progress on on these kinds of tools by

a few years here and there, you know, I think that would be,

that'd be pretty great because it means that we'll be able to do more interesting science and, kind of, you know, make things in the world just a little bit, a little bit better, which, you know, I think we can all agree is, you know, how things are going is, you know, something we should all be concerned with. Alright. Well, with that, I'll move us on into the picks. And this week, I'm going to choose the author Roald Dahl. He has written a number of great books for all ages. I have enjoyed them for many years. So if you have never read any of his books, I definitely highly recommend them.

The BFG is a great 1. James and the giant peach. Pretty much anything he's ever written is good fun. So with that, I'll pass it to you, Wes. Do you have any picks this week?

Well, I just, you know, finally, after being recommended to me several times over the years, I just finished reading the Soul of a New Machine by Tracy Kidder. So it's a classic book and,

about, engineering or computer engineering from the from the early 19 eighties. I it's it's a bit, the book is a bit is a bit dated, but still, I think, you know, for anyone who's in software or computer engineering, I feel like it, you know, it's a good read for for anyone who's who's in the field. It helps kind of understand the the motivation

that drives

engineers to, you know, to build things. So I found it to be a pretty enlightening profile of, you know, some people that worked, you know, very hard, in a over a very short timeline to build a new, new computer system. So it gets a gets a high recommendation from me. Well, thank you very much for that recommendation. I'll have to take a look at it, and I appreciate you taking the time today to join me and share your

experiences working with and building tools for the data science and data analytics communities.

I have used the outputs of your labors a number of times. I'm sure a number of other people have as well. So thank you for that, and I hope you enjoy the rest of your day. Great. Thank you. Thanks for the conversation.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__