Summary
Python has become one of the dominant languages for data science and data analysis. Wes McKinney has been working for a decade to make tools that are easy and powerful, starting with the creation of Pandas, and eventually leading to his current work on Apache Arrow. In this episode he discusses his motivation for this work, what he sees as the current challenges to be overcome, and his hopes for the future of the industry.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Check out the Practical AI podcast from our friends at Changelog Media to learn and stay up to date with what’s happening in AI
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Your host as usual is Tobias Macey and today I’m interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone
Interview
- Introductions
- How did you get introduced to Python?
- You have spent a large portion of your career on building tools for data science and analytics in the Python ecosystem. What is your motivation for focusing on this problem domain?
- Having been an open source author and contributor for many years now, what are your current thoughts on paths to sustainability?
- What are some of the common challenges pertaining to data analysis that you have experienced in the various work environments and software projects that you have been involved in?
- What area(s) of data science and analytics do you find are not receiving the attention that they deserve?
- Recently there has been a lot of focus and excitement around the capabilities of neural networks and deep learning. In your experience, what are some of the shortcomings or blind spots to that class of approach that would be better served by other classes of solution?
- Your most recent work is focused on the Arrow project for improving interoperability across languages. What are some of the cases where a Python developer would want to incorporate capabilities from other runtimes?
- Do you think that we should be working to replicate some of those capabilities into the Python language and ecosystem, or is that wasted effort that would be better spent elsewhere?
- Now that Pandas has been in active use for over a decade and you have had the opportunity to get some space from it, what are your thoughts on its success?
- With the perspective that you have gained in that time, what would you do differently if you were starting over today?
- You are best known for being the creator of Pandas, but can you list some of the other achievements that you are most proud of?
- What projects are you most excited to be working on in the near to medium future?
- What are your grand ambitions for the future of the data science community, both in and outside of the Python ecosystem?
- Do you have any parting advice for active or aspiring data scientists, or resources that you would like to recommend?
Keep In Touch
- wesm on GitHub
- Website
- @wesmckinn on Twitter
Picks
- Tobias
- Wes
- The Soul Of A New Machine by Tracy Kidder
Links
- Ursa Labs
- Pandas
- Podcast Interview with Jeff Reback
- Pandas Extension Arrays Interview with Tom Augsburger
- AQR Capital Management
- Distributed Computing
- SQL
- Excel
- Duke University
- AppNexus
- Chang She
- Ibis
- Open Source Governance
- Apache Software Foundation
- Paul Graham
- Schlep Blindness
- Big Data File Formats
- Apache Arrow
- Hadoop
- Spark
- Apache Impala
- R Language
- Ruby
- Rust
- Pandas 2.0 Design Docs
- Apache Arrow and the 10 Things I Hate About Pandas
- GeoPandas
- Statsmodels
- Python For Data Analysis by Wes McKinney
- 2 Sigma
- R Studio
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Today, I'm interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone. Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project to hear about on the show, you'll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale. And for those tasks that need fast computation, such as training machine learning models or building your deployment pipeline, they just launched dedicated CPU instances.
Go to python podcast.com/linode, that's l I n o d e, to get a $20 credit today and launch a new server in under a minute. And don't forget to say thanks for their continued support of the show. And don't forget to visit the site at python podcast.com to subscribe to the show, sign up for the newsletter, and read the show notes. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers. And to keep the conversation going, go to python podcast.com/chat. To learn and stay up to date with what's happening in artificial intelligence, check out this podcast from our friends over at the changelog.
[00:01:26] Unknown:
Practical AI is a show hosted by Daniel Whitenack and Chris Benson about making artificial intelligence practical, productive, and accessible to everyone. You'll hear from AI influencers and practitioners, and they'll keep you up to date with the latest news and resources
[00:01:40] Unknown:
so you can cut through all the hype. As you were at the, Thanksgiving table with your your friends and family, were you talking about the fear of AI? Well, I I wasn't at the Thanksgiving table because my wife has forbidden me from doing so. Oh, it's It's off limits for for me, lest I drive
[00:02:08] Unknown:
podcasts. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with O'Reilly Media for the Strata Conference in San Francisco on March 25th and the Artificial Intelligence Conference in New York City on April 15th. In Boston, starting on March 17th, you still have time to grab a ticket to the enterprise data world. And from April 30th to May 3rd is the Open Data Science Conference.
Go to python podcast.com/conferences
[00:02:47] Unknown:
to learn more and to take advantage of our partner discounts when you register. Your host as usual is Tobias Macy. And today I'm interviewing Wes McKinney about his contributions to the Python community and his current projects to make data analytics easier for everyone. So, Wes, could you start by introducing yourself? Sure. I'm, Wes McKinney. Most people know me as the creator of the Python Pandas project. So I've been doing,
[00:03:10] Unknown:
development work for data analysis tools for a little over a little over 10 years, and, very interested in open source software and, and models for doing funding models for doing more, more open source software and making, data processing systems
[00:03:29] Unknown:
more powerful and more accessible to normal people. And do you remember how you first got introduced to Python?
[00:03:34] Unknown:
I do. The very first time I heard of Python when I was an undergrad at at MIT. I was taking a class on algorithms. It was a computer science class. I wasn't a computer science major, but we were studying dynamic programming. It was the only part of the course where we actually needed to write code. And I had done a bit of Java, and I wrote my solution to the dynamic programming problem in Java, and it was maybe, you know, a 150 lines of code or something. And a friend of mine, someone named, Christine Corbett, told me that she was going to write her solution in Python. You You know, she expected it would only be 20 or 30 lines, and I was like, how could that be? How could it be so how could it be so short?
And, and she was right. And so that was my first exposure to you know, the first time I became aware aware of Python as a programming language. That was in, like, 2005. But I didn't get really start programming in Python until a couple of years later in 2007, when I was working at, AQR. And so a colleague of mine, named Michael Wong had had written some rudimentary distributed computing tools in Python. And I was intrigued by the language, from my prior experience with it, and and I sort of went down the rabbit hole and started rewriting some legacy, Pearl code into into Python. That was November or December 2007, something like that. And at this point, you have spent a large portion of your career on building tools and platforms for data science and data analytics, largely in the Python ecosystem.
[00:05:15] Unknown:
So I'm wondering what your overall motivation is for focusing on that particular problem domain.
[00:05:21] Unknown:
Yeah. Well, I, you know, I have a I have a mathematics background, so I I didn't do a lot of programming in the past. So, you know, people talk about how they, learn to program when they're, like, 10 or 11 years old or something like that. But, you know, I I really didn't start programming until so much later in life, because, you know, I I realized at some point that I needed to be able to program in order to be useful. And so I was struck in in my first job working at AQR Capital Management up in Greenwich, Connecticut, I, I was struck by how difficult it was to go about very basic data analysis problems. So I saw how much time people were spending using Microsoft Excel. I was learning SQL, and I felt that for a lot of basic data wrangling, like SQL felt very clumsy. We were working with a lot of time series data.
Also, that felt extremely clumsy. And just in general, it felt like there was a disconnect between ability of your your mind and your brain to think about what you wanna do with the data and the the actual tools to do what you're, you know, what you want, essentially, to to carry out, you know, your desire to analyze the data. And so I got really interested in building tools for myself to be more productive, to make the whole process of translating, like, I have an idea about what I wanna do with the data to actually I have working code to do that. And as I've kind of expanded, you know, I I've been more focused on the needs of other people and looking at how people are working with data and how to make them make them more productive. And so I find it to be a kind of virtuous cycle where you build you build tools, you get feedback from people, you see if it makes their work or their lives better, and then you incorporate that, you know, that feedback into your process.
And as time has gone on, I've, you know, I've gone deeper and deeper into, underlying systems problems that are the underpinnings of tools like pandas, in general, data analysis tools. Because after spending several years building very user centric data analysis tools, things that normal people can pick up and use, like somebody that might be a Microsoft Excel user, and they wanna start writing some Python code instead, I found myself pretty limited in terms of what could be built in those user facing tools, and the constraints lie in the in the systems domain. And so, you know, some people have asked me, like, why are you, you know, working on all these systems problems? It's because they they directly impact the kinds of tools that can be built for end users.
[00:08:06] Unknown:
And a large variety of the tools and projects that you've worked on are open source, And at various points, you've worked for different businesses or run your own. And I'm wondering what your experience has been overall in terms of the differences as far as sustainability and levels of sophistication that are possible based on different environments and funding models and sustainability
[00:08:31] Unknown:
models for the different projects that you've worked on? Yeah. So, I mean, I've been through pretty much all of the the different types of standard funding models for for building open source software. So, I mean, nowadays, if you consider, like, where is most of the funding for open source coming from, it's largely coming from corporations that are allowing their employees to work on open source projects either as their full time job or maybe they spend anywhere from, you know, 20% of their time to 80% of their time in working on 1 or more open source projects. And so in a sense, my that was my first exposure. I worked for an investment manager, a financial firm AQR.
That was 2, 007 through 2, 010. So they effectively were the initial funding for Pandas because they paid my they paid my salary. After that, I'm I'm moved into the next kind of environment where funding comes from, academia. I spent a year as a PhD student at Duke University and continued to do some open source development, but I found myself basically split between doing research and doing software development, and I also found that a bit problematic. I've done consulting, and so sometimes open source gets funded through consulting work. And so I after I I dropped out of grad school and, moved back to New York and and did some consulting work with with AppNexus and, which is an ad tech company and and some other organizations. And, basically, I was looking for people that were trying to do do more data analysis work in Python, and I used kind of that working with them to kind of influence the development road map for pandas. I also started a company, and, I with 1 of the early pandas developers, Changsha. We he and I started Datapad, which is a venture backed start up, and we were building a visual analytics tool that was developed. The all of the back end technology was the Python data stack, including pandas and a number of other things. And so we ended up, shutting down the project in 2014, but part of our objective in starting the company was to direct some of our r and d budget and engineering time back into supporting the underlying open source projects. So there's challenges with with all of those, you know, funding models, basically, consulting, academia, start up, single company funding. And, you know, just just taking you know, working for a company as an example, 1 of the challenges that open source developers have is that they may run into a conflict of priorities with their parent company where they may feel at liberty to work on the problems that are directly relevant to their company's business, but work on maintenance and building features and fixing bugs that are that do not directly impact the business or don't have, you know, quote, unquote, you know, ROI, you know, return on investment, they may find it more difficult to prioritize those things.
And a lot of the work in making open source software projects successful is very unglamorous and falls into this category of of things that can be let's let's call it hard to explain to your boss, like, how you're spending your time. You know, doing code reviews and fixing, you know, esoteric bugs that people report may not, on paper, seem like a high priority, you know, when written down, but, you know, really is is that is the core stuff that that makes open source software projects successful is that grind of, you know, taking care of all of the little things and making sure that, you know, that the the project as a whole is is healthy and not just, your little corner of the project
[00:12:19] Unknown:
that's immediately relevant to the applications that you're working on. And another issue that has been coming up more recently as a larger number of businesses are starting to become comfortable with open source is the idea of corporate driven open source with projects such as TensorFlow being top of mind where the needs of the organization potentially outweigh the needs of the community. And so there can be some conflict of interest in terms of how the project is progressing or who the primary developers and maintainers are on the project that might not necessarily be conducive to a long term health and sustainability of the project.
[00:12:59] Unknown:
Right. Yeah. So that's so you're you're talking about governance, and governance is certainly affected by like, a project's governance is affected a lot by where by where the money is is coming from. And this is part of why I've, you know, become such a big fan of of projects in in the Apache Software Foundation because it forces a community centric governance model on projects that may be largely corporate funded. And so some of the bad patterns that you see in corporate driven open source projects, like, let's just say, private discussions, throwing code over the wall.
So sometimes you'll see companies that a project will be, quote, unquote, open source. But, you know, maybe there's, like, a monthly or quarterly code dump, and all of the code reviews are private. All of the developer discussions or many of the developer discussions are are private, and there may not be that much of an opportunity for people in the community to give their to give their feedback and to be involved in the process. So, really, I, you know, I think that the process that produces open source software is just as important as as the code as the code itself, and so you do see those struggles in, in corporate, you know, driven projects. I think in the case of you mentioned TensorFlow, and, you know, I think that that was a criticism early on. And since since, in recent times, TensorFlow has has, instituted a a kind of a formalized process for collecting community feedback and writing essentially to to, you know, to help get design discussions and new initiatives out in the open so that there's an opportunity for people to feel like they're involved in the process. But yet, you know, if you go on if you go on the TensorFlow, you know, GitHub, and you look at the contributors, you can see that, you know, there's a huge number of contributions coming from the, TensorFlow or Gardener account, which effectively means contributions from internal Googlers.
And so, you know, whether or not those code reviews or the process that produced those I'm just looking at it now. There's, like, 13, 700 commits from the TensorFlow Gardener. It may or may not be easy for for individuals, you know, somebody who does not work at Google to be a full fledged participant in that process.
[00:15:29] Unknown:
And going back to the overall topic of data analysis and some of the challenges that are present in that overall problem domain, what have you found to be some of the common issues that practitioners in that area are dealing with
[00:15:45] Unknown:
in terms of the overall work environments and software projects that you've been involved with? Yeah. So the the challenges that that I've been most interested in or that I've gravitated towards are usability and and accessibility. So API design and the user experience of working with of working with data. And so 1 of the reasons why Pandas is so popular is not only that it has a ton of features and it has the individual, you know, all of the kinds of data manipulations that you need to work with real world datasets, but the ergonomics or the usability of the library is good. So with relatively, concise and easy to read code, you can perform quite complex data analysis tasks. Related to these things, I've also been very interested in performance and scalability.
So, you know, performance, a lot of the you know, because, you know, pandas is a a real a tool for relatively small data, you know, single digit gigabytes and down is kind of the recommended size for for pandas. A lot of the performance work there has been around, you know, making the software more interactive. So the difference between something taking 5 seconds and 500 milliseconds, so a 10 x improvement can be pretty huge. And so we've done a lot of work to improve, like, the, you know, expand the sweet spot for what interactivity means. I've also been interested a lot in in interoperability. So you you may not be doing a 100% of your work in Python and have all your your data may not all fit into memory in Python. And so, you know, we need to be able to take advantage of SQL based systems, distributed processing frameworks like Apache Spark, Apache Hadoop, and, so but there's a lot challenges associated with having a work having a workflow that involves multiple storage systems or processing environments. So all of this work is, like, supporting different file formats, interacting with different processing frameworks, and just going about your day to day work. Yeah. Just, if you sit down and try to use these systems to get something done, you run into a lot of rough edges and, you know, my process is basically, you know, I try to do things. If something seems hard or something seems like a rough edge, I I'll take note of that. And so if you're very diligent about, you know, keeping notes and and tracking, like, what seems like, why is this hard, you know, could can we make this easier? Can we make this faster?
Can we make this use less memory? Can we make this more interoperable? You accumulate, you know, a pretty long list of, things that are things that are imperfect or things that could be made better. The solution to fixing those things may not always be obvious, but it's at least good to if you perceive something to be to at least have the feeling of being suboptimal or imperfect, that you at least, like, make yourself aware of that and, like, try to figure out, like, what's a way to make things better? Like, how can we if if the status quo has something we don't like, like, how can we try to try to change that? And maybe, you know, maybe you don't like something and you find that it's just, you know, it's just as good as it can get. And I find that sometimes when I ask people about, you know, hey. This, you know, this tool works like this, and maybe it doesn't work that well or you know, I think that I I feel like a lot of people become somewhat, kind of dull to some of the the difficulties that they experience. So Paul Graham, not that I'm a big Paul Graham supporter, but he has this essay called schlep blindness, and it's this idea that the the kind of the tedium and, like, the difficulties that people experience in their work, they after a while, they often will stop seeing them. And so they and they just just accept, like, a certain level of unpleasantness is, like, just endemic to the process. And so classic example of this is, like, Microsoft Excel, and maybe, you know, it's sort of, you know, it's sort of a Stockholm Syndrome kind of thing where after a while, you you may stop asking yourself, like, what can we do better than this?
[00:19:47] Unknown:
And are there any other areas of data science and data analytics that you think are not receiving the attention that they deserve or the support or funding that we should be providing to be able to bring all of our capabilities forward and improve the types of systems that we're able to build, whether in terms of tooling or just general research or awareness?
[00:20:14] Unknown:
Yeah. Well, I guess my answer is a little bit is a little bit biased. But if you if you look objectively at where a lot of the the money is going and where a lot of the hype and and marketing is going, it's really, it's it's a bit skewed towards, towards machine learning and, quote, unquote, AI. So that's basically, like, machine learning frameworks, deep learning. Like, there's there's a huge amount of amount of money that's being invested in that. And comparatively, a lot less money and effort and attention is being given to some of the more fundamental problems in data access and kind of interoperability.
So just really basic things like reading data files. You know, I think in, if if you consider, like, public cloud providers, so Google, Amazon, and Microsoft, so they support 5, primary, open file formats for data warehousing, CSV, JSON, Avro, Parquet, and ORC or ORC. And if you look at, like, the quality of libraries for dealing with those file formats and dealing with those file formats in the context of using a cloud platform, like, the software is really not very good. And it it sort of leads you to scratch your heads. Like, well, you wouldn't, you know isn't the problem of reading and writing datasets, like, pretty fundamental, and why don't more people work on that? It it really perplexes me a lot.
That happens to be what I'm working on in large part, but it's, you know, it's a function of I part of the reason why I'm working on it is because I I feel that it is, it is under attended, and it's, it is something that deeply impacts the productivity of of the users. And it's, I think when there's a lot of friction and really basic, data access and data manipulation, that it tends to close out off, kind of people will choose not to pursue different development approaches to problems because they run into these really basic road like, roadblocks just dealing with with the data in a very basic really basic way.
So, you know, I'm really interested in in, having robust and reusable high quality solutions to data access and, you know, just dealing with datasets in a multi language setting. So not just for Python, but for, you know, for all programming languages.
[00:22:55] Unknown:
And on that front, you've been dedicating a lot of your time and attention to the Arrow project, which I know initially started off as just a way of being able to share data frames in memory between Python and R and has now expanded into the realms that you were discussing of data access and interoperability with different data formats. So I'm wondering if you can talk a bit about some of the cases where where a Python developer would be interested in leveraging Aero and also being able to use its capability for incorporating capabilities from other runtimes, such as a particular analysis suite in R or something in Julia versus reimplementing it in Python or finding a different Python library that does some measure of the same things?
[00:23:42] Unknown:
Sure. So, so so the Apache Arrow project, just to give a little bit of history about things, very brief history about how things got started. I, I had been building, the Datapad company with, with with Chong and, our and our team in 2013 2014. And we felt like we were boiling the ocean in a number of ways and working on a lot of systems problems around, you know, low latency and high performance analytics in the cloud. And we developed a a bunch of column column based or columnar analytics tools, to be able to power the the Datapad application. We decided to join Cloudera at the end of 2014 to spend more of our time working on systems problems for data science. And my initial appointment when I landed there was to come up with a a plan to make Python more of a first class citizen in the big data world. So that's in the Hadoop and Spark ecosystem. And 1 of the things that struck me pretty much right out the gate was how fragmented the technology was. And this is, you know, just a function of things being open source and there being lots of different corporate players. And so even though, you know, 100 of 1, 000, 000 of dollars have been invested in big in open source big data projects, there's still you know, the the level of interoperability was not very good in terms of sharing data and using multiple computing frameworks in a single application.
It was also very Java centric, and so a lot of these systems were written, you know, written in Java, or they were more, like, black box y in a sense. So I was working with the, the Apache Impala team. It was still the Cloudera Impala back then, but I was interested in in plugging Python into Impala and found that really basic issues, like, how do we move data between Impala and Python, that there was no off the shelf technology to do that in a standardized way. And so I spent a large part of 2015 gathering a group of, open source developers to see if there was interest in defining a standardized data representation, basically, a standard data frame that was language independent. So it could be used in Java, it could be used in Python, c plus plus, R, really used in any language that, would be a technology that we could use to, essentially tie the room together. So that was the the the rationale for creating the creating the project. And the reason that having this common data format is so important is that it gives you something to collaborate on. So, traditionally, if you consider 2 systems, they write libraries to read and write datasets.
They write algorithms to perform analytics on the datasets. They write messaging layers to move datasets around in a distributed system, around on a network. All of the code and the libraries that people write in general are specialized to the way that the data is represented in memory in in the runtime environment. So by defining this standardized representation, it allows us to create reusable libraries and, also, you know, to be able to use libraries written in different programming languages in process without any overhead. So now, you know, a few years later, we can use c plus plus and LLVM to process data that originates in the JVM without any serialization.
And so it's the kind of thing that we always dreamed of, but it's just extremely difficult because you have to define all of these standards and these and these ways of communicating large datasets in a language in a language agnostic way. And so as we've as we've build out the project, you know, our goal has expanded beyond just defining an open standard for tabular, datasets, AKA data frames, to building, essentially, a polyglot development platform for building data processing applications. So if you're working with tabular datasets and you're working in Python or c plus plus there's building blocks. All of the building blocks that you need to to do analytics, to read and write datasets, to send datasets in a distributed system. These are kind of the the basic pieces that you could use to build a data frame library like Pandas or that you could use to build something more sophisticated like an analytic database.
And so for me, the the interest is in partially improving interoperability, like, broadly across the big data world, but I'm also interested in consolidating efforts within the the broader data science community. So that's the Python world, the R world. We have a pretty significant contingent of Ruby developers that are really interested in having data science tools for Ruby, and we are, you know, building almost all of the core computational system software in in c plus plus, and then we build relatively thin bindings to those libraries, that we can use in Python and R and Ruby.
And so in other languages, we also have MATLAB kind of bindings to the c plus plus libraries. So it's really cool that we can implement a feature once and keep improving that implementation, and that code is immediately reusable in in all of these places. And so I believe that as time goes on, that we'll start seeing more and more data processing systems that are formed from heterogeneous computational components. So you might see a system that includes some Rust and some c plus plus and maybe even some Java, and that, you know, is possible because we have this unifying, technology at the data level. And so I'm really, so I'm really very excited about that. But as you can imagine, it's 1 of these, like, frighteningly ambitious projects that, you know, hasn't really happened in the past, and and part of the reason it hasn't happened is because it's so difficult. And I'm sure that 1 of the biggest challenges that you're facing in the process
[00:30:00] Unknown:
of working on and helping to shepherd the Arrow project is identifying which concerns belong to Arrow specifically and which should be relegated to those other language ecosystems. So whether you should just rely on Arrow being a method of data interchange or include some even lightweight analytical capabilities there to try and allow for push downs from things like R or Python and any challenges as far as replicating effort or wasted effort between those different language communities and raising awareness of which capabilities exist where and how to incorporate them into a larger system.
[00:30:43] Unknown:
Right. We definitely have to strike a a balance in terms of what problems we're taking on in in the Apache Arrow project, and we've been pretty deliberate about, like, where we draw the line. And so the where we've mostly been drawing the line in terms of, like, what is an arrow problem and what is, like, a downstream consumer problem is largely in the user interface side of things. So we don't want to be prescriptive about how the libraries are consumed by end users. So some people have expressed, like, oh, you know, are we gonna expect, like, a pandas like library to exist inside the Apache Arrow project?
And the answer is probably not, but that is the type of, you know, like, a next generation pandas, aka what we'll be calling pandas 2, would exist as a separate open source project that utilizes the Arrow runtime libraries to to create its implementation. And so we want people to build many different kinds of front end interfaces to the technology that we're building in the Arrow project. But we we don't wanna say, you know, you have to write SQL or you have to use this 1 particular data frame like API, that there's flexibility as far as user interface. But the key thing is that the the components in the project are reusable and have clean public APIs, and so if you just need to read and write datasets as arrow format, you can do that. If you're mostly concerned with in memory query processing, in memory analytics that you on in memory data, that you can just use those libraries, and maybe you have your own serialization or data access layer, that is maybe proprietary to your application.
You can still do that. You don't have to accept a particular storage system. Like, you don't have to store things as parquet files in order to make use of the query engine components that we're building. But 1 of the big pushes over the next couple of years is creating a full fledged comp like a query engine, basically parallel execution of analytical queries against in memory and on disk datasets. So that's could be used to execute SQL, but also evaluate, Pandas type data frame operations. So group by aggregate, filtering, column expressions.
So we we're right now, we are, you know, laying down the building blocks of kind of having an end to end in memory and out of core query query engine. So that's kind of the, you know, the major growth area for the of the project for the next 2 to 3 years. But you did you did mention, probably the multilanguage aspect, and so it's notable that there is a parallel effort to build a native query engine for Arrow that is in Rust, and I'm I'm happy to encourage that. And if we end up with native language query query engines written in multiple languages in the project, I actually think it's a really great thing because we can learn from each other, not only about, you know, implementation strategies, but also I think the design of how like, what is what is idiomatic arrow code in c plus plus may translate well to Rust or maybe not, and and maybe vice versa. There's some things that, are learned about how to make the problem more tractable or build better software in Rust or in Go or in Java, and that can inform development in the other in the other subcomponents of the project.
[00:34:16] Unknown:
And you were mentioning some plans for the, yet to be determined Pandas 2.0 release. And I know that you have spent a lot of your career working with Pandas and working on Pandas being the initial author, and that now it has grown to be a larger community beyond yourself. And I'm wondering now that you've had some time with it and space from it, what your current thoughts are on its overall success. And with that perspective that you've gained, any thoughts on what you would have done differently if you were to start it over today? Yeah. So, so so 1, I guess, you know, 1,
[00:34:54] Unknown:
elephant in the room that people always ask me about is the is the quote, unquote pandas 2 effort. So at the end of I think it was the end of 2015, I started a discussion with the pandas core team. It's you can find the discussion on the pandas dev mailing list about, you know, what kinds of changes or improvements, like, would we like to make to the core library. And later, I wrote, like, a long kind of doc set of documents about, like, a hypothetical pandas 2 project. Like, what do we want? Like, what do we what problems do we wanna solve? What problems do we don't wanna solve? And as we've discussed as a community, the the work the working plan there is that the existing Pandas project will live on effectively in perpetuity given the, you know, millions of users and the, you know, years years of code that depends on pandas. And so the development there is focusing on stability and and having a mature and consistent and reliable reliable code base. And the work in so if you look in the pandas 2 design documents that I wrote along blog post called called Apache Arrow and the 10 things I hate about pandas, which is based on a talk that I gave about 5 or 6 years ago about internal design issues in pandas. So we've been laboring kind of diligently to address those systems problems that have affected the performance and scalability of Pandas in in the Arrow project.
And our plan there, effectively, is to, is to create a a sibling project, that is intended to be used as a complementary tool to Pandas that is using all of the Arrow technology under the hood and will be geared toward, doing analytics on massive datasets. And so a little bit less flexible in its API, but designed to work, in a scalable way with much larger datasets. But going back to your original question about kind of, you know, my reflection on pandas', you know, will be pandas will be 11 years old this April, so it has been around for a long time. I can't say that I I would have done very much differently kind of going back and and thinking on it. You know, I think, if anything, I I there are probably some things that I would have said no to but, you know, when you're building a brand new open source project and you're excited to have people join the community and become users, you have the tendency to say yes to everything. And so now, you know, there's some things in Pandas that were, you know, are being deprecated, maybe removed, like things that seemed exciting once but maybe, haven't gotten as much maintenance love or use over the years. But, you know, when you go back a decade, things were a lot different back then. So the center of gravity and ecosystem was in scientific computing and HPC, and people used NumPy for a lot of things. And so it was in Pandas' interest to have really tight, interoperability with NumPy that if you didn't have that, then basically your project would be more or less dead on arrival because there's just too much friction for people to pick up and use the tool if they were already using NumPy.
And so now, you know, fast forward a decade later, some of that close relationship with NumPy has has created problems in in the sense that, you know, a lot of the internal the implementation details of pandas are exposed publicly. And so that makes things like NumPy, and so that makes changing things really difficult. But the community, you know, Jeff Reback and Boris Standen Bosch and Tom Alksberger, they've done a really great job, you know, growing the community and also judiciously growing new features to make the project more extensible. So things like extension arrays just dropped in the project relatively recently, and that's opened up a lot of interesting opportunities to expand beyond the horizons of what's possible, with vanilla NumPy.
So having no values and integer columns is something that's been enabled by extension arrays. There's also, like, adding there's GeoPandas which provides some, you know, geographic data structures. And so so there's still a lot of exciting work happening in Pandas, and I think the project has a lot has a has a lot of growth and, and, and work ahead of it. I don't see the the arrow work as being in conflict with pandas at all. And, you know, I I see kind of a, let's call it, a symbiotic relationship between the work that we're doing in in in the Arrow project and Pandas in the sense that we wanna make sure that if somebody has, you know, a 100 gigabytes or 100 of gigabytes of parquet files, that they're able to perform standard analytical queries on those datasets. And if at some point, you need to cross over into pandas and build pivot tables and do classic Pandas stuff that that it's straightforward for for you to do that. But the presumption is that at that point, you're gonna be working with smaller amounts of data. Pandas is more of a Swiss army knife than a chainsaw, if that makes sense. And
[00:40:02] Unknown:
a lot of your overall sort of recognition in the Python community in particular, but also data science at large is related to Pandas. And most recently, you've been very involved with the Arrow project, but I'm wondering if there are any other achievements that you're particularly proud of that you'd like to call out that people might not necessarily be as familiar with?
[00:40:23] Unknown:
Yeah. So 1, I mean, 1 project that we haven't, we haven't talked at all about on this on the podcast yet is the is the IBIS project, I v I s. So it's a project that I started when I was at Cloudera, and the idea was to develop a fairly rich DSL, so domain specific language for interacting with, initially SQL based systems. And so I wanted to build something that was very similar in its API to pandas and could be adopted by a pandas user, but was lazy and created you would create strongly typed, well typed expressions, and you could essentially take any SQL query and rewrite it as an IBIS pie expression and write it with Python code.
And, you know, I contend that that that, you know, the IBIS is probably 1 of the most underappreciated things that I've worked on because, you know, are are big fans of it because it makes, writing really complex SQL a lot a lot more tractable, and it brings a level of code reuse to SQL analytics that is you know, you can't really do in SQL. So the way you reuse code in SQL is by copy and pasting. It's something I'm quite proud of, and I think also the work in designing the, expression language of of IBIS is also influencing the work that we're doing in in the Arrow project, because we also need to have a DSL for writing down deferred, analytical expressions in c plus plus. And so the fact that, you know, I have a fully formed DSL that is a superset of SQL and can express even very complex SQL concepts like correlated subqueries and, you know, things that, you know, traditionally are very hard to even think about in pandas. I think it gives us a head start on kind of thinking about how to map declarative SQL analytics onto the imperative kind of functional approach of Python and essentially composability and function calls.
So I think it's an interesting research project and, I you know, something I hope that more people, you know, take a look at. It's been 1 of these, like, sleeper projects that, you know, has been growing and developing over time, and has expanded to support a lot of different SQL systems and now has an in memory pandas back end. And, Philip Cloud from from the pandas project, has has done a lot of, amazing work on that. Christian Suetsch, who, is also, in getting involved in, in Apache Arrow, It's as a PMC member in Apache Arrow has been working with me on that project as well. So he's he's done a lot of work on, on Ibis. So it's a so it's a cool project. But, I've tended to concentrate my my development work in a in a in a few in a few areas.
[00:43:16] Unknown:
So so, you know, Pandas and IBIS and the and the stats models project or outside of outside of the arrow project are where most of my most of my development work has gone. And it's also worth at least an honorable mention that you wrote the book Python for data analysis. So if anybody hasn't read that, it's probably worth taking a look at that as well as a means of getting introduced to the ecosystem.
[00:43:39] Unknown:
That's true. Yeah. So I I I wrote, so I wrote my book, Python for data analysis. Essentially concurrent with the development of pandas, which was a bit risky thinking back on it. This was, you know, this was 2012 that I was mostly writing the book, and, I'm I did a second edition of of the book a couple years ago to update it for Python 3 and, for for the latest version of Pandas. So, you know, if you're learning Pandas or you you wanna learn more about data analysis and Python, the book definitely is a good, is a good resource for that.
[00:44:13] Unknown:
And looking forward, what do you have in terms of grand ambitions for the future of the data science community, both inside and outside of the Python ecosystem? And within that, any projects that you are particularly excited to be working on in the near to medium future?
[00:44:31] Unknown:
Yeah. So, so I I, I helped start the Arrow project while while I was at at Cloudera in in 20 2015. In 2016, I I moved across the country, and and spent a couple of years working with 2 Sigma on systems for for data science, and they're really gracious, supporters of the of the Apache Arrow work. And we, we did, you know, integration between we developed integration between, Spark and Arrow to make Python on Spark faster, and we collaborated with with IBM on that work. That was really an exciting collaboration to see, come together and be successful.
And last year, I, in order to scale up the the my, you know, the work in in Apache Arrow, I partnered with RStudio and 2 Sigma to create a new organization called Ursa Labs, which is a, a nonprofit group, which enables me to put put people to work full time working on the Arrow project, and the mission for the mission for Ursa Labs is to, to develop the Arrow platform as a shared computational infrastructure for for data science. And so my grand ambition for all of this is to have a really powerful computational runtime for data science, for analytics, data access, future engineering for machine learning and statistical applications that is uniformly accessible from R and Python and Ruby and the different languages that people want to use for data science. And so it's the kind of it's the kind of thing that hasn't really been possible in in the past because of the points of friction that had made it difficult to share code between these programming languages.
And, also, part of that ambition is to foster collaboration between the data science world and the database systems community because part of what's missing from data science is the level of computer science and computational systems work that has been done in the analytic database world for the last, you know, 20, 25 years. But almost very little of that, systems work has made its way into the hands of everyday data scientists. And so part of the goal of of the Arrow project is to create reusable libraries that provide the level of performance and scalability that you have in modern analytic databases, but put those at the fingertips of everyday data scientists and to essentially liberate individuals from from being, tied to a particular programming language or being able to use multiple programming languages, but not having to be too concerned about, you know, whether things are gonna run 10 times slower in 1 programming language versus the other. So I think things are on their way toward that goal, and and, you know, I think setting up the the Ursa Labs organization helps provide some scalability in in terms of building relationships with more corporations that wish to fund the the Arrow work so that we can build a bigger and bigger team as time goes on. So, so if anyone listening has, you know, has interest in in funding, you know, is funding this work or helping us move faster, you you can definitely reach out to me on a Twitter or, you know, any of the standard communication channels. And for anybody who does want to get in touch either for offering funding or support or who are interested
[00:48:00] Unknown:
in working with you on the Ursa Labs mission, I'll have you add your preferred contact information to the show notes. And to close out the show, is there anything else that we didn't discuss yet that you think we should cover before we close out or any parting advice that you have for active or aspiring data scientists or any resources that you'd like to recommend?
[00:48:21] Unknown:
I don't think so. But, yeah, it's a it's a pretty it's still a pretty, in terms of where we are and, like, the you know, on our on our sort of historical timeline, if you think about, you know, maybe what life might be like in in 2050 and where are we right now, you know, I think it's it still feels like we're we're a bit in the wild west in terms of the development of of systems and tools for doing, for doing data science. And so I think we have a long way a long way to go. And, you know, I think the more open dialogue we have around these problems and, and get it kinda collect ideas and and combine and kinda coalesce efforts in building the open source software and light and tools and systems, I I think the more, you know, the more progress, you know, we'll make. So, you know, I I like to tell people that, you know, I want the future to get here faster. I don't want time to pass more quickly, but, you know, if we could advance the kind of human progress on on these kinds of tools by a few years here and there, you know, I think that would be, that'd be pretty great because it means that we'll be able to do more interesting science and, kind of, you know, make things in the world just a little bit, a little bit better, which, you know, I think we can all agree is, you know, how things are going is, you know, something we should all be concerned with. Alright. Well, with that, I'll move us on into the picks. And this week, I'm going to choose the author Roald Dahl. He has written a number of great books for all ages. I have enjoyed them for many years. So if you have never read any of his books, I definitely highly recommend them.
[00:49:57] Unknown:
The BFG is a great 1. James and the giant peach. Pretty much anything he's ever written is good fun. So with that, I'll pass it to you, Wes. Do you have any picks this week?
[00:50:06] Unknown:
Well, I just, you know, finally, after being recommended to me several times over the years, I just finished reading the Soul of a New Machine by Tracy Kidder. So it's a classic book and, about, engineering or computer engineering from the from the early 19 eighties. I it's it's a bit, the book is a bit is a bit dated, but still, I think, you know, for anyone who's in software or computer engineering, I feel like it, you know, it's a good read for for anyone who's who's in the field. It helps kind of understand the the motivation that drives engineers to, you know, to build things. So I found it to be a pretty enlightening profile of, you know, some people that worked, you know, very hard, in a over a very short timeline to build a new, new computer system. So it gets a gets a high recommendation from me. Well, thank you very much for that recommendation. I'll have to take a look at it, and I appreciate you taking the time today to join me and share your
[00:51:01] Unknown:
experiences working with and building tools for the data science and data analytics communities. I have used the outputs of your labors a number of times. I'm sure a number of other people have as well. So thank you for that, and I hope you enjoy the rest of your day. Great. Thank you. Thanks for the conversation.
Introduction and Overview
Wes McKinney's Background and Python Journey
Motivations for Data Science Tools
Open Source Funding Models
Corporate Driven Open Source
Challenges in Data Analysis
Underfunded Areas in Data Science
Apache Arrow Project
Balancing Arrow's Scope
Future of Pandas and Pandas 2.0
Other Notable Projects
Grand Ambitions for Data Science
Closing Thoughts and Advice