Unleash The Power Of Dataframes At Any Scale With Modin

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers, 40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Devin Peterson about Modin, a Pandas compatible data frame library for datasets ranging from 1 megabyte to 1 terabyte plus. So, Devin, can you start by introducing yourself? Yeah. Hi. I'm Devin Peterson. I'm a 6th year PhD student at the University of Berkeley in the RISE lab.

And I'm advised by Anthony Joseph. And do you remember how you first got introduced to Python? Yeah. So it was an undergrad actually

when I was working on

scripting for

a genomics project. So

early on in my undergrad, I got really interested in genomics and

trying to deal with,

you know, genomic data and Python really came in handy. It was introduced to me by a bioinformatician

actually.

And

so, yeah, that's kind of how

my journey

in Python started.

And so

it has led you now to building

description about what it is that you're building and some of the story behind why you decided to try and tackle this problem and how you came to this particular solution for it. So

as I mentioned, I'm, you know, I've been interested in genomics problems, and

I started my PhD

working on genomics problems, actually. And I started to notice more general problems in science as a whole where,

like tools to deal with large amounts of data

really require a lot of

expertise

in distributed computing.

And 1 of the things that I really wanted to kind of tackle is this problem of accessibility

in big data tools and tools that deal with large amounts of data. And so,

like I started with pandas because, you know, pandas is everywhere. It's it's like what everyone uses.

And it also has a lot of nice features. I mean, there are a lot of things to dislike about pandas for sure. And any of the maintainers would share with you a long list of things that they don't like about pandas.

But there are also a lot of reasons that people use it a lot. And so I started with pandas and started started with trying to tackle this problem. I thought it was gonna be a lot easier than it was. It's a very hard problem that turns out. And I've learned a lot along the way about, not only about pandas, but about data frames as a whole

and distributed processing. I mean, it's really nice to be able to do this as a PhD student because

I I get the freedom to kind of explore

the problem kind of as deeply as I'd like. And what is it about the overall space of data frames that draws you to it and wanting to study it and dig deep into this problem?

As kind of as a part of my,

you know, PhD

program, like, the really nice thing about being a student at Berkeley is that I I get a lot of freedom.

You know, my adviser, Anthony, has given me a lot of freedom to kind of

go in whatever direction. Like I said, I started in genomics and now I'm working on a lot more

general systems problems. And data frames as a

whole, looking at pandas and trying to figure out,

how do we scale it?

I mean, it took some theory

in truth. Like, what we did is we kind of looked at the data model and we tried to say, how is this different from databases?

And we dug basically as deep as possible, and our group was kind of the first put forth, like, a real formalism that could underlie

data frame systems. And

so now that we have that formalism, you know, that formalism is kind of what modem is is built on

is the idea that, you know, data frames are unique. They're not quite the same thing as databases. They're not quite the same things as

arrays. They're different from spreadsheets, of course. And so they kind of sit somewhere in the middle of all of those. Yeah. I mean, why data frames is a good question because

it

really tables

are how people

like to think about data. Right? And it's much easier to kind of think about data in 2 dimensions.

And so that's kind of how a lot of people think about

and reason about solving their data problems is by putting them in like a table or a table like structure. And data frames are a lot more kind of They're a lot more permissive in what they allow

for

various stages of data cleaning.

As you mentioned, 2 dimensions

is kind of the limit of what we can comfortably work with as a mental model. But there are also a number

of problem domains, particularly in the sciences where you're going to be dealing with multidimensional matrices or n dimensional arrays.

And I know that there is the X-ray project that is built on top of pandas for being able to address some of those

multidimensional

spaces. And I'm curious if that's something that you have considered exploring with the work that you're doing with Modin or if because of the fact that you are hewing to the Pandas API,

X-ray could just natively sit on top of what you're building there? The short answer is it's the latter. Like, X-ray is a really, really interesting project.

And, you know, what they've done with it in terms of being able to take something that we can think about in 2 dimensions and basically turn that into a n dimensional

array library, I think that that's phenomenal. They support Dask.

And so Modin also supports Dask. So Modin could kind of sit underneath of X-ray.

And so you mentioned briefly the

kind

of juxtaposition

of data frames and databases

in that they're both working with tabular structures, but they have different

strengths

and areas of focus. I'm wondering if you can just dig a bit more into

some of the things that you can do in a data frame that are either

not

allow you to have multiple different types in the same column for example.

This is

really important for data that's at different stages

in the cleaning process. So when data comes in from the wild, we don't know

if the data is clean or gonna fit really nicely into our schema.

And databases really require that you declare the schema upfront.

You must declare the schema

at the time of the table's creation,

and then all the data that you ingest must fit into that schema. And with a data frame, you actually don't need to declare a schema upfront. The schema can kind of be inferred

at runtime.

I mean, it's very natural that Python would have

a data frame and a very popular data frame at that

because of Python's

dynamic typing and

these things that database people don't like. Database people tend to like, you know, static typing and strict structures.

But the real world is not so clean. Right? The real world actually requires us to deal with messy data.

And so, you know, data frames are really good at letting me do

analysis

on data that is, you know, not necessarily

in a perfect schema or that doesn't necessarily fit into

the kind of the traditional concept of the schema that it's

And as far as the Modin project, I'm wondering what you have as far as overall goals. You mentioned wanting to be able to support the

accessibility

of distributed computing for scientific endeavors. But just more broadly, what is it that you are hoping to be able to achieve with the Modin project?

At first,

the goal was basically to just kind of,

you know, get a formalism around data frames and basically have an implementation of pandas. And

like, at this point, modin supports a variety of APIs. It also supports a variety of distributed engines underneath. So

it didn't start out like that. It started out with a very simple purpose, and it's kind of grown into what it is today.

So my goals with it, I guess, are that I'd like to basically allow people to interact with data

in the way that they're most comfortable with

on whatever they have access to. That's a very broad statement. Right? But what I mean by that is, for example,

SQL is sometimes more convenient to use for certain operations than pandas is. Take like join. Right? A lot of people who are familiar with SQL Joins don't like

pandas style joins because

you have to use like pd.merge

ordf.merge

or something. And,

you know, it doesn't feel as intuitive as writing a join in SQL.

So

it would be nice if we could just basically

switch between Pandas and SQL,

for example,

whenever we wanted to in the midst of our notebook, and

it'll run on the same underlying execution.

And with that, you know, we also want the the underlying execution to be abstracted in a way that users can kind of

use whatever hardware

or whatever

cluster that they have available. Right?

We don't always get control over what we have access to in a job or, you know, even like at home.

And so,

you know, Modin kind of abstracted away the execution engine in a way that allows us to

use whatever you have. If you have access to a DaaS cluster,

then Modin can run on the DAS cluster. If you have access to a rate cluster, Modin can run on the rate cluster. And we're working on adding a few others.

But the idea here is that, like, I, as a user, don't need to think about

where things are running or what things are running or changing my notebook for

a given compute engine

or anything like that because it's all just kind of there and it it all just kind of works.

It's really centered around the end user's productivity

and letting them kind of choose how they interact with data at any given point in time for any given task. Right? In general,

that is the goal. It's it's pretty ambitious, I would say.

Because there's a lot of challenges in

abstracting away so many details from the user. And there's a lot of machinery underneath of modem that kind of

enables us to make sure that data is where it needs to be and that it's partitioned in the right way and that all of these things so that users don't have

to really think about distributed computing even though they're using it. You mentioned the support for both Dask and Ray as these underlying substrates for being able to distribute the computation.

And I'm wondering

what you have had to

build in order to be able to abstract across those different systems for being able to

handle the computation,

spread it out,

and the architectural aspects of modem that allow you to

parallelize

based on these different underlying substrates while still being able to provide a shared API that targets pandas on top of the whole thing?

Modin has kind of a core implementation

that implements

a set of data frame algebraic operators that we defined as a part of this theoretical data framework that I've done as as my PhD.

And

these algebraic operators

are

roughly

12 to 15. We're still in the process of kind of getting everything completely formalized and implemented. But there are about 15 operators at this time

that allow you to operate on data and metadata. And all of these are implemented as a part of this mode and core.

This mode and core is then kind of abstracted on top of these 2

really good task parallel

distributed systems,

Ray and Dask.

We use these 2 as basically task scheduler. So Dask has a data frame, and a lot of people get modems usage

confused with Dask data frame itself. And that modem is

distinct from Dask data frame. It doesn't use anything any of the kind of library code from Dask Dataframe.

Dask Dataframe kind of sits at the same

level as Modin does.

Modin itself,

this kind of Modin core,

there's a lot of challenges in mostly dealing with the metadata. Right? There's a lot that you can do at the data level to kind of make sure that things are nicely parallelized.

Metadata

and keeping that metadata

close to the user is actually 1 of the biggest challenges.

Data frame users are used to being able to do something like a look or an eye look, and it's supposed to be really fast.

And making these fast in a distributed system, it is it's really hard because the data might be on another machine. And so how do we make sure that we're giving the user these kind of

immediate response times or or very close to immediate response times in a distributed system? But this is part of my research, and we've done a lot of work to make sure that it is fast and that the user, you know, that we're not copying when we do a look and that we not, you know, making memory copies

all the time. Right? Which is another problem that Pandas has generally.

But Ray and Dask tend to be used basically as our compute engines

for this modem core implementation.

So modem as a whole is about 40, 000 lines of code, I think. And the difference between the ray and the das code is about 1500 lines. So

it's a very small difference

in terms of

how we use

Ray and DASK. They're relatively similar, but there's a lot of shared code between the engines because they're Python

distributed task frameworks. Right? So we can

submit Python tasks. We can submit the same Python tasks to the 2 of them, if that makes sense. As far as the difference of sort of user experience

of why somebody might choose to use Ray versus DASK, is it just a matter of what they have already chosen to use

based on the

requirements within their organization, what they already have running, or are there other variances in terms of capability

or user experience that come about when somebody is using Voden and deciding which of these engines to build on top of?

So, generally, it is what you have access to is the first question to answer. Right? Because if you only have access to Dask or only have access to Ray,

then I recommend you use what you have access to. Don't necessarily try to

add a new

dependency or or something like that.

Beyond that, the Ray implementation is more mature. That's where it started. Ray is actually another RISE Lab project. And so I personally have a closer relationship with the Ray

team and with the Ray folks.

You know,

it'll be a bit more stable on

Ray. Now there's not a lot of differences. Like I said, it's only 1500 lines of code.

So there's not a lot of differences

between

Ray and Dask.

But

if it's all the same, then I generally recommend

using

Ray because

we've been working with them a lot longer. And,

you know, it's another Rizelab project. So Modin itself

doesn't try to take an opinion on this, rather it tries not to be opinionated on what you should use. I personally am opinionated because I, you know, I've worked with these folks for a long time. So Digging more into the actual implementation

of Modin,

I know that at the core level, as you said, you've developed a formalism for how a data frame should operate. But I'm wondering

how you actually manage to parallelize these array structures

in the compute grid in a way that you are able to get these

responsive interactive speeds across a distributed structure, whether that's across multiple cores or across multiple physical machines? This question has a really interesting it's or at least it's interesting to the nerd in me. But

the distributed

parts of modem

are really interesting and unique when you compare them to a lot of traditional distributed systems.

Modin's partitioning,

in particular,

is very

general. In that, the partitioning can be anything. It's not a strict column store. It's not a

row store. It's not a block store. It can be any of these at any point in time

during a, like, a data frame program.

And so, like, this flexibility

lets us,

you know, resolve these, you know, look and eye look

extremely quickly because

well, first of all, they don't copy data. Right? So it's kind of a special case where we can just kind of point to the partition where

the key lives in the case of Loke. Right? We can point to the partition where that row lives or those those set of rows and then just be done with it and kind

of copy on right is effectively what we do. So if you if you might make a modification, that's when we do the copy.

In the case of other operations, for example, data frame operations allow you to intermix column operations and row operations, things that, like, databases also couldn't do.

In those cases, Modin can flexibly move between

different forms of partitioning.

And all of this is done without any user right? So the users

kind of doesn't know at any time what the partitioning is

because it's all kind of taken care of by the modin core

system.

The core component of the system, which,

you know, manages the metadata. It also manages the partitioning.

In terms

of the

sort of motivations for some of the architectural decisions that you made, I'm curious

how the

constraint of targeting the Pandas API

has influenced

the direction of your thinking about how to build modin given that it is a bit more of a general purpose

framework for being able to do this, you know, compute across array structures on distributed systems?

So Moda actually started with as like a weekend hack kind of. Right? So I basically wanted

to try to implement the Pandas API with a simple,

like traditional

database approach which is row partitioning.

That didn't work. Basically, at that point, I decided to kind of boil down

the pandas implementation

and look at other traditional data frames like r, for example, and then s which comes before r.

And go back to the scientific literature and see what I could find about these systems. And I couldn't find much really about the formalism that underlies

traditional data frames. And so

at that point, it wasn't just me. There's a big group of us at Berkeley

who did this.

We boiled down

the data frame operations into

this algebra that I've been talking about. And we also formalized the data model around it. Basically, so that we could

prove that we can

do everything that pandas can do and maybe more actually. The generalization turned out to be more general than Pandas and what it allows. And

so Pandas has over 600 operators. 15 operators doesn't sound like it could cover that many, but

it does because

of user defined functions. Like, a lot of pandas is very repetitive. And so user defined functions actually

give us a lot of power if we support them in their completeness. Right? In the same way that pandas does.

And so the challenge was that

we couldn't use

traditional styles. Right? Traditional methods in databases

to kind of

tackle the problem of pandas. We had to actually

come up with something novel that would allow us

to do everything that we needed to do.

Because of the fact that the structure that you've built is a bit more flexible in general purpose than Pandas, you've actually built a couple of other experimental interfaces on top of it. I was noticing that you have capability of being able to execute SQL statements on the data frame as well as

having a spreadsheet's

API for being able to do more of an Excel type experience. And I'm wondering if you can just

talk through some of the

motivations

for going down these experimental paths and some of the thoughts that you have on

being able to make the data frame a bit more of a general purpose structure and not tied specifically to a Python oriented API.

So

the the whole point of Modin has kind of transformed into this vision that a group of us at Berkeley have around

meeting users where they are and allowing them to interact with data in a way that's most comfortable to them.

The goal

is to basically

take Modin, this kind of middle layer that we have, the Modin core,

and

allow users to

kind of use the APIs that they are familiar with. And so that involves

basically translating

the spreadsheet API, for example,

and those interactions down into

the mode and core algebra. And since pandas and data frames are so

generic in general and what they allow

these other interfaces, you know, they end up being subsets

of the pandas API. So SQL, for example,

you could do everything in SQL that you would want to in pandas.

The behavior isn't always the same though. So in SQL for example,

operations aren't ordered.

In pandas, they are. Right? So every output

part of the formalism we have is also, like, there's an output order

for every input order and operation.

So

in SQL, since there's not an output order necessarily,

users might assume that certain operations are going to be fast

because

unordered operations are just generally faster than ordered operations. If you don't have to worry about order, then you can do things

in whatever order is the fastest.

So there are challenges like this that we haven't tackled yet that are kind of related to

meeting the expectations of users who are using different APIs.

But as a whole, like, what we want to tackle is problem that we have of systems that kind of force users to change their behavior

in order to gain their benefits.

Right? So learning a new API is

expensive in terms of human time. Right? It takes a long time

to learn a new API. And, yes, there are better APIs than pandas out there. I think a lot of people would agree.

But,

you know, so many people know pandas that I think that we would be doing

the world a disservice if we just kind of

try to get everybody to shift

away from the existing APIs that we have. Right? I mean, if you imagine that it takes, you know, several months

to kind of get familiar with

a new API and get to the point to where that you're productive with it, you know, that's a lot of time. Right? That's a lot of time that that you've kind of lost in effect.

And

the performance

is probably not worth it. In general, like chasing the fastest tool is just it's just generally not gonna be worth it. Right? We need tools that

give users

superpowers effectively

without

requiring that they change a lot of their behavior.

Because of the fact that you do have this flexible interface and you're not forcing people into a particular

sort of language or syntax

to be able to

work through the problems that they're trying to solve for. I'm curious how that affects the

the use cases and the target users that you have for Modin as it compares to maybe something like Pandas.

The general use case of

Modin has so far been

users who are kind of hitting the wall that that you hit with Pandas performance.

These other features, I haven't seen a lot of use of them in production use cases. There are people who are

experimenting with it a bit, but they're very very new. Maybe a few months old. And so, you know,

the use cases around

using, you know, the SQL API or the spreadsheet API, Those aren't really used in production just yet.

The way that the pandas API is being used is generally as a way to kind of

put off having to

rewrite things

into Spark, for example, or a distributed system. So there are production use cases

at large companies

where

the user kind of just writes

a notebook

that uses modem. And

then they're able to just basically take that and run it as a production job.

And

they don't have to rewrite it into Spark in order to basically

make use of the entire dataset.

Right? The user is able to actually just

write moded code and

experiment

with Modin using,

you know, the pandas API that they're familiar with. And they don't have to go through this kind of costly phase of

translating that into Spark. And a lot of companies have this problem where they're

taking

Panda's code or some code that's written on a single node, and they have to productionize it in some way. A lot of times that looks like translating code into Spark or something like that. Right? Translating Pandas into Spark is a common

use case. And, you know, data engineers typically do that. Like, a lot of companies don't let data scientists,

you know, write production code.

But

what modem kind of allows is it allows for the data engineer to

not have to be so hands on with how they productionize

the code. Right? They they are able to kind of scale their efforts a bit better because it takes a long time to rewrite these. Right? Like distributed applications just take a long time to write in general.

So so that's a really interesting use case that actually I wasn't anticipating when I first started writing. I was trying to solve the kind of data science

productivity problem,

and it ended up actually helping out data engineers quite a bit as well, which is very interesting.

So you mentioned that where you are with Modem today isn't where you set out to be from the start. And I'm curious as you look back to when you were first thinking about what you were going to build, what were some of the initial ideas or assumptions that you had about

the particular needs of data scientists

and how

a distributed computation system would be able to alleviate some of those, how some of those ideas have been challenged or

invalidated as you

went along the journey of actually building Modin and some of the

maybe lessons that you wished you had known earlier on as you were starting down this path that would have maybe saved you significant amount of time and heartache?

So 1 of the things that I wish that I had learned earlier was that data scientists

don't, or at least the data scientists that I've talked to, they don't really care about

like modems internals.

So

in the early days when I would talk about modems, I would talk about, you know, this

algebra and how it's, you know, very elegant and things like that.

But a lot of the data scientists that I was speaking with didn't care. And it's not that they didn't care.

It didn't align with what their problems are. Or rather,

it didn't align with what they thought their problems were. So

I kind of so I wish in those early days that I hadn't

focused so much on like the

deep technical parts of Modin when I when I would talk to data scientists.

I wish that I had kind of focused more on the problems that it solves and allowing users to kind of think about

how Modin solves their problems rather than, oh, Modin has this like deep technical algebra. Am I gonna have to learn how to use this algebra? Or what does this algebra have to do with my problem?

And so

it comes down to kind of like like a communication lesson, really. Like

like there's also the lesson that I wish I had learned earlier, which was like pandas is really hard and it's really big. And

like, you know, I knew pandas when I started, but now I

think back to those days and I think I didn't know pandas at all. Like, when I first started compared to how much I know today.

And it's been a really interesting

journey and a really interesting

progression to kind of

be able to, you know, start out with some prototype that

worked for, like, you know,

a few APIs, not a lot, and then kind of grow it into something that, you know,

today is actually relatively useful, I would say. But,

you know,

in general, like,

the biggest lessons that I've had,

data frames are unique and they're actually really, really useful and really, really interesting structures.

And

yeah. I mean,

it's been a very fun journey to kind of

unwrap this complexity and kind of drill it down into

a formal structure and something that we can actually

reason about from, like, a theoretical standpoint.

In terms of the actual

uses

and

projects that have been built using Modin, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it used.

In terms of most unexpected, absolutely, it would be the

delaying or replacing

Spark in in that way. Right? Like,

I mean, Spark came out of App Lab, which is like the predecessor to RISE Lab. So, you know, I know a lot of the people who have worked on Spark.

It was never my intention to kind of replace Spark, if that makes sense.

But having that actually

happen,

that's been the most surprising thing to me, really. Because, like I said, I wanted to solve this problem of,

like individual data scientist productivity.

And what that's kind of turned into is like, you solve a bunch of other problems for free if you focus on the data scientists first. And I think that that's actually a really good lesson too. Speaking of lessons is that like

data scientists have a lot of kind of challenges in their day to day.

And I don't know if it's just me, but lately I've been reading a lot of things about how data scientists

are gonna go extinct and on various blog posts and how data engineers are gonna take over data science. And, you know, coming from

genomics,

I don't think that that's the case. Because

in genomics, in the bioinformatics

side of genomics,

they've tried

for years to teach computer scientists

biology.

It's just too hard. You just have to know too much biology in in order to be useful at looking at biological data.

And the same thing is true with data science, I believe. And so I feel like I'm kind of advocating

for data science at this point, you know, but I do believe that data scientists are extremely important and that we should focus on

their productivity as

a first class citizen. And then, you know, all the free stuff that we get

is gonna be great. Right? Like, it's gonna just make the field of data science and data engineering

better just because the individual data scientists will be better at their job and more productive and kind of not tied to

the same tools that they've been using for the last decade.

And so for people who are excited about the capabilities of what Modin is offering, and they're thinking that they're just going to replace pandas with Modin everywhere that they're using it. What are some of the cases where Modin is the wrong choice and somebody might be better served just using Pandas out of the box?

There are several cases where I would suggest not to use Mona. The first is

if you're happy with what you're using. If you are happy with what you're using, don't change it. Because if you don't have any problems with what you're using, don't change it. I feel like we're often pressured to change things when we don't necessarily need to or when we're not kind of forced to. If you're happy with your tools,

just keep using them. But like, I'm not trying to convince anybody

who's happy with their tools

to switch to Modin. And that includes, you know, Spark and Dask and and all these other tools. Right? Like, there are a lot of reasons to use Modin if you have problems with with other systems.

But if you're happy with them, then don't change. Like, I think that that's the simplest answer.

Now, there are also cases where it's hard to make things fast in a distributed system. For example, if you do a lot of looping over your data, if you prefer to operate on your data in loops,

Modin and its current implementation would not be a good choice.

Because looping over data in a distributed system means that you're touching every row. And if you're modifying every row or kind of doing something with the data at every row, then that means that you're pulling all of the data effectively

line by line

into the driver process,

into your current environment.

It makes your memory

kind

of explode a little bit. It's not gonna be fast. So

the best way to use modem is to kind of let it

do its thing. Right? Like, not try

to performance hack or anything like that. Like, I get a lot of notebooks where people tell me that it's slower than Pandas. And most of these notebooks actually

involve performance hacking using, like,

a loop

or iLok in a loop or something like that.

And performance hacking in Modin actually just doesn't work. Like if you performance hack for pandas, it actually is gonna have the opposite effect, often

times. Sometimes it won't, but oftentimes

performance hacking

involves

doing some kind of

weird things that doesn't give Modin enough information about what is happening, if that makes sense.

As you continue to work with Modin and do more research and progress through your PhD program, I'm curious what you have planned for the near to medium term future of the project, either in terms of architectural changes or additional experimental interfaces or just, you know, increasing the API coverage or whatever other sort of maybe community building?

In the middle term,

we're hoping to have a

stable release,

hopefully within the next year.

A stable, like, 1 dot o release, which, you know, at that time, the pandas API coverage will be, you know, a 100% or very, very close to a 100%. Right now, we're set at, like, 93%

or something like that.

There's also, you know,

improving the the stability of the underlying system. Metadata, like I said, is is a real challenge, actually, on the implementation side to

to manage the metadata and to keep that to keep the metadata management fast. That's a really hard problem just in data frames as a whole.

And improving these experimental APIs and actually turning them into non experimental APIs or stable APIs, I think that's a really

big goal that we have moving forward.

And then, you know, in the shorter term, you know, we are

constantly

improving

performance

and, you know, bug fixes and things like that. It's been great to see how Modin has been used. You know, I really appreciate it when people come and, you know, give detailed bug reports. It's like that is super helpful. I mean, I'm a PhD student. Right? So, like, I spend a lot of time

on research, but I also spend a lot of time on the open source development. And so I actually spend the majority of my time on the open source development. If you if you look at my publication history, that'll be pretty evident, I think.

But

in general, like,

we want it to be more stable in the short term. That's our focus. More stable, faster.

And in the medium

term, we're looking at kind of a a stable 1.0 release

and,

you know, all of those things that I mentioned that come with that. For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose a tool that I just recently found out about called XXH

That is a Python utility for being able to replicate your shell configuration and shell environment on servers that you SSH into.

So just started playing around with that recently just because

having to move from my local customized shell where I'm using the Xonch shell and then, you know, SSH it and have bash, and you'd have to deal with the switching back and forth between that being able to just have it the same everywhere,

I think would be nice. So I've been playing around with that, so definitely worth checking out. So with that, I'll pass it to you, Devin. Do you have any picks this week? So something I've been playing around with recently is

a project called Luxe. Luxe is

a visualization recommendation system.

It's also

by a student out of the RISE Lab consequently.

But it's really interesting because I had a chance recently to talk with Doris Lee, who is the author. And

Doris mentioned that, you know,

this is far outside of my area, but the recommendation system

is actually a really hard problem. Right? Recommending a interesting visualization is actually a really hard problem.

You know, Luxe does a really good job of kind of filtering out all of the unnecessary stuff

and only showing you

things that might interest you. Only showing you visualizations that might interest you and might be worth exploring. So when we're exploring data, it's often hard to know what to do next. And I've been playing around with Lux a bit myself. And it's been really interesting because there are datasets that I use in, like, the modem context

that, like, I've been using for years that, like, Lux will show me something. I'll be like, oh, I never thought to think of doing a correlation between these columns, for example, or something like that. So

that's my pick for the week. Yeah. Definitely an interesting tool. And I actually had Doris on the podcast a little while ago to talk about her work there. So I'll add a link to that as well. So

thank you very much for taking the time today to join me and share the work that you've been doing on Modin. It's a very interesting project and 1 that I look forward to tracking as you continue to progress through it and build out more capabilities. So thank you for all the time and energy that you've put into that, and I hope you enjoy the rest of your day. Thank

you.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__