Summary
When you start working on a data project there are always a variety of unknown factors that you have to explore. One of those is the volume of total data that you will eventually need to handle, and the speed and scale at which it will need to be processed. If you optimize for scale too early then it adds a high barrier to entry due to the complexities of distributed systems, but if you invest in a lot of engineering up front then it can be challenging to refactor for scale. Modin is a project that aims to remove that decision by letting you seamlessly replace your existing Pandas code and scale across CPU cores or across a cluster of machines. In this episode Devin Petersohn explains why he started working on solving this problem, how Modin is architected to allow for a smooth escalation from small to large volumes of data and compute, and how you can start using it today to accelerate your Pandas workflows.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Devin Petersohn about Modin, a Pandas compatible dataframe library for datasets from 1MB to 1TB+
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what Modin is and the story behind it?
- Why study dataframes?
- How do dataframes compare to databases?
- What can you do in a dataframe that you couldn’t in a database?
- What are your overall goals for the Modin project?
- Who are the target users of Modin and how does that influence your prioritization of features?
- What are some of the API inconsistencies that you have had to abstract and work around between Pandas, Ray, and Dask to give users a seamless experience?
- What are some of the considerations in terms of capabilities or user experience that will influence whether to use Ray or Dask as the execution engine?
- Can you describe how Modin is implemented?
- How has the constraint of replicating the Pandas API influenced your architectural choices?
- What are the most complex or challenging Pandas APIs to replicate in Modin?
- In addition to the core Pandas API you have also added experimental features such as SQL support and a spreadsheet interface. How have those capabilities affected the range of potential use cases and end users?
- What are some of the complexities that come from acting as a middleware between the Pandas API and the Ray and Dask frameworks?
- What are some of the initial ideas or assumptions that you had about the design or utility of Modin that have been challenged as you worked through building and releasing it?
- What are the most interesting, innovative, or unexpected ways that you have seen Modin used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Modin?
- When is Modin the wrong choice?
- What do you have planned for the future of Modin?
Keep In Touch
- devin-petersohn on GitHub
Picks
- Tobias
- Devin
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Devin Peterson about Modin, a Pandas compatible data frame library for datasets ranging from 1 megabyte to 1 terabyte plus. So, Devin, can you start by introducing yourself? Yeah. Hi. I'm Devin Peterson. I'm a 6th year PhD student at the University of Berkeley in the RISE lab.
[00:01:16] Unknown:
And I'm advised by Anthony Joseph. And do you remember how you first got introduced to Python? Yeah. So it was an undergrad actually when I was working on scripting for a genomics project. So early on in my undergrad, I got really interested in genomics and trying to deal with, you know, genomic data and Python really came in handy. It was introduced to me by a bioinformatician actually. And so, yeah, that's kind of how my journey in Python started.
[00:01:50] Unknown:
And so it has led you now to building description about what it is that you're building and some of the story behind why you decided to try and tackle this problem and how you came to this particular solution for it. So
[00:02:07] Unknown:
as I mentioned, I'm, you know, I've been interested in genomics problems, and I started my PhD working on genomics problems, actually. And I started to notice more general problems in science as a whole where, like tools to deal with large amounts of data really require a lot of expertise in distributed computing. And 1 of the things that I really wanted to kind of tackle is this problem of accessibility in big data tools and tools that deal with large amounts of data. And so, like I started with pandas because, you know, pandas is everywhere. It's it's like what everyone uses. And it also has a lot of nice features. I mean, there are a lot of things to dislike about pandas for sure. And any of the maintainers would share with you a long list of things that they don't like about pandas.
But there are also a lot of reasons that people use it a lot. And so I started with pandas and started started with trying to tackle this problem. I thought it was gonna be a lot easier than it was. It's a very hard problem that turns out. And I've learned a lot along the way about, not only about pandas, but about data frames as a whole and distributed processing. I mean, it's really nice to be able to do this as a PhD student because I I get the freedom to kind of explore
[00:03:20] Unknown:
the problem kind of as deeply as I'd like. And what is it about the overall space of data frames that draws you to it and wanting to study it and dig deep into this problem?
[00:03:31] Unknown:
As kind of as a part of my, you know, PhD program, like, the really nice thing about being a student at Berkeley is that I I get a lot of freedom. You know, my adviser, Anthony, has given me a lot of freedom to kind of go in whatever direction. Like I said, I started in genomics and now I'm working on a lot more general systems problems. And data frames as a whole, looking at pandas and trying to figure out, how do we scale it? I mean, it took some theory in truth. Like, what we did is we kind of looked at the data model and we tried to say, how is this different from databases? And we dug basically as deep as possible, and our group was kind of the first put forth, like, a real formalism that could underlie data frame systems. And so now that we have that formalism, you know, that formalism is kind of what modem is is built on is the idea that, you know, data frames are unique. They're not quite the same thing as databases. They're not quite the same things as arrays. They're different from spreadsheets, of course. And so they kind of sit somewhere in the middle of all of those. Yeah. I mean, why data frames is a good question because it really tables are how people like to think about data. Right? And it's much easier to kind of think about data in 2 dimensions.
And so that's kind of how a lot of people think about and reason about solving their data problems is by putting them in like a table or a table like structure. And data frames are a lot more kind of They're a lot more permissive in what they allow for various stages of data cleaning.
[00:05:10] Unknown:
As you mentioned, 2 dimensions is kind of the limit of what we can comfortably work with as a mental model. But there are also a number of problem domains, particularly in the sciences where you're going to be dealing with multidimensional matrices or n dimensional arrays. And I know that there is the X-ray project that is built on top of pandas for being able to address some of those multidimensional spaces. And I'm curious if that's something that you have considered exploring with the work that you're doing with Modin or if because of the fact that you are hewing to the Pandas API,
[00:05:44] Unknown:
X-ray could just natively sit on top of what you're building there? The short answer is it's the latter. Like, X-ray is a really, really interesting project. And, you know, what they've done with it in terms of being able to take something that we can think about in 2 dimensions and basically turn that into a n dimensional array library, I think that that's phenomenal. They support Dask. And so Modin also supports Dask. So Modin could kind of sit underneath of X-ray.
[00:06:10] Unknown:
And so you mentioned briefly the kind of juxtaposition of data frames and databases in that they're both working with tabular structures, but they have different strengths and areas of focus. I'm wondering if you can just dig a bit more into some of the things that you can do in a data frame that are either
[00:06:37] Unknown:
not allow you to have multiple different types in the same column for example. This is really important for data that's at different stages in the cleaning process. So when data comes in from the wild, we don't know if the data is clean or gonna fit really nicely into our schema. And databases really require that you declare the schema upfront. You must declare the schema at the time of the table's creation, and then all the data that you ingest must fit into that schema. And with a data frame, you actually don't need to declare a schema upfront. The schema can kind of be inferred at runtime.
I mean, it's very natural that Python would have a data frame and a very popular data frame at that because of Python's dynamic typing and these things that database people don't like. Database people tend to like, you know, static typing and strict structures. But the real world is not so clean. Right? The real world actually requires us to deal with messy data. And so, you know, data frames are really good at letting me do analysis on data that is, you know, not necessarily in a perfect schema or that doesn't necessarily fit into the kind of the traditional concept of the schema that it's
[00:07:58] Unknown:
And as far as the Modin project, I'm wondering what you have as far as overall goals. You mentioned wanting to be able to support the accessibility of distributed computing for scientific endeavors. But just more broadly, what is it that you are hoping to be able to achieve with the Modin project?
[00:08:16] Unknown:
At first, the goal was basically to just kind of, you know, get a formalism around data frames and basically have an implementation of pandas. And like, at this point, modin supports a variety of APIs. It also supports a variety of distributed engines underneath. So it didn't start out like that. It started out with a very simple purpose, and it's kind of grown into what it is today. So my goals with it, I guess, are that I'd like to basically allow people to interact with data in the way that they're most comfortable with on whatever they have access to. That's a very broad statement. Right? But what I mean by that is, for example, SQL is sometimes more convenient to use for certain operations than pandas is. Take like join. Right? A lot of people who are familiar with SQL Joins don't like pandas style joins because you have to use like pd.merge ordf.merge or something. And, you know, it doesn't feel as intuitive as writing a join in SQL.
So it would be nice if we could just basically switch between Pandas and SQL, for example, whenever we wanted to in the midst of our notebook, and it'll run on the same underlying execution. And with that, you know, we also want the the underlying execution to be abstracted in a way that users can kind of use whatever hardware or whatever cluster that they have available. Right? We don't always get control over what we have access to in a job or, you know, even like at home. And so, you know, Modin kind of abstracted away the execution engine in a way that allows us to use whatever you have. If you have access to a DaaS cluster, then Modin can run on the DAS cluster. If you have access to a rate cluster, Modin can run on the rate cluster. And we're working on adding a few others.
But the idea here is that, like, I, as a user, don't need to think about where things are running or what things are running or changing my notebook for a given compute engine or anything like that because it's all just kind of there and it it all just kind of works. It's really centered around the end user's productivity and letting them kind of choose how they interact with data at any given point in time for any given task. Right? In general, that is the goal. It's it's pretty ambitious, I would say. Because there's a lot of challenges in abstracting away so many details from the user. And there's a lot of machinery underneath of modem that kind of enables us to make sure that data is where it needs to be and that it's partitioned in the right way and that all of these things so that users don't have
[00:11:02] Unknown:
to really think about distributed computing even though they're using it. You mentioned the support for both Dask and Ray as these underlying substrates for being able to distribute the computation. And I'm wondering what you have had to build in order to be able to abstract across those different systems for being able to handle the computation, spread it out, and the architectural aspects of modem that allow you to parallelize based on these different underlying substrates while still being able to provide a shared API that targets pandas on top of the whole thing?
[00:11:42] Unknown:
Modin has kind of a core implementation that implements a set of data frame algebraic operators that we defined as a part of this theoretical data framework that I've done as as my PhD. And these algebraic operators are roughly 12 to 15. We're still in the process of kind of getting everything completely formalized and implemented. But there are about 15 operators at this time that allow you to operate on data and metadata. And all of these are implemented as a part of this mode and core. This mode and core is then kind of abstracted on top of these 2 really good task parallel distributed systems, Ray and Dask.
We use these 2 as basically task scheduler. So Dask has a data frame, and a lot of people get modems usage confused with Dask data frame itself. And that modem is distinct from Dask data frame. It doesn't use anything any of the kind of library code from Dask Dataframe. Dask Dataframe kind of sits at the same level as Modin does. Modin itself, this kind of Modin core, there's a lot of challenges in mostly dealing with the metadata. Right? There's a lot that you can do at the data level to kind of make sure that things are nicely parallelized. Metadata and keeping that metadata close to the user is actually 1 of the biggest challenges.
Data frame users are used to being able to do something like a look or an eye look, and it's supposed to be really fast. And making these fast in a distributed system, it is it's really hard because the data might be on another machine. And so how do we make sure that we're giving the user these kind of immediate response times or or very close to immediate response times in a distributed system? But this is part of my research, and we've done a lot of work to make sure that it is fast and that the user, you know, that we're not copying when we do a look and that we not, you know, making memory copies all the time. Right? Which is another problem that Pandas has generally.
But Ray and Dask tend to be used basically as our compute engines for this modem core implementation. So modem as a whole is about 40, 000 lines of code, I think. And the difference between the ray and the das code is about 1500 lines. So it's a very small difference in terms of how we use Ray and DASK. They're relatively similar, but there's a lot of shared code between the engines because they're Python distributed task frameworks. Right? So we can submit Python tasks. We can submit the same Python tasks to the 2 of them, if that makes sense. As far as the difference of sort of user experience
[00:14:25] Unknown:
of why somebody might choose to use Ray versus DASK, is it just a matter of what they have already chosen to use based on the requirements within their organization, what they already have running, or are there other variances in terms of capability or user experience that come about when somebody is using Voden and deciding which of these engines to build on top of?
[00:14:48] Unknown:
So, generally, it is what you have access to is the first question to answer. Right? Because if you only have access to Dask or only have access to Ray, then I recommend you use what you have access to. Don't necessarily try to add a new dependency or or something like that. Beyond that, the Ray implementation is more mature. That's where it started. Ray is actually another RISE Lab project. And so I personally have a closer relationship with the Ray team and with the Ray folks. You know, it'll be a bit more stable on Ray. Now there's not a lot of differences. Like I said, it's only 1500 lines of code.
So there's not a lot of differences between Ray and Dask. But if it's all the same, then I generally recommend using Ray because we've been working with them a lot longer. And, you know, it's another Rizelab project. So Modin itself doesn't try to take an opinion on this, rather it tries not to be opinionated on what you should use. I personally am opinionated because I, you know, I've worked with these folks for a long time. So Digging more into the actual implementation
[00:15:56] Unknown:
of Modin, I know that at the core level, as you said, you've developed a formalism for how a data frame should operate. But I'm wondering how you actually manage to parallelize these array structures in the compute grid in a way that you are able to get these
[00:16:15] Unknown:
responsive interactive speeds across a distributed structure, whether that's across multiple cores or across multiple physical machines? This question has a really interesting it's or at least it's interesting to the nerd in me. But the distributed parts of modem are really interesting and unique when you compare them to a lot of traditional distributed systems. Modin's partitioning, in particular, is very general. In that, the partitioning can be anything. It's not a strict column store. It's not a row store. It's not a block store. It can be any of these at any point in time during a, like, a data frame program.
And so, like, this flexibility lets us, you know, resolve these, you know, look and eye look extremely quickly because well, first of all, they don't copy data. Right? So it's kind of a special case where we can just kind of point to the partition where the key lives in the case of Loke. Right? We can point to the partition where that row lives or those those set of rows and then just be done with it and kind of copy on right is effectively what we do. So if you if you might make a modification, that's when we do the copy. In the case of other operations, for example, data frame operations allow you to intermix column operations and row operations, things that, like, databases also couldn't do.
In those cases, Modin can flexibly move between different forms of partitioning. And all of this is done without any user right? So the users kind of doesn't know at any time what the partitioning is because it's all kind of taken care of by the modin core system. The core component of the system, which, you know, manages the metadata. It also manages the partitioning.
[00:18:02] Unknown:
In terms of the sort of motivations for some of the architectural decisions that you made, I'm curious how the constraint of targeting the Pandas API has influenced the direction of your thinking about how to build modin given that it is a bit more of a general purpose framework for being able to do this, you know, compute across array structures on distributed systems?
[00:18:29] Unknown:
So Moda actually started with as like a weekend hack kind of. Right? So I basically wanted to try to implement the Pandas API with a simple, like traditional database approach which is row partitioning. That didn't work. Basically, at that point, I decided to kind of boil down the pandas implementation and look at other traditional data frames like r, for example, and then s which comes before r. And go back to the scientific literature and see what I could find about these systems. And I couldn't find much really about the formalism that underlies traditional data frames. And so at that point, it wasn't just me. There's a big group of us at Berkeley who did this.
We boiled down the data frame operations into this algebra that I've been talking about. And we also formalized the data model around it. Basically, so that we could prove that we can do everything that pandas can do and maybe more actually. The generalization turned out to be more general than Pandas and what it allows. And so Pandas has over 600 operators. 15 operators doesn't sound like it could cover that many, but it does because of user defined functions. Like, a lot of pandas is very repetitive. And so user defined functions actually give us a lot of power if we support them in their completeness. Right? In the same way that pandas does.
And so the challenge was that we couldn't use traditional styles. Right? Traditional methods in databases to kind of tackle the problem of pandas. We had to actually come up with something novel that would allow us to do everything that we needed to do.
[00:20:13] Unknown:
Because of the fact that the structure that you've built is a bit more flexible in general purpose than Pandas, you've actually built a couple of other experimental interfaces on top of it. I was noticing that you have capability of being able to execute SQL statements on the data frame as well as having a spreadsheet's API for being able to do more of an Excel type experience. And I'm wondering if you can just talk through some of the motivations for going down these experimental paths and some of the thoughts that you have on being able to make the data frame a bit more of a general purpose structure and not tied specifically to a Python oriented API.
[00:20:55] Unknown:
So the the whole point of Modin has kind of transformed into this vision that a group of us at Berkeley have around meeting users where they are and allowing them to interact with data in a way that's most comfortable to them. The goal is to basically take Modin, this kind of middle layer that we have, the Modin core, and allow users to kind of use the APIs that they are familiar with. And so that involves basically translating the spreadsheet API, for example, and those interactions down into the mode and core algebra. And since pandas and data frames are so generic in general and what they allow these other interfaces, you know, they end up being subsets of the pandas API. So SQL, for example, you could do everything in SQL that you would want to in pandas.
The behavior isn't always the same though. So in SQL for example, operations aren't ordered. In pandas, they are. Right? So every output part of the formalism we have is also, like, there's an output order for every input order and operation. So in SQL, since there's not an output order necessarily, users might assume that certain operations are going to be fast because unordered operations are just generally faster than ordered operations. If you don't have to worry about order, then you can do things in whatever order is the fastest. So there are challenges like this that we haven't tackled yet that are kind of related to meeting the expectations of users who are using different APIs.
But as a whole, like, what we want to tackle is problem that we have of systems that kind of force users to change their behavior in order to gain their benefits. Right? So learning a new API is expensive in terms of human time. Right? It takes a long time to learn a new API. And, yes, there are better APIs than pandas out there. I think a lot of people would agree. But, you know, so many people know pandas that I think that we would be doing the world a disservice if we just kind of try to get everybody to shift away from the existing APIs that we have. Right? I mean, if you imagine that it takes, you know, several months to kind of get familiar with a new API and get to the point to where that you're productive with it, you know, that's a lot of time. Right? That's a lot of time that that you've kind of lost in effect.
And the performance is probably not worth it. In general, like chasing the fastest tool is just it's just generally not gonna be worth it. Right? We need tools that give users superpowers effectively without requiring that they change a lot of their behavior.
[00:23:46] Unknown:
Because of the fact that you do have this flexible interface and you're not forcing people into a particular sort of language or syntax to be able to work through the problems that they're trying to solve for. I'm curious how that affects the the use cases and the target users that you have for Modin as it compares to maybe something like Pandas.
[00:24:11] Unknown:
The general use case of Modin has so far been users who are kind of hitting the wall that that you hit with Pandas performance. These other features, I haven't seen a lot of use of them in production use cases. There are people who are experimenting with it a bit, but they're very very new. Maybe a few months old. And so, you know, the use cases around using, you know, the SQL API or the spreadsheet API, Those aren't really used in production just yet. The way that the pandas API is being used is generally as a way to kind of put off having to rewrite things into Spark, for example, or a distributed system. So there are production use cases at large companies where the user kind of just writes a notebook that uses modem. And then they're able to just basically take that and run it as a production job.
And they don't have to rewrite it into Spark in order to basically make use of the entire dataset. Right? The user is able to actually just write moded code and experiment with Modin using, you know, the pandas API that they're familiar with. And they don't have to go through this kind of costly phase of translating that into Spark. And a lot of companies have this problem where they're taking Panda's code or some code that's written on a single node, and they have to productionize it in some way. A lot of times that looks like translating code into Spark or something like that. Right? Translating Pandas into Spark is a common use case. And, you know, data engineers typically do that. Like, a lot of companies don't let data scientists, you know, write production code.
But what modem kind of allows is it allows for the data engineer to not have to be so hands on with how they productionize the code. Right? They they are able to kind of scale their efforts a bit better because it takes a long time to rewrite these. Right? Like distributed applications just take a long time to write in general. So so that's a really interesting use case that actually I wasn't anticipating when I first started writing. I was trying to solve the kind of data science productivity problem, and it ended up actually helping out data engineers quite a bit as well, which is very interesting.
[00:26:29] Unknown:
So you mentioned that where you are with Modem today isn't where you set out to be from the start. And I'm curious as you look back to when you were first thinking about what you were going to build, what were some of the initial ideas or assumptions that you had about the particular needs of data scientists and how a distributed computation system would be able to alleviate some of those, how some of those ideas have been challenged or invalidated as you went along the journey of actually building Modin and some of the maybe lessons that you wished you had known earlier on as you were starting down this path that would have maybe saved you significant amount of time and heartache?
[00:27:11] Unknown:
So 1 of the things that I wish that I had learned earlier was that data scientists don't, or at least the data scientists that I've talked to, they don't really care about like modems internals. So in the early days when I would talk about modems, I would talk about, you know, this algebra and how it's, you know, very elegant and things like that. But a lot of the data scientists that I was speaking with didn't care. And it's not that they didn't care. It didn't align with what their problems are. Or rather, it didn't align with what they thought their problems were. So I kind of so I wish in those early days that I hadn't focused so much on like the deep technical parts of Modin when I when I would talk to data scientists.
I wish that I had kind of focused more on the problems that it solves and allowing users to kind of think about how Modin solves their problems rather than, oh, Modin has this like deep technical algebra. Am I gonna have to learn how to use this algebra? Or what does this algebra have to do with my problem? And so it comes down to kind of like like a communication lesson, really. Like like there's also the lesson that I wish I had learned earlier, which was like pandas is really hard and it's really big. And like, you know, I knew pandas when I started, but now I think back to those days and I think I didn't know pandas at all. Like, when I first started compared to how much I know today.
And it's been a really interesting journey and a really interesting progression to kind of be able to, you know, start out with some prototype that worked for, like, you know, a few APIs, not a lot, and then kind of grow it into something that, you know, today is actually relatively useful, I would say. But, you know, in general, like, the biggest lessons that I've had, data frames are unique and they're actually really, really useful and really, really interesting structures. And yeah. I mean, it's been a very fun journey to kind of unwrap this complexity and kind of drill it down into a formal structure and something that we can actually reason about from, like, a theoretical standpoint.
[00:29:34] Unknown:
In terms of the actual uses and projects that have been built using Modin, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it used.
[00:29:44] Unknown:
In terms of most unexpected, absolutely, it would be the delaying or replacing Spark in in that way. Right? Like, I mean, Spark came out of App Lab, which is like the predecessor to RISE Lab. So, you know, I know a lot of the people who have worked on Spark. It was never my intention to kind of replace Spark, if that makes sense. But having that actually happen, that's been the most surprising thing to me, really. Because, like I said, I wanted to solve this problem of, like individual data scientist productivity. And what that's kind of turned into is like, you solve a bunch of other problems for free if you focus on the data scientists first. And I think that that's actually a really good lesson too. Speaking of lessons is that like data scientists have a lot of kind of challenges in their day to day.
And I don't know if it's just me, but lately I've been reading a lot of things about how data scientists are gonna go extinct and on various blog posts and how data engineers are gonna take over data science. And, you know, coming from genomics, I don't think that that's the case. Because in genomics, in the bioinformatics side of genomics, they've tried for years to teach computer scientists biology. It's just too hard. You just have to know too much biology in in order to be useful at looking at biological data. And the same thing is true with data science, I believe. And so I feel like I'm kind of advocating for data science at this point, you know, but I do believe that data scientists are extremely important and that we should focus on their productivity as a first class citizen. And then, you know, all the free stuff that we get is gonna be great. Right? Like, it's gonna just make the field of data science and data engineering better just because the individual data scientists will be better at their job and more productive and kind of not tied to the same tools that they've been using for the last decade.
[00:31:43] Unknown:
And so for people who are excited about the capabilities of what Modin is offering, and they're thinking that they're just going to replace pandas with Modin everywhere that they're using it. What are some of the cases where Modin is the wrong choice and somebody might be better served just using Pandas out of the box?
[00:31:59] Unknown:
There are several cases where I would suggest not to use Mona. The first is if you're happy with what you're using. If you are happy with what you're using, don't change it. Because if you don't have any problems with what you're using, don't change it. I feel like we're often pressured to change things when we don't necessarily need to or when we're not kind of forced to. If you're happy with your tools, just keep using them. But like, I'm not trying to convince anybody who's happy with their tools to switch to Modin. And that includes, you know, Spark and Dask and and all these other tools. Right? Like, there are a lot of reasons to use Modin if you have problems with with other systems.
But if you're happy with them, then don't change. Like, I think that that's the simplest answer. Now, there are also cases where it's hard to make things fast in a distributed system. For example, if you do a lot of looping over your data, if you prefer to operate on your data in loops, Modin and its current implementation would not be a good choice. Because looping over data in a distributed system means that you're touching every row. And if you're modifying every row or kind of doing something with the data at every row, then that means that you're pulling all of the data effectively line by line into the driver process, into your current environment.
It makes your memory kind of explode a little bit. It's not gonna be fast. So the best way to use modem is to kind of let it do its thing. Right? Like, not try to performance hack or anything like that. Like, I get a lot of notebooks where people tell me that it's slower than Pandas. And most of these notebooks actually involve performance hacking using, like, a loop or iLok in a loop or something like that. And performance hacking in Modin actually just doesn't work. Like if you performance hack for pandas, it actually is gonna have the opposite effect, often times. Sometimes it won't, but oftentimes performance hacking involves doing some kind of weird things that doesn't give Modin enough information about what is happening, if that makes sense.
[00:33:56] Unknown:
As you continue to work with Modin and do more research and progress through your PhD program, I'm curious what you have planned for the near to medium term future of the project, either in terms of architectural changes or additional experimental interfaces or just, you know, increasing the API coverage or whatever other sort of maybe community building?
[00:34:16] Unknown:
In the middle term, we're hoping to have a stable release, hopefully within the next year. A stable, like, 1 dot o release, which, you know, at that time, the pandas API coverage will be, you know, a 100% or very, very close to a 100%. Right now, we're set at, like, 93% or something like that. There's also, you know, improving the the stability of the underlying system. Metadata, like I said, is is a real challenge, actually, on the implementation side to to manage the metadata and to keep that to keep the metadata management fast. That's a really hard problem just in data frames as a whole. And improving these experimental APIs and actually turning them into non experimental APIs or stable APIs, I think that's a really big goal that we have moving forward.
And then, you know, in the shorter term, you know, we are constantly improving performance and, you know, bug fixes and things like that. It's been great to see how Modin has been used. You know, I really appreciate it when people come and, you know, give detailed bug reports. It's like that is super helpful. I mean, I'm a PhD student. Right? So, like, I spend a lot of time on research, but I also spend a lot of time on the open source development. And so I actually spend the majority of my time on the open source development. If you if you look at my publication history, that'll be pretty evident, I think. But in general, like, we want it to be more stable in the short term. That's our focus. More stable, faster.
And in the medium term, we're looking at kind of a a stable 1.0 release and,
[00:35:50] Unknown:
you know, all of those things that I mentioned that come with that. For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose a tool that I just recently found out about called XXH That is a Python utility for being able to replicate your shell configuration and shell environment on servers that you SSH into. So just started playing around with that recently just because having to move from my local customized shell where I'm using the Xonch shell and then, you know, SSH it and have bash, and you'd have to deal with the switching back and forth between that being able to just have it the same everywhere, I think would be nice. So I've been playing around with that, so definitely worth checking out. So with that, I'll pass it to you, Devin. Do you have any picks this week? So something I've been playing around with recently is
[00:36:42] Unknown:
a project called Luxe. Luxe is a visualization recommendation system. It's also by a student out of the RISE Lab consequently. But it's really interesting because I had a chance recently to talk with Doris Lee, who is the author. And Doris mentioned that, you know, this is far outside of my area, but the recommendation system is actually a really hard problem. Right? Recommending a interesting visualization is actually a really hard problem. You know, Luxe does a really good job of kind of filtering out all of the unnecessary stuff and only showing you things that might interest you. Only showing you visualizations that might interest you and might be worth exploring. So when we're exploring data, it's often hard to know what to do next. And I've been playing around with Lux a bit myself. And it's been really interesting because there are datasets that I use in, like, the modem context that, like, I've been using for years that, like, Lux will show me something. I'll be like, oh, I never thought to think of doing a correlation between these columns, for example, or something like that. So
[00:37:43] Unknown:
that's my pick for the week. Yeah. Definitely an interesting tool. And I actually had Doris on the podcast a little while ago to talk about her work there. So I'll add a link to that as well. So thank you very much for taking the time today to join me and share the work that you've been doing on Modin. It's a very interesting project and 1 that I look forward to tracking as you continue to progress through it and build out more capabilities. So thank you for all the time and energy that you've put into that, and I hope you enjoy the rest of your day. Thank
[00:38:10] Unknown:
you.
[00:38:12] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to the Guest and Topic
Devin's Journey with Python and Genomics
Challenges and Insights in Data Frames and Pandas
Data Frames vs. Databases
Goals and Vision for Modin
Abstracting Distributed Systems with Modin
Choosing Between Ray and Dask
Parallelizing Array Structures
Experimental Interfaces and Flexibility
Use Cases and Target Users
Lessons Learned and Challenges Faced
Interesting Uses of Modin
When Not to Use Modin
Future Plans for Modin
Closing Remarks and Picks