Summary
Data exploration is an important step in any analysis or machine learning project. Visualizing the data that you are working with makes that exploration faster and more effective, but having to remember and write all of the code to build a scatter plot or histogram is tedious and time consuming. In order to eliminate that friction Doris Lee helped create the Lux project, which wraps your Pandas data frame and automatically generates a set of visualizations without you having to lift a finger. In this episode she explains how Lux works under the hood, what inspired her to create it in the first place, and how it can help you create a better end result. The Lux project is a valuable addition to the toolbox of anyone who is doing data wrangling with Pandas.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to pythonpodcast.com/census today to get a free 14-day trial.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at pythonpodcast.com/hightouch.
- Your host as usual is Tobias Macey and today I’m interviewing Doris Lee about Lux, a Python library that facilitates fast and easy data exploration by automating the visualization and data analysis process
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Lux is and how the project got started?
- What is the role of visualization in a data science workflow?
- What are the challenges that data scientists face in the exploratory phase of their analysis?
- There are a wide variety of data visualization tools in the Python ecosystem with differing areas of focus. What is the role of Lux in that ecosystem?
- How does Lux compare to tools such as scikit-yb?
- What is the workflow for someone using Lux in their analysis and what problems does it solve for them?
- Can you talk through how Lux is architected?
- How have the goals and design of Lux changed or evolved since you first began working on it?
- Data visualization is a broad field. How do you determine which kinds of charts or plots are best suited to a particular data set or exploration?
- What are some of the capabilities of Lux that are often overlooked or underutilized?
- How has Lux impacted your own work in data analysis/data science?
- What are some of the other gaps that you see in the available tooling for data science?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Lux used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Lux?
- When is Lux the wrong choice?
- What do you have planned for the future of the project?
Keep In Touch
Picks
- Tobias
- Pirates of the Carribean movies
- Doris
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Lux
- UC Berkeley
- RISE Lab
- School of Information
- Pandas
- Bokeh
- Seaborn
- Altair
- Matplotlib
- Grammar of Graphics
- Plotly
- Scikit YellowBrick
- D3.js
- Vega
- Numpy
- xarray
- Tensorflow
- Jupyter Widget
- Chloropleth Map
- G10 Countries
- Ray
- Modin
- Dask
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hitouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at python podcast.com/hitouch.
Your host as usual is Tobias Macy. And today, I'm interviewing Doris Lee about Luxe, a Python library that facilitates fast and easy data exploration by automating the visualization and data analysis process. So Doris, can you start by introducing yourself? Hi, everyone. My name is Doris Lee, and I'm a PhD student at UC Berkeley at the School of Information and at RISE Lab.
[00:01:52] Unknown:
Most of my work is around building tools to help enable data scientists to more easily explore their data or visualize their data. And recently, I've been working on a Python library called Lux, which is a tool that helps you visualize your data frames within your Jupyter Notebooks.
[00:02:10] Unknown:
And And do you remember how you first got introduced to Python?
[00:02:13] Unknown:
Yeah. It's actually an interesting story. I think in high school, I was really interested in physics. So I reached out to a particle physics professor asking him if there's any kind of research or something that I could, like, start, you know, learning about physics. And then 1 of the first thing that he asked me to do is learn Python because in these large scale sort of particle physics collaboration, a lot of the work is actually computational. That was kind of my first introduction to Python. I was actually handed this book called Snake Wrangling for Kids. That's how I got introduced to Python. And then so during undergrad, I did a lot of work in astronomy and physics in terms of research.
And a lot of the times, we had to handle large amounts of data to discover some sort of insights in in your data. And that was when I realized that there was sort of this lack of tools around making it easy for people to kind of visualize and understand their data, especially for domain experts like astronomers or maybe material scientists or maybe even domain experts in, let's say, like health care or other domains. In those cases, the lack of tools kind of prevents these researchers or domain experts in getting to their insights. And so kind of I set out my PhD to to build kind of these systems to help people
[00:03:37] Unknown:
sort of more easily access their data and be able to understand what is going on with their data and how do you discover useful insights from it. And digging into that, we'll get to Lux in a moment, but you mentioned that your overarching goal was to be able to improve the overall ecosystem of tooling to make it easier for these different scientists in specific problem domains to be able to interpret their data and be able to do analysis. And so from that sort of broad vision, what was your process for deciding that at least 1 of the main problems that needed to be addressed? And from that, maybe discuss a bit about what it is that you're building with Luxe.
[00:04:15] Unknown:
I think beyond even scientists, because so many data science is touching so many different fields now, you know, from health care to even, you know, businesses as well as, like, commercial use cases. All these organizations have large amounts of data that they need to somehow process and handle. And, specifically, like, 1 of the areas that I'm interested in is around visual analytics and building tools to help people visualize their data. And this is kind of where the human analyst kinda comes in and bring in kind of some of their own domain knowledge. Let's say they have some specific domain knowledge that I want to look for in my data or, I have some sort of hypothesis, And visualization is kind of this right medium to facilitate this type of investigation just because, you know, you have the human in loop and doing the analysis.
And so I think getting back to your question, which is around, like, how we got started on Luxe, I think that that was a motivation why we kind of were interested in this visualization aspect because it has kind of the human in the loop, and it brings together kind of the domain knowledge as well as
[00:05:29] Unknown:
the computational insights that you can generate from the data. For Look specifically, as I was going through the documentation, it definitely has a great sort of user experience, it looks like, where you have your pandas data frame and you wanna be able to get some just quick analysis to see, you know, what are some of the interesting pieces of information here. So you just interesting pieces of information here. So you just have the wrapper around it where you can just say, you know, d f dot I forget the specific syntax, and it will spit out a set of visualization. So I'm just wondering if you can talk a bit more about what that enables and a bit more about sort of what you have actually built with Luxe and how it fits into that the workflow of somebody who's trying to do an analysis. Like, where does it sit in the life cycle of, I have a problem to, you know, I've built a machine learning model or I've built a statistical model for being able to solve the problem, like, where Lux fits in the 1 spot or throughout the workflow of going from problem to solution?
[00:06:25] Unknown:
I think 1 of the things that we were really interested in in terms of when we started building Luxe was in terms of some of our previous visualization projects where it was more of these kind of web browser. You open up a page, and there's, like, kind of a point and click interface where you could click through. And then the system shows you some sort of interesting insights or useful visualization for your work. And when we started working with a lot of these domain experts in terms of these GUI or graphical tools. 1 of their main complaints was all my data cleaning and kind of modeling code is in my Jupyter Notebook. And then I have to kind of export everything and put it into this GUI tool and then have it export some sort of insight. And so there's a lot of kind of friction in that process. Right? You have to import and export things to and from your Jupyter Notebook. So we started thinking like, hey. How can we bring in some of these useful visual analytic systems and techniques into the notebook itself? And in particular, a lot of our work was around this idea of a visualization recommendation system. So how can you recommend interesting insights to the users as they're exploring that data? And it naturally kind of fits in very well with kind of a data science workflow where you're kind of doing a bit of a exploratory work. You know, maybe you you've just received the dataset, and you wanna do some sort of data cleaning as well as getting some understanding of whether data is clean enough or, you know, well formatted enough for any sort of downstream modeling or processing.
And so this is where Lux can kind of be useful where you could essentially load in your data frame just similar to how you're using Pandas. All the Pandas commands, it's the same. And when you print out the data frame, Lux recommends a set of interesting sort of visualizations to you. You You know, we might be showing you, like, correlations in your data, skewness in your distributions, and so on. So we display a set of kind of in this inline widget, within your Jupyter notebook whenever you print the data frame. And you could toggle through these different tabs and interact with the visualizations. And then 1 of the key parts of that is, we always kind of display the original Pandas data frame table to you. That was a design decision that we made very early on to to ensure that people still could do everything that they could do in Pandas, and they could see everything that they expect to see from Pandas. And then when you click on kind of a button that toggles back and forth between kind of the grid view, which is the the pandas view, and the luxe view, you now get all sorts of recommendations and visualizations that are shown to you. So part of the motivation is that we wanted this tool to be kind of always on. We wanted to always be able to afford visual insights to users as they're exploring the data. Even if they, like, intended to just, like, print out the data frame to inspect their data itself, maybe sometimes when they toggle to the view, they learn something interesting.
[00:09:28] Unknown:
So kind of lowering that barrier to kind of exploring your data by providing some sort of always on visual display to the to to the user. And your point about the friction that these different domain scientists were experiencing as they were going back and forth from their notebooks to these different domain oriented graphical analysis tools is interesting. And I also know that there are large and growing number of options for being able to do data visualization within the Python ecosystem, and the majority of those actually work well or natively within Jupyter. And so for those types of libraries, so I'm thinking things like matplotlib, altair, bokeh.
Sure. There are a number of others that I'm maligning. What are some of the points of friction that people run into even within that native Python ecosystem and within the Jupyter Notebook ecosystem that Luxe helps to overcome just by having this, you know, native automated wrapper around Pandas data frame?
[00:10:25] Unknown:
Yeah. That's a really good point. There's a lot of these existing libraries including matplotlib, seaborne, and Altair, some of the ones that you mentioned, which you can think of them as, like, a visualization library. So they fully implement sort of the grammar around how do you build the visualization. But to go from kind of your data frame, which is the input to your visualization, to actually a final visualization that you've designed, let's say, a bar chart, you still have to write quite a bit of code to first kind of do some sort of data processing. Let's give an example. Let's say that you're trying to generate a bar chart. First of all, you have to do a group by and an aggregation with pandas, And then you have to kind of take that resulting data frame and then map that onto graphical encoding. So you have to figure out, hey. Do I do I want to use a bar chart or a scatter plot, or do I want to use another chart encoding? What color do I wanna pick? So there's all these design decisions as well as data decisions that you have to make in order to get that visualization.
And so even though a lot of these libraries are extremely flexible, you can create these beautiful visualizations with within these libraries. It's a lot of work, upfront work to even be able to write the code to generate a single visualization. And what we kinda see is the process of having to write that code kind of discourages people even from exploring their data. I mean, exploration is 1 of those things where it's like, you could do it, you could also not do it. But when you do it, sometimes you find something interesting, sometimes you don't. And so it's very hard to measure the value of whether it's worth it to plot this visualization, and a lot of the times, people don't. And so by making the recommendations that we're displaying always on and attached to your data frame so that it doesn't require any additional work to visualize, it makes it a lot easier for people to want to look at the visual views. That being said, I think there's Altair and Matplotlib and Plotly is are still very powerful tools, and Lux itself uses those libraries underneath the hood to generate our visualization.
So Luxe has the capability of generating visualizations in Altair and Matplotlib. And for a lot of more fine grained tasks, it's still more useful to use, those libraries. For example, like, if you wanted to change your scatter plot marks to green, we don't natively support that in lux. You would have to, you know, be able to do that. You would essentially use luxe as a way of getting to a templated visualization and then use Altair or Matplotlib code that's generated to make those changes. So I think Lux and all these other visualization tools sits in pretty different areas. And I think in some situations, Luxe is more useful. In some situations, it's much more useful to use Matplotlib and Altair when you know exactly what you're looking for and what visualization you're plotting.
[00:13:32] Unknown:
Yeah. There's definitely a lot to be said for having something that does all of the work for you so you don't have to think about it. Because even if you are an expert in, you know, Altair or Matplotlib, you still are at some point going to have to go back and say, okay. What was that specific parameter, and, you know, what are the valid ranges for it? So not having to even think about that to get something to start with is a huge amount of value for, as you said, your very little effort. Another tool that comes to mind as I'm looking at Luxe that seems to have a similar type of use case is the scikit yellow brick project, which is more oriented around integrating with the scikit ecosystem or scikit learn ecosystem and being able to do some sort of, like, dimensionality reduction and visualization of trained machine learning models, being able to validate your assumptions about what the model is actually doing. And I'm wondering if you have any thoughts on sort of the comparison as to the utility or use cases or maybe crossover between Lux and scikit yellow brick. Yeah. I think scikit yellow brick seems like a library that is really great for
[00:14:33] Unknown:
displaying these machine learning visualizations specific to, you know, maybe you're doing some sort of diagnostic or you're you're trying to understand your model a little bit more. And I think scikit yellow, the fact that it kind of packages everything into a visualizer for you, and then you could just call these, you know, high level visualizers without having to, you know, write specific code. It makes it much easier for you to sort of be able to get at these visualizations and be able to understand your machine learning model. In that sense, it's very similarly motivated as Luxe. I think a scikit yellow bricks is a package that is a lot more mature, so I think it's been around for a long time. It it seems like there's a lot of adoption in this space. So it's interesting to see kind of this idea of a high level sort of visualizer that is able to package some of these details away. And for this case, even being able to help you understand your machine learning model, we're seeing kind of adoption towards some of these. And like what we talked about, we sort of see that the visualization ecosystem has these high level sort of things similar to Yellow Brick and lux, as well as things like matplotlib, l pair. Even those are kind of even considered, like, declarative libraries for visualizations.
And then we get to things like the 3 Vega, which is even a level that's slower than that. So kind of these different stratified layers interacting with your data. And I think part of the design of Luxe is trying to understand, like, how do you interoperate with these tools and these things that are in people's existing ecosystem? Like, I don't think the solution is, like, hey. Everyone should use Lux for all of their data analysis. It's definitely using lux to get at something that is useful and then going into your favorite visualization library, you know, seaborne, matplotlib, to really be able to fine tune and present that data effectively.
[00:16:33] Unknown:
To your point about ecosystems, it's worth digging into the fact that Lux is tied, at least currently, to pandas in terms of being able to generate visualizations from these data frames, whereas scikit yellow brick is tied into the scikit learn ecosystem. And I'm just wondering if you can talk to the benefits and trade offs and sort of the considerations that you went through as you were deciding how closely to sort of tie yourself to the pandas ecosystem and some of the trade offs of maybe if somebody's trying to use NumPy and using some n dimensional array and wants to be able to get some of the same benefits as Lux, sort of how you thought about what the cost benefit analysis was in terms of which ecosystems to focus on at least initially.
[00:17:18] Unknown:
Part of the motivation in kind of plugging into the Pandas ecosystem is that the Pandas data frame itself comes with so many rich functionalities and APIs for, you know, data cleaning, analysis, and even some plotting capabilities that's kind of tied in with Matplotlib. And so based on, at least, what we've seen, most of the data scientists who are in the Python ecosystem are using, Pandas for exploratory data analysis and even across data cleaning and other types of workloads. And so we kind of found that to be a very nice kind of entry point to be able to say, hey. What would it look like if we added an always on visualization display on top of the pandas data frame? That being said, like, I think your example of the NumPy array is definitely an interesting 1. We haven't thought about it too much, but I know that NumPy matrices and arrays are something that, like, machine learning users, for example, use pretty often. I think for now, we definitely focused more on the use case where, you know, maybe you're loading in a CSV or you're connecting to some sort of database, and then you're pulling in some sort of semi structured data. And I think there's a lot of opportunities in the future to look at how Lux could potentially be used in the machine learning kind of workflow.
[00:18:36] Unknown:
And another library that comes to mind as we're talking about some of these other types of data that we're working with is X-ray, which I know is based at least partially off of pandas, but adds additional dimensionality to it. And I'm wondering what your thoughts are on some of the the viability of being able to use lux with something like X-ray and these potential complexities that get added because of those additional dimensions that you then have figure out how to map it down into lower dimensionalities or, you know, expand your out of the box visualizations.
[00:19:07] Unknown:
Yeah. I'm not too familiar with X-ray itself, but, like, I think going back to the NumPy array or I guess, X-ray could be thought of as a in terms of how we recommend the visualizations, that is not particularly tied to the data frame itself. I think we use the data frame as a medium to kind of deliver these visualizations. But if we did have, let's say, you know, a matrix that is filled with quantitative values, then we do have a way to visualizing it. Now being able to visualize it doesn't necessarily means that that makes sense for those users who are using those tools. That obviously requires, like, a different kind of use case and a little bit more research into how that is being done. So thinking about how that fits in with, let's say, like, an X array or maybe even, like, TensorFlow objects, that is definitely something that's not currently our focus, but we could look into that in the future.
[00:20:11] Unknown:
Pandas is definitely a viable target because as you said, it's a huge ecosystem. There's a large community of people who use it, and it fits probably, you know, 80% of the data analysis and data manipulation workflows that people are doing. And it's just a much more natural paradigm to think about data in 2 dimensional arrays versus, you know, these multidimensional planes that you get into as you go into things like neural networks or vector space. But digging more into Lux itself, can you talk a bit about how it's architected and just some of the ways that the overall goals and design of the project to be evolved since you first began working on it? So Luxe is completely written
[00:20:47] Unknown:
in Python for the back end. And then in the front end, we are using this thing called Jupyter widgets. So Jupyter widgets is like this library, which you to add things like sliders and buttons and all these interactive widgets, which can actually talk to your front end and back end code. So it avoids kind of this layer of having to do the, you know, IPython kernel communication for you. It kinda handles it for you. So our front end is then completely written in TypeScript and JavaScript. So very standard kind of web app that's based on React. And, like, a lot of the core of the work that we're doing is mostly on the back end of being able to build these recommendation modules that recommend interesting insights to the users as they're exploring their data. Another key aspect of the recommendation is that we have to maintain some sort of metadata around the data itself. There's actually some really interesting problems here. For example, in the typical case where you're trying to visualize data, typically, your data isn't evolving that much. Let's say you're, like, in a BI tool or, like, in 1 of these interactive interfaces.
You're not really modifying your data all that much. It's more of, like, you're reading the data and then maybe plotting some sort of visualizations. But in the Pandas workflow case, you're you're actually modifying your data a lot. You might be dropping your null values. You might be renaming a column. You might be doing all these different transformations to your data. And the structure of your data could look very different from time to time. And so being able to generate visualizations on top of that requires us to maintain some sort of data. And so a lot of the work that we've been doing in Luxe is trying to make sure that the overhead cost of us generating those visualization is not significant so that it doesn't slow down the user's sort of data science workflow. There's some interesting optimizations that we've applied in order to make that happen.
[00:22:48] Unknown:
Digging more into some of the specifics of the actual visualizations themselves, I'm curious what your process has been to understand sort of what are the best sort of, at least, first pass visualizations to use. You know, do you use a bar chart versus a histogram or a pie chart or a line graph? And just some of the potential challenges or points of confusion that might present themselves as you're using these different types of analyses or, you know, in the case if you have geographic data embedded in the data frame, you know, bringing up a choropleth map or time series data and just some of the complexities and how you're handling those variations in data types and the representations that you generate out of the box?
[00:23:32] Unknown:
As I mentioned in the past, Luxe is built on this series of work that we've been doing on visualization recommendation systems. And as part of that work as part of that work, like, we've looked at how to actually recommend interesting or useful visualizations to the user and what types of encoding we should use for a given visualization type. Over the past decade, there's a lot of interesting work in the visualization community that have specifically done research on this. So people have actually done studies where they look at what is the best practices for encoding, let's say, like, you know, your multidimensional data with, like, maybe you have, like, 2 nominal variables as well as a categorical variable. Like, for all these combinations of data, how do you actually best encode your data? There's a rich body of work that the community has done in order to figure out, like, what visualization we should show. And so let's kind of borrow some of that research.
But for data science use cases, there's also some really interesting and unique challenges. Not necessarily challenges, but for the data science use case, there's some very interesting and unique visualizations that you would actually show that hasn't really been well studied. So 1 example of this is when we do a group by and then an aggregation, the data frame itself becomes kind of this index kind of thing. So the column itself is kind of indexed to the index column. And in that case, like, it doesn't really make sense to visualize the entire data frame. It makes sense to visualize what you've already computed as the aggregate. Solux has a way of sort of being able to look at these data frames that are already preaggregated.
Similarly, like, you might be doing things like pivots or, for example, like, selecting a particular column to look at the 1 d series. All of that looks tries to display a visualization that is natural to the user. And going back to the example that you gave, which is this this geographic example, this is 1 of our latest feature in Luxe where we started thinking about, like, hey, if I just had a column of data that is, like, all countries, that's not super useful for me as, like, an analyst. I might be able to read, like, 5 of these rows, and then I know, like, what are the countries that are there. But wouldn't it be great if I could plot that on, like, a map and show a different perspective on the data? And so a lot of these recommendation types that we started to work on, in addition to the body of work that, you know, the existing visualization literature has already been looking at, is what kind of makes sense in a data science workflow.
We've looked at sort of notebooks on Kaggle and GitHub to see what are the typical visualizations that people have been plotting that are associated with these Pandas commands and visualizations. And so being able to kind of bridge those 2 together and building a solution that makes sense is something that we've done with Luxe. That being said, I think there's still a lot of opportunities to improve Luxe in improving those recommendations. Sometimes we don't necessarily always pick the best view that the user wants. In exploratory analysis, there's a lot of possible objectives and tasks that people are interested in, and it's very difficult to actually recommend, you know, the single best 1. I don't think that's kind of the goal of what Lux is trying to do. So always being able to, like, tune and to improve what what we're doing with the recommendations is an important aspect of what we're working on.
[00:27:15] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census today to get a free 14 day trial and make your life a lot easier.
Digging more into the actual workflow of using luxe and interacting with it, I know that as I was going through the documentation, it has this concept of being able to specify your intent so that that helps to scope the specifics of the visualization or tune the way that the axes might be represented or, you know, choose between, you know, a histogram versus a bar chart for determining, you know, this is the initial pass of data that I'm gonna look at or this is the you know, these are the axes labels that I'm going to add to the scatterplot. I'm just wondering if you can just talk through a bit more of some of the capabilities that that intent based API introduces and just the overall workflow of actually starting from, I have a data frame to maybe digging more into interacting with Lux and having that help to drive
[00:28:46] Unknown:
specify, like, you know, hey, this is the column that I'm interested in the data frame, and could you show me something interesting related to that? And so what a user would do is, you know, when you print out the data frame, you know, you get your basic visualizations. We show kind of an overview visualization. And then maybe, let's say that you're looking at, like, the sales dataset and you're looking at, you know, all these different columns and maybe you're interested in the the column price. So what you would actually do is you would do d f dot intent, and then you would feed in kind of a list of variables or values that you're interested in. So here, you would essentially attach the intent, which is maybe I'm interested in price, and I would attach that to the data frame. And then upon doing that, when you print out your data frame again, what we would actually show you is the visualization that is related to price. So maybe this would be, like, maybe a histogram of price values. And then we would also show you a set of recommendations that are related to price. For example, you know, price plus sales or, like, price plus, you know, geography.
And so now you can actually take the thing that you're interested in and all the ones that are also related to that. So this is very similar to what let's say, like, you're on Netflix and, you know, you're interested in some movie, you type in that movie. But then Netflix also shows you all of these other useful, potentially relevant movies that are related to, you know, the curry that you typed in. So it's very similar to that. And the idea is being able to specify these, these intent in a way that is high level based on what intuitively makes sense to the user. So in terms of columns of the data or values, values meaning something like, let's say,
[00:30:42] Unknown:
country equals to US. That is like a subset of data, so that is also an entity that you could specify. And then once you've specified those, the system will then find the recommendations that are relevant to that and display it to you. In terms of your own experience of working on Lux and using it for your data analysis, I'm curious how it has impacted the way that you think about approaching a problem or the specifics as to, you know, any steps that you make or how it has changed some of the potential outcomes of maybe you had done a project once and then you revisited it using locks to help with the data exploration aspect, and maybe you went a different route in terms of the conclusions or, you know, some downstream analyses that you performed?
[00:31:24] Unknown:
I hope that I get to do more data science these days, but unfortunately, most of my time is building systems and building blocks. But a lot of the debugging that we actually do is kind of just taking a dataset, you know, from, you know, a public dataset or a dataset that 1 of our users are using and then doing some sort of exploratory analysis on their own and trying all these different, you know, ways of visualizing the data and asking ourself, like, does this visualization actually make sense? And a lot of the times, like, it might not make sense. And so working backwards to figure out, like, hey. Like, if I was the user and I wanted to see that visualization, how could ducks kind of meet me in the middle and show that to to you. 1 of the really interesting 1 these days is, like, we've never really thought about this, but 1 of the tutorials that we went through, for example, had these data, which only had 2 columns of the data. And lux is designed to work with, like, data with, like, data frames with large number of columns. And then when it's just 2 columns, then Lux gets kinda confused and tries to just show each 1 of them separately.
And so in this very specific narrow case where there's only 2 columns, maybe it makes sense to just show the single visualization that represents that. And so that's an example where we can take a use case or a dataset that is already kind of there and then work our way backwards and say, hey, does this make sense? If not, how can we build Luxe to better support that use case?
[00:32:59] Unknown:
Digging more into the discovery of this particular edge case and just using Luxe in an analysis environment. I'm curious what your approach has been as to being able to get it in front of people who are using it in their day to day work and elicit feedback and how that has helped to drive your direction for the project or, you know, how the particular end user personas have influenced the way that you think about the interface and feature capabilities.
[00:33:28] Unknown:
Yeah. Definitely. We've definitely done a lot of sort of interview studies as well as working with particular sets of users to better understand how they're using the tool. So both sort of these in-depth interviews with existing users as well as sort of some of the stuff that I did earlier on was kinda giving talks where I would have kind of a Binder instance that is open and public for everyone. So maybe I would give a talk to, like, 15 or 20 people in the audience, and they would all be able to click on that link, and it brings them to a Jupyter Notebook where Luxe is already installed and all kind of wrapped up. This is especially useful in the earlier days when Luxe was not as stable, to begin with. And so being able to have that cloud environment where the users can kind of, you know, play around with Luxe and give us, you know, instant feedback was super useful because we didn't have to, like, install in their machines.
And so once we had that, we gave people surveys and, you know, kind of listened to their feedback. And 1 of the things that that was really useful for was earlier on for this intent language that I mentioned earlier. We had to do a lot of refinement based on user feedback to essentially make that language a little bit more intuitive and provide better warning or guidance messages. Because Celux works itself with just the printing the d f with all the Pandas commands, but this is kind of a different API or language that Luxe is introducing. So there's a lot of initial barrier to be able to compose those intents.
So a lot of the user feedback has helped with improving that language quite significantly.
[00:35:07] Unknown:
As you have been working with developing the project yourself and working through these, you know, end user studies and, you know, working with the community around it, what are some of the most overlooked or underutilized features of Luxe that you think would benefit from being given more attention or that people would be able to gain more value if they were aware of them and, you know, maybe expose them a bit more readily?
[00:35:31] Unknown:
1 of the features that I'm really excited about is this idea that in Luxe, you can actually write a function, like a small function with a particular input and output, and it would create a tab of recommendations for you. And so instead of the default tabs that we're showing you, maybe a correlation or, like, you know, histograms and temporal trends and geo features. Like, if there's a particular recommendation that you're like, hey. That kinda makes sense. I wanna see this whenever I'm printing out a data frame. You can actually write your own user defined function where you can use the intent language that we provide to iterate through, you know, a large collection of visualization and display some sort of resulting visualization based on that. So maybe you could take 1 of the examples that we gave, like, in the tutorial was, let's say you're looking at a world countries dataset, and you're interested in differences between g 10 countries and non g 10 countries. So you will always wanna look at these bar charts of, like, whether a country is g 10 or not g 10. So you could write, like, a Python function to make that into a custom recommendation.
And whatever you do to the dataset, you could filter it, you could drop your null values, you could do anything that you want. We would still be able to show you how that recommendation changes, for your data as you're kind of working with it. So that's a feature that I'm excited about because a lot of our users tell us that, like, hey. Luxe is great for, like, these basic analysis, but I also want x, y, z. And then my solution was that, like, hey. I exposed this user defined function, you know, capability for the user. But unfortunately, that feature is, based on my understanding, it's not being used as frequently as it could potentially be. And I think part of that is really around improving that intent language and improving how easy it is to be able to write that function and the constructs around that. So that's an example where it's like, we think that it's useful and we hear that the users really want something like this, But the solution that we have currently probably is not exactly what the user wants.
[00:37:44] Unknown:
And in terms of the actual ways that Luxe is being used, I'm wondering what are some of the most interesting or unexpected or innovative ways that you've seen it employed?
[00:37:53] Unknown:
I think 1 of the use cases that I wasn't expecting was we saw several users that we were working with. We're working with survey data, and survey data are well known to be kind of very categorical. And, you know, maybe you're, like, you're doing a survey and there's, like, a 100 questions, but you have, you know, your demographics and all of this. And that was 1 particular example. We didn't design Luxe to work well with survey data, but it turns out that a lot of our users used it to kind of print out the data frame on survey data just because it's so multifaceted.
There are so many different columns to look at and as well as the values. And a lot of times, people are interested in, like, hey. Is question 21 in my survey what is the frequency of counts in each 1 of them? And that's exactly 1 of the basic sort of recommendations that we show to the users. And so that's an example where these very wide data frames with simple sort of low to medium cardinality. So maybe you only have, you know, 4 or 5 different distinct values. That is an example where Luxe works really well with that particular data. So that was a surprising finding that we found after talking to our users a little bit more because the nature of the data happens to work really well with the recommendation types that we've designed.
[00:39:18] Unknown:
In your own experience of working on Luxe and building it out and helping to grow the community and evolve the project, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:29] Unknown:
I definitely learned a lot about the Jupyter notebook kind of ecosystem and how to structure, you know, projects around Jupyter and kind of just best practices around Git, Python, and building open source tools. It's not something that I've done a whole lot in the past. I've worked on projects, but I've never done it at this scale where I have to care about, let's say, releases or, like, uploading my Python wheels to PyPy or, like, NPM. And so that's all kind of new to me, and we kind of learned through the process of multiple releases. I think the initial versions that we released were probably not the best ones. People had trouble installing them, and over time, there was a lot of requests to, like, hey. I can't build this on my machine, especially around, like, Jupyter widgets, getting it to work with different, you know, conda versus, like, pip and all of this. And so being able to understand, like, what is best practices and what is a good standard for, like, an open source project was kind of a learning experience for the entire team.
[00:40:39] Unknown:
For people who are looking to add visualization to their data analysis and data exploration workflow, what are the cases where Alexa is the wrong choice and they might be better suited to go to, you know, 1 of the more underlying libraries or go a different approach?
[00:40:55] Unknown:
Yeah. That's a really good point. I think for a lot of presentational use cases, for example, like, maybe you're a journalist and you know exactly, like, this is the beautiful visualization that I want to plot for New York Times. And, you know, there's all these different hovering and interactions and just all these capabilities that you could do. That is a use case that Luxe does not support, and I don't think it would ever get to the point to support a use case like that. Luxe is really intended for exploration and for these quick and dirty ways to experiment with your data. The visualization itself is not really the final goal. It's just a way of getting to these insights.
So that's a use case where it doesn't make sense.
[00:41:40] Unknown:
And as you continue to work on Luxe and evolve it, what are some of the things that you have planned for the near to medium term, and what are some of the ways that listeners can help to contribute to the project and help drive it forward?
[00:41:51] Unknown:
So Lux is kind of at this point where, you know, most of the basic features are out there, and we're looking in kind of 2 directions. 1 is improving the recommendations as well as the visualization encodings that we're showing. So that is the part that I talked about earlier, which is really working with users and trying to understand, like, hey, does this recommendation make sense to you? Are there better things that we can show you? Does this encoding even make sense? And that's a really challenging problem because even if you give 2 users the exact same dataset and the exact same series of steps that they've taken to get there, they might have 2 very different goals. And so it's impossible for the recommendation to get at anything perfect. The only thing that the recommendation could do is, show you as many things as possible so that you can kind of filter and very easily switch to these different alternatives.
So that's kind of 1 direction that we're working towards. The other direction is, we briefly touched on this, is the idea of scaling up Luxe to larger and larger datasets. So right now, luxe incurs kind of an overhead on top of pandas, and the goal is to kind of minimize that and say, can we print the data frame just as fast as we do in pandas, but be able to also get this rich recommendations, making sure that the recommendations don't slow down the user's workflow in any way. And we recently wrote a paper around this, developing several interesting sort of optimizations to make this process faster. And 1 of the ones that I'm really excited about is the idea of, like, being able to stream in the visualizations as they're computing.
So this is something that we're currently working on is, like, obviously, the recommendations take a really long time to compute. And if you could show the users, let's say, pandas table itself and then maybe the first few tabs, maybe that's good enough to keep them, you know, occupied for the first, like, you know, 10 or 20 seconds, and that buys us some time to compute things in the background. And so a lot of optimizations like this could dramatically improve the interactive performances of Luxe and making it faster for the users to get to, you know, something that they can work with. That's 1 particular 1 that I'm really excited about.
[00:44:13] Unknown:
And it's definitely interesting how speed can so drastically impact the overall user experience of a project or a tool and how that can drive, you know, whether somebody wants to use it at all or if they only use it for special cases. And I think that, yeah, to your point of, you know, we want to be able to lazily evaluate and optimistically display these visualizations rather than block everything on computing everything all at once. It's definitely a good approach. So 1 last thing I'll ask is you had mentioned that you came into this project with the overarching goal of improving data analysis and data science for domain data scientists and, you know, identified Lux as 1 of the key components to that. I'm wondering if there are any other types of tools that you have that you would like to be able to start or that you'd like to see other people build to help expand the overall capabilities and accelerate people's ability to do these types of analyses?
[00:45:05] Unknown:
Yeah. We're actually already now seeing that there is more and more of these tools for being able to work with your data at different scales, but using the same native sort of environment. So for example, like, 1 of the goals for Luxe is that you could do all your data processing with Python. You don't have to change your lines of code. We would need to essentially be able to visualize your data for you without changing too much around what the data scientist is doing. They don't have to learn a new language, or they don't have to learn new APIs. And part of the goal of thinking about data science as a whole and in the future is thinking about, for different scales of data, maybe you have, you know, data with 10, 000 rows as well as data with, like, 100, 000 rows or a 1000000 rows.
Can we use the same sort of framework to be able to visualize your data with pandas and matplotlib and other systems? So taking kind of the existing kernel of the tools that people are using and thinking about, like, if we increase the scale of the data or if we increase, you know, the number of people who are collaborating on the same project, This these are kind of trends that we're seeing more and more in, sort of, data science teams and how data scientists work in industry. And so if you take those different changes in the ecosystem, can we still have the same kernel and the API that everyone kind of loves and use? I think that's kind of a grand challenge that we'll be seeing more and more of these tools in the future.
[00:46:47] Unknown:
I know that, as you mentioned, you're working at the RISE Lab. I'm curious if you have any particular thoughts on the overall impact of tools such as Ray on the data science ecosystem and being able to expand the existing APIs into more of these scale out use cases?
[00:47:02] Unknown:
Yeah. I think Ray is definitely a great example along the direction of what I just mentioned, which is being able to run your Python computation, you know, with 10 rows of data or 10, 000, 000 rows of data. 1 of the interesting project 1 of my colleagues, Devin Peterson, is working on a project called Modin, which is actually a data frame API Panda's data frame API that is on top of Ray and Dask. So Dask is another parallel computing library similar to Ray. The idea there is also that you simply have to do something like import modem as pde, I think, and then be able to use the same exact Pandas code that you're using to run on small data as well as large data. And it automatically, underneath the hood, does all the distributed computing work for you, which is great because a lot of data scientists are focused on insights. Like, they don't wanna write visualization code. They don't wanna write, like, distributed computing code. And so being able to abstract that away and have a common API, I think, is, again, some of the projects that we're also working on in terms of in the Rydus Lab and beyond.
[00:48:11] Unknown:
For those scale out environments where you're running on top of Dask or Ray or Modin, does Luxe also function across those distributed systems? Or is does that bring in some additional challenges for being able to compute the visualizations across the sharded and parallelized data frames?
[00:48:27] Unknown:
I think it's definitely a direction that we're exploring along the scalability direction. And lux itself is fortunately a very paralyzable application just because for the recommendations that we're computing, they're more or less embarrassingly parallel. They're kind of orthogonal to each other. You they there's very logical units that they can be separated into. So that's definitely a direction in the future that we are thinking about in terms of making these visualization recommendations a little bit more paralyzable and faster to compute.
[00:48:59] Unknown:
Alright. Well, for anybody who wants to follow along with the work that you're and get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move this into the picks. And this week, I'm going to choose the Pirates of the Caribbean movies. I started revisiting those recently with my kids. So that's been entertaining. They're just, you know, good fun adventure movies to watch for, you know, something to keep you occupied for a couple hours. So definitely recommend that if you're looking to keep yourself entertained over the weekend. And so with that, I'll pass it to you, Doris. Do you have any picks this week? Cool. I think I briefly brought this up earlier on when you asked me the question about
[00:49:33] Unknown:
how did I get started in Python. And so there's this amazing book out there called Snake Wrangling for it's a great, like, beginner's guide to Python, as well as if you want a weekend activity with your kids learning Python, that's also, you know, a great resource. So definitely check it out, like snake wrangling for kids.
[00:49:51] Unknown:
Alright. I'll definitely have to take a look at that. So thank you again for taking the time today to join me and share the work that you're doing on Luxe. It's definitely very interesting project, and that it's something that is obviously valuable for people who are trying to do data analysis to reduce the barriers to entry for being able to get visualization and help them with their exploratory tasks for understanding the underlying datasets. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias, for inviting me to the podcast. I really enjoyed our conversation. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Doris Lee and Lux
Motivation and Goals Behind Lux
Lux in Data Science Workflow
Comparison with Other Visualization Tools
Lux's Architecture and Design
Visualization Recommendations in Lux
Intent-Based API in Lux
User Feedback and Community Involvement
Underutilized Features of Lux
Lessons Learned and Future Directions
Scaling Lux for Larger Datasets
Impact of Tools Like Ray on Data Science