Summary
Building a machine learning model is a process that requires a lot of iteration and trial and error. For certain classes of problem a large portion of the searching and tuning can be automated. This allows data scientists to focus their time on more complex or valuable projects, as well as opening the door for non-specialists to experiment with machine learning. Frustrated with some of the awkward or difficult to use tools for AutoML, Angela Lin and Jeremy Shih helped to create the EvalML framework. In this episode they share the use cases for automated machine learning, how they have designed the EvalML project to be approachable, and how you can use it for building and training your own models.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Angela Lin, Jeremy Shih about EvalML, an AutoML library which builds, optimizes, and evaluates machine learning pipelines
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what EvalML is and the story behind it?
- What do we mean by the term AutoML?
- What are the kinds of problems that are best suited to applications of automated ML?
- What does the landscape for AutoML tools look like?
- What was missing in the available offerings that motivated you and your team to create EvalML?
- Who is the target audience for EvalML?
- How is the EvalML project implemented?
- How has the project changed or evolved since you first began working on it?
- What is the workflow for building a model with EvalML?
- Can you describe the preprocessing steps that are necessary and the input formats that it is expecting?
- What are the supported algorithms/model architectures?
- How does EvalML explore the search space for an optimal model?
- What decision functions does it employ to determine an appropriate stopping point?
- What is involved in operationalizing an AutoML pipeline?
- What are some challenges or edge cases that you see users of EvalML run into?
- What are the most interesting, innovative, or unexpected ways that you have seen EvalML used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on EvalML?
- When is EvalML the wrong choice?
- When is auto ML the wrong approach?
- What do you have planned for the future of EvalML?
Keep In Touch
- Angela
- angela97lin on GitHub
- Jeremy
- jeremyliweishih on GitHub
Picks
- Tobias
- Angela
- Sarma mediterranean restaurant
- Jeremy
- Crucial Conversations by Stephen Covey (affiliate link)
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- EvalML
- FeatureLabs
- Alteryx
- Scheme
- NetLogo
- Flask
- AutoML
- Woodwork
- FeatureTools
- Compose
- Random Forest
- XGBoost
- Prophet
- GreyKite
- Shap
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Angela Lynn and Jeremy Shi about EvalML, an AutoML library which builds, optimizes, and evaluates machine learning pipelines. So, Angela, can you start by introducing yourself?
[00:01:10] Unknown:
Sure thing. Hi. I'm Angela. I graduated from my undergrad at MIT around 2 years ago, and I joined the team working on a ValML at Feature Labs as an intern. And then after Feature Labs was acquired by Alteryx, I joined them full time as a software engineer, and I've been there since. And Jeremy, how about yourself? Hi, everyone. I'm Jeremy.
[00:01:33] Unknown:
I pretty much have the same timeline as Angela, but I finished my master's at Tufts University 2019 as well, and right after I joined Future Labs, and I was basically there for the inception of the Val Val project, and I've been working on it since, and have been joining Alteryx as well. And going back to you, Angela, do you remember how you first got introduced to Python?
[00:01:52] Unknown:
Yeah. I was very lucky in that I attended a high school that had a pretty, like, well thought out CS curriculum. So we started the 1st semester with more functional programming, so scheme that logo. And then the 2nd semester was in Python, where we are able to just, like, code up small, simple algorithms,
[00:02:13] Unknown:
small, like, Flask, web apps, that kind of stuff. And Jeremy, do you remember how you first got introduced to Python? Yeah. So I started a little later, probably like sophomore year of college where I took, like, this web engineering course, and that's when we were introduced to Django and Flask. So I kinda got my start there, and then as I progressed down more of the machine learning framework at Tufts, I was, like, naturally introduced to more and more Python things. Can you start out a bit by describing
[00:02:39] Unknown:
what the evalML project is and some of the story behind how it got started and what the reason was for creating it in the first place? Sure. I can talk a little about that. So ValML, as you mentioned, is the automated machine learning library.
[00:02:53] Unknown:
And so maybe to talk a little bit about what that means, we can talk about the workflow for machine learning usually is. I think, traditionally, with machine learning, it refers to, like, being able to select data and then models and then the model parameters, and that's quite difficult to do by hand. I think, usually, it requires a lot of brute force or specific knowledge, whether about data science or also just the field. And what I think my manager likes to refer to, it's trial by grad student. Basically, just a lot of, like, brute force trial and error. And so AutoML and EvalML aims to simplify that process.
So the user, instead, only has selected data that they care about that describes the system of interest, and then AutoML is supposed to help you select both the type of model and also the parameters for your model. And a little bit about, like, how that came to be. So I can speak personally, at least. Like, I came from a background where I had used machine learning in some of my courses, but I wasn't very familiar with it. And in the courses where I used machine learning, it was exactly what I described, where I had to use specific models that maybe my professors had talked about or we went over in class. I didn't really understand how to tune the models or exactly how to, like, get better performance except try different parameters.
So there was a lot of frustration on my part where I was just trying a lot of different things and, like, seeing the scores go up and down and not really, like, understanding how to approach it well and spending a lot of time on that. So I think because we saw a lot of frustrations out there, not just in modeling, but in the data science process in general, a ValML was created through that process.
[00:04:48] Unknown:
Yeah. Like Angela said, I think, basically, we really wanted to work more on AutoML just to cover, like, the pain points of data scientists out there. Me personally, I worked 1 summer as a data science intern, I just kind of got to see that, like Angela said, it was like a very iterative process where there was a lot of trying new things and going back and trying more things, so you kind of really needed some infrastructure around you to kind of facilitate that process, and not many companies out there actually have that kind of infrastructure or can invest in that kind of infrastructure. So when I was talking to Max, Max Kantar, who was a co founder at Future Labs, that was something that really inspired me, and to join, so that's kinda, like, why we started the project as well. To mention Feature Labs, in house, what we already had was Feature Tools, which is a library
[00:05:34] Unknown:
used for automated feature engineering, but that's only 1 step of the pipeline, Right? And since then, because we want to build, like, quality tools for data scientists, we've kind of expanded out our repertoire to include different steps of the pipeline. So now we have, in the open source world, Compose, which helps with automatic data labeling Woodwork, which helps with typing data typing, feature tools, which is for automated feature engineering, and now ValML, which helps with the last step of that process, our automated machine learning.
[00:06:10] Unknown:
In terms of AutoML, my understanding is that it's basically just enumerating and exploring a particular search space for the machine learning outcome that you're trying to solve for, determining the appropriate set of inputs and models and the tuning parameters thereof, and then giving you the output of this is the best result. This is the model that was used. Here are a couple of other options. Wondering if you can just describe some of the process of determining that search space and how you're able to constrain it so that you don't end up wasting a lot of time and compute resources going down potential dead ends or sort of determining early in the cycle which paths might end up being suboptimal for the problem at hand. So I think I'd kind of like to
[00:06:57] Unknown:
describe the whole AutoML process kind of through the lens of a data scientist. So at the start of any problem, a data scientist would probably try to, let's just say for like a customer churn problem, they'll probably try to configure a hypothesis and try to try out what kind of data they want. So at the beginning, there's this problem formulation phase, and NextJIT you kind of build features or grab data to try to see the data that you grab can be converted into good models. And there, like, at the model search is kind of, I guess, like, on a higher level of what you described, like, what AutoML works, right? So maybe like a traditional data scientist has good domain knowledge on a customer churn problem, they kind of know what model is out there or what kind of data out there is good for something like that. So maybe for some specific models, like he can take like a Random Forest model or like an XGB boost model and kind of build out the Skuid search of possible parameters for these models and kind of just work off that. So there it's kind of like an iterative process, and for us, we try to do the same. At the very beginning of our project, we kind of just had a very simple automata algorithm, and it was kind of exactly as you described, right? Given these data and the potential models out there, kind of iterate through these models, and for each of these models, just trying to figure out which hyperparameters or what configurations for these models are the best for this specific problem.
Recently, we've been doing a lot more development into the automail algorithm, as we call it, and that's something I've been working on. And basically, it's just adding more and more heuristics or more automation in the automile process. So example of that would be to try to see if adding on feature engineering or feature selection is like a possible avenue for us to optimize on, and if it's not, we can move forward. What else? Like, try to see what steps we should do ensembling and stuff like that. But like I said, like, at the very beginning, our auto ML algorithm was fairly simple, but we're still in the works of trying to improve that. In terms of the
[00:09:04] Unknown:
audience for the eval ml project, who is sort of your your target persona for the eval ml toolkit and the types of problems that it is well suited for being applied to? I would say, at the end of the day, I would want
[00:09:19] Unknown:
anyone who has, like, any knowledge in Python and wants to use machine machine learning but maybe doesn't have as much knowledge about machine learning, that's who I want the persona of a ValML to be. I think it's not quite there yet. Right now, we still, I think, gear more towards a data science persona simply because we're not as opinionated as we want to be about, like, the choices that the algorithm makes or, like, so at this point, we still require the user to know at least some knowledge about tweaking specific things or understanding, like, the problem type. But at the end of the day, I think we are trying to get to a place where we are more opinionated, and we do a lot more of the automation so that, again, all the user has to do is provide us with data. We'll tell you if the data is faulty or things that you can even clean up with your data, and then you can just call our library, build the model that you need, and then go off and solve your business objective. Another interesting thing to dig into is the
[00:10:26] Unknown:
overall space of AutoML as a problem domain and some of the tools and systems that are available for people who are looking to take advantage of the capabilities of machine learning without necessarily either having all of the domain expertise or understanding of how to actually build and apply these bottles or who are looking to iterate faster than if they were to do all that exploration on their own. And so I'm wondering if you can just talk a bit about sort of what are the available systems and tools in the ecosystem, whether, you know, in Python or outside, and what was missing in the available offerings that motivated you to create the eval ML project and sort of what it brings to the space that is unique? I think in general, there's, like, a couple of good tools out there in the Python open source community
[00:11:19] Unknown:
for automated machine learning. A couple examples of them would be HTML, AutoML, Auto scikit learn, I think Databricks has their own library, AutoWeka is another 1, and I think in general, all these packages do cache optimization, which is like combined algorithm selection and hyperparameter optimization. At the base level, that's what we all do, and it's kind of like what you described earlier on like what AutoML is like in general, basically taking the process of getting data and then figuring out what the best models are out there and trying to figure out the best parameters for those models. It kind of felt like we could improve on the user experience for AutoML, and that's kind of what we focused on to begin with. Like Angela explained earlier on, at the very beginning, we wanted a lot of customizability or a lot of user input in our automail library. An example of that was adding custom objectives to evalmail, and custom objectives on a high level is basically taking the real objective of your business and trying to optimize your models on that instead of these more obscure loss functions or these optimization objections like you see in machine learning. So it's like kind of like transforming the machine learning objective into something that you can see have more of an impact on your business. Like example of that would be for customer churn, maybe a certain gain or loss, depending on if your customer were to come back or not. Yeah, so in general, we just wanted to focus a lot on the user side of things.
And now that our package has progressed a little more, like Ajayo said, we kind of are trying to focus more on giving more opinions, or having more of an opinionated approach to AutoML. We can also move, not move necessarily, but like support both data scientists that can take advantage of all these things that we allow them to configure, but also give it to users who have less domain knowledge or less, machine learning expertise, and also figure out their problems as well. I think for me, for sure, the first thing that comes to mind is being very user centric, and I think that's pretty evident in that
[00:13:26] Unknown:
for our landing page or for when someone wants to use a Valmont. There's 1 method that they really need to think about, which is our search method, and that runs the AutoML process for you. There and done. So I think that makes it a very, like, easy access for beginner users because there's just 1 method that you really need to focus on. But then if you are a more expert user, then, sure, you can dive more into the nitty gritty and change things up as you will. I think other important things out there is that we're 1 of the open source libraries that combines other popular libraries via a unified API. So Jeremy had mentioned auto scikit learn and some other ones out there. We use models from popular libraries, such as scikit learn or LightGBM, Catboost, XGBoost, and we provide all those models under our API, which I think is pretty useful if people just wanna try out different things and not have to learn how each of those libraries work individually because they do tend to have, like, slightly different APIs.
Something else that I was able to work on a while ago and kind of expanding beyond the AutoML model building process is something that we call data checks, and data checks are basically heuristics that you can apply onto your data to tell you if there's something wrong or potentially would cause a model to not perform as well. So some simple examples might be if you have a lot of NaN values in your data. Usually, a lot of ML models out there, they don't really know what to do with NaN values or they'll just error out. So we built a whole collection of, and we're continuing to build more and more data checks, which check for different types of errors or potential issues with your problems, such as NAND values is probably the 1 that comes to mind right now.
But in building the data checks, we hope that we're able to give users clear feedback about how to update their problem configuration and their data so that they get better
[00:15:39] Unknown:
better model performance at the end of the day. Digging into the project itself, can you talk through some of the software design that has gone into it and some of the sort of libraries that you're able to use from the ecosystem and just some of the overall design considerations that have gone into how you've built the project and how you have aimed to make it accessible to end users?
[00:16:02] Unknown:
So I think the way that our project is architected is probably similar to a way that a user might want to interface with it top down. So at the very top, we have what I mentioned before, which is the AutoML search object or the search method. That's a primary interface that a user would interface with if all they cared about was creating or using AutoML. And so that breaks down into all of our other, like, smaller components. So we have objectives, which are what metrics the model should be optimizing for. AutoML search creates pipelines, which represent a series of operations that should be applied to the data. And each of these operations are either a data transformation or an actual ML modeling algorithm.
And then we have component graphs, which fall under needvar pipelines, and they simply encode the transformations that are applied to data. But unlike pipelines, they don't necessarily have to be a linear sequence so that you're able to have different input pathways for different types of features, or you're able to combine different, like, pathways into 1 final algorithm, basically. And kind of going even 1 step below that, what our component graphs are comprised of are components, which are exactly what we talked about, the data transformations or the modeling algorithm. And they're the lowest building blocks of a ValML, and they just represent, like, a fundamental operation that should be applied to the data. So I guess going back to the user side of things, again, if a user just wants to come into our library and run AutoML, they can just go from top up, AutoML search, call search, done. If they care more about what pipelines are created, then they can go and create their own pipelines. They can create their own component graphs or components. And so there's, like, different levels of control that they could have depending on their expertise.
[00:18:06] Unknown:
Yeah. The sort of incremental reveal of complexity is always a useful attribute of different software projects where if all you care about is I just need to get something done fast, there's a way to do that. But then as you spend more time with it and want to, you know, pull back the covers more and gain more control, there is a way to do that without having to hack apart the project and, you know, pull out the guts to put it back together the way you want it to.
[00:18:30] Unknown:
Right. I think in some ways, that's kind of representative of how we've, like, progressed as well, where we had AutoML, and it had pipelines which were hard coded at first because we didn't have this abstraction of components really baked in. And then we realized, well, for our more complex problems, we might not always want to use the same components or, like, fundamental operations. Right? So that's how we broke it down to the level of components, and now we're able to, on our side as well, be able to combine these components in different ways for a smarter algorithm.
[00:19:06] Unknown:
In terms of the overall design and evolution of the project, you mentioned that you have added in some of these features that make it a bit more flexible. But what are some of the assumptions that you had early on about the way that the project was going to be used or the feature set that you wanted it to be able to have that have been sort of changed or updated as you worked through it and as you had other people using it and giving you feedback?
[00:19:32] Unknown:
So I think at the beginning of the project, like Angela said, like, a lot of what we were doing was just building out the abstractions, and I guess I kind of referred to it as like the platform for everything, so kind of just like we are building out the pipelines and components, I think that's actually the very first big project we both worked on together back in October of 2019, and this was just very near the inception of the project. And still, like, evalma was still quite a young project, just going on basically 2 years old. So like you said, we've been evolving a lot. Now, like a big part of that is expanding our team. Like when we started, there was just 2 of us and Max who was leading us, and then eventually, we had a couple more interns join on board and more and more team members, and now we have around 10 full time team members, and basically that has just allowed us to expand our output by a lot. But I don't think, in general, we really deviated far from our original goals. I think maybe certain things came in when we felt that it was necessary to help with the automail process, like example, with a lot of our model understanding tools, we were missing this final piece at the end of the automail process that allowed users to gain more from using a Valmall. So we put in a lot of investment into model understanding tools, things like partial dependent, and just kind of things that people could look at the pipelines or models that we outperform of Alma and see how it relates to their data beforehand, so I think that's 1 thing that was added. Yeah, Angel, do you remember anything else that we kind of added as we went along?
[00:21:05] Unknown:
Sure. I think that's a really good call out, and to what you talked about before, Tobias, about AutoML libraries just being libraries that automate the building process of pipelines. I think we've tried to expand beyond that so that that's not all that we do. Jeremy mentioned, like, model understanding. So once you get your pipelines, like, how can you actually understand or interpret the model? There's also, like I mentioned before about data checks. So, like, even before you build your models or third at AutoML, like, can you better understand your data or, like, clean that up?
I think another big push for us that's still in beta, but something that we've been working towards is time series. And that, I think, as Jeremy mentioned, has just been, like, because we've been able to expand our team, we've been able to kind of dive into
[00:21:58] Unknown:
other realms and kind of broaden our scope a lot more. The overall space of time series is 1 that's definitely been seeing a lot of attention lately where there's the profit library that's been gaining a lot of attention from coming from Facebook. LinkedIn recently released great kite. I know that Zillow has a project for being able to do some anomaly detection on time series data. And so I'm curious what you've seen as some of the interesting problems to solve in that domain and how how you can apply AutoML techniques to it and some of the complexities that that brings to the problem when you are dealing with this dimension of time in the dataset.
[00:22:37] Unknown:
Well, I think it's funny that you mentioned Prophet because 1 of our team members had just merged in Prophet integration so that now you can use the Prophet library within eval. And, honestly, I can't say I've been too involved with the time series work. Like, listen to some discussions, and I think at the end of the day, like, not a lot of AutoML libraries out there have time series because working with classification or regression problems versus time series problems can be, like, a whole different set. I think 1 of the biggest things that I think about is, like, for time series because, well, time is important. You can't simply just do CV or, like, cross validation split of data because you can't shuffle the data around without messing up the importance.
[00:23:25] Unknown:
Yeah. I guess just to add a little more, like, I guess it's just a little more than just expanding a Vowel model to accept time series as a problem type. It's just time series is so different from traditional, like, classification or just regression problems that, like, we were thinking about how we might potentially need to change our user interface or our user experience to deal with such problems, and another thing that kind of Angela alluded to was how we build our features from time series problems. Right? Like, it's not as simple as just taking everything in and then just running through a model. We kind of need to go through the process of maybe building windows or building time gaps or creating time buckets, or stuff like that. There's many, many different varieties of things you can do to time series problems, and I think it's still very much work in progress for us to figure that out, and how we can potentially leverage feature tools to do some of that for us. So I think right now, that's like a big thing for us, and we're still trying to figure it out. You've mentioned a few of these other libraries and components that you have to help support the evalML
[00:24:28] Unknown:
project in terms of the other stages of the overall process. I'm wondering if you can just give a bit of an overview about the capabilities that those tools provide and some of the benefits that you've been able to lean on of having this componentized approach to the overall end to end process of building and deploying or building and training and using these machine learning models?
[00:24:49] Unknown:
Sure. So within the VowelML, I think the 2 biggest ones that we lean on are trying to integrate with are woodwork, which we use for our data typing. So I think it builds on top of pandas, which is a very popular library for holding, storing data. So this was an iterative process where, originally, we just worked with Pandas and NumPy, but we realized over time that that wasn't enough because pandas, you can store a lot of different types of data under the same, like, physical data type, but it was very difficult for us to understand, like, how to parse that data and use it for modeling. So to give you an example, like, let's say you might have string data, and in some cases, that string data might refer to a categorical type or, like, different categories.
But other cases, it might be, it's, like, a natural language, just text. And so how can you get a model to understand how exactly to use that data? I think we ran into limitations with pandas there, and we decided, well, if we build our own library, which enables us to differentiate that through what are called logical types, which refer to how the data should be used, then we're able to use that in a VowelML and handle those cases separately even if they might both be string physical types at the end of the day. So that's 1 library that we integrate closely with, and other 1 feature tools, which we use for automated feature engineering. Jeremy, I don't know if you wanna talk about your ongoing work with that. So in the past,
[00:26:28] Unknown:
I guess, like, our integration with feature tools more so, we recommended users to run feature engineering as a step before running AutoML, and I guess that kind of required users to have more expertise or knowledge on the whole feature engineering process in general. But a current project of mine right now, it's kind of related to what I talked about earlier about the AutoML algorithm, is trying to add feature engineering as a step into our AutoML algorithm, and also utilize or leverage more feature selection to build better models.
So I guess that's 1 thing I've been working on recently, and, still TBD when we're gonna be done with that, but that is like, my main focus right now. And I guess to add a little more, like, these are all Alteryx open source tools. So the other ones that Angela mentioned are the only 1 other was compose. Right? So that it's, like, even earlier in the, I guess, like, the data science life cycle or pipeline where label creation is at the very beginning, so I think these are the couple tools that Alteryx open source has, but within Avama, we still try to integrate as many useful machine learning libraries out there, and like a big 1 that we integrate closely with is SKlearn, and not only like we draw upon their capabilities for some of the machine learning algorithms, but we try to actively support kind of like the SKLearn API, and that kind of makes it easier for new users to come to ValML and use our components or our pipelines and stuff like that, so I think that was a main focus of ours at the very beginning of the project. I guess some other libraries include Capboost or XGBoost or more specific
[00:28:02] Unknown:
machine learning algorithms out there. Going back to the overall search space problem and finding the optimal stopping point in particular, I'm wondering what you have found to be some of the useful heuristics for understanding when you have either reached a sort of local maxima for the problem. And this is a good point to stop the search base to say this is the answer
[00:28:26] Unknown:
or knowing when to sort of prune certain branches of the overall search base to say early on, this is not going to be a fruitful pursuit. We're going to, you know, drop this model or this set of parameters because it is leading to a local minima. So I guess to begin with, like, what we have within Avail Mail isn't that sophisticated yet, so a lot of it, what we have, kind of draws upon user configuration to end our search. So example of that would be a certain amount of time has passed, or if a user expects a certain amount of batches out of our auto model search. I think that was what we started with, so just basically, yeah, user defined endpoints.
We've kind of expanded on that by adding early stopping. I guess this is like fairly simple early stopping, where if the score stop improving to a certain degree, we'll end the search. So we have that early stopping parameter and, like, associate tolerance parameter that helps configure that. So I think that's what we have right now. And I guess, like, more to what you alluded to is, like, what we want to move into, and I guess I'll come with the iterations of our automa algorithm. So right now I'm working on this algorithm that takes, I guess, like more machine learning heuristics, but less about stopping.
But as we move forward, I'm sure that's something that we'll try to figure out eventually.
[00:29:45] Unknown:
In terms of the problem of dealing with AutoML and working with all of the different types of data that people are going to bring to the problem, what are some of the challenges or edge cases that you have run into yourself and that you have seen users of the library run up against and some of the ways that you either try to smooth over those edges or provide sort of early warning signs to the users that they're about to, you know, end up in a place that they don't want to be? Yeah. I think that's exactly what data checks tries to warn users about.
[00:30:20] Unknown:
And so, basically, with data checks, again, the user provides the data to the data checks, and we have a whole bunch of different heuristics about, like, you have too many NAND values or your distribution of data. Like, you might have class imbalance, and that might not perform well. I think that's kind of how we try to have users be aware of issues before they run AutoML search. We have also considered and talked about creating data check actions. So first, sure, you might have these data checks that warn you or tell you of, like, errors that might happen, but taking that a step further and saying, well, how can we, as an AutoML library, automatically update and transform your data so that those issues that you ran into are no longer issues?
So, for example, with the simplest 1, like, always being NAND, having NAND values, how can we automatically impute your features such that
[00:31:22] Unknown:
to the end algorithm, you don't see those NAND values and you don't run into errors. To add a little onto that, even with something like NAND, there's like all sorts of different types of NANDs out there, and we always have to try to figure out, like just NAND in general. I think within the Python ecosystem, Pandas uses their own NAND, there's a NumPy NAND, so that's something that came out, I guess, like as edge cases for us, and like even more so than that, we needed to figure out like how to handle Dans for all these different data types as well, right? Like, let's just say even for string data types, we had to figure out how should we handle NANDs for what we call natural language or text data, or how should we handle NANDs for categorical data, and there's some intricacies on not only how we should handle them so that models run, but so that models run better as well. So I think for us, it's definitely a iterative process where users come to us with, like, situations where we haven't seen before, and we're always constantly trying to improve in that run. Yeah. I think we see the
[00:32:25] Unknown:
issues that are in the datasets that users give us, and then we take that to the drawing table and we say, Alright. We see that this user ran into this error because their data was malformed in some way. How can we more generalize that so that next time the users have that kind of issue, we can warn them beforehand with a data check.
[00:32:48] Unknown:
In terms of the capabilities of the EvalML project, what are some of the either overlooked or underutilized features that you think are worth calling out?
[00:32:59] Unknown:
I know I talk a lot about the data checks, but I well, I guess I care a lot about the data check, and I think that is something that I really wish was used more. We've played around with having the data checks run as part of the AutoML search process. Because right now, the way that it works is users have to go out of their way to call these data checks, and we provide a default set of data checks that they can use, which, like, will cover some general cases that we've seen. But it still requires a user to manually do that, and I think that's something that could help users with a lot of their issues that might come up in later in the modeling process that they could have avoided if they
[00:33:48] Unknown:
had, like, basically sanity checked their data. Yeah. No. I definitely agree with that. I I don't know if I can give a specific number, but I guess like I would say maybe like 80% of your, I don't know, like data science journey or like data science process is done before any modeling or even think of modeling, and a lot of it is like Angela was saying about cleaning your data or augmenting your data such that not only it actually works with machine learning models, but also all performs better as well. So, yeah, I definitely agree that data checks is something people should leverage more, but I guess like another point to call out is like some of our model understanding tools, so shout out what we did for partial dependence earlier, and I think we're in the works of adding more model understanding tools, but I guess like in general, it's just kind of underlooked because people kind of see that like, oh, once you got these models and they perform well, they kind of think it's the end of the process, but there's so much more out there, right, like you can take these models and kind of see how your earlier features related to your end result through partial dependence, or see how important these features are, and you can, I guess, drive more analysis than just pure prediction out there, right? We don't actively support image data, but an example of that would be like taking like a neural network and then seeing what specific features are associated with what, then you can go back to that, like the beginning of the process. Let's just say if you're working with breast cancer data or something like that, then you can focus on certain things without even utilizing machine learning at the end of the day. Yeah. And that goes to your point about the explainability
[00:35:18] Unknown:
problem and how that is an important aspect of the overall process and something importance that it has. Because if you don't understand you And then your point of not having support for image data right now, I guess, brings up the question of sort of what are the types of datasets and the types of sort of industry or problem domains that EvalML is currently focused on supporting and some of the upcoming capabilities that you're looking to add to it? So at the end of the day, what Avanell,
[00:36:05] Unknown:
I guess, expects is to be tabular data. And I don't think that's limited to any specific industry per se. As long as you can give us data in a tabular form, then, hopefully, we'll be able to run a ValML with that. I will say that a ValML currently powers Alteryx machine learning, which is, like, an enterprise solution. And, like, with Designer and Alteryx, like, that's already being used in a lot of different industries. So, again, I don't think that should be limited in terms of, like, where it should be used. I think anywhere where
[00:36:39] Unknown:
machine learning applies, then hopefully eval now applies as well. And in terms of the uses of the project and some of the experiences that you've had building it and working with the community, what are some of the most interesting or unexpected or unexpected or innovative ways that you've seen it used? So ValNL is still fairly new, especially in the open source world. I don't think it's gotten a lot of traction yet,
[00:37:03] Unknown:
but we did come across a blog post about someone. I think his name was Mike Casales, who had wanted to use Val Amel as part of the process of confirming or denying whether or not South Florida real estate prices were taking a hit from sea level rise or urban climate risks. So he asked the question of, like, is the difference in property evaluation versus similar properties nearby high? And if so, can it be explained by a metric called the flood risk factor? And he actually used not only a ValML, but also feature tools. So he used feature tools to clean up some data, and then he used a ValML to generate models. And then later in the pipeline, he used a SHAP to, like, better understand the model that was created by a VowelML.
And so there, I think it was a pretty cool use case, especially in that I mean, this might seem, like, counterintuitive in some sense, but a VowelML got maybe, like, 1 or 2 sentences, which was and then I used the VowelML package to, like, generate a model, period. But I thought that was actually pretty indicative of how I imagine a ValML to be used because most of the blog post was talking about all the different ways that he needed to clean up the data. And once he got the data into the 2 d tabular form, he ran a ValML search method, got the model, and then he was able to do a whole lot of model understanding and, like, more insight driven work using Shap and other products.
So the takeaway, I think, he got from doing that was that, yeah, the flood factor flood factor info, I think it was. I don't remember the exact feature, but 1 of the features that he was looking at, it was 1 of the most important features that, like, popped up at the top. So it was pretty important in predicting the differences in real estate valuation.
[00:39:00] Unknown:
In terms of your own experience of working on the evalML project and using it for your own problems and helping the community to take advantage of its capabilities, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:15] Unknown:
Well, I think the biggest 1 for me is how we should serve users from all sorts of backgrounds. I think we're getting, like you said, we kind of targeted, I guess, people with more data science or machine learning knowledge, and tried to give as much information or as much customizability to our users, right, but definitely created a lot of confusion for people that don't necessarily have that kind of knowledge or that kind of expertise, so I guess there's 2 routes you could go with that. 1 is to try to upscale your users by providing a lot of education on that, or the latter, which kind of we took, was try to give more opinionated advice on certain things. So, like, for example, like for the data checks, we start out with just having data checks and basically telling users where they've gone wrong or where they could have improved, but we're trying to move to having data checks actions where basically saying, just give given your data, we'll fix it up, and we'll pass that into AutoML.
And I think the last blog post we talked about, like, the real estate blog post, was, like, a very good example of what we want to encompass within Avao. As, like I mentioned, the guy that wrote the blog post used shop, and then, and that's like something we've actually added to a eval model as part of our model understanding tools. So I guess more and more trying to figure out what our users need in the whole data science or machine learning process
[00:40:36] Unknown:
and trying to address those needs within our library as well. So I guess it's, like, always a continuous learning process for us as well. I guess for me, the challenge, the ongoing challenge, this is maybe, like, less about a ValML and more again, personally, I didn't come from, like, a machine learning background. It was never really a focus for me. I came in as, I guess, what the a ValML user should be, which is someone who wanted to use machine learning, but wasn't very well versed in it. So I always take, like, that perspective when building the library, and I'd like to say that, like, I offer a different, maybe, perspective from other people on the team because of that. But that also means as someone who's building helping build and maintain the library, like, some of the machine learning concepts trying to boil down something that might seem foreign to someone who has, like, no understanding of machine learning. How do you boil that down into something that they understand and therefore can use is always a challenge that I think everyone on the team is trying to face.
[00:41:40] Unknown:
For people who are interested in exploring evalML and the overall space of AutoML, what are some of the cases where 1 either AutoML as a general approach or EvalML
[00:41:53] Unknown:
specifically are the wrong choice? I think it's a wrong choice when machine learning is a wrong choice. I know machine learning is, like, a buzzword, and everyone wants to use machine learning. I've definitely been in companies where they've told me, like, I don't know how I want to use machine learning, but we want to use machine learning. Here's some data. Go. And I was like, well, I don't know what problem you want me to solve. So I think that's pretty key, like, understanding exactly what problem you're trying to solve, like, what objective it is that you're trying to solve for is very important. And another thing, I guess, is if you don't have the data that, like, correctly represents or accurately represents the system that you're trying to develop, then I'm not sure that machine learning or a vowel model can help you there because, like, it's a little cliche as well. But it's, like, if your data is garbage, then it doesn't matter what the machine learning model predicts. Right? Like, that will also be garbage too, or it'll give you insights about garbage. Those are 2 things. I guess, like, on a ValML specifically, there's a lot we're trying to tackle, but there's also a lot we're not trying to tackle, especially with, like, deep learning, architectural search.
Jeremy mentioned before, like, we don't handle image, video, audio audio or higher dimension data. I don't think we have plans to right now. Maybe that maybe I'll bite my words in a few years, but
[00:43:23] Unknown:
yeah. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a band that just came across on Spotify that is interesting and amusing and, you know, fun to listen to called the Glory Hammer that's a sort of power metal band that is a throwback to the sort of metal ballads of the eighties and sort of each album is sort of its own epic story played out as a sort of metal.
Just it's hilarious and entertaining, so definitely worth taking a look at if you're looking for something to keep you occupied for a little while. So with that, I'll pass it to you, Angela. Do you have any picks this week? My pickle has to be a restaurant that I've been craving in Cambridge,
[00:44:12] Unknown:
and that's Sarma. It's a Mediterranean restaurant, and I think it's really, really good. Alright. And Jeremy, how about you? For me, I've been reading this book called
[00:44:21] Unknown:
Crucial Conversations. I know it's not as cool as your topics, but this book called Crucial Conversations by Stephen Covey, and I feel like it tries to provide a framework for people to tackle exactly what it says, like, crucial conversations, which are difficult conversations where people have different opinions or have different interests, and I think it applies to, you know, not only just, like, the workplace, but, like, you know, just individual relationships or stuff like that. So I've been enjoying the book a lot recently.
[00:44:50] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you've been doing on EvalML and some of the related tooling. It's definitely a very interesting problem and an interesting project. So I appreciate the time and energy you've both put into that, and I hope you enjoy the rest of your day. Thank you. Thank you for setting this up. Yeah. Thank you for having us, Tobias. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Angela Lynn and Jeremy Shi
Overview of EvalML and AutoML
Determining the Search Space in AutoML
Target Audience for EvalML
AutoML Tools and Ecosystem
Software Design and Architecture of EvalML
Evolution and Assumptions of EvalML
Challenges in Time Series Data
Supporting Tools and Libraries
Heuristics for Optimal Stopping in AutoML
Handling Different Types of Data
Underutilized Features of EvalML
Supported Data Types and Domains
Interesting Use Cases of EvalML
Lessons Learned from Building EvalML
When Not to Use AutoML or EvalML
Contact Information and Picks