Summary
Machine learning models are often inscrutable and it can be difficult to know whether you are making progress. To improve feedback and speed up iteration cycles Benjamin Bengfort and Rebecca Bilbro built Yellowbrick to easily generate visualizations of model performance. In this episode they explain how to use Yellowbrick in the process of building a machine learning project, how it aids in understanding how different parameters impact the outcome, and the improved understanding among teammates that it creates. They also explain how it integrates with the scikit-learn API, the difficulty of producing effective visualizations, and future plans for improvement and new features.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
- To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Rebecca Bilbro and Benjamin Bengfort about Yellowbrick, a scikit extension to use visualizations for assisting with model selection in your data science projects.
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe the use case for Yellowbrick and how the project got started?
- What is involved in visualizing scikit-learn models?
- What kinds of information do the visualizations convey?
- How do they aid in understanding what is happening in the models?
- How much direction does yellowbrick provide in terms of knowing which visualizations will be helpful in various circumstances?
- What does the workflow look like for someone using Yellowbrick while iterating on a data science project?
- What are some of the common points of confusion that your students encounter when learning data science and how has yellowbrick assisted in achieving understanding?
- How is Yellowbrick iplemented and how has the design changed over the lifetime of the project?
- What would be required to integrate with other visualization libraries and what benefits (if any) might that provide?
- What about other ML frameworks?
- What are some of the most challenging or unexpected aspects of building and maintaining Yellowbrick?
- What are the limitations or edge cases for yellowbrick?
- What do you have planned for the future of yellowbrick?
- Beyond visualization, what are some of the other areas that you would like to see innovation in how data science is taught and/or conducted to make it more accessible?
Keep In Touch
Picks
- Tobias
- Rebecca
- Benjamin
Links
- Hadoop
- Natural Language Processing
- Machine Learning
- scikit-learn
- Model Selection Triple
- the machine learning workflow
- scikit-yb
- Yellowbrick
- Visualizer API
- Visual Tests
- Jupyter
- Matplotlib
- Tensorflow
- Hyperparameter
- Parallel Coordinates
- Radviz
- Rank2D
- Prediction Error Plot
- Residuals Plot
- Validation Curves
- Alpha Selection
- Frequency Distribution Plot
- Bayes Theorem
- Seaborn
- Stop Words
- N-gram
- Craig – Bias and Fairness of Algorithms
- Shiny
- Bokeh
- Keras
- StatsModels
- Tensorboard
- PyTorch
- NumPy
- Voxel
- Wizard of Oz
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200 gigabit network, all controlled by a brand new API, you've got everything you need to scale. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute. To get worry free releases, download Go CD, the open source continuous delivery server built by Thoughtworks. You can use their pipeline modeling and value stream app to build, control, and monitor every step from commit to deployment in 1 place. And with their new Kubernetes integration, it's even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add ons, and visit the site at podcastinit.com to subscribe to the show, sign up for the newsletter, and read the show notes. Your host as usual is Tobias Macy, and today, I'm interviewing Rebecca Bilbrough and Benjamin Benfoure about Yellowbrick, a scikit extension to use visualizations for assisting with model selection in your data science project. So, Rebecca, could you start by introducing yourself?
[00:01:18] Unknown:
Hi, Tobias. I'm Rebecca. I'm a data scientist and Python programmer in Washington DC.
[00:01:25] Unknown:
And Ben, how about yourself?
[00:01:27] Unknown:
Hi guys. Yeah. I'm Ben. I'm a computer scientist. I guess I'm also sometimes called a data scientist, but certainly more software engineering focused. And I also live in Washington, DC. And Rebecca again, do you remember how you first got introduced to Python?
[00:01:40] Unknown:
Yes. So I came to Python somewhat recently. So maybe 5 5, 6 years ago after I had already finished graduate school and I had a job, post academic job, analyzing what was pretty messy data and looking for practical tools that I could use to sort of augment the research skills that I developed while I was doing my dissertation. I learned Maple in high school and then in college, some Java and some Perl. So I found Python just a really natural way to express my thoughts, and it was just when data science was starting to happen, and I was really excited to see all of the open source Python packages that were sort of rising up to meet that work. And it became clear that something sort of special about Python is that it felt like a really good way to build software and also build experiments.
And then kind of starting to go to Python, you know, I felt a real affinity for the the Python community too.
[00:02:37] Unknown:
And Ben, how about yourself? Do you remember how you got introduced to Python?
[00:02:40] Unknown:
I do. I've actually I've been a software engineer for 12 years or so. I guess a really long time. And so when I started, doing software development, it was a variety of systems and scripting languages like Java, C, and I remember that my first, 1 of my first bosses introduced me to Python as a prototyping code for some embedded c projects that we were doing. So we would program in Python, and then once we had everything like sorted out and tested and designed, we would reimplement that in in c. But we quickly started using it for web development and system scripting.
And as my job evolved, I started doing more cluster computing with Hadoop and my job was focused on natural language processing. So, you know, dealing with large amounts of text. And so I found a lot of reasons to use Python with Hadoop using Hadoop streaming and some other packages for NLP and I was quite relieved when when Spark came along and gave like a full on Python interface to to that sort of distributed computing. So I've been using it for a long time, but I don't know if I started in the same place that many other people did.
[00:03:52] Unknown:
And so the 2 of you have gotten involved in building and working with the yellow brick project. So I'm wondering if you can describe a bit about the use case for it and how the project got started.
[00:04:04] Unknown:
Absolutely. So Yellow Brick is about visual diagnostics for machine learning. And where we see it fitting in is, you know, how do you decide what's the best model or what are the right features. And there are sort of a host of mathematical tools for doing that. But, you know, in my career and teaching, I've noticed that, you know, you just can't stare at a bunch of numbers and really come to meaningful conclusions about progress. And so yellow brick entered as this way to see sort of more clearly how you were doing, where you were going, if you were going backwards or forwards during the machine learning workflow.
[00:04:48] Unknown:
Ben and I met because I was enrolled in a data science certificate some of this sort of just came out of the experience of me being a student, and sort of talking to my teacher and trying to understand what was happening and and wanting desperately to sort of understand, as much as I could, as quickly as possible and spin up that intuition, you know, not sort of being satisfied to just sort of deploy, scikit learn without understanding what was really happening. And so this sort of, you know, this came out of like a very close use case for me.
[00:05:32] Unknown:
And so in order to be able to build these visualizations in a easy and somewhat automated fashion, you've hooked into scikit learn and extended some of its capabilities and API. So I'm wondering if you can describe a bit about what's involved in creating these visualizations and, what what's actually involved in visualizing a machine learning model?
[00:05:55] Unknown:
So like you say, you know, this is really, you know, modeled on the scikit learn API which, you know, for for folks who haven't explored it very deeply, it's really kind of an amazing thing. You know, it's it represents sort of the consolidated research and, domain expertise from a whole lot of, you know, machine learning practitioners and researchers, and it's all sort of coalesced under this unified API that makes it really kind of almost trivial to deploy and test, you know, any number of models that kind of in order to, you know, visualize the the models, you sort of have to you have to have some understanding of that API and, to know where to kind of hook in so that in scikit learn, you know, there are estimators and transformers and estimators have a fit method and and a predict method, and transformers have a fit method and a transform method.
And so, you know, part of what you have to decide with visualization is sort of deciding, you know, am I trying to visualize a model, you know, a fit and predict kind of sequence or am I trying to visualize a transformation? So, you know, my data was this and I performed some transformation and now it is this other thing and I'm trying to kind of visually compare. So, you know, we kind of think about the machine learning workflow in terms of something called the model selection triple, which originally comes from a databases paper that we read 5 years ago now, 6 years ago now, that's sort of, you know, trying to imagine what next generation databases will be like, ones that are sort of built to anticipate machine learning rather than just having machine learning sort of performed on them after the fact. And so the authors sort talk about the model selection triple which just turned out to be a really good way that we found to describe how machine learning actually happens in practice. You've got these sort of, you know, these phases where you have sort of a feature analysis, feature engineering phase, and then you move to a model selection, model comparison phase, and then into a hyperparameter tuning phase and it's iterative and you kind of go through cycles.
But, you know, part of really kind of leveraging the visualizations is figuring out, you know, where to inject visualizations into that workflow.
[00:08:25] Unknown:
And, you know, I I just wanted to add here too that that that workflow often looks like research code. I already have these sort of Jupyter notebooks just full of all this like map plot live and maybe experimental stuff. And, you know, that sort of annoyed me, I guess. It wasn't a very strong feeling, but annoyed me as a software engineer, that you didn't have repeatable dry code, that you didn't have a mechanism for for using these tools to actually see what you were doing. And so we really wanted to create a tool that got out of the way of the machine learning process, got out of the way of the model selection triple process, which I think is what most, visualization inside of a gnomic does. It's just like this big block of code and it really is. I mean to generate a Matplotlib visualization you're talking about 15 lines of code at least for a simple visualization.
And so with that in mind, we created Yellowbrick to map to scikit learn's API. So we have this idea of a visualizer, which in itself is an estimator. So we like to think that visualizers learn from data in order to draw something or to draw, a visual interpretation of that data. And so they have the same API, they wrap other estimators, they wrap transformers. They are themselves transformers. I can stick them into pipelines. And then in the end, you know, using yellow brick is as simple as 3 lines of code, just the same 3 lines of code you'd use to do machine learning with scikit learn. Right? Import the visualizer, instantiate it, and then fit it, or score it. Although we do have 1 special method called poof.
So poof is our extra method that you have to call in order to actually make the visualization happen. To finalize it with, you know, axes and setting the limits and titles and and all of that sort of stuff. And machine learning models
[00:10:17] Unknown:
have been typically very difficult to interpret and necessarily understand exactly what's happening, particularly when you get into deep learning and neural networks. So what kinds of information are you able to capture and convey in the visualizations, and how does that assist in
[00:10:35] Unknown:
creating an understanding of what's actually happening within that model? You know, it's funny you mentioned the thing about deep learning. I was gonna say on I was reading on Twitter yesterday and Neil Conway posted a really funny tweet. It was something like anyone who wants to write an alarmist op ed about the dangers of AI should be forced to spend 48 hours using TensorFlow to solve some non trivial problem. You know, I think that this this work that we're doing is, you know, a lot harder than, you know, that it seems. It's not just hard to explain it. It's also hard to just kind of kind of systematically go through the the process. So in the context of Yellowbrick, if we're thinking about that workflow that model selection triple workflow, you know, in the beginning for future analysis and future engineering you might use something like the parallel coordinates visualizer or Radvis, the radial visualization or something like rank 2 d for in the case of parallel coordinates and Radvis, you can use those to look for class separability in your data, to see if the data are well suited for a classification. In the case of rank 2 d, you know, you might be looking for relationships, you know, pairwise relationships between features and potentially looking for things like covariance, you know, or you know, things that may complicate the modeling process, downstream.
For the, you know, model selection and model comparison phase, you might be using something like, you know, if you're doing regression you might be using a prediction error plot or a residuals plot to kind of inspect how effectively different regressors are fitting the data, and kind of where the error is happening. If you're noticing heteroskedasticity, you know, distribution errors, those are very, very easy to see kind of quickly and intuitively with those kinds of plots. And then for the hyperparameter hyperparameter tuning phase, you might use something like, validation curves to understand, you know, as you introduce more training data, you know, what's happening to the trade off between bias and variance.
Are you getting to overfit, you know, at some point where you might use something like the alpha selection visualizer to see how well regularization is working. Maybe you're using like L1 regularization, L2 regularization to smooth out your data ahead of, you know, some some linear model, and you're kind of wondering if that, you know, is actually working, if, if it is kind of helping to reduce that noise. But generally, kind of for for all of the visualizations in Yellow Brick, we're aiming to kind of expose as much information as possible. So in the same way that scikit learn really provides a lot of access to model attributes. You know, we're trying to do the same thing, you know, make sure that you have access to the scores, you know, the timings for how long it took to to fit or transform, you know, annotations and, you know, those kinds of things to make it to make, you know, make sure we're conveying as much information as possible. And can you dig a bit more into
[00:13:37] Unknown:
how the visualizations will assist in gaining understanding about what's happening within the various models that people are experimenting with? Yeah. So this is
[00:13:47] Unknown:
a a place that we really think that yellow brick shines, and that's this idea of steering or visual steering of the model selection process. The term steering itself comes from the HPC world or the high performance computing world. And if you rephrase the model selection triple in terms of a search problem, right, what are the best what is the best combination of features, algorithm, data, and hyperparameters that leads to a model that's operational? Something that we can use in our systems to actually make decisions. If you phrase it that way, you see that you're in a really just massive search space. There's a lot of different options and, you know, that's actually the benefit of machine learning. Right? I think 1 of the joys of having generalized machine learning frameworks like scikit learn available and so easily accessible by a wide number a wide variety of people is that they can combine these models and data without necessarily having a PhD or a a deep understanding. They can engage in the search process to find novel and unique occurrences of of good fits and you know, if if you're doing that on purpose that can be very daunting. Right? Where do you look? Where do you go? How do you correct your course if you're sort of heading down a wrong path? You know, or should we be using a probabilistic method or a Bayesian method? Is that best most well suited for this? Should we go down a completely different path and use something nonparametric like, gradient boosting or or random forest? And I think that's what yellow brick is designed to do in terms of understanding the models itself. Trying to compare 2 different visualizations of the same model, tweak it experimentally, and then you can see what changes happen in the visualization. So it turns out not to be maybe a an interpretation of a single visualization.
Although that's very common once you get to that good search space, once you start honing in on it, you might start to think, okay. Well, what does this mean? How do I interpret the behavior of this model in this data? Do I trust this data? What amount of risk do I take on? But in the initial stages, it's really about comparing different models, comparing this different instantiations of these triples, and looking for progress as you're making these changes, and moving forward. That is certainly my experience of machine learning, and using Yellowbrick in in a couple of different projects and contexts. And as I talk to to more data scientists, more people who are doing machine learning on a daily basis, I think that they're really starting to buy into this idea of steering, to this idea of visual analytics, where there's a combination of of sort of this human domain knowledge with this machine ability to generate models very rapidly, and how can you combine those things in a meaningful way to to find the best model and yellow brick is a start at that. It's certainly not a comprehensive tool for visual analytics but it does provide that interpretability that I think is lacking from from just simple reports or or numeric scores. So in a way, it helps with
[00:16:52] Unknown:
going through sort of a human scale Bayesian process where you're iterating on your priors to gain direction in terms of where to go next to be able to come to an effective outcome. Yeah. That's absolutely
[00:17:06] Unknown:
a a a great metaphor for it. You know, how, you know, what do you understand about what you were doing before and how does that affect what happens after? And importantly, it also informs your team. Right? It's a good way of of saying, you know, here's where I came from and here's where I ended up and now you can take the baton and and go from there as well. So I think the Bayesian metaphor is very apropos.
[00:17:26] Unknown:
And in a way, the visualizations also can help to serve as documentation of your process so that when you're going back later to try and understand what it is that you were doing, you have some way of quickly being able to scan through and say, oh, these are the these are the things that I attempted. This is where I ended up, and then not have to reexplore some of that space. And also for somebody else who's project, they have that same benefit of being able to just scan through the visualizations and say, these are the models that were attempted. This is why they went 1 direction or the
[00:17:56] Unknown:
the other. Yeah. Absolutely. I it's it's absolutely critical for me personally. You know, I I constantly think about future me and what's future gonna be think gonna think about what I'm doing right now and and having that documentation, having that trail, having that sort of very quick getting up to speed with an experimental process. And, you know, also understanding failure, to a certain extent. Right? Like, you know, a lot of times if you think about this experimental process, you're gonna fail a lot. And yellow brick document like, this yellow brick visualization serves as documentation to the failure so that you can sort of quickly go back and build from that base and and sort of figure out what went wrong and and why without having to maybe retread a rocky path. Absolutely. And is there any way to easily collocate some of the parameters that were used along with the visualization that was produced so that you don't necessarily have to retain all of the intermediate code, but be able to just quickly see, okay, that was this model, these paper parameters, and this was the outcome? So the answer to that is no. There's not an easy way to do that. It just if you want, to test yourself a little bit, if you take a scikit learn model and do get params on it and you try to JSON encode that thing, you will find that that is a nightmare.
And it's very difficult to do because you have all of these data types that can't be serialized and it's nested and there's references and objects and and it's kind of a mess. And that's 1 of the reasons, you know, Rebecca mentioned that we try to include as much information as possible in the visualizations. We wanna make them as rich as possible. And part of that is for that self documenting reason. Right? What what parameters were in this model? What you know, for example, all of our titles usually have the name of the estimator in the title just so at least you have that. For me personally though, you know, I do have to do a little bit of extra work to coordinate, you know, I end up pickling, my estimators even, you know, I just have like this directory is full of a mess of estimators trying to coordinate them with visualizers, usually in the form of like markdown files and things.
That's definitely something we'd like to see yellow brick do better at. Maybe include meta information in the in the image itself. And it's definitely on the list of ideas as we're trying to explore how to coordinate this workflow a little bit better. Or possibly even in some of the EXIF data so that it's not necessarily immediately visible that you can still extract it from the image and co locate the data. That's not a bad idea.
[00:20:18] Unknown:
And visualizations can be difficult to get right in terms of figuring out what style of visualization to use, color scales, labeling, you know, how big to make the image, etcetera, you know, the the distribution of the axis as far as how many units to display. So does yellow brick provide any assistance in terms of automatically selecting some of those settings or providing guidance on what types of visualizations to use for different types of models or different use cases?
[00:20:46] Unknown:
Yeah. That's a good question. So yellow brick doesn't do a lot stylistically. I think that if we have a guiding motivation, it's for correct interpretation all the time. 1 example of this is we had a contributor say to us, you know, this this sort of looks strange. This figure, it looks like a little bit warped. And what we realized is we had a 45 degree line, but we didn't have a square image. And so even though it was a 45 degree line, it looked like it was, you know, at at a weird slope. And so, you know, for us it's all about correctness more than anything else.
Although, you know, we do tend to use like seaborne styles and things to make it look pretty. And, you know, matplotlib, 2.0 is is actually very good looking, but we definitely focus on on maybe the utility, of the visual visualizers.
[00:21:35] Unknown:
And for somebody who is just getting started with yellow brick and starting with a greenfield project, what would the work flow look like for getting started and iterating through using yellow brick in that process?
[00:21:48] Unknown:
So that's a really good question. I think that, you know, we we've been really focused on kind of building up a base of contributors over the last couple of years. And 1 of the things that we we are sort of excited to do, you know, this coming year is to kind of get a better sense of the users and kind of ideally, what we'd like to do is sort of capture the workflows of professional, you know, practitioners and try to encapsulate some of that, you know, and and map some of that onto Yellowbrick to make it as sort of natural as possible to do what they're already doing. We we conducted like a kind of preliminary usability study last year, but we're looking to launch something kind of more formal in the next couple of months. So something that includes, you know, user interviews with people who've been doing this for a while and maybe focus groups and AB testing, but for the most part, you know, we're frequently most frequently kind of talking to students about the, you know, building data science, products and, you know, working on data science projects. And and so in a lot of cases for them, the questions that they're asking are things like, which features do I use and which model do I use? Which model is best?
And we always tell them just try them all and compare them visually and then try to, you know, use those visualizations to understand, why they're performing differently. You know, they frequently get, you know, we get asked why isn't my model working and it's usually something like a class imbalance problem or, you know, strangely distributed data like really sparse data, you know, but those are things that you can visualize very quickly with yellow brick. But I think like for the for students, the main thing that we try to communicate is that, you know, whatever your specific, you know, kind of specific use cases, specific workflow is that it's important that it's not a random walk, that your that your pipeline needs to be purposeful.
It can be, you know, cyclic and iterative, but those iterations need to be associated with hypotheses. You need to be, you know, kind of thinking about hypothesis driven development, conducting experiments, and, you know, recording. And as you, you know, you guys were just talking about documentation, you know, documentation is really important so you can, you know, understand and compare your results, but, you know, the flip side is that you need to be able to move quickly. You know, so if you're doing a data science project for a class, you know, it's due. If you're doing it for work, you know, the client is expecting it. So there's no time to write out, you know, 30 lines of custom code each time or worse to, you know, export the results, of your modeling and then try to visualize them in some, you know, proprietary tool. So even though, you know, Python isn't always students go to for visualization, We really emphasize map hot live and and yellow brick because, you know, being able to do everything in Python using open source tools really reduces the entropy and makes that workflow more smooth and iterative
[00:24:58] Unknown:
so that you can do better science. And, you know, just along those lines, you know, I I I can at least tell 2 stories of how, maybe yellow brick is used, and, you know, what we're finding is there's like a lot of different ways like Rebecca said and there's a lot of different workflows. But you know maybe just more specifically, if it's interesting, you know I had a project. It was a a regression based project and I was sort of coming from the world where I was very used to, the modeling effort taking a lot of time. Big classifiers or training big neural models or or something like that. And so but this new regression project, you know, I maybe had only 5,000 instances and I was building about 30 models at a time for these different, segments that I was trying to investigate. And they were trained sort of near instantly. And so, in that type of scenario and that type of workflow, I was using Yellowbrick to really manage visually the changes between not just all of these sort of individual models, but then the aggregate model as a whole to get a better sense of, you know, these sort of very small changes that I would make. Did they have a large impact? Did they have a big impact? And try to categorize them just from this broad level. Like, what was the impact? Using things like prediction error, or the residuals chart. And, you know, it's actually amazing. Like, once you've run enough of these things, like, you actually get a intuition.
You get a sense just by comparing, you know, training and test data residuals. You can really start to feel like, am I overfitting my model? Am I underfitting my model? Do I need to make my model more complex? Do I need to reduce the complexity of my model? And it's actually it's sort of hard to describe in words, you know, because it is a visual tool. You need to sort of see it. You sort of see these patterns emerge, whether it's areas of very high density in the residuals, whether it's along sort of 1 part of the target. You'll notice that different models, like have different shapes. So the parametric models often have very hard lines in the residuals and you can sort of see, like, stratification.
Same with ensemble models. Ensemble models will, like, partition different parts of the target and you can sort of see if 1 part of your ensemble is weaker than another part of your ensemble. What the effect of train test splits and cross validation has. I really wish I could explain this. I mean, it would be better to show, obviously, this kind of thing. But you start to get that intuition. You can start to get this feel like, oh, I'm going in the right direction. Or no, I'm not going in the right direction. And once you start to hone in, you know, you might be using just like a handful of yellow brick tools. Like, I was just using prediction error, in the residuals plot. But once I started to hone in, then I sort of came back and started doing these, you know, deeper feature analyses with with rank 2 d. Do I have covariance? Do I have different correlations between my features? Are these things affecting my regularization decisions inside of these larger models?
And once you start to explore it, you know, 1 thing that I found was you start to get, like, maybe a deeper understanding of the underlying data. And you start to maybe get that intuition or that sense of what the models are doing, and particularly different model families. Like, how are they interacting with your data? And it's very specific. I don't know if I could take the intuition I gained in that project and apply it to a different regression. But in that project itself, it you know, things were very clear for me. And I was able to sort of tell, you know, my teammates, like, this is what I'm seeing. This is what I'm thinking based on these results. And and they would look at it and they would sort of start to develop that intuition too, even though they weren't in the thick of it during the entire process. I think it's almost like every unique ML exercise ends up with a a sort of unique yellow brick or visual experience.
And and I'm hoping it's cumulative that as you go on to more projects that intuition, you know, becomes more nuanced so that you can see just see what you're doing and what impact it has.
[00:28:45] Unknown:
And in your experience, are there any other tools in either Python or other language environments that provide any sort of similar experience of being able to have this visually iterative cycle of building and testing these machine learning models? Or do you find that it's largely unique within the problem domain of data science? You know, yellow brick is such a a domain specific thing. It's it's really tied,
[00:29:11] Unknown:
to machine learning. There is another library that is a competitor of ours, I guess, that has, sort of a similar thing. But I I will say that that, you know, when you're using Seaborn for exploratory data analysis, like, I think that I get it's it's different. But I maybe have that same experience of, you know, I'm I'm understanding the data in a in a deeper way, understanding distributions, and the visualize visualizations that Seaborn's producing are enhancing my understanding of the underlying data. And, I mean, it's completely different. Right? That's for exploratory data analysis. That's for answering questions, you know, with your data. It's not about machine learning, but I could see how that would be a similar experience, in Seaborn for someone who's doing machine learning with Yellowbrick. And you mentioned a bit about
[00:29:57] Unknown:
how Yellowbrick wraps these estimators and transformers in scikit learn, but I'm wondering if you can discuss a bit more about the details of how it's constructed and how the design of the library has changed and evolved over the lifetime of the project.
[00:30:13] Unknown:
Sure. So, you know, kind of in the same way that scikit learn has, you know, estimators and, transformers, you know, yellow brick has tools that are specifically kind of anticipating some kind of scoring, you know, so in the case of, you know, modeling, they're sort of anticipating that you have a machine learning model that's being fit and that you can use, you know, the training and test data, for example, to score how well the model performed. And so it's sort of hooking into the estimator part of the API there and, you know, there's other visualizers that are more kind of anticipating, you know, data transformations.
So 1 of the ones that I like a lot is the frequency distribution visualizer, which I use a lot because I work a whole lot with text data. And so a lot of times it's for me a way to visualize how the corpus sort of is changed or transformed as I perform sort of different kind of operations on it like stop words removal or certain kinds of, you know, key phrase or n gram analysis. But, in terms of how it's changed over time, you know, I think that we got I mean, we're lucky because we were basing it on a very very well established API. The scikit learn API kind of was, you know, gave us, you know, a really good design to begin with. But I will say, you know, we've had a lot of contributors. We have, I think, maybe 35, maybe 40 contributors now. And every time somebody contributes something, you know, the project changes a little bit, you know, everybody kind of injects some sort of, you know, creative thing that that that changes things a little bit. But probably, you know, from from now on out, it's mostly gonna be about adding polish and adding, new visualizers that sort of capture the best practices that are being used sort of already by folks, but you know, who are doing it kind of manually and when, you know, wanna be able to do it a little bit more easily. I think, you know, the the steps, the sort of the journey that we took was kind of first sort of hardening the API and then, you know, every time Matplotlib or scikit learn changes, we have to, you know, we sort of have to be prepared to keep up with those changes and and provide, you know, compatibility.
But kind 1 of the more interesting things recently is that we had to figure out how to do unit testing for plots. And so the visual tests was tricky and that was something that, you know, we hadn't really thought about from the beginning, and then had to kind of retroactively implement for all of the the visualizers that we had created. Luckily we have, a core contributor Nathan who really dug in, you know, dug in his heels and, you know, rolled up his sleeves and he figured out how to do the visual tests and he sort of set the the model for how we'll we'll all do that moving forward.
[00:33:09] Unknown:
And currently, yellow brick is tied to matplotlib for doing the visualizations, but do you foresee adding in support for any other visualization tool chains and or is it sort of hard coded to matplotlib and that's just going to be your main focus going forward? You know, that's that's sort of a topic of a lot of debate.
[00:33:29] Unknown:
I mean, certainly, we get a lot of advantage from using matplotlib directly. And, you know, I will say that it does play well with other libraries that are also matplotlib focused. You know, so Seaborn, if you change the styles. You know, I constantly use Seaborn and Yellow Brick, side by side. You know, 1 of the questions I'm constantly asked is can you modify or manipulate the the visualization? And the answer is yes. You can access the axis on the visualizer and use, any sort of tool that, you know, map plot live tool associated with it. And so I, you know, I think that there's a few new libraries that are coming out to try to, you know, extend matplotlib into, you know, into other domains and web domains and things like that. And so if if people do that, then then yellow brick will also, you know, be able to take advantage of those things. You know, however, I I will say that we're not against other visualization libraries.
And, you know, we just started a contrib module to sort of try to think about these things formally. For example, another 1 of our contributors, Craig, just recently wrote a a blog post on on district datalabs.com, where he talks about using, R and Python, along with matplotlib and and how to sort of coordinate these things. And his post is actually really interesting. It's on, you know, bias and and fairness ins inside of algorithms. So we don't think that it it necessarily has to be mutually exclusive. But, you know, when I when I look at sort of the visualization landscape, to me, it's it's either something like matplotlib where you're rendering sort of raster images.
And there is a little bit of of interactivity, inside of a notebook that you can add to to matplotlib. But then the other side of things is, you know, tools for visual visualizing inside of a web application. Something like Shiny or Bokeh or something like that. And, you know, we we do ask ourselves that question, you know, would Yellowbrick work well inside of a web context? And and we're we're not sure because, you know, Yellowbrick is is sort of a single seat user kind of tool. Right? It's not meant for a general audience. It's meant specifically for the data scientist who's in the chair, who's actually working and interacting with with the modeling process, and then for that person to communicate their discoveries. I don't know if you made a yellow brick visualization just globally available on a website whether or not that would be meaningful or interpretable without all the sort of sweat equity that goes into developing that intuition.
That said, we do have this feeling that interactivity is gonna become a very big part of Yellowbrick in the future. You know, Yellowbrick has sort of, at its heart is a high dimensional visualization tool. And, you know, I like to say there's really only 6 visual dimensions that you can encode, you know, size, color, shape, all these kinds of things. And and time is sort of the 7th. Right? If we can have a slider where we can drag things or if we can, you know, do this sort of meaningful interaction where we have an overview first and zoom and filter, get details in demand, we think that that provides sort of a richer visual, experience. And so, like I said when I started this is this sort of a question, you know, is map plotlib gonna gonna take us to that place of interactivity?
Are we gonna have to look at other tools that are are maybe more web driven? Is there a happy medium? We're we're not a 100% sure. Although, I mean, there could be maybe there'd be a fork, you know, yellow brick web or something like that, and and we'd be very interested in in seeing tools like that. And what about any other machine learning frameworks? Have you looked at the possibilities
[00:37:03] Unknown:
of integrating with things like TensorFlow or Keras or any of the others in the ecosystem?
[00:37:08] Unknown:
Yeah. So, and actually that is the whole reason that the contrib module exists actually is is for that reason till we sort of thought about formally, engaging in other things. So statsmodels was the first where we had a contributor say, hey, I wanna use statsmodels. There's nothing stopping me from using statsmodels except that I have to, you know, create an API, like a scikit learn API or a wrapper for this thing. And so they did. Right? They created this sort of fake estimator where, you know, it used statsmodels under the hood, and then and the visualizations worked off the bat. Craig, whose post we mentioned before, he actually had, historical data and so he created something called the identity estimator that just acted like a scikit learn estimator, but fed information from a file on disk. And that was able to allow him to create, you know, the classification report, Roccat curves, and threshold things, you know, to actually use the yellow brick, visualizers.
And so at PyCon, actually, several of us had a big discussion. Well, how can we open this up, beyond scikit learn? And, you know, Keras has a scikit learn interface. We suspect that it it would not be that difficult to include that. You know the question is is really not if but when and how do we hook into these things. Is it gonna have to hook in from some sort of underlying pickle? You know, how do we sort of distinguish yellow brick for some something like tensor board? How do we provide sort of a meaningful augmentation to the tools that already exist for things like, TensorFlow, and Keras, and PyTorch, and things like that. So, there is a path in place.
Yellowbrick is at, 07 right now. I don't think that that there might be some initial prototyping of that kind of that kind kind of tooling in 08, but I would say that definitely around 09, whose time frame is about the end of this year. Then we're gonna start to see significant support for for Keras at least,
[00:39:02] Unknown:
moving forward. And what have been some of the most challenging or unexpected aspects of building and maintaining and growing Yellow Brick in its community? Originally, when when we started out working on it, it was just me and Ben. And
[00:39:16] Unknown:
at that point, it was pretty it was kind of surprisingly easy to just kind of prototype out, you know, a single visualizer from, from beginning to end. Now that is, you know, a little less straightforward. You know, the the idea part, the prototype is, you know, still kind of fun and and exciting, and then there's sort of the work of writing the tests and the visual tests and doing the documentation and all of that and it's 1 of the things that we, you know, we do think about. We wanna make sure that people feel like they can contribute and enjoy the sort of the fun part and not feel kind of too burdened by the, the rest of it. So we offer, you know, a significant amount of support to contributors, you know, to kind of get them to the finish line. And so 1 of the things that we work really hard at is striking the right tone, in the dialogue of our issues and with pull requests and code reviews even with each other. You know, if it's just core contributors communicating with each other, you know, those things are visible to everybody and it, you know, it's important to set the right tone so that people know that we're welcoming and, you know, we're all excited and we respect each other and admire each other and, you know, are interested in in other people's sort of creative ideas for implementation. You know, some of the challenging things are support for for platforms, especially Windows, has been a little challenging. I guess we'll all see what happens now with the, GitHub, kind of being bought.
You know, potentially, we're looking forward to a future of, you know, slightly less challenges, potentially looking for silver linings there. Python 2 7 support is an ongoing struggle. You know, we we just kind of put out a survey to the users that our community of users asking whether or not, they might be open to not having 27 support anymore because it is kind of a a pain in the neck.
[00:41:19] Unknown:
And in the process of working with yellow brick, what have you found to be some of the limitations
[00:41:26] Unknown:
or edge cases that it doesn't cover yet or that you possibly don't intend to have it cover? So, yeah, this is a great question. I you know, 1 of the things that is most striking perhaps is the fact that Yellowbrick has a smaller data cap than scikit learn does. You know, you can only draw so many things and, you know, whereas scikit learn is implemented with a lot of c support, Yellowbrick is mostly pure Python. So, performance can also become a very, you know, real issue, especially if you're trying to deeply integrate Yellowbrick with, you know, scikit learn using pipelines, or or some other toolkit.
So, you know, we recently had this with, parallel coordinates. We noticed that, parallel coordinates was getting slower, and we didn't really know why. We just knew that we had these sort of benchmarks that were just, starting to crawl. And we knew that we couldn't let that affect the machine learning process. Right? We want these visualizations to be an aid, not something that you have to do, and not something that gets in the way, of the process. And so, you know, our first tack at that was coming from the approach of this, you know, data cap. Well, maybe we can add in some sampling, right, and just, you know, take a uniform random sample or stratified sample of the instances and only display those things. And that definitely eased the problem.
And and, you know, it wasn't a 100% solution. 1 thing that did come of that is we noticed that you can actually use different sampling methods and compare different sampling methods. And that's sort of like an idea for an interactive, type thing that's sort of similar to brushing, in parallel coordinates. But, you know, as we dug deeper into the problem, we started realizing, well and this is 1 of our contributors, Kyle at PyCon, sort of noticed this. Well, we're we're visualizing or we're drawing, you know, 1 thing at a time. Which means that as you add more data, the slower this thing is gonna become. So instead of drawing everything 1 at a time, every instance 1 at a time, why don't we just draw the class as a whole and insert NANDs, you know, in the middle so that the lines looked correct? And boy, Heidi. I mean, Kyle's implementation was something like 245 times faster. We went from, you know, 30 second, fit times down into the into the millisecond range. Like, we were just stunned by how fast this new implementation was. And we start to think to ourselves like, oh, man. Like, are we you know, what have we been doing? Or are we really, do we need to review all of our visualizations and and make sure that that we're not doing things? But as we continued through this process, we realized, oh, actually, this comes at a cost. When you're drawing an instance at a time and you're giving them an opacity, these sort of, you know, the opacity is additive 1 instance at a time. And you can start to see these dense braids of of instances, which is what you're looking for. Right? You're looking for clusters or groupings or any sort of coordination across features, and you're looking for those braids. And that only happens the additive opacity thing only happens in the drawing 1 sentence at a time, and it does not happen when you draw a class at a time.
And instead you only get that sort of darkening effect between classes. So you know in the end we said okay well I guess we'll provide both of these options. You know so there's like a fast parallel coordinates and a and a regular parallel coordinates, I guess. So they allow the user to choose how they want to tackle this visualization problem. But that is certainly a challenge that we're going to continue to face. You know, visualization just has limitations both in terms of of time, and in terms of space that you have on the image. More and more, we've been trying to ensure that the underlying implementations are using NumPy, that we're taking advantage of Numpy's performance, wherever possible.
We try to avoid, you know, Python objects where we can, especially since, you know, we're in that sci fi world, just to ensure that we're sort of following the same types of steps. But this also leads to this sort of other interesting problem in us when either, you know, we sit on top of both matplotlib and scikit learn. And when either of them chokes, like raises an exception, that bubbles up through yellow brick and you get these very unintelligible trace backs, Bug finding becomes an issue. And so, you know, if there's 1 thing I wanted to say, you know, these limitations are are certainly there. But the fact that they aren't as noticeable is because of the extremely hard work of our contributors, to go and find these things and capture exceptions and write better, trace back messages and give advice on the documentation of how to handle things. It it really has been impressive with with how much work they're doing. In the end, you know, the biggest thing is that, you know, in terms of edge cases, visual meaning depends on the data and the model that you're using. And the edge case, I think, is is possibly disappointment.
So we you know, this and this is something maybe I feel a lot more maybe than other people. So 1 of the new visualizers I just recently implemented was a manifold, visualizer using t SNE, or ISO maps or other embedding techniques to try to do an embedding instead of a projection of high dimensional space into 2 dimensional space. And I thought, oh, man. This is gonna be great. This is gonna provide a lot of functionality. And you're gonna be able to see a lot of different groupings. And and it's gonna be a tool that's gonna be widely used. And those tools are are performance intensive, and they can take, you know, 100 of seconds to to draw to fit and draw.
And, you know, the first real world quote unquote data set that I tried this on, I ended up with like 1 point. Like it just embedded everything right on top of each other. And so, you know, it was just the data I had. It was just, you know, the tools I had, available. And and so so I think that that's maybe the biggest I don't know if you call that an edge case, but the sort of biggest limitation is is both, you know, your data and your model and what that ends up being in terms of visualization because it it will be unique. It will be a different experience. And looking forward, what are some of the things that you have planned for the future of Yellowbrick?
[00:47:28] Unknown:
So we're really interested in kind of capturing more users and, you know, figuring out more of the domain practices that are already being used that we can create, you know, visualizers with, you know, hopefully in collaboration with those, you know, practitioners that correspond to use cases beyond the ones that we're the most familiar with. You know, machine learning is a really big space and it gets used in a lot of different disciplines and there's a lot that, you know, I'm sure is out there that people are, you know, kind of doing manually now that would be and that's sort of, you know, that's what makes scikit learn so great. You know, it sort of became this place where all of these different, practitioners came to contribute.
So we would really like to have something like that, in Yellowbrick. So we're, you know, we're hoping that we can get more people to find the project, you know, on GitHub and star it and check out the docs, reach out to us through the issues and and kind of share their ideas or share, case studies from things that they're working on, so we can get kind of a a sense of of what else we could be doing.
[00:48:41] Unknown:
And then from the future side or the future side, we're looking at doing visual optimizations. I think that's kind of the next big thing for me. Thinking about how can we minimize, maximize white space, opacity, overlap. How can we actually use the sort of modeling process instead of the visualizations themselves? A lot of the visualizations depend on the order of the features or the order that you draw things. And and so can we apply maybe a more, rigorous method in in selecting the the best visualization, the best possible visualization for the model. This also leads to sort of another, maybe thinner idea that we have and that's, is there such thing as as visual correlations? Something that exists inside the visual space that doesn't necessarily exist inside of the statistical space. So, you know, 1 thing that we're thinking about is using based approaches where we sort of draw a voxel over any dimensional space that's been plotted by visualization.
And we look at the density of the points or the lines or the the drawing, the colors that are inside that space, and use that number, as a representation of the goodness of that visualization compared to the goodness of another visualization. Coincidentally, that's more or less how we do the visual testing to see if if to of a baseline image has changed from our test image. I mentioned this before, interactivity, is certainly, the next phase of features along with the contrib library and Keras support and better stats model support. And then the last thing is is better reports and organization. The 1 thing is, you know, I've noticed that, you know, you're never doing just 1 visualization per model. You usually have a handful or a deck of visualizations per model. And so the question is, how can we group these things together? Can we come up with, you know, like a flip book model? So, you know, like inside of a notebook you you just you can flip through the visualizations, 1 at a time. Is a visual grid better where we just create sort of a a report and and create a grid? Should we be creating, you know, HTML templates with Jinja, where we're writing Base 64 encoded images into an HTML file that you can open up inside of a web browser? Maybe something kind of similar to TensorBoard, but more yellow brick focused. So those are the types of of features, that we're thinking about as we're moving forward.
[00:51:02] Unknown:
And beyond visualization, what are some of the other areas that you would like to see innovation or additional tooling come forth in terms of how data science is taught or conducted to make it more accessible and understandable for more people? So 1 of the things that I
[00:51:19] Unknown:
observe is that, you know, in in practice, I'm seeing data science sort of being used in 1 of 2 ways. So 1 as sort of business analytics 2 0, so where data scientists, you know, are maybe in some kind of standalone office in their organization and their job is to, you know, use machine learning, use visualization, do analysis that kind of goes into some sort of business reporting that goes to management and then, you know, decisions are sort of made based on that. And then the other model is where data scientists, you know, occupy maybe a somewhat less glamorous position but are actually embedded in the engineering team so they are building, features that kind of just go into the software integrated into the big biggest big biggest challenges that I face and I think Ben would agree with this is that a lot of the problems you face are sort of more mundane. It's not like, you know, is my model working or like what's the best TensorFlow architecture to eke out, you know, the most performance, but things like how does experimentation that I do fit into the agile workflow?
Or how do I get, you know, these back end developers who don't know very much about machine learning to approve my pull request so that I can actually get my stuff integrated into the software. Because ultimately, you know, the machine learning stuff is really that that sort of development environment. But my sense is that the skills that really need to be taught to data scientists kind of as we move forward, into, you know, the next generation of data products is really software development skills. So things like testing and configuration management security, microservice architecture, so that you know how to actually, deploy the stuff that you're building. We're also sort of looking forward to a new generation of project managers, product managers who are really, you know, more savvy about, data product construction so that they can really help mitigate the risk, and kind of plan around it and support data science development throughout kind of that model selection triple workflow
[00:53:43] Unknown:
that we talked about. And are there any other aspects of Yellow Brick or any other things that we touched on today that you think we should discuss further, before you sort of close out the show? I mean, we definitely want to hear from Yellowbrick users.
[00:53:58] Unknown:
I'd say that's that's the number 1 thing. We have a lot of stories that we can tell about yellow brick and and how it's used. I think you've heard a lot of them, both from the user side and the and the development side. It's been a very important project, to both Rebecca and I. And we want to make that project, more widely used, and to do that we want to incorporate feedback from others so that we know how to sort of what features are are most meaningful, and how we can make the library as as general,
[00:54:26] Unknown:
as possible. And so with that, I'll have you each add your preferred contact information to the show notes. And I'll thank you both for taking the time to join me today and discuss the work that you're doing with yellow brick. It's definitely a very interesting project and 1 that seems like it's providing a lot of value. So thank you for that, and I hope you enjoy the rest of your day. Hey. Thank you so much for having us.
Introduction to Guests and Yellowbrick
Origins and Use Cases of Yellowbrick
Integrating Yellowbrick with Scikit-Learn
Understanding Machine Learning Models with Visualizations
Iterative Model Selection and Visual Steering
Practical Applications and User Stories
Technical Construction and Evolution of Yellowbrick
Challenges and Community Contributions
Limitations and Future Directions
Innovation in Data Science Education and Tooling