Understanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great. When you're ready to launch your next app, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200 gigabit network, all controlled by a brand new API, you've got everything you need to scale.

Go to podcastinit.com/linode

to get a $20 credit and launch a new server in under a minute. To get worry free releases, download Go CD, the open source continuous delivery server built by Thoughtworks. You can use their pipeline modeling and value stream app to build, control, and monitor every step from commit to deployment in 1 place. And with their new Kubernetes integration, it's even easier to deploy and scale your build agents. Go to podcastinit.com/gocd

to learn more about their professional support services and enterprise add ons, and visit the site at podcastinit.com

to subscribe to the show, sign up for the newsletter, and read the show notes. Your host as usual is Tobias Macy, and today, I'm interviewing Rebecca Bilbrough and Benjamin Benfoure about Yellowbrick, a scikit extension to use visualizations for assisting with model selection in your data science project. So, Rebecca, could you start by introducing yourself?

Hi, Tobias. I'm Rebecca.

I'm a data scientist

and Python programmer in Washington DC.

And Ben, how about yourself?

Hi guys. Yeah. I'm Ben.

I'm a computer scientist. I guess I'm also sometimes called a data scientist, but certainly more software engineering focused. And I also live in Washington, DC. And Rebecca again, do you remember how you first got introduced to Python?

Yes.

So I came to Python

somewhat recently. So maybe 5 5, 6 years ago after I had already finished graduate school and I had a job,

post academic job, analyzing

what was pretty messy data and looking for practical tools

that I could use to sort of augment the research skills that I developed while I was doing my dissertation. I learned Maple in high school

and then in college, some Java and some Perl.

So I found Python

just a really natural way

to express my thoughts, and it was just when data science was starting to happen,

and I was really excited to see all of the open source Python packages that were sort of rising up to meet that work. And it became clear that something sort of special about Python is that it felt like a really good way to build software and also build experiments.

And then kind of starting to go to Python,

you know, I felt a real affinity for the the Python community too.

And Ben, how about yourself? Do you remember how you got introduced to Python?

I do. I've actually I've been a software engineer

for 12 years or so. I guess a really long time. And so when I started,

doing software development, it was a variety of systems and scripting languages

like Java, C,

and I remember that my first, 1 of my first bosses introduced me to Python as

a prototyping

code for some embedded c projects that we were doing. So we would program in Python,

and then once we had everything like sorted out and tested and designed, we would reimplement that in in c.

But we quickly started using it for

web development

and system scripting.

And as my job evolved,

I started doing more cluster computing with Hadoop

and my job was focused on natural language processing. So, you know, dealing with large amounts of text. And so I found a lot of reasons to use Python with Hadoop

using Hadoop streaming and some other packages

for NLP and I was quite relieved when when Spark came along and gave like a full on

Python interface to to that sort of distributed computing. So I've been using it for a long time, but I don't know if I started in the same place that many other people did.

And so the 2 of you have gotten involved in building and working with the yellow brick project. So I'm wondering if you can describe a bit about the use case for it and how the project got started.

Absolutely. So Yellow Brick is about visual diagnostics

for machine learning.

And

where we see it fitting in is, you know, how do you decide what's the best model or what are the right features.

And there are sort of a host of mathematical tools for doing that.

But,

you know, in my career and teaching, I've noticed that, you know, you just can't stare at a bunch of numbers and really come to meaningful conclusions

about progress. And so yellow brick entered

as this way to

see

sort of more clearly how you were doing, where you were going, if you were going backwards or forwards

during the machine learning workflow.

Ben and I met because I was enrolled in a data science certificate

some of this sort of just came out of the experience of me being a student,

and sort of talking to my teacher and trying to understand

what was happening and and wanting desperately to sort of understand,

as much as I could, as quickly as possible and spin up that intuition,

you know, not sort of being satisfied to just sort of deploy,

scikit learn

without understanding

what was really happening. And so this sort of, you know, this came out of like a very close use case for me.

And so

in order to be able to build these visualizations

in a easy and somewhat automated fashion, you've hooked into scikit learn and extended some of its capabilities and API. So I'm wondering if you can describe a bit about what's involved in creating these visualizations

and,

what what's actually involved in visualizing a machine learning model?

So like you say, you know, this is really,

you know, modeled on

the scikit learn API which,

you know, for for folks who haven't

explored it very deeply, it's really kind of an amazing thing. You know, it's it represents

sort of the consolidated

research and,

domain expertise from a whole lot of, you know, machine learning practitioners and researchers,

and it's all sort of coalesced under this unified API that makes it really

kind of almost trivial to

deploy and test,

you know, any number of models

that kind of in order to, you know, visualize

the the models, you sort of have to you have to have some understanding of that API and, to know where to kind of hook in so

that in scikit learn, you know, there are estimators and transformers and estimators have a fit method and and a predict method, and transformers have a fit method and a transform method.

And so, you know, part of what you have to decide

with visualization is sort of deciding, you know, am I trying to visualize

a model, you know,

a fit and predict kind of sequence or am I trying to visualize

a transformation? So, you know, my data was this

and I performed some transformation and now it is this other thing and I'm trying to

kind of visually compare.

So, you know, we kind of think about

the machine learning workflow in terms of something called the model selection triple,

which originally comes from a

databases paper that we read 5 years ago now, 6 years ago now, that's sort of, you know, trying to imagine what next generation databases will be like, ones that are sort of built to anticipate machine learning rather than just having machine learning sort of performed on them after the fact. And so the authors sort talk about the model selection triple which just turned out to be a really good way that we found to describe how machine learning actually happens in practice. You've got these sort of, you know, these phases where

you have sort of a feature analysis, feature engineering phase, and then you move to a model selection, model comparison phase, and then into a hyperparameter

tuning phase and it's iterative and you kind of go through cycles.

But, you know, part of really kind of leveraging the visualizations

is figuring out, you know, where to inject

visualizations into that workflow.

And, you know, I I just wanted to add here too that that that workflow

often

looks like research code. I already have these sort of Jupyter notebooks just full of all this like map plot live and maybe experimental

stuff. And, you know, that sort of annoyed me, I guess. It wasn't a very strong feeling, but annoyed me as a software engineer,

that you didn't have repeatable

dry code, that you didn't have a mechanism for for using these tools to actually see what you were doing. And so we really wanted to create a tool that got out of the way of the machine learning process, got out of the way of the model selection triple process, which I think is what most,

visualization inside of a gnomic does. It's just like this big block of code and it really is. I mean to generate a Matplotlib visualization

you're talking about 15 lines of code at least

for a simple visualization.

And so with that in mind, we created

Yellowbrick

to map to scikit learn's

API. So we have this idea of a visualizer,

which in itself is an estimator. So we like to think that visualizers learn from data in order to draw

something or to draw,

a visual interpretation of that data. And so they have the same API, they wrap other estimators, they wrap transformers. They are themselves transformers.

I can stick them into pipelines.

And then in the end, you know, using yellow brick is as simple as 3 lines of code, just the same 3 lines of code you'd use to do machine learning with scikit learn. Right? Import the visualizer,

instantiate it, and then fit it, or score it. Although we do have 1 special method called poof.

So poof is our extra method that you have to call in order to actually make the visualization

happen. To finalize it with, you know, axes and setting the limits and titles and and all of that sort of stuff. And machine learning models

have been typically very difficult

to interpret and necessarily understand exactly what's happening, particularly when you get into deep learning and neural networks.

So what kinds of information are you able to capture and convey in the visualizations,

and how does that assist in

creating an understanding of what's actually happening within that model? You know, it's funny you mentioned the thing about deep learning. I was gonna say on I was

reading on Twitter yesterday and Neil Conway posted a really funny tweet. It was something like anyone who wants to write an alarmist op ed about the dangers of AI should be forced to spend 48 hours using TensorFlow to solve some non trivial problem. You know, I think that this this work that we're doing is, you know, a lot harder

than, you know, that it seems. It's not just hard to explain it. It's also hard to just kind of kind of systematically go through the the process. So in the context of Yellowbrick,

if we're thinking about that workflow that model selection triple workflow,

you know, in the beginning for future analysis and future engineering you might use something like the parallel coordinates visualizer or Radvis,

the radial visualization

or something like rank 2 d for in the case of parallel coordinates and Radvis,

you can use those

to look for class separability in your data, to see if the data are well suited for a classification. In the case of rank 2 d, you know, you might be looking for relationships,

you know, pairwise relationships between features and potentially looking for things like covariance,

you know, or you know, things that may

complicate the modeling process,

downstream.

For the, you know, model selection and model comparison phase, you might be using something like, you know, if you're doing regression you might be using a prediction error plot or a residuals plot to kind of inspect

how effectively different regressors are fitting the data,

and kind of where the error is happening. If you're

noticing

heteroskedasticity, you know, distribution

errors,

those are

very, very easy to see kind of quickly and intuitively with those kinds of plots. And then for the hyperparameter

hyperparameter

tuning phase,

you might use something like,

validation curves to understand, you know, as you introduce more training data,

you know, what's happening to the trade off between bias and variance.

Are you getting to overfit,

you know, at some point where you might use something like the alpha selection visualizer

to see how well regularization is working. Maybe you're using like L1

regularization, L2 regularization

to smooth out your data ahead of, you know, some some linear model,

and you're kind of wondering

if that, you know, is actually working, if,

if it is kind of

helping to reduce that noise. But generally, kind of for for all of the visualizations

in Yellow Brick, we're aiming to kind of expose as much information as possible. So in the same way that scikit learn really provides a lot of access to model attributes. You know, we're trying to do the same thing, you know, make sure that you have access

to the scores, you know, the timings for how long it took to to fit or transform,

you know, annotations and, you know, those kinds of things to make it to make, you know, make sure we're conveying as much information as possible. And can you dig a bit more into

how the visualizations

will assist in gaining understanding about what's happening within the various models that people are experimenting with? Yeah. So this is

a a place that we really think that yellow brick shines,

and that's this idea of steering or visual steering of the model selection process.

The term steering itself comes from the HPC world or the high performance computing world. And if you rephrase the model selection triple in terms of a search problem, right, what are the best what is the best combination of features, algorithm,

data, and hyperparameters

that leads to a model that's operational? Something that we can use in our systems to actually make decisions. If you phrase it that way, you see that you're in a really just massive search space. There's a lot of different options and, you know, that's actually the benefit of machine learning. Right? I think 1 of the joys of having generalized machine learning frameworks like scikit learn available and so easily accessible by a wide number a wide variety of people is that they can combine these models and data without necessarily having a PhD or a a deep understanding.

They can engage in the search process to find novel and unique occurrences

of of good fits

and

you know, if if you're doing that on purpose that can be very daunting. Right? Where do you look? Where do you go? How do you

correct your course if you're sort of heading down a wrong path? You know, or should we be using a probabilistic method or a Bayesian method? Is that best most well suited for this? Should we go down a completely different path and use something nonparametric

like, gradient boosting or or random forest? And I think that's what yellow brick is designed to do in terms

of understanding

the models itself. Trying to

compare 2 different visualizations

of the same model, tweak it experimentally,

and then you can see

what changes happen

in the visualization. So it turns out not to be maybe a an interpretation of a single visualization.

Although that's very common once you get to that good search space, once you start honing in on it, you might start to think, okay. Well, what does this mean? How do I interpret the behavior of this model in this data? Do I trust this data? What amount of risk do I take on? But in the initial stages, it's really about comparing different models, comparing this different instantiations of these triples, and looking for

progress as you're making these changes,

and moving forward. That is certainly

my experience of machine learning,

and using Yellowbrick in in a couple of different projects and contexts. And as I talk to to more data scientists, more people who are doing machine learning on a daily basis, I think that they're really starting to buy into this idea of steering, to this idea of visual analytics, where there's a combination

of of sort of this human domain knowledge with this machine ability to generate models very rapidly, and how can you combine those things in a meaningful way to to find the best model and yellow brick is a start at that. It's certainly not a comprehensive

tool for visual analytics but it does provide

that interpretability

that I think is lacking from from just simple reports or or numeric scores. So in a way, it helps with

going through sort of a human scale Bayesian process where you're iterating on your priors to

gain direction in terms of where to go next to be able to come to an effective outcome. Yeah. That's absolutely

a a a great metaphor for it. You know, how, you know, what do you understand about what you were doing before and how does that affect what happens after? And importantly, it also informs your team. Right? It's a good way of of saying, you know, here's where I came from and here's where I ended up and now you can take the baton and and go from there as well. So I think the Bayesian metaphor is

very apropos.

And in a way, the visualizations

also can help to serve as documentation of your process so that when you're going back later to try and understand what it is that you were doing, you have some way of quickly being able to scan through and say, oh, these are the these are the things that I attempted. This is where I ended up, and then not have to reexplore some of that space. And also for somebody else who's project, they have that same benefit of being able to just scan through the visualizations and say, these are the models that were attempted. This is why they went 1 direction or the

the other. Yeah. Absolutely. I it's it's absolutely critical for me personally.

You know, I I constantly think about future me and what's future gonna be think gonna think about what I'm doing right now and and having that documentation, having that trail, having that sort of very quick getting up to speed

with an experimental process. And, you know, also understanding failure,

to a certain extent. Right? Like, you know, a lot of times if you think about this experimental process, you're gonna fail a lot. And yellow brick document like, this yellow brick visualization serves as documentation to the failure so that you can sort of quickly go back and build from that base and and sort of figure out what went wrong and and why without having to maybe retread a rocky path. Absolutely. And is there any way to easily collocate some of the parameters that were used along with the visualization that was produced so that you don't necessarily have to retain all of the intermediate code, but be able to just quickly see, okay, that was this model, these paper parameters, and this was the outcome? So the answer to that is no. There's not an easy way to do that. It just if you want,

to test yourself a little bit, if you take a scikit learn model and do get params on it and you try to JSON encode that thing, you will find that that is a nightmare.

And it's very difficult to do because you have all of these data types that can't be serialized and it's nested and there's references and objects and and it's kind of a mess. And that's 1 of the reasons, you know, Rebecca mentioned that we try to include as much information as possible in the visualizations. We wanna make them as rich as possible. And part of that is for that self documenting reason. Right? What what parameters were in this model? What you know, for example, all of our titles

usually have the name of the estimator in the title just so at least you have that. For me personally though, you know, I do have to do a little bit of extra work to coordinate, you know,

I end up pickling,

my estimators even, you know, I just have like this directory is full of a mess of estimators trying to coordinate them with visualizers,

usually in the form of like markdown files and things.

That's definitely something we'd like to see yellow brick do better at.

Maybe include meta information in the in the image itself. And it's definitely on the list of ideas as we're trying to explore how to coordinate this workflow a little bit better. Or possibly even in some of the EXIF data so that it's not necessarily immediately visible that you can still extract it from the image and co locate the data. That's not a bad idea.

And visualizations

can be difficult to get right in terms of figuring out what style of visualization to use, color scales, labeling,

you know, how big to make the image, etcetera, you know, the the distribution of the axis as far as how many units to display. So does yellow brick provide any assistance in terms of automatically selecting some of those

settings or providing guidance on what types of visualizations to use for different types of models or different use cases?

Yeah. That's a good question. So

yellow brick doesn't do a lot

stylistically.

I think that if we have a guiding

motivation,

it's

for correct interpretation

all the time. 1 example of this is we had a contributor

say to us, you know, this this sort of looks strange. This figure,

it looks like a little bit warped. And what we realized is we had a 45 degree line, but we didn't have a square image. And so even though it was a 45 degree line, it looked like it was, you know, at at a weird slope. And so, you know, for us it's all about correctness

more than anything else.

Although, you know, we do tend to use like seaborne styles and things to make it look pretty. And, you know, matplotlib,

2.0 is is actually very good looking, but we definitely focus on on maybe the utility,

of the visual visualizers.

And for somebody who is just getting started with yellow brick and starting with a greenfield project,

what would the work flow look like for getting started and iterating through using yellow brick in that process?

So that's a really good question.

I think that, you know,

we we've been

really focused on

kind of building up a

base of

contributors over the last couple of years. And 1 of the things that we we are sort of excited to do, you know, this coming year is to kind of get a better sense of the users and kind of ideally, what we'd like to do is sort of capture

the workflows

of professional,

you know, practitioners

and try to encapsulate some of that, you know, and and map some of that onto

Yellowbrick to make it as sort of natural as possible to do what they're already doing.

We we conducted like a kind of preliminary

usability study last year, but we're looking to launch something kind of more formal in the next couple of months. So something that includes, you know, user interviews with people who've been doing this for a while and maybe focus groups and AB testing, but for the most part, you know, we're frequently

most frequently kind of talking to students about the, you know, building data science, products and, you know, working on data science projects. And and so in a lot of cases for them, the questions that they're asking are things like, which features do I use and which model do I use? Which model is best?

And we always tell them just try them all

and compare them visually and then try to, you know, use those visualizations

to understand,

why they're performing differently.

You know, they frequently get, you know, we get asked why isn't my model working and

it's usually something like a class imbalance problem or, you know, strangely distributed data like really sparse data,

you know, but those are things that you can visualize

very quickly with yellow brick. But I think like for the for students, the main thing that we try to communicate

is that, you know, whatever your specific,

you know, kind of specific use cases, specific

workflow is

that it's important that it's not a random walk, that your

that your pipeline needs to be purposeful.

It can be, you know, cyclic and iterative, but those iterations need to be associated with hypotheses.

You need to be, you know, kind of thinking about

hypothesis driven development, conducting experiments, and, you know, recording. And as you, you know, you guys were just talking about documentation, you know, documentation

is really important

so you can, you know, understand and compare your results,

but, you know, the flip side is that you need to be able to move quickly.

You know, so if you're doing a data science project for a class, you know, it's due. If you're doing it for work,

you know, the client is expecting it. So there's no time to write out, you know, 30 lines of custom code each time or worse to, you know, export the results,

of your

modeling and then try to visualize them in some, you know, proprietary tool. So even though, you know, Python isn't always students go to for visualization,

We really emphasize

map hot live and and yellow brick because, you know, being able to do everything in Python using open source tools really reduces the entropy and makes that workflow more smooth and iterative

so that you can do better science. And, you know, just along those lines, you know, I I I can at least tell 2 stories of how,

maybe yellow brick is used,

and, you know, what we're finding is there's like a lot of different ways like Rebecca said and there's a lot of different workflows.

But you know maybe just more specifically,

if it's interesting, you know I had a project. It was a a regression based project and I was sort of coming from the world where I was very used to,

the modeling effort taking a lot of time.

Big classifiers or training big neural models or or something like that. And so but this new regression project, you know, I maybe had only 5,000 instances and I was building about 30 models at a time for these different,

segments that I was trying to investigate. And they were trained sort of near instantly. And so, in that type of scenario and that type of workflow, I was using Yellowbrick to really manage

visually the changes between not just all of these sort of individual models, but then the aggregate model as a whole to get a better sense of, you know, these sort of very small changes that I would make. Did they have a large impact? Did they have a big impact? And try to categorize them just from this broad level. Like, what was the impact? Using things like prediction error,

or the residuals chart. And,

you know, it's actually amazing. Like, once you've run enough of these things, like, you actually get a intuition.

You get a sense just by comparing, you know, training and test data residuals. You can really start to feel like, am I overfitting my model? Am I underfitting my model? Do I need to make my model more complex?

Do I need to reduce the complexity of my model?

And it's actually it's sort of hard to describe in words,

you know, because it is a visual tool. You need to sort of see it. You sort of see these patterns emerge,

whether it's areas of very high density in the residuals,

whether it's along sort of 1 part of the target. You'll notice that different models,

like have different shapes. So the parametric models often have very hard lines in the residuals and you can sort of see, like, stratification.

Same with ensemble models. Ensemble models will, like, partition different parts of the target and you can sort of see if 1 part of your ensemble is weaker than another part of your ensemble. What the effect of train test splits and cross validation has. I really wish I could explain this. I mean, it would be better to show, obviously, this kind of thing. But you start to get that intuition. You can start to get this feel like, oh, I'm going in the right direction.

Or no, I'm not going in the right direction. And once you start to hone in, you know, you might be using just like a handful of yellow brick tools. Like, I was just using prediction error,

in the residuals plot. But once I started to hone in, then I sort of came back and started doing these, you know, deeper feature analyses with with rank 2 d. Do I have covariance? Do I have different correlations between my features? Are these things affecting

my regularization decisions inside of these larger models?

And once you start to explore it, you know, 1 thing that I found was you start to get, like, maybe a deeper understanding

of the underlying data. And you start to maybe get that intuition or that sense of what the models are doing, and particularly different model families. Like, how are they interacting with your data? And it's very specific. I don't know if I could take the intuition I gained in that project and apply it to a different regression.

But in that project itself, it you know, things were very clear for me. And I was able to sort of tell, you know, my teammates, like, this is what I'm seeing. This is what I'm thinking based on these results. And and they would look at it and they would sort of start to develop that intuition too, even though they weren't in the thick of it during the entire process. I think it's almost like every

unique ML exercise

ends up with a a sort of unique

yellow brick or visual experience.

And and I'm hoping it's cumulative that as you go on to more projects that intuition,

you know, becomes more nuanced so that you can

see

just see what you're doing and what impact it has.

And in your experience,

are there any other tools

in either Python or other language environments

that provide any sort of similar experience of being able to have this

visually iterative cycle of building and testing these machine learning models? Or do you find that it's largely unique within the problem domain of data science? You know, yellow brick is such a a domain specific thing. It's it's really tied,

to machine learning. There is another library

that is a competitor of ours, I guess, that has, sort of a similar thing. But I I will say that that, you know, when you're using Seaborn

for exploratory data analysis, like, I think that I get it's it's different. But I maybe have that same experience of, you know, I'm I'm understanding the data in a in a deeper way,

understanding distributions,

and the visualize visualizations that Seaborn's producing are enhancing

my understanding of the underlying data. And, I mean, it's completely different. Right? That's for exploratory data analysis. That's for answering questions,

you know, with your data. It's not about machine learning,

but I could see how that would be a similar experience,

in Seaborn for someone who's doing machine learning with Yellowbrick. And you mentioned a bit about

how Yellowbrick wraps these estimators and transformers in scikit learn, but I'm wondering if you can discuss a bit more about the

details

of how it's constructed and how the design of the library has changed and evolved over the lifetime of the project.

Sure. So, you know, kind of in the same way that scikit learn has, you know, estimators

and,

transformers, you know, yellow brick has

tools that are specifically

kind of anticipating

some kind of scoring, you know, so in the case of, you know, modeling,

they're sort of anticipating that you have a machine learning model that's being

fit and that you can use,

you know, the training and test data, for example, to score how well the model performed. And so it's sort of hooking into the estimator

part of the API there and, you know, there's other visualizers

that are

more kind of anticipating,

you know, data

transformations.

So

1 of the ones that I like a lot is the frequency distribution visualizer,

which I use a lot because I work a whole lot with text data. And so a lot of times it's for me a way to visualize

how the corpus

sort of is changed or transformed as I perform sort of different kind of operations on it like stop words removal

or certain kinds of,

you know, key phrase or n gram analysis.

But,

in terms of how it's changed over time, you know, I think that we

got I mean, we're lucky because we were basing it on a very very well established API. The scikit learn API kind of was, you know, gave us, you know, a really good design

to begin with. But I will say, you know, we've had a lot of contributors. We have, I think,

maybe 35,

maybe 40 contributors

now. And every time somebody contributes something, you know, the project changes a little bit, you know, everybody kind of injects some sort of, you know, creative

thing that

that that changes things a little bit. But probably, you know, from from now on out, it's mostly gonna be about adding polish

and adding,

new visualizers that sort of capture

the best practices that are being used sort of already by folks, but you know, who are doing it kind of manually and when, you know, wanna be able to do it a little bit more easily. I think, you know, the the steps, the sort of the journey that we took was kind of first sort of hardening the API and then, you know, every time Matplotlib or scikit learn changes,

we have to, you know, we sort of have to be prepared to keep up with those changes and and provide, you know, compatibility.

But kind 1 of the more interesting things recently

is that we had to figure out how to do unit testing for plots. And so the visual tests

was tricky and that was something that, you know, we hadn't really

thought about from the beginning,

and then had to kind of retroactively

implement

for all of the the visualizers that we had created. Luckily we have, a core contributor Nathan

who really dug in, you know, dug in his heels and, you know, rolled up his sleeves and he figured out how to do the visual tests and he sort of set the the model for how we'll we'll all do that moving forward.

And

currently, yellow brick is tied to

matplotlib for doing the visualizations,

but do you foresee adding in support for any other visualization

tool chains and or is it sort of hard coded to matplotlib and that's just going to be your main focus going forward? You know, that's that's sort of a topic of a lot of debate.

I mean, certainly, we get a lot of advantage from using matplotlib

directly.

And, you know, I will say that it does play well with other libraries that are also matplotlib

focused. You know, so Seaborn, if you change the styles. You know, I constantly use Seaborn and Yellow Brick, side by side. You know, 1 of the questions I'm constantly asked is can you modify or manipulate the the visualization? And the answer is yes. You can access the axis on the visualizer

and

use, any sort

of tool that, you know, map plot live tool associated with it. And so I, you know, I think that there's a few

new libraries that are coming out to try to,

you know, extend matplotlib into, you know, into other domains and web domains and things like that. And so if if people do that, then then yellow brick will also, you know, be able to take advantage of those things. You know, however, I I will say that we're not against other visualization libraries.

And, you know, we just started a contrib module to sort of try to think about these things formally. For example,

another 1 of our contributors, Craig,

just recently wrote a a blog post on on district datalabs.com,

where he talks about using,

R and Python,

along with matplotlib

and and how to sort of coordinate these things. And his post is actually really interesting. It's on, you know, bias and and fairness ins inside of algorithms. So we don't think that it it necessarily has to be mutually exclusive.

But,

you know, when I when I look at sort of the visualization

landscape,

to me, it's it's either something like matplotlib where you're rendering sort of raster images.

And there is a little bit of of interactivity,

inside of a notebook that you can add to to matplotlib. But then the other side of things is, you know, tools for visual visualizing inside of a web application. Something like Shiny or Bokeh or something like that. And, you know, we we do ask ourselves that question, you know, would Yellowbrick

work well inside of a web context? And and we're we're not sure

because,

you know, Yellowbrick is is sort of a single seat user kind of tool. Right? It's not meant for a general audience. It's meant specifically for the data scientist who's in the chair, who's actually working

and interacting with with the modeling process, and then for that person to communicate

their discoveries. I don't know if you made a yellow brick visualization

just globally available on a website whether or not that would be meaningful or

interpretable without all the sort of sweat equity that goes into developing

that intuition.

That said,

we do have this feeling that interactivity is gonna become a very big part of Yellowbrick in the future. You know, Yellowbrick

has sort of,

at its heart is a high dimensional visualization tool.

And, you know, I like to say there's really only 6 visual dimensions that you can encode, you know, size, color,

shape, all these kinds of things. And and time is sort of the 7th. Right? If we can have a slider where we can drag things or if we can, you know, do this sort of meaningful interaction where we have an overview first and zoom and filter, get details in demand,

we think that that provides sort of a richer visual,

experience. And so, like I said when I started this is this sort of a question,

you know, is map plotlib gonna gonna take us to that place of interactivity?

Are we gonna have to look at other tools that are are maybe more web driven? Is there a happy medium? We're we're not a 100% sure. Although, I mean, there could be maybe there'd be a fork, you know, yellow brick web or something like that, and and we'd be very interested in in seeing tools like that. And what about any other machine learning frameworks? Have you looked at the possibilities

of integrating with things like TensorFlow or Keras or any of the others in the ecosystem?

Yeah. So,

and actually that is the whole reason that the contrib module exists actually is is for that reason till we sort of thought about formally,

engaging in other things. So statsmodels was the first where we had a contributor say, hey, I wanna use statsmodels.

There's nothing stopping me from using statsmodels except that I have to, you know, create an API,

like a scikit learn API or a wrapper for this thing. And so they did. Right? They created this sort of fake estimator

where, you know, it used statsmodels under the hood,

and then and the visualizations worked off the bat. Craig, whose post we mentioned before, he actually had,

historical data and so he created something called the identity estimator that just acted like a scikit learn estimator, but fed information from a file on disk. And that was able to allow him to create, you know, the classification report,

Roccat curves, and threshold things, you know,

to actually use the yellow brick, visualizers.

And so at PyCon, actually, several of us had a big discussion. Well, how can we open this up, beyond scikit learn? And, you know, Keras has a scikit learn interface. We suspect

that it it would not be that difficult

to include that. You know the question is

is really not if but when and how do we hook into these things. Is it gonna have to hook in from some sort of underlying pickle? You know, how do we sort of distinguish yellow brick for some something like tensor board? How do we provide sort of a meaningful

augmentation

to the tools that already exist for things like,

TensorFlow,

and Keras, and PyTorch,

and things like that. So, there is a path in place.

Yellowbrick is at,

07 right now. I don't think that that there might be some initial prototyping of that kind of that kind kind of tooling in 08, but I would say that definitely around 09, whose time frame is about the end of this year. Then we're gonna start to see significant support for for Keras at least,

moving forward. And what have been some of the most challenging or unexpected aspects of building and maintaining and growing Yellow Brick in its community? Originally, when when we started out working on it, it was just me and Ben. And

at that point, it was pretty it was kind of surprisingly

easy to just kind of prototype out, you know, a single visualizer from,

from beginning to end. Now that is, you know, a little

less straightforward.

You know, the

the idea part, the prototype is, you know, still

kind of fun and and exciting, and then there's sort of the work of writing the tests and the visual tests and doing the

documentation and all of that and it's 1 of the things that we, you know, we do think about. We wanna make sure that people feel like they can contribute

and enjoy the sort of the fun part and not feel kind of too burdened by the,

the rest of it. So we offer, you know, a significant amount of support to contributors,

you know, to kind of get them to the finish line. And so 1 of the things that we work really hard at is striking the right tone,

in the dialogue of our issues and with pull requests and code reviews even with each other. You know, if it's just core contributors communicating with each other, you know, those things are visible to everybody and it, you know, it's important to set the right tone so that people know that we're welcoming and, you know, we're all excited and we respect each other and admire each other and,

you know, are interested in in other people's sort of creative ideas for

implementation. You know, some of the challenging things are support for for platforms, especially Windows,

has been a little challenging. I guess we'll all see what happens now with

the,

GitHub,

kind of being bought.

You know, potentially, we're

looking forward to a future of, you know, slightly less challenges,

potentially looking for silver linings there. Python 2 7 support is an ongoing

struggle. You know, we we just kind of put out a survey to the users

that our community of users asking whether or not,

they might be open to

not having 27 support anymore because it is kind of a a pain in the neck.

And in the process

of working with yellow brick, what have you found to be some of the limitations

or edge cases that it doesn't cover yet or that you possibly don't intend to have it cover? So, yeah, this is a great question. I you know, 1 of the things that is most striking perhaps is the fact that

Yellowbrick has a smaller data cap than scikit learn does. You know, you can only draw so many things

and, you know, whereas scikit learn is implemented with a lot of c support,

Yellowbrick is mostly pure Python. So,

performance can also become a very, you know, real issue, especially if you're trying to

deeply integrate Yellowbrick

with, you know, scikit learn

using pipelines,

or or some other toolkit.

So, you know, we recently had this with, parallel coordinates. We noticed that,

parallel coordinates was getting slower, and we didn't really know why. We just knew that we had these sort of benchmarks that were just, starting to crawl. And we knew that we couldn't let that affect the machine learning process. Right? We want these visualizations to be an aid, not something that you have to do, and not something that gets in the way, of the process. And so, you know, our first tack at that was coming from the approach of this, you know, data cap. Well, maybe we can add in some sampling,

right, and just, you know, take a uniform random sample or stratified sample

of the instances and only display those things. And that definitely

eased the problem.

And and, you know, it wasn't a 100% solution. 1 thing that did come of that is we noticed that you can actually use different sampling methods and compare different sampling methods. And that's sort of like an idea for an interactive,

type thing that's sort of similar to brushing, in parallel coordinates. But, you know, as we dug deeper into the problem, we started realizing, well and this is 1 of our contributors, Kyle at PyCon, sort of noticed this. Well, we're we're visualizing or we're drawing, you know, 1 thing at a time. Which means that as you add more data, the slower this thing is gonna become. So instead of drawing everything 1 at a time, every instance 1 at a time, why don't we just draw

the class as a whole

and insert NANDs,

you know, in the middle so that the lines looked correct? And boy, Heidi. I mean, Kyle's implementation was something like

245 times faster. We went from, you know, 30 second,

fit times down into the into the millisecond range. Like, we were just stunned by how fast this new implementation was. And we start to think to ourselves like, oh, man. Like, are we you know, what have we been doing? Or are we really, do we need to review all of our visualizations and and make sure that that we're not doing things? But as we continued through this process, we realized, oh, actually, this comes at a cost. When you're drawing

an instance at a time and you're giving them an opacity,

these sort of, you know, the opacity is additive 1 instance at a time. And you can start to see

these dense braids

of of instances, which is what you're looking for. Right? You're looking for clusters or groupings

or any sort of coordination across features, and you're looking for those braids. And that only happens the additive opacity thing only happens

in the drawing 1 sentence at a time, and it does not happen when you draw a class at a time.

And instead you only get that sort of darkening effect between classes. So you know in the end we said okay well I guess we'll provide both of these options. You know so there's like a fast parallel coordinates and a and a regular parallel coordinates, I guess. So they allow the user to choose how they want to tackle this visualization problem. But that is certainly a challenge that we're going to continue to face.

You know,

visualization just has limitations both in terms of of time,

and in terms of space that you have on the image. More and more, we've been

trying to ensure that the underlying implementations are using NumPy,

that we're taking advantage

of Numpy's performance,

wherever possible.

We try to avoid,

you know, Python objects where we can,

especially since, you know, we're in that sci fi world, just to ensure

that we're

sort of following the same types of steps. But this also leads to this sort of other interesting problem in us when

either, you know, we sit on top of both matplotlib

and scikit learn. And when either of them chokes, like raises an exception, that bubbles up through yellow brick and you get these very unintelligible

trace backs, Bug finding becomes an issue.

And so,

you know, if there's 1 thing I wanted to say, you know, these limitations

are are certainly there. But the fact that they aren't as noticeable

is because of the extremely hard work of our contributors,

to go and find these things and capture exceptions and write better,

trace back messages and give advice on the documentation of how to handle things. It it really has been

impressive with with how much work they're doing. In the end, you know, the biggest thing is that, you know, in terms of edge cases,

visual meaning depends

on the data and the model that you're using. And the edge case, I think, is is possibly disappointment.

So we you know, this and this is something maybe I feel a lot more maybe than other people. So 1 of the new visualizers I just recently implemented was a manifold,

visualizer using t SNE,

or ISO maps or other embedding techniques to try to do an embedding instead of a projection

of high dimensional space into 2 dimensional space. And I thought, oh, man. This is gonna be great. This is gonna provide a lot of functionality. And you're gonna be able to see a lot of different groupings. And and it's gonna be a tool that's gonna be widely used.

And those tools are are performance intensive, and they can take, you know, 100 of seconds to to draw to fit and draw.

And, you know, the first real world quote unquote data set that I tried this on, I ended up with like 1 point. Like it just embedded everything right on top of each other.

And so, you know, it was just the data I had. It was just, you know, the tools I had,

available. And and so so I think that that's maybe the biggest I don't know if you call that an edge case, but the sort of biggest limitation is is both, you know,

your data and your model

and what that ends up being in terms of visualization because it it will be unique. It will be a different experience. And looking forward, what are some of the things that you have planned for the future of Yellowbrick?

So we're really interested in kind of capturing

more users and, you know,

figuring out more of the domain practices

that are already being used that we can create, you know, visualizers with, you know, hopefully in collaboration with those, you know, practitioners

that correspond to use cases beyond the ones that we're the most familiar with. You know, machine learning is a really big space and it gets used in a lot of different disciplines

and there's a lot that, you know, I'm sure is out there that people are, you know, kind of doing

manually now that would be and that's sort of, you know, that's what makes scikit learn so great. You know, it sort of became this place where all of these different,

practitioners came to contribute.

So we would really like

to have something like that,

in Yellowbrick.

So we're, you know, we're hoping that we can get more people to find the project, you know, on GitHub and star it and check out the docs, reach out to us through the issues and and kind of share their ideas or

share,

case studies from things that they're working on,

so we can get kind of a a sense of of what else we could be doing.

And then from the future

side or the future side,

we're looking at doing visual optimizations.

I think that's kind of the next big thing for me. Thinking about how can we

minimize,

maximize

white space, opacity,

overlap. How can we actually use the sort of modeling process instead of the visualizations

themselves? A lot of the visualizations

depend on the order of the features or the order that you draw things. And and so can we apply maybe a more,

rigorous method in in selecting the the best visualization, the best possible visualization for the model. This also leads to sort of another, maybe thinner idea that we have and that's, is there such thing as as visual correlations?

Something that exists inside the visual space that doesn't necessarily exist inside

of the statistical space. So, you know, 1 thing that we're thinking about is using

based approaches where we sort of draw a voxel over

any dimensional space

that's been plotted by visualization.

And we look at the density

of the points or the lines or the the drawing, the colors that are inside that space, and use that number,

as a representation

of

the goodness of that visualization compared to the goodness of another visualization.

Coincidentally, that's more or less how we do the visual

testing to see if if to of a baseline image has changed from our test image. I mentioned this before, interactivity,

is certainly, the next phase of features along with the contrib library and Keras support and better stats model support. And then the last thing is is better reports and organization.

The 1 thing is, you know, I've noticed that, you know, you're never doing just 1 visualization per model. You usually have a handful or a deck of visualizations per model. And so the question is, how can we group these things together? Can we come up with, you know, like a flip book model? So, you know, like inside of a notebook you you just you can flip through the visualizations,

1 at a time. Is a visual grid better where we just create sort of a a report and and create a grid? Should we be creating,

you know, HTML templates with Jinja, where we're writing Base 64 encoded images into an HTML file that you can open up inside of a web browser? Maybe something kind of similar to TensorBoard,

but more yellow brick focused. So those are the types of of features,

that we're thinking about as we're moving forward.

And beyond visualization, what are some of the other areas that you would like to see innovation or additional tooling

come forth in terms of how data science is taught or conducted to make it more accessible and understandable for more people? So 1 of the things that I

observe is that,

you know, in in practice, I'm seeing data science sort of being used in 1 of 2 ways. So 1 as

sort of business analytics 2 0, so where data scientists,

you know, are maybe in some kind of standalone office in their organization

and their job is to, you know, use machine learning, use visualization,

do analysis

that kind of goes into some sort

of business reporting

that goes to management and then, you know,

decisions are sort of made based on that. And then the other model is where data scientists, you know, occupy maybe a somewhat less glamorous

position but are actually embedded in the engineering team

so they are building,

features that kind of just go into the software integrated into the big

biggest

big biggest challenges that I face and I think Ben would agree with this is that a lot of the problems you face are sort of more mundane. It's not like, you know, is my model working or like what's the best TensorFlow architecture

to eke out, you know, the most performance, but things like how does

experimentation that I do fit into the agile workflow?

Or how do I get, you know, these back end developers who don't know very much about machine learning to approve my pull request so that I can actually get my stuff integrated into the software.

Because ultimately, you know, the machine learning stuff is really

that

that sort of development environment. But my sense is that the skills that really need to be taught to data scientists

kind of as we move forward,

into, you know, the next generation

of data products is really software development skills. So things like testing and configuration management security,

microservice architecture,

so that you know how to actually,

deploy

the stuff that you're building. We're also sort of looking forward to a new generation of project managers,

product managers who are really, you know, more

savvy about,

data product construction so that they can really help mitigate the risk, and kind of plan around it and support

data science development throughout kind of that model selection triple workflow

that we talked about. And are there any other aspects of Yellow Brick or any other things that we touched on today that you think we should discuss further, before you sort of close out the show? I mean, we definitely want to hear from Yellowbrick users.

I'd say that's that's the number 1 thing.

We have a lot of stories that we can tell about yellow brick and and how it's used. I think you've heard a lot of them,

both from the user side and the and the development side. It's been a very important project,

to both Rebecca and I. And we want to make that project,

more widely used,

and to do that we want to incorporate feedback from others so that we know how to sort of what features are are most meaningful,

and how we can make the library as as general,

as possible. And so with that, I'll have you each add your preferred contact information to the show notes. And I'll thank you both for taking the time to join me today and discuss the work that you're doing with yellow brick. It's definitely a very interesting project and 1 that seems like it's providing a lot of value. So thank you for that, and I hope you enjoy the rest of your day. Hey. Thank you so much for having us.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.init