Summary
You’ve got a machine learning model trained and running in production, but that’s only half of the battle. Are you certain that it is still serving the predictions that you tested? Are the inputs within the range of tolerance that you designed? Monitoring machine learning products is an essential step of the story so that you know when it needs to be retrained against new data, or parameters need to be adjusted. In this episode Emeli Dral shares the work that she and her team at Evidently are doing to build an open source system for tracking and alerting on the health of your ML products in production. She discusses the ways that model drift can occur, the types of metrics that you need to track, and what to do when the health of your system is suffering. This is an important and complex aspect of the machine learning lifecycle, so give it a listen and then try out Evidently for your own projects.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Emeli Dral about monitoring machine learning models in production with Evidently
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what Evidently is and the story behind it?
- What are the metrics that are useful for determining the performance and health of a machine learning model?
- What are the questions that you are trying to answer with those metrics?
- How does monitoring of machine learning models compare to monitoring of infrastructure or "traditional" software projects?
- What are the failure modes for a model?
- Can you describe the design and implementation of Evidently?
- How has the architecture changed or evolved since you started working on it?
- What categories of model is Evidently designed to work with?
- What are some strategies for making models conducive to monitoring?
- What is involved in monitoring a model on a continuous basis?
- What are some considerations when establishing useful thresholds for metrics to alert on?
- Once an alert has been triggered what is the process for resolving it?
- If the training process takes a long time, how can you mitigate the impact of a model failure until the new/updated version is deployed?
- What are the most interesting, innovative, or unexpected ways that you have seen Evidently used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Evidently?
- When is Evidently the wrong choice?
- What do you have planned for the future of Evidently?
Keep In Touch
- @EmeliDral on Twitter
- emeli-dral on GitHub
Picks
- Tobias
- Emeli
Links
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Emily Droll about monitoring machine learning models in production with Evidently. So, Emily, can you start by introducing yourself? 1st of all, thank you so much for the notation. I'm really enjoying. My name is Emily. I'm CTO and cofounder of Evidently AI. This is the startup where we build tool to analyze and validate machine learning models in production.
[00:01:19] Unknown:
And prior to it, I worked for 7 years helping companies from different industries where I'm from banking communication to industries like manufacturing,
[00:01:28] Unknown:
building database products, and learn them to production. And do you remember how you first got introduced to Python?
[00:01:35] Unknown:
Yeah. I guess I remember it happened when I was studying at the Yandex School of of, data. And the end is something like Google of Russia. And, yes, at the time, I was coding c plus plus, so I did not really enjoy Python at the time. But later on, I got the idea and the power and the beauty of Python and, yes, now I'm not complaining on
[00:01:55] Unknown:
dynamic typing and things like this. And yes. Now I really enjoyed language. And you mentioned that you're currently running the Evidently AI business and that you helped to found the company. I'm wondering if you can just give a bit of an overview about what it is that you're building there and some of the story behind how it came about and what it is about machine learning monitoring that made you want to focus your time and energy on that? Prior to Evident Care, I worked in at Yandex Data Factory, which is a Yandex department,
[00:02:23] Unknown:
which helped companies build database products. I mean, there, help to build prediction services, some recommender systems, and things like this. And it was quite interesting and nice to build the service. And when you move, you just start to deploy the service into production, you need actually to guarantee very nice quality and SLA and things like this. And for many companies, it was the closed perimeter, so we didn't really have an access to the services. We most into production. And, yes, models started to degrade, the year increased, the quality sometimes were getting pretty worse. And at that time, they started thinking about monitoring because, well, we understood that there are a lot of different things and issue that can happen with the model since we moved into production.
And at that time, we we just started thinking about monitoring commercial learning models and about some specific metrics and statistics that we should add to make sure that models works really well. And after later on, together with Yelena, who also worked at Yandex DataFactors for many years, they decided to start our startup for helping out manufacturing companies and build data products. And, yeah, there, we figured out that it's even more important because, well, you optimize pretty complicated processes, and they almost always have this cost perimeter that we, like, felt this problem ourselves that we really need to have some tools to monitor and analyze and make sure that everything works.
And after it, we decided to build to monitor our machine learning model in open source. In terms of monitoring machine learning,
[00:04:00] Unknown:
there's, I know, a number of different aspects that go into that. And I'm wondering if you can just give a bit of an overview about the types of metrics that you're focusing on when you're monitoring machine learning and how that might compare to sort of traditional application monitoring that people might be familiar with from deploying web applications or things of that nature? 1st of all, I kinda agree that if we are talking about database service,
[00:04:23] Unknown:
it's a service in the 1st place. Right? So we definitely should start from service health monitoring metrics, like response time, memory usage, CPU usage, and things like this that we are used to in any production services. But together, these metrics, when it comes to data products, they have a lot of things related to data. And I believe there are 4 the most important things that we need to monitor. And first is data quality because, well, everybody knows that if they have garbage in, they have garbage out. So it's important to make sure that the input is correct and everything is written the expected ranges in the expected formats and things like this.
The second, I believe, we need to make sure that the environment, which is what our models communicate, is still the same comparing to the data we trained model on. And here, we have data drift and constant drift. That's the metrics that helps us to compare the distributions of features, the distributions of target or targets if we have several the data we have during the training and make sure that model continue, continues, operates in the same environment with the same distributions and dependencies and patterns so we can continue with the model. It's especially important where we do not have a feedback or we have a feedback with the delay, so we cannot calculate the error straightforward. In this case, this data drift and data drift should serve as an error monitoring.
And from this point of view, it's pretty important metric to monitor. And, yeah, basically, the first 1 is this performance. In most cases, this is a URL based. And if we have a feedback loop, if we can get the ground truth data compared to a small delay, then definitely we need to pre collate models here to make sure that model does not tend to, for example, overestimate or underestimate our target. Yeah. I believe this is for the most important things. And, of course, yes, for each case, there are some specifics, some specific metrics that are pretty relevant. For this case, for example, if you're talking about health care or social services, there we can monitor things like quality by segments, quality for some specific or low underrepresented segments because, well, it's very important and we can adjust analyze the model's performance. In general, we need to make sure that we account to all needs for all segments of our audience.
Or maybe sometimes we need to specifically focus on anomaly detection and outliers, especially when it's critical service. And we can, for example, send some objects to manual review if we cannot guarantee the high quality of our model for these objects. Or maybe some information about some metrics related to biases, shifts, explanations, or things like this. So, yeah, for some cases, we definitely can add a lot of specifics. But technically, we need to have some pragmatic approach. And first, like, start thinking about service criticality. And if it's not that critical, then maybe limit ourselves to service health and data quality and data driven you're doing. If service is more critical, then add more and more and more depending on the case. Another interesting aspect of monitoring for machine learning models is that, especially if you're using them for predictive purposes, there's some measure of subjectivity
[00:07:39] Unknown:
and built in potential for error, whereas with, you know, calculating the amount of memory used by a web service, it's very objective. There's no question about what the number is going to actually mean. And so I'm wondering how you actually account for determining what the error rate is for a prediction that might not have happened yet. Yeah. This is a very nice question. And honestly, we
[00:08:01] Unknown:
spent a lot of time defining our thresholds and sensitivity and things like this because it have a lot of defaults, in our tool. Right? So for example, 1 can run the report without specifying any metrics and thresholds and it will be calculated as well. So I believe the best way to determine all these thresholds is to analyze historical data and try to connect the models here with the failure across. For example, well, if your model overpredict something you spent or you, like, lose this amount of money, if if ever predict then this amount of money and try to make sure that you penalize your model or maybe if you send send an alarm only if it really makes sense. So I believe this connection between the business clauses and model of yours is very important. It's not always that easy to, like, create this connection, but in most cases, if, an analyst sit together with the manager and they discuss how actually the models here influence the business processes, what decisions are made on top of model's output, and how actually it influences the whole quality metrics, then I believe it's possible.
Yes. We always have historical data so we can simulate things. We can simulate higher years, lower years, things like this, and see how it influence on the process and design on the thresholds. Yeah. I believe it's the way how it can be done.
[00:09:22] Unknown:
To your point of bringing in the business users to determining what some of these thresholds are, that also introduces another point of divergence from traditional server monitoring where that was largely the domain of the operations people or maybe developers. And the business users might only care about the, you know, the time that it takes for a page to load. Whereas with machine learning, there is a lot of business logic and specifics to the organization's needs and wants that goes into the creation of that model and the whole reason that the model is being deployed. And so I'm wondering how that influences the types of metrics that you're collecting, the ways that you present that, and just some of the overall design and accessibility of the tool that you're building for being able to monitor these models in production.
[00:10:08] Unknown:
There is a huge difference between the metrics that you just measure and metrics that you first design how to create. Right? Because, well, for all the statistical tests, we need to select the window for calculation. We need to select some sensitivity and actually test. Right? Because there are a lot of different tests that suits for, for example, numeric distributions comparison. Right? So I believe that 1 should invest into historical data analysis a lot because, well, it's free. Right? I mean, if you made some mistakes or come up with the unsuitable window lens or something like this, you can always try another variables and parameters and until you are satisfied with the results. Right?
So I think that the importance of historical data is a bit underestimated, and we even wrote a blog about it. So how we can decide on the monitoring settings on top of the historical data. And, yes, how I suggest to tackle it is that first, analyze your historical data, select all the events or issues or moments where you wanted to get alarm. Right? So just label your events as the dangerous potentially or just dangerous, and then create the monitoring in a way that for all these events, you would get the alarm and yeah. So those something like this. I mean, historical data is the very nice tool to analyze the quality of your monitoring schema and make it maybe more strict or less strict and to check if it suits. Because, well, if your monitoring does not catch the things that you would like to catch on the historical data, same will happen in production. So this is the nice way how we can make sure that we do not overload our engineers with the wrong alarms and that we catch really important things. So I believe the schema should be tested on the historical data. And when it comes to different kinds of metric, I mean, business metrics, engineering metrics, right, and data science metrics. Again, there, I suggest to be a practic pragmatic and make sure that you have something like the rules or of how we are going to react on each alarm or each ISMO.
[00:12:19] Unknown:
Alright? And really create metrics and set up alarms only for things that we are know how to go with or how to react on. Yeah. Something like like this. And then you mentioned a little bit some of the types of error that might occur. But I'm wondering, what are some of the kind of failure modes for a model where with a web application, you can see, is it up or is it down? Is it returning a 503 error or a 200? But with a model, it might not be so clean-cut where it might be returning a response, and it might be within the error bounds. And, you know, maybe occasionally, it's returning outside of the error bounds that you've defined as your thresholds. And I'm just wondering if you can enumerate some of the types of failure modes and how they manifest and how you can identify them to be able to know when to take some sort of corrective action?
[00:13:07] Unknown:
So first, I believe the most popular urotype is that when data is the wrong quality. So, yeah, I believe there might be many different types of errors, but the main 3 types are the following. So first is the cases where we have the wrong input data. For example, some physical sensors might be broken when we talk about manufacturing or maybe some changes in CRM system or just some failures because our data pipelines were implemented incorrectly. And in this case, definitely, our model will get as an input data something wrong, and the output also should be wrong. So this is why we need to monitor data quality, Actually, right, and this is the most common error. So in most cases, when something wrong with this model's output, that's because the input data is wrong. So this is the first type of error I suggest to monitor. It's more data engineering thing, but that's very important, and it's cover, like, more than half of the issues is the models. So the second big segment of yours is related to cases where, the world has changed. For example, some seller is increasing or the audience of the service has changing or just the world has changed overnight. I like to have this there is a pandemic crisis. Like, a lot of model just get irrelevant in the moment. For these things, you can monitor data drift and constant drift, and, yeah, there is no specific ways to maybe differentiate between continuous data drift and, like, greater data drift. Right? But it makes a lot of sense to analyze these distributions and make sure that we still have more or less the same data points.
It's also not the straightforward moment because, well, when I have a model that operates on top of, like, 100 or a 1000 weeks big features and it's really pretty popular design, then if, like, amount of few thousands features, 20 or even 50 is decent, it's sometimes tricky because maybe everything is still okay, maybe not. And this is why together we use the day to day, it makes a lot of sense to analyze model performance, right, and make sure that the quality, that the error rate is still within the expected ranges. So this is the 3rd type of metrics that makes sense to analyze. And then when it comes to model years, right, you mentioned there are some cases where object can have higher year, maybe the subs can have a lower year, and it's really complicated to make sure that, yes, this is the moment where our model became irrelevant. But what we definitely can do there is try to separate the data drift from the anomalous and outliers.
Yeah. There is a completely different things because outlier or anomaly, it's the, like, single event. Right? So it might happen for various reasons. Again, maybe some mistakes in data or maybe, some broken sensors or something like this. But in most cases, this is something that's related to maybe 1 or a couple or 3 objects, and it doesn't necessarily mean that there is something wrong with our model generally. So in most cases, we can just solve it on the object level, right, and do not change anything in our model. But when it comes to data drift, then we definitely need to think about model's quality. Maybe we need to perform some calibration, maybe a repeat. The model may be completely rebuilt using different training schema, maybe using less historical data to make sure that model do not trend on top of the irrelevant old patterns and accounts to new to new changes in the data. And, yes, that's how we can choose different strategy on top of the results of analysis.
So, yeah, I would suggest to measure these 3 things, data quality, data and strategies, and models performance.
[00:16:44] Unknown:
In terms of the tool that you're building, I'm wondering if you can give a bit of an overview about the design and implementation of the Evidently project and some of the ways that the architecture has changed or evolved since you first began working on it. Yeah. Sure. Actually, we are pretty young, so
[00:17:02] Unknown:
this is why I believe you do not reach yet our final architecture. And, yeah, I believe many projects can say the same. Yes. But it started from more monolith schema ahead at this time on the dashboard. So, yeah, I believe I need to say that our tool helps analysts and engineers build reports related to data quality and models quality. For data quality, we have data drift and target drift, so called constant drift. And for models quality, you have a lot of reports suitable for different kind of models like binary classification, multi class classification regression. From the very beginning, we had only a dashboard, which are HTML dashboard available inside of Jupyter notebook, A lot of data scientists like. So this is why it's convenient. And I also it's possible to import this HTML report as a separate file to send it over to colleagues or managers or someone who is also interested in the model's quality. And later on, after we speak to a lot of our potential users who just try our tools, we figure out that it's quite interesting to see, and analyze such reports. But for production usage, we need something more suitable to be integrated with some databases and monitoring tools like Zapix or Grafana.
And this is why later, we decided to create the, JSON based profiles. And this was the exact moment where we rethink our architecture and figure out that we need to create more models and maybe add an extra layer to make sure that we can speed the data representation and the, like, visual representation of our reports. Until now, we have more layers. Basically, we have, like, 3 different layers. So first is the analyzers. This is the classes that can create some statistical metrics, quality metrics, and analyze, compare, fund those data frames. The second is tabs or report parts. So this is the parts where we combine the results from different analyzers altogether and create the meaningful parts of reports.
And the last layer is the layer of representation. There we have class that generates dashboards I mean, HTML dashboards or JSON profiles. So now it's became more modular, and now it's much easier to change something, to add things, to add new analyzers, or change structures of the reports. And, yes, definitely, I wish we had it from the very beginning. That's like I believe it's very rare when you start from the right architecture. And, yeah, now it's thinking more about adding some maybe more layers or some modules to add some scalability.
I mean, now our tool is impossible to run-in a distributed mode, but it turns out that it's pretty important for many of our users because they have much huge datasets. Even if you talk about general data, still they are pretty huge. And so now it's possible to run our tool, for example, in Hadoop or using the MapReduce as well. And now we are thinking about maybe some redesign or adding some functionality to make it possible.
[00:20:06] Unknown:
In terms of actually running Evidently and integrating it with your deployment system and being able to do continuous monitoring of a machine learning model as it interacts with the real world. I'm wondering if you can talk through some of the setup that's involved and some of the supporting systems that are necessary for being able to collect and analyze those metrics over time? Sure.
[00:20:30] Unknown:
I would probably start from data because it turns out that not every company who company who uses machine learning actually logs enough data about it. So first, I would suggest to think about the amount of events and data you are logging and make sure that they log all the input data that your model uses as the input features, right, and the model's output. So this will be essential just to be able to create and calculate all these metrics, like gist analysis and model's quality. So, yeah, this is the most important things. And, yeah, from the very beginning, it makes sense to invest in this part because, well, it's historical data, and it's impossible to, like, recalculate it back. Right? If you haven't collected it before, then it's no chance to, like, redo it later. So, yeah, this is the first things. And, of course, if you have some ground tools or feedback, correct this as well. So not just input data and models output, but also some feedback from the systems that uses the model's output as an input. So this would be a first step, and, yeah, make sure that we use all this data for a long period of time. Our tool is now, works much better with the batch models just because because we are in the development process right right now, and the tool is report based. Right? And it comes to report. It's much easier to apply it to batch models.
So there are many ways how we can call and create the report using Evidently. 1st, of course, in Jupyter Notebook, this is something that data scientists like to do a lot, but it's not really the best way when it comes to monitoring. Right? But, yeah, I know that some companies have got in Jupyter Notebook working in production, but it's, I believe, very, very, very few companies. So yes. And this, I have the command line interface, which also helps to create the reports. So I believe for building the production grade system for batch models, it's better to use it this way. Or you're familiar with some workflow managers like Airflow, and then personally is a huge fan of Airflow now. So then I can either use a Python operator and call our reports from Python or just from Bash using a command line interface. So yeah.
1st is data. 2nd is from command line interface or together with some our call manager, then schedule the creation of the reports. For example, once a day or once a week or or every hour, it's depend, on the data granularity and some delays in data collection. Right? So it can be, scheduled by some events. For example, when the new bunch of data came, then we can start creating the reports. Our command will help a lot to calculate it in the right moment. Then decide on what should happen after we created data because, for example, there are many scenarios which are possible. So first is just use Evidently to concrete metrics, then create output in the JSON format. Right? And having this JSON format, we can integrate we can plug in this data to some database, for example, Prometheus or CrossDisplay or SQL database. Right? So any of the database is suitable for the project. And then having this data into database, it's fairly easy to use some monitoring systems, again, like Grafana because it's very popular, or maybe it's it's also a very nice tool. So this is 1 way of integration.
Another way, which I also like, is to combine the JSON monitoring with help of Redemptive with the HTML reports because HTML reports are interactive. You can analyze each plot. You can do zoom in, zoom out, select some objects, analyze the sequence of whole performance. So it's pretty nice to generate such reports when you see that there is something wrong with the data and you need to, like, dig deeper to figure out what has happened. Do we tend to underestimate or overestimated month because, well, we have the euro distribution, some statistical tests, so many, many things that can help to figure out what is actually happening with the model. So I would suggest to use the JSON output to monitor things and create some triggers.
And if some trigger worked, then generate the HTML report and send it over by mail, for example, to analysts to help him or her figure out what is happening. So I believe this will be the best architecture. And when it comes to tools, just select the workflow manager that you like. For example, I like Airflow, but there is a bunch of another workflow manager. So starting from something simple as Chrome and maybe or something like this, and schedule the recreation of Evidently. Then select the database where you want to store all this data related to quality of model and then finish with some monitoring tool like Grafana or Zabbix to make sure that you can visualize all the results, send alarms,
[00:25:20] Unknown:
and send the reports to someone who can fix the model. In terms of actually building your model, 1 of the things that you mentioned is that getting it deployed, most people don't collect enough information about the inputs that are going into it and the types of outputs it's producing to be able to effectively determine what its performance is and whether it's, you know, operating as it's intended to. And I'm wondering if there are any other aspects of the model building or model deployment process that you think can and should be improved to make it easier to monitor the models in production and identify how they're performing or when they might be hitting error cases?
[00:26:00] Unknown:
I think when it comes to some data that we collect, some logging, the information about data about input data and output data should be more or less enough. But in case the model use some external data, for example, data about whether or maybe some news data or something like this, it also makes sense to collect this information also. So this is the information about the environment or some external data that you can use as an input. And maybe if the model somehow communicate a lot of external services, then maybe some information about how the services process the output of the model. And let me expand this thought because I believe it sounded a bit too general. For example, I worked on the system that helped to the factory optimize the usage of raw materials to produce the needle, the steel, and and he created the recommender system that recommended the amount of raw material to use for each steel smelting. But sometimes, these recommendations were just ignored and did not, used. So I believe that if the output of your model is interacted with some other systems, it's important to make sure that the system actually used the output of your model. Maybe the system somehow was processed this output or maybe it was some fallbacks that worked and instead of your models output, the system you are worked with some another system like just statistics or first principle model or something like this because for machine learning models, when it comes to production services, it's very often machine learning models combines with the first principle models or some statistics to make sure that the whole service work reliable and it has some fallbacks in case, for example, we do not have enough data to generate the reliable output is help of the machine learning model. So in this case, it's about not only lock the input data and output data, but also lock what back end is answered.
Where is there any post processing from the next level system? And to address the, maybe, result or recommendation or a final forecast or something like this. Because very often, it helps to figure out the whole service quality much better and see some segments where, for example, the machine learning model works worse than first principle models or vice versa or maybe detect some segments of low performance where we need to maybe substitute our model with another model or maybe retrain it to this specific segment or collect more data, label more data, and, yeah, again, retrain model to make sure that we catch all the patterns related to the specific segment. Right? So such things are not that important during the general monitoring, but when something goes wrong and we need to figure out what exactly and why, it really helps to analyze and easier find the reasons and connections between the final business process quality and model's output.
[00:28:53] Unknown:
Once you identify that there is an issue with the model, what is the process for actually remediating that problem and getting a new version built and deployed into production? And then specifically for the case where the training process might take a long time and there is some critical issue with how the model is currently operating, what are some steps for being able to remediate those problems
[00:29:15] Unknown:
until you're able to get the new version out? That's really, I believe, the huge issues nowadays because, well, nowadays, more and more models start working in the production critical for optimized production critical processes. Right? And now, yes, it's pretty hard to come up with the ideal scheme of how to retrain these models and to make sure that they operate really well. So I believe that it's really makes a lot of sense to figure out at the very beginning of the project whether this model is a part of the very critical service or not because if it is, then I would suggest to invest a lot of time into automation. I mean, first of all, of course, setting up the monitoring system, then when you write some codes to automatically model retraining, I mean, collecting the new dataset on the goal and setting up some figures to detect the moment where we need to retrain our model. And this is actually not that straightforward because, well, if, for example, model quality decreased, right, it might seems like this is the right moment to retrain model. But if you collected only a few amounts of data, for example, like, on a 100 new lines of the data, this might make no sense because, well, they can have a very small instance on the model training process. Right? So there are a lot of different aspects that should be taken into account for the retraining.
This is the model's performance, the amount of new data you have. Do you have the labels for this data or not? So things like this. And I believe if we're talking about the critical service, then we need to make sure that we implemented the data collection system correctly, that we implemented the retraining process, that you can actually automatically analyze the model's quality. For example, using some golden dataset or maybe having some expectations on the model's output distribution or stuff like this. So some metrics is the expected values for them and system to switch between the models, but this is more technical issues. I believe the software engineers can help business. Yeah. So for critical services, we can end something like this.
But when it really takes a lot of time to retrain model, sometimes it's not that easy to make it automatically in real time because in most cases, when it takes a lot of time, it means that the dataset structure and amount of data that we use for training is pretty huge. And sometimes we really need to think before starting this retraining process because sometimes it makes sense to completely rebuild the model, maybe running some experiments, distribute different algorithms, and maybe different preprocessing scheme and different feature engineering or even some hybrid models where, for example, we combine different algorithms, like, maybe linear models and gradient boosting or something like this.
So I would invest the biggest amount of time into scheduling the monitoring that helps you to figure out why exactly the model's quality decrease. Is this the data issue? Is this just some outliers that came to our system, and maybe hour later, everything will be fine? Or is this still something wrong with the model? For example, some pipelines was broken or something like this. Because when you see not only the decrease in accuracy or maybe increase in the error rate, but you understand what might be the reason for this. You can be more flexible into decision into deciding of what to do and how to react. So to be able to optimize some critical processes with help of machine learning, we definitely needed to have this solid monitoring schema, maybe some tools to train it, and some fallbacks because, well, when model starts, behave somehow weirdly and start generate some not reliable outputs, it's makes sense to switch another system and not another machine learning model. Often, it's better to be some some system that was built in other principles. Like, maybe there is some analytical models, some maybe human based system. Right? So some rule based system, something like this. And these fallbacks are super helpful, especially at the very beginning when you do not know what to expect. Right?
Yeah. I think it will have monitoring, fallback system, training schema, and maybe estimation and some validation schema to make sure that you renew your trained model good enough to be automatically deployed into production.
[00:33:44] Unknown:
And then in terms of the alerting based on the monitoring that you're generating from Evidently, I'm wondering what you have seen as some of the strategies for the urgency of the alert, who's responsible for receiving and acting on those alerts, and just some of the process of actually operationalizing and treating these models as a production grade asset and the necessity for having them be able to operate sort of on a 247 basis?
[00:34:12] Unknown:
Unfortunately, for now, it's really hard to answer this because, first of all, we are open source too, and we do not really know about all the cases of usage. Right? We cannot really, like, make someone tell us how and when they're using the tool. And the second thing is that our tool now is really suit much more to some batch models because, well, for real time calculations, reports is not really the best way how we can analyze the quality. It's much better to be able to to create a lot of metrics really fast and push it to pass through some visualizations too, like, a mini trying tool. Right? So it's really hard to say. For now, I can say that there are a lot of data scientists and analysts who use our model to surprisingly run their experiments and compare the models with each other. We actually do not expect a lot of such usage because, well, we were more about comparing the recent data with current data, right, and make sure that the year is not increasing or something like this. But it turns out that together with this, if you use 1 model as the reference and another model as the current, you can do something quite side by side comparison.
And turns out that a lot of data scientists now at least some of them are using the tool exactly in this scenario.
[00:35:31] Unknown:
And then another interesting aspect of this problem space is the categories of models that you're able to generate these metrics for and monitor for in terms of things like a more statistical style model where you're building something maybe with scikit learn or something of that nature versus a deep learning model and what you see as being some of the category of metrics or the overall complexity that's involved in monitoring those sort of different categories of models and where you're at currently with Evidently and whether you're looking
[00:36:06] Unknown:
to support just 1 of those categories or bridge across those styles of model. Currently, we have pretty simple integration. I mean, to use Evidently to create the reports or JSON profile. You only need to come up with the CSV file, with the features, and sometimes target and prediction if you want to build some performance for it. And when I use the Python notebook or Python, then, it needs to be converted I mean, read it in Python and to convert it to a pandas data frame. Right? And from this point of view, it doesn't get important on which tool you use to train this model. We just need to be able to generate this CC file or find this data frame. And if I can do this, then you can run errantly. Just small things there is the size of the data frame or CSC file. If it's CSC file, then in command line interface, we, kind of recently released the sample feature. So we can choose the sample strategy. It can be some strategy with randomization and can select which part of the data of the history file we need to use to calculate these reports. And for huge datasets, it's actually makes a lot of sense because, well, from statistical point of view, the results, will be more or less the same. Right? I mean, the statistical test will not be changed that much if you, like, use the half of your data, if you already have enough to compare distributions.
Yes. That's like this. If you have the tabular data with any, tool or library to train the model, then you can use it eventually. We have some limitations in terms of the case or problem statement because now it works only for supervised learning. So we can do the have a lot with the unsupervised models, like, maybe user segmentations or outlier detection or something like this. Well but, yeah, for maybe segmentations or some unsupervised models, someone can still try to run target drift reports to maybe compare the results of different models or something like this. But in most cases, yes, it's for supervised models, and it's either for classification, binary, multi class, probabilistic, or just labels, and for regression. It do not have any specific reports now for ranking problem or for time series analysis.
But for time series, it's possible to run the regression performance stuff. It cover really a mass of metrics and most of quality issues, but we do not have some specific time series things. But we are going to add some. And for now, our tool works for the data, but we are going to expand it to different kinds of data later on. So it will be a text and later pictures. But before we move on to this direction, we need to solve our issues with scaling because I asked for the dual data. It's just more cases where your data have smaller sizes, but when it comes to pictures and for text, it's in most cases with huge datasets that needs to be processed and distributed.
[00:39:10] Unknown:
And, yeah, there we need to update our package quite a lot. In terms of the business aspects of what you're doing at Evidently, you mentioned that sort of the core project right now is this open source framework for being able to analyze and report on these model metrics. I'm wondering what your overall sort of business strategy is in terms of being able to build on top of that library or some of the additional capabilities that you're looking to provide. I have several hypotheses there. We still don't know which 1 will be the best
[00:39:43] Unknown:
1, but, we talk to our users a lot, and this really helps us to develop some view and some vision about it. And first things that we figured out is that we have a lot of users who are actually interested in adding some, let's say, business users features, like changing the powers of the reports, adding some logo, single sign sign ons, and some specific business metrics. So our first hypothesis is that these features will be paid, but all the technical features related to the quality assessments will be open source. So we really want to create a tool that will be useful and helpful for open source. So open source is not just a small part of some huge solution. It's more of a resource. So I believe that the most important things will be always free and open, but additional features for business users might be paid.
And so I have second hypothesis, which is related to the deployment. She worked out that the using the tool as a Python library or as the command line, tool is not is not the best way for some, again, business users or some companies who are not want to invest a lot of efforts into generating their own internal data science team. So sometimes they prefer to outsource as much as they can. And for these companies, we are thinking about building cloud solution where they just need to load their data, and they will create some services that will help use the analysis, use the monitoring, and use alarms. So, yeah, it will be a service based solution with the AirDent client side. In terms of the
[00:41:21] Unknown:
ways that the Evidently project is being used, I'm wondering what you have seen as some of the most interesting or innovative or unexpected applications of it. Yeah. Recently, we created Discord channel, and yeah. Now a lot of users sometimes
[00:41:35] Unknown:
write us letters, and the most unexpected things thing to me was when when user just wrote that he wanted to use our tool to create a blog because he really likes the visuals, and he asked about some specific pictures. But I was really surprised that the monitoring tool might be used as a way to create, visuals for a blog. Most unexpected thing to me. And, yeah, I believe my founder and CEO, Elena, would answer differently. But, yeah, I believe this was unexpected. The most creative way is comparison side by side model comparison because we even created the tutorial on that after we figured out that our tool can be used in this way.
[00:42:16] Unknown:
And in your experience of building the Evidently project and building a business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:42:26] Unknown:
This is the tough question because we are still learning. I believe you learn every day. But the most tough lesson for me was building open source because this is the 1st project I built open source, and we kinda used that we first can build something on our own and test it, let's say, make sure that you like everything. And on the after this public, it's to the whole community and to hear for feedback. But when you build something open source, you, like, share it from the very beginning. And, yeah, at the very beginning, it was was just to show the final product with a lot of issues and mistakes and not all scenarios covered.
We still have some, things. I guess, generating reports in Windows because, well, it's not have a deep engineering team. So it's literally me together with and 1 engineer helping us out. But, yeah, there are a lot of things that should be implemented and fixed. And at the very beginning, I was thinking that we will have a lot of problems and complaints from the community regarding the product's quality. But it turns out that it actually works vice versa because there are a lot of people who are thinking about the same issues, who were thinking about how to monitor their production models, and we were looking for some open source solutions.
And they are super open and happy to share theirs. So it's about what to monitor, how to monitor, what are most important metrics, and what are less. And it's just very, very nice where you can think about the problem that seems very important to you together with someone who actually share the same vision, and you can really benefit a lot from the whole community knowledge. Because, well, I worked on some cases, for example, manufacturing, other people worked in banking, some worked in communication, and again, combine all the issues and things that can happen and integrate all this experience and all this knowledge into 1 tool, I believe it's just super amazing. And this is the most important lesson Instead of maybe being afraid of showing some imperfect product, it's makes sense to show it earlier, get more feedback, and build something better and faster.
[00:44:39] Unknown:
For people who are looking to be able to understand the performance of their models in production, What are some of the cases where Evidently might be the wrong choice, at least currently?
[00:44:49] Unknown:
Yeah. There are actually a lot, unfortunately. But, yeah, I would start from the cases where from not that production critical cases. So if it's just some bad projects or maybe not this critical processes, then I would suggest to go for service health monitoring and just 1 or 2 metrics. So without all statistical analysis and many, many things inside because, well, I believe it will be overkill to not supercritical cases. Again, for experiments, for models that are used more for fun, let's say, again, I would limit amount of metric because, inevitably, we have a lot of statistical tests and a lot of visualizations, and it can just overload the engineers and system administrators to react on this alarm. So I would, like, limit the amount of output for not critical cases. Another thing is the different data types because, well, now we are just working with stable data. And later on, it will be possible to use evident before text and pictures, but but currently, it's not the best choice, of course, for these data types.
And second is the. Of course, we are not optimized for this, but this is definitely something that we are working on, and we are invest a lot of effort a bit later after we fix all the instances, the
[00:46:13] Unknown:
current issues. So, yeah, if there is a high load system, I would wait, like, for a couple, then come back and test us again. In terms of your plans for the near to medium term, you mentioned a few features that you're working on building. But what are some of the projects that you're particularly interested in digging into?
[00:46:30] Unknown:
Yeah. There are a lot, but the first I need to mention is that now we have a lot of nice defaults and predefined, report structure. And for now, it's kind of complicated to change something. For example, add metrics and customize the report structure, and this is the things that users complain actually a lot about. So our first step will be create the very intuitive and simple way to configure the report, maybe using some JSON or YAML configs or something like this. So we are going to think about it and maybe even create some ways how users can write some Python code. I mean, some specific metrics with some specific ways or maybe metrics related to the main areas where they're working, maybe to add some business metrics to reports. So first, we are going to invest a lot of effort into customization.
We believe that having nice default is very important for user who are just starting out the monitoring process and for those users who are not really want to invest a lot of time into this customization, but we definitely have a lot of users who want to customize their monitoring process and reports. And for them, yes, they are going to add a lot of features to help with this. So second, you'll be adding more statistical metrics and tests because we have pretty huge backlog and plans about it, and this is our, like, chance. I mean, the metrics, statistics, and the analysis of models performance, and we are going to integrate all this knowledge into our reports.
And we are going to increase the amount of reports. So our next, probably, will be the reports that analyzes the dataset without the models without target functions and models output to generate some feedback on data quality, on the potential issues, problems, and things like this. Yeah. So this is if you speak about the algorithms and statistics, and I have a bunch of technical problems to solve. Yeah. It's scaling and distribution.
[00:48:25] Unknown:
And at some point, we are going to build service. We believe that it will happen soon. And, yes, this is the big step for us. For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose the most recent Suicide Squad movie. I just watched that over the weekend, and it was bizarre and a little over the top, but very entertaining. So worth a shot if you're interested in any of the sort of superhero genre.
[00:48:55] Unknown:
It was good fun. And so with that, I'll pass it to you, Emily. Do you have any picks this week? Actually, I'm impressed. But, yeah, for this week, especially for weekends. Good question. Oh, actually, I have a lot of very nice lectures and videos related to how I can configure and set up crazy complicated workflows in Airflow because I want to create some very nice tutorials for our users. So, yeah, my picks for the SIP is airflow, workflow manager, and a lot of tutorials around that. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on helping people understand
[00:49:35] Unknown:
and address the issues that they will run into when they put their models into production. It's definitely a very interesting problem space and 1 that is still too little understood. So thank you for all the time and energy that you're putting into helping people address those situations. So thank you again, and I hope you enjoy the rest of your day. Thank you so much. I really enjoyed the discussion, and I hope
[00:49:57] Unknown:
someone will find it maybe useful. It might be just interesting because this is a hot topic, I believe. Nowadays, a lot of models move into production, so it's just interesting to hear what people think about it. Right? And thank you so much for running the discussion. It was very nice. Thank you for your time and for your energy.
[00:50:17] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Emily Droll's Background and Evidently AI
Challenges in Machine Learning Monitoring
Key Metrics for Monitoring Machine Learning Models
Determining Error Rates and Thresholds
Business Logic and Metrics Design
Types of Model Failures
Evidently AI: Design and Implementation
Integrating Evidently AI with Deployment Systems
Improving Model Building and Deployment
Remediating Model Issues
Operationalizing Model Monitoring
Categories of Models and Metrics
Business Strategy and Future Plans
When Evidently AI Might Not Be Suitable
Future Projects and Customization
Closing Remarks and Picks