Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
Machine learning has the potential to transform industries and revolutionize business capabilities, but only if the models are reliable and robust. Because of the fundamental probabilistic nature of machine learning techniques it can be challenging to test and validate the generated models. The team at Deepchecks understands the widespread need to easily and repeatably check and verify the outputs of machine learning models and the complexity involved in making it a reality. In this episode Shir Chorev and Philip Tannor explain how they are addressing the problem with their open source deepchecks library and how you can start using it today to build trust in your machine learning applications.
Announcements
- Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
- Do you wish you could use artificial intelligence to drive your business the way Big Tech does, but don’t have a money printer? Graft is a cloud-native platform that aims to make the AI of the 1% accessible to the 99%. Wield the most advanced techniques for unlocking the value of data, including text, images, video, audio, and graphs. No machine learning skills required, no team to hire, and no infrastructure to build or maintain. For more information on Graft or to schedule a demo, visit themachinelearningpodcast.com/graft today and tell them Tobias sent you.
- Predibase is a low-code ML platform without low-code limits. Built on top of our open source foundations of Ludwig and Horovod, our platform allows you to train state-of-the-art ML and deep learning models on your datasets at scale. Our platform works on text, images, tabular, audio and multi-modal data using our novel compositional model architecture. We allow users to operationalize models on top of the modern data stack, through REST and PQL – an extension of SQL that puts predictive power in the hands of data practitioners. Go to themachinelearningpodcast.com/predibase today to learn more and try it out!
- Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations. Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today!
- Your host is Tobias Macey and today I’m interviewing Shir Chorev and Philip Tannor about Deepchecks, a Python package for comprehensively validating your machine learning models and data with minimal effort.
Interview
- Introduction
- How did you get involved in machine learning?
- Can you describe what Deepchecks is and the story behind it?
- Who is the target audience for the project?
- What are the biggest challenges that these users face in bringing ML models from concept to production and how does DeepChecks address those problems?
- In the absence of DeepChecks how are practitioners solving the problems of model validation and comparison across iteratiosn?
- What are some of the other tools in this ecosystem and what are the differentiating features of DeepChecks?
- What are some examples of the kinds of tests that are useful for understanding the "correctness" of models?
- What are the methods by which ML engineers/data scientists/domain experts can define what "correctness" means in a given model or subject area?
- In software engineering the categories of tests are tiered as unit -> integration -> end-to-end. What are the relevant categories of tests that need to be built for validating the behavior of machine learning models?
- How do model monitoring utilities overlap with the kinds of tests that you are building with deepchecks?
- Can you describe how the DeepChecks package is implemented?
- How have the design and goals of the project changed or evolved from when you started working on it?
- What are the assumptions that you have built up from your own experiences that have been challenged by your early users and design partners?
- Can you describe the workflow for an individual or team using DeepChecks as part of their model training and deployment lifecycle?
- Test engineering is a deep discipline in its own right. How have you approached the user experience and API design to reduce the overhead for ML practitioners to adopt good practices?
- What are the interfaces available for creating reusable tests and composing test suites together?
- What are the additional services/capabilities that you are providing in your commercial offering?
- How are you managing the governance and sustainability of the OSS project and balancing that against the needs/priorities of the business?
- What are the most interesting, innovative, or unexpected ways that you have seen DeepChecks used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on DeepChecks?
- When is DeepChecks the wrong choice?
- What do you have planned for the future of DeepChecks?
Contact Info
- Shir
- Philip
- @philiptannor on Twitter
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Hello, and welcome to Podcast Dot In It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, and dedicated CPU and GPU instances. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover.
Go to python podcast.com/linode today to get a $100 credit to try out their new database service, and don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy, and this month, I'm running a series about Python's use in machine learning. If you enjoy this episode, you can explore further on my new show, The Machine Learning Podcast, that helps you go from idea to production with machine learning. To find out more, you can go to the machine learning podcast.com.
[00:01:14] Unknown:
Your host is Tobias Macy, and today I'm interviewing Shir Kharev and Philip Tanner about deep checks, a Python package for comprehensively validating your machine learning models and data with minimal effort. So, Philip, can you start by introducing yourself? My name is Philip Tanner. Born in the US, but I lived most of my life here in Israel. I got a lot of my professional background actually from the Tapi'oq program. It's a excellence program that's done for technological leadership in the defense system in Israel. I kinda started off with operations research. It's kind of like, you know, consulting, you know, BCG McKinsey type things within the military. And then at some point, before, like, I ever heard about machine learning, there was this kind of competitive Kaggle like group that presented a project to me of, like, using random forest for something, like, incredible.
And I was, like, knocked off my chair. I said I have to move over to machine learning. I'm, you know, I'm not doing anything else aside from this kind of, and I was, you know, a big fan of machine learning ever since. So I started off in the other group, you know, kind of starting off as a researcher and then at some point moving over to manage 1 of few groups that were there. And Shir, how about yourself? I'm Shir. So I also started my army career and also a lot of my deeper dive into tech with the TARPY Orth program, where Filip and I were together in the same year group. I started initially in, low level cyber research in Unit A200
[00:02:34] Unknown:
and later on transitioned into machine learning where I worked on various anti terror related use cases, which was cool. And except from being the CTO of DeepChex, I, in the little spare time, love also doing various types of sports and riding my electrical unit site with the helmet on to stay safe.
[00:02:53] Unknown:
You've both mentioned how you got into machine learning. I'm wondering if you can talk through what it is about the machine learning space that led you to building the DeepChex project and some of the story behind it and sort of what the main focuses and capabilities are that you're offering with it?
[00:03:09] Unknown:
So what was really interesting about the roles I had in the, like, machinery research group I mentioned was that we got to see a lot of projects and we got to, like, deal with the interesting parts while collaborating with, like, many different parts within the let's call it the wider kind of defense ecosystem, which was pretty much ahead of most of the civilian world. I know it's counterintuitive, but there were a lot of things that were pretty new there. So I really felt like a potential customer for what I'm doing now. We just couldn't find any solutions that were kind of making sure that machinery models are behaving as expected. And, you know, there wasn't really anything out there and that's a big deal because, you know, we kind of felt that what we're what we're seeing
[00:03:47] Unknown:
in our domain was probably what's gonna happen in a lot of different companies in a few years. So at least for me, it was a very clear decision that it was interesting to work on and, you know, she had similar experience, and we did a lot of ideation together once we had that, like, initial seed. A lot of it from my side. 1 thing I did note before is that the reason I or 1 of the main reasons I initially transitioned to machine learning was that Philip approached me and said, okay. You love research and you love algorithms, and look at what a cool space is kind of building up here. And really a few months later, I found myself kind of starting to dive into the field. And I can say that also, well, we were in different places, but tackling challenging problems with new, exciting tools. And 1 of the challenges that we face is that I felt that while the algorithms are developing and really kind of starting to prove themselves, many of the challenges are how do you both convince yourself and also the various users that you're now introducing these new tools to, that they are actually working properly and that they're going to continue working properly over time. When you try to introduce a new technology and make people adopt it, that was a major factor as a challenge and also in the establishment of deep checks.
[00:04:55] Unknown:
And I know that 1 of the major issues in the machine learning space is the question of explainability beyond just being able to validate that the model does what you want it to do. And I'm familiar with projects such as SHAP, which focuses on Shapley values to be able to generate some explainability metrics for you to be able to kind of peek inside the black box of what the model is doing. I'm curious if you can talk to some of the ways that DeepTrax is maybe treading the same ground and some of the ways that building unit tests around these probabilistic models is a different or kind of a step beyond just the explainability question.
[00:05:35] Unknown:
I would call this kind of space that we're dealing with continuous validation. You're trying to make sure the machine learning model is doing what's expected, but also be having control over the machine learning throughout, like, a whole bunch of different phases. So that includes, you know, training machine learning models, CICD, monitoring, auditing. And I think explainability is definitely 1 component and very important component of many of these aspects. So I think credibility in general, I'd say, has, like, 2 different aspects. There may be more, but the 2 that I would address are 1 of them is to help the developers or the, you know, machinery practitioners make a better model or make it more robust and so forth. And then the other is for, like, a customer and end user to understand, like, or, you know, 3rd party, why a certain prediction was made, be like looking back. So I would say all these are, you know, a part of the continuous validation. It's just usually when you have needs and explainability, that's not the only thing you need. You might need some other things that have to do with just, you know, performance analysis, performance monitoring, some of the other things we're working on. The testing part of deep checks, I mean, I know that's, like, you know, most of what we're known for and part of the open source, but for us, it's, you know, as a company, it's only part of our vision and the space that we're operating in. And, yeah, Expendability, I think, you know, there's a lot of work being done in the academic world. I think it's a great kind of mutual academic discussion between companies like ourselves and the academic research is being done. And in terms of the audience that you're targeting, both with the open source
[00:07:00] Unknown:
Python package and the business that you're layering on top of that. I'm curious if you can just talk to some of the ways that you think about who that's serving and the problems that they have and some of the ways that they might be addressing it now that deep checks can either augment or potentially replace in terms of their sort of tool chain and workflow?
[00:07:19] Unknown:
So for the open source, it's really focused on machine learning practitioners, sometimes we call them. We mean, basically, anyone doing ML, many times called data scientists. It's really kind of intertwined in the workflow from training to training your model to deployment
[00:07:35] Unknown:
for overall monitoring and verifying your model over time and also, seeing it in different aspects, we feel that that is something that is broader and there are other stakeholders in the organization that want to ensure that everything is working properly. I'd say that that's part of the challenge of machine learning models in production, not solely what you would see in the open source package, but it's kind of like a no man's den today. I don't think it will stay like that. But, you know, you have, you know, machinery engineer in some cases, not always data scientist, data engineer, and IT, DevOps. It's sometimes there's, you know, business analytics. A lot of times, the problem, like, everyone suffers from it, but it's not like a clear division of responsibility, authority, and even even when it is, and the person responsible can always get all the work done just from, like, the vision of where the data is and the tools and so forth.
[00:08:20] Unknown:
For people who are trying to figure out how do I actually make sure that my model is doing what it's supposed to do and that it keeps doing what it's supposed to do, what are some of the tools and workflows that they've built up in the absence of deep checks?
[00:08:34] Unknown:
I think we really have to separate the different phases, but let's start off by talking about testing a bit. So within testing, I think there are basically 2 different types of workflows. 1 of them is creating a lot of in house kind of checks and tools which are typically kind of redone for every group of similar projects. Pretty time consuming and also, you know, not always as thorough for some organizations. They do a very thorough job. You know, then there are other places where it's just not really done. So, like, they know that there's something wrong with the mechanism, and they're happy to have a package like BlueCheck sets helping. For the machinery monitoring part of it, I think it's really use case dependent. It's a huge difference if it's a high risk model and if the implications of what would it mean to all of a sudden have, like, you know, much lower accuracy or things like that. I think you'll find that in cases where there is high risk, no matter what the cost is, they always had some sort of monitoring in place. And then once they have monitoring in place for the high risk models, they they'll also implement the same solutions for the other models. And from the beginning, everything's low risk, then there's a higher variance, and it depends much more on the culture of the specific personas and maybe where they came for before that. A bit like in soft mistakes is what really matters at the end. I think just that in machine learning, the methodology is much less developed. So it's kind of now gradually evolving.
[00:09:45] Unknown:
As far as the testing aspect of it, there are a lot of resources available for people who are working in the software engineering space about how to think about testing, how to structure their code to make it more testable, and that's definitely valuable. But I know that when you move into machine learning, a lot of those ideas and lessons aren't directly applicable. So you can definitely apply those to the actual code that you're writing to be able to build the models of being able to say, you know, make sure that you write small functions that are composable, that have, you know, deterministic inputs and outputs so that you can test the function as a unit and then build that up into integration tests to be able to, how does this behave when it's connected up to other pieces? But I know that once you start throwing data and probability into the mix, then there's no way to guarantee to say, I have this function. I'm going to, you know, put in the numbers 45. I'm going to get out 9 because I'm doing an addition. When you have a machine learning model, you say, I'm going to put in these 2 images, and maybe it tells me there's a cat.
But, you know, depending on how the model has been trained, what other, you know, data or inputs it's processing, particularly when you get into the space of deep learning, there's no way to guarantee that you're going to get a specific value. You have to say, I'm going to accept this range of values, and maybe that range shifts up and down over time. Or, you know, you wanna maintain a certain confidence interval and you wanna maybe ratchet that down. I'm curious how you think about the overall space of testing for machine learning models and how it is distinct from the testing that people who have worked in software engineering might be familiar with.
[00:11:25] Unknown:
I think you covered lots of the, major differences between these 2 worlds. And, really, unlike in software, where we're kind of looking at components and gradually zooming out and trying to test the system as a whole, in machine learning, it's a bit more throughout the workflow. So initially you have the data before it's even starting to start its processing. And that's the first step you wanna phase it and make sure that everything is okay. So that's why we look at the phases. So it's first the data, then you have the processing and the splitting. So you have like a train test and like you wanna make sure that test actually represents what do you think it does. And, obviously, when you have a model, that's like, you want to make sure both that the model is trained as you think it is and also that it represents what you want. So, instead of, like, looking from bottom up throughout the system, we look throughout the workflow. So, I'd say that's, like, 1 difference and also examine the different components throughout the way. The types of tests themselves are very much affected by that because instead of doing things like coverage testing, which we really want to check all the branches in software, this is something that we can't really do in machine learning. So we really wanna look, for example, at test the worst case scenario and see what is actually happening there. And is it reasonable? Or trying to segment everything and see if there are specific areas which behave in a weirder manner. Or in general, trying to get, like, statistical analysis of the space and distributions and see if there are any outliers. So the way we actually look at it is, let's say, probably a bit more algorithmic and statistically based, like the space itself. I know this goes without saying, but I mean, since it's open source, so it's really cool. You can just like run a few lines of code and see everything she'll just talk about now as, like, an output in the document real quick.
[00:13:02] Unknown:
Another interesting aspect of testing is that in sort of traditional software engineering, there's this idea of test driven development where you write your test first to encapsulate the functionality that you're trying to implement, watch it fail, build the logic that makes it pass, and then keep iterating on that. And I'm curious if there is any synonymous approach in building machine learning models and managing experimentation to say, you know, maybe I want to ensure that I'm getting outputs of this type that are within this specified range, and I'm going to build the model, train it, see what the outputs are, see, okay, well, maybe this failed, and then figure out why it failed or it passed. And so now I'm going to go through another iteration cycle of experimenting in a slightly different direction and just, like, how that idea of doing the validation in the process of writing the logic plays out as you're building these machine learning models?
[00:13:59] Unknown:
The main analog in my perspective is looking at the metrics and seeing, first of all, what is the first initial baseline we can achieve? And then that's kind of the 0 assumption. And then does our model improve it? And then what is the minimal what do we need to achieve in order for this model to be usable? I don't think it's the same process, but I'd say that's the way to approach it from looking kind of from the end. I want this test not to fail. So I want to achieve the minimal metric I need in order to deploy this and then look back. I think in test driven developer, what's interesting is when you speak to, like, senior developers or developers that are, you know, very into what they're working on, everyone knows the term. When you go into Teams and say, do you develop with, you know, TDD?
[00:14:36] Unknown:
It's not that common to find a team that says this is kinda what I'm working on. Now in machine learning, I think it's more important actually. It's the bigger part of the process. And you'll see actually machine learning teams like the successful teams, you know, or at least a large portion of successful teams. You'll see that they don't use that terminology, but they do work like that. You'll see that everyone will sit down together and define the business KPIs and define, you know, which different criteria have to pass for the project to be considered a success. And a lot of these teams have kind of tried working on projects without doing that first and see if, you know, a lot of problems come up. You could be working for, like, you know, 6 months and get to the point where it's a project that then the business is like, okay. I'm not sure we would have launched the project if we know these will be the results.
So I would say the terminology isn't there, but the ideas definitely are, you know, with with some variations like she mentioned.
[00:15:25] Unknown:
Another interesting aspect of validation in the machine learning space is the question of guarding against bias in the ways that the model is interpreting your inputs and the logic that you're building in. And I'm curious if you've come across any kind of best practices or strategies for being able to write some of these validations to provide maybe counterfactuals to say, you know, I'm going to feed in this input. I expect this type of output, but I'm also going to guard against, you know, potential bias. So, you know, some of the notable issues are in computer vision where maybe a model is trained only on people of a particular ethnicity, and so then it gets presented with a different type of imagery than what it was trained on, and it doesn't know how to handle it or completely it. Or in the case of the kind of infamous Twitter bot that Microsoft released, they didn't really expect that, oh, the Internet is full of, you know, bigots and, you know, kind of bad behavior, and so we're not going to put in guardrails to make sure that it doesn't go too far down this, you know, avenue. And so I'm curious what you've seen as far as ways the teams might think about building in some of those validations as part of their kind of general test suite for any machine learning model to be able to catch or report on some of these ways that bias might be exhibited?
[00:16:48] Unknown:
Really important question, bias. And I think to some people, it feels like bias is just, like, another, you know, interesting kind of lecture type thing. And, you know, from some other people, some other populations, sometimes, you know, these types of things, even without you knowing about them, really have a drastic effect on your life, and it's really because the more machine learning is gonna be deployed, the more you might feel. It's been from a, like, engineering or professional point of view, I think I wanna separate, you know, 2 different use cases. 1 of them is structured data, the other is unstructured data. When you have structured data, then, you know, there are some challenges. You don't wanna just look at, I don't know, gender or, you know, a zip code or something. You gotta compare it to whatever you don't want it to be correlated to. You also want to look for hidden features. You know, that's all kind of part of a process that, you know, with enough manual work, and it is kind of clear how to do it, at least for a static dataset. You know, there are challenges that help you get it done also kind of continuously over time. But for unstructured data, I think the answer is pretty different. So if you have a language model that's been trained on, you know, problematic purposes or purposes with some sort of bias, you might not know about it for a very long time. And these are huge challenges, and it's the same thing with images. The models can contain you know, because they're trained on these huge datasets that don't even, like, fit in the RAM of the of the machine that's training it. It's really a challenge, and I think you definitely don't want it to cause some of these effects. And sometimes from the simple examples of and, like, all these tutorials, it seems like you do just do x and y and z, but I think it's really an ongoing challenge. There is definitely the academic challenge in improving the algorithms and finding better ways to do it, but there's also the challenge of implementing this in the real systems. And the more it's critical for I wouldn't say for your business, but let's call it for, like, the ethical or kind of responsible parts of the business that are associated with the models you're deploying, the importance of being good at it just increases, like, not just towards, you know, your revenue goal, but also towards, you know, the kind of corporate responsibility.
[00:18:44] Unknown:
The other interesting aspect is the question of correctness and what that means in different contexts where if you have a machine learning engineer who's very familiar with the algorithms, but not necessarily a subject matter expert of what they are working with, you know, how they might pair with the existing subject matter experts or the business owners who have a particular objective for this machine learning project to understand, you know, what does correctness mean given the fact that you are dealing with inherent, you know, improbabilities or probabilities, and you need to be able to understand within what parameters is this actually correct, and is this correctness acceptable?
[00:19:23] Unknown:
I think I'd split it into 2. And 1 is, are the things I'm checking for correct? So, for example, I think this also connects to the previous question. Like, in some use cases, we can actually check like, do a performance evaluation and see how the performance is in different demographics. Do we do this or not? This is more of methodology and how comprehensive we actually check the performance of our model. But we can say 1 aspect is kind of we have specific targets and how deep do we check them? Think another aspect, which is very much domain specific, is really what is considered correct in this domain.
So, for example, if I have a healthcare use case, obviously, my restrictions or, like, my boundaries on the metrics I'm trying to achieve would be very, very different in how sensitive I am to mistakes, for example, than if I'm working on a marketing use case in which I'm trying to optimize kind of the overall. And it's okay that if I have some areas which are performing a bit worse because I really care about the broader picture. It's really an issue of defining in the domain what we need to achieve in order to be correct for us.
[00:20:25] Unknown:
Digging now into the deep checks package, I'm wondering if you can talk through some of the ways that it's implemented and some of the architectural decisions that you baked into it and how that plays into the different types of machine learning or the different machine learning frameworks that somebody might be using it with? There's different ways to address this question. I think I'll start with 1 way that connects to the previous question,
[00:20:47] Unknown:
and that is that I think 1 of the challenges is really to make sure we check the model and the data from lots of different scenarios or different perspectives. And 1 of the things that we feel, and it's also the feedback that we've been receiving from our users, is that the fact that they're now able to run with really minimal effort, so many tests, both on their data and the splits and the model and performance and performance in different segments and different metrics and comparing it to a baseline. So, really lots of different things that are relevant in the context of machine learning is something that that's more kind of how to address in general, the notion of testing and machine learning, talking about the structure of the package itself.
[00:21:25] Unknown:
Maybe before the structure, just like in how it's used, I'd say there are like 2 main use cases. 1 of them is for the package, you know, work with online and so forth for excluding a world of reason for future. There's working during training where it's, like, within a notebook or with an ID, and then there's doing it within the CICD, which is, you know, typically, it can be integrated with a lot of different tools, but, like, pretty much any tool that has a sort of process of saying go, no go. So Airflow, I think, is a great example for that. But I think just in terms of the structure, so we have a few different terms maybe, you know, let's call it the checks, conditions, and then suites. So the idea with each individual checks each individual check is 1 of the things that Sheila mentioned earlier, like, you know, let's look for is there a big difference between distribution of training data and test data, which, you know, you can use those names even with something that isn't trained test, but, like, you know, 2 similar datasets.
Then these are united together in a suite, which is very, very much like in software having that suites. Conditions are kind of, you know, what you get these checks a lot of times will give, you know, a histogram or a number or something like that. And then the condition is kind of what will make it faster, pale, what turns it into a binary output. The idea was to build a framework that both has quite a lot of logics built in. So without a lot of coding, you can already receive many results. And on the other hand, can be easily both extended or customized
[00:22:43] Unknown:
to really address the different parameters that are relevant for your space. We took it both ways and get inspiration from various testing frameworks for software.
[00:22:52] Unknown:
Given the fact that there is this sort of a dichotomy in the machine learning space between people who are doing, you know, what some might call simple, in air quotes, definitely not simple, but sort of linear aggression models versus the folks who are all in on deep learning, you know, massive neural nets. I'm wondering how that breaks down in terms of how you need to think about building a package that can address those different approaches or if at the end of the day, because it's all, you know, code and math, it, you know, boils down to the same types of information that you're trying to collect.
[00:23:27] Unknown:
Yes. I think what's really interesting, like, for us, at least, as, you know, maintainers of package is kind of, you know, the API or how you call these different functions and what they look at, but we're doing our best to have the closest thing as possible to 1 single look and feel of our package. We have different sub modules. So, you know, we have tabular data, which is in place, and, you know, we're on a 1.0 with that. For computer vision, we already we're in beta. We'll have for NLP and so forth in the future. We're aiming for the experience to be pretty similar and so forth. Now just in terms of where machine learning is going, I think that's, like, a very interesting question because I think it was pretty standard until a short time ago that, you know, XGBoost is pretty much the best for almost everything tabular. And then, of course, you know, it's CNN that's best for computer vision, and then, you know, LSTM for, you know, dealing with NLP, but in some cases, kind of splitting into features. This is kind of changing. I think there are a lot of question marks about a lot of those. We're starting to see, you know, definitely transformers for NLP, but transformers are, you know, at least starting to bite in tabular data use cases and also computer vision use cases. I don't think we know yet what's gonna be, like, the dominant model in in each 1 of these use cases in, you know, 3, 4 years. But for the foreseeable future, we think machine learning practitioners will be working with multiple types of datasets, multiple types of models. And we want everyone to be able to use the checks.
[00:24:49] Unknown:
It does affect and prioritize some of our checks sometimes, because for example, many times for computer vision related use cases, the labeling is a bigger challenge, for example, than for tabular. So understanding things like which images am I performing worst on and then maybe verifying their labels or relabeling them or maybe kind of labeling new images. That's a challenge that's relevant for improving your model and less relevant, for example, in tabular. So I'd say it has kind of a meta effect on the package, but it's not really a direct 1.
[00:25:20] Unknown:
As you started down this path of building this project and saying, we want to be able to add some rigor to the machine learning process to understand as I progress through from idea to experimentation to implementation in production to retraining and just managing that whole workflow. I'm curious what were some of the ideas or assumptions that you had about what you knew indelibly to be true that have been challenged in the process of exposing it to the open source community and letting people use it and working with some of your early design partners?
[00:25:54] Unknown:
So something I don't know if it's surprising or not surprising. I think surprising for us as, you know, developing an open source package for the first time, but probably for people experiencing open source, it wasn't that surprising, is how central read me documentation and so forth are. It's crazy. Even though if you think, you know, people are smart and dive into the code and so forth, they have a lot of great packages out there that they have to see and kind of chew and so forth. And it's really make or break for some of our users. If they can get exactly what they need very quickly from the documentation and from the. And it makes a lot of our focus be there, not just in, like, you know, building great tech, but in getting to the point where our users have a very quick start and not for just the main use case, but also for all these different side use cases, like, you know, custom checks or custom suites. So that's very different, I think, than what we thought when we said, okay, let's start to do the open source. You know, there are of course many other things that we've learned while building it there. You know, it's a combination of many, many, many, many projects.
[00:26:50] Unknown:
But I think that's the largest 1 for me at least. 1 thing I noticed throughout the process of working in deep checks and also working with customers on monitoring use cases is, well, it kind of changes between time. But sometimes I think initially that we were many times surprised from very, very simple checks giving lots of value because they just find things that you wouldn't it's kind of trivial that, okay, that was probably tested for. But, no, actually, it wasn't, and it just popped up here. And once you really check it from lots of, different angles and you find these new things again from the other side. Of course, once we'll actually expand it into the more research areas, then also the kind of deeper algorithmic insights were the ones that, also gave lots of value. So it's kind of some balance that we eventually reached.
[00:27:30] Unknown:
The simpler the test that you failed, the more trouble you're in. If you fail the test of your model is not up, you're in deep trouble. You feel the, you know, model performance is down 40%, you're not as deep trouble, but, you know, also deep trouble. Well, depends on the case. If you pass all the simple tests and then, you know, you have some, you know, model confidence in the subsegment type test that you fill in, that's great. You you should fix it, but you're, you know, better shape than than some other firm. So talking through an example workflow for somebody who is
[00:28:03] Unknown:
adopting deep checks for being able to incorporate into their model development and deployment process. What is the actual process for identifying the types of checks that you want to use, building your own checks, incorporating it into your code, and just the overall steps from, you know I've recognized that model validation and testing is something that I need to do, and now I've got my model deployment and code and sort of the overall workflow fully instrumented using deep checks?
[00:28:35] Unknown:
1 thing is that we see testing as something that is really valid throughout the life cycle of development. So it's actually once you already have your data, you can start validating it. And also once you train your first model and then you train a better model. So it's really an ongoing process. And regarding the adoption of deep checks in that process, we try to also build it in a way in which you can initially just run a specific suite for your use case. So for example, for data integrity or for evaluating your model. And then when you both look at the implemented checks and also think about your scenario and what are the additional things that you want to check, That's the phase where you dive in and add a specific condition or or customize the check parameters to really tune it after you saw kind of what's relevant and what things are missing. So it's you have the we give you the template, and from there, you should do the process and really extend it. And in some cases, also probably use it in additional projects like in your company. So if you have some domain specific checks that you've developed and hopefully also contributed to the package, then you can also reuse them in new use cases when you start working on them. The comparable I would look at is unit testing. So, like, when do you need to run a unit test? I think it's kind of pretty clear, but it's also pretty clear that it's frequent.
[00:29:45] Unknown:
So the main difference is that people know what unit tests are, and then once you kind of know what the capabilities are and what could be tested there, then then then there's some sort of common sense in the process once I run it. The difference about deep checks and in general, you know, machine learning validation or testing is it's not so common. It's not yet, you know, well known. The terminology isn't there yet. So I would just kind of separate in 2 phases. 1 of it is we want the first few usages of deep checks to kind of show the users what the possibilities are. So we'll prefer to be more thorough and show more options. But then, you know, as you started to to live with the healthy machine learning culture and and not just like the kind of world demo type of POC culture, then, you know, we think a lot of the work is kind of much more customizing, changing the checks, changing the suites, and and just fitting it to the culture of what you mentioned earlier, kind of like test driven development, but let's call it maybe test driven machine
[00:30:35] Unknown:
learning. Another interesting aspect of building out these test suites and customizing some of the types of checks that you want to perform is thinking about how to extract them into reusable modules to say, you know, for any machine learning model that's dealing with computer vision, I always want to, you know, check that, you know, 1, that's able to process images of these different formats and that I'm able to, you know, retrieve the labels of this particular type and then ensuring that the outputs are, you know, being given confidence intervals. So just, like, curious what the process of building some of these reusable checks looks like and some of the complexities that are involved in understanding how to write them in such a way that they are composable so that you can build out your own sort of suite of checks that you wanna run across your various machine learning projects and the cases where it is very tightly coupled to a specific implementation and sort of understanding what that kind of degree of variance looks like.
[00:31:35] Unknown:
In the case of data and models, I think a major aspect here is what are the tools you're using and what is the structure you're working with. So, for example, also in the package, you can really see it. Tabular data is very different than computer vision. And, also, for us, it's different if you're using PyTorch or TensorFlow or something else. So when trying to build something that's really reproducible over time, what we look at is, first of all, what do we support and what phases in the process do we aim to support? And from thereon, we kind of build a puzzle. In the context of data and model, it's really what are the puzzle pieces that you aim to support and what inside it can you customize. So I think really for data workflows, it's really what is the type of data? What are the tools you're using for processing it? What types of packages are you using for working with it? And we try to implement kind of a very clear interface of what do we actually need. For example, which tabular model do we support? So what is the input and output we need? And same for computer vision. And within that space, you should really dive in and see what are the relevant things you want to test for.
[00:32:39] Unknown:
I think in terms of them writing their own checks and then being able to reuse them and know what fits for each type of use case, So that's exactly what the package is, like, optimal for. So, you know, you create checks, but then you create suites. In our mind, suite
[00:32:55] Unknown:
equals use case. Each suite is optimized for a specific use case, and then you don't have to, like, rethink what's there, know which checks you need and kind of have, like, some sort of, like, seller list or Jupyter Notebook. You just have kind of a sweet object. That's exactly the design of the package. We did feel that many teams are having a challenge of while they are trying to thoroughly check their model, they're not necessarily transferring the information or the implemented code. And once you have the structure, this enables you to do it in a much more efficient manner. The other interesting avenue to explore is the question of using this in a team context where for software engineering, you'll typically have your tests execute on every commit to the main branch. So you've got your CICD pipeline to make sure that your tests get executed, that if it fails, it doesn't go into production. And, you know, even in the supposedly deterministic world of a software project, there are the risks for, you know, flapping tests where it, you know, might randomly decide to pass or fail depending on timing issues or, you know, various other types of indeterminism.
And I'm wondering how you've seen folks approach that question of reducing the kind of flapping test case where you wanna make sure that your confidence intervals are dialed in right so that it doesn't randomly pass or fail, but also doesn't pass erroneously. And how do they think about this kind of CICD question of making sure that when I augment this model, it goes through and runs this full test suite and being able to manage some of the kind of additional cost that's involved in, you know, what can sometimes be a very lengthy training process to make sure that get a model that is testable and can pass these validations?
[00:34:37] Unknown:
I would split this process into 2 phases. There is everything that I do before the model is deployed for the first time. And, typically, in that phase, I'd also define what exactly needs to happen and what is a pass or fail. And also eventually after I do the final retrain also evaluate again the model before it's deployed. And once it is already deployed, then depends on the use case, but it's probably retrained in a certain amount of time. Then everything is already defined. And then the models, if something pops up, I can possibly redefine the boundaries or manually inspect and see what's going on there. Generally speaking, I would say that the process is really every time I retrain and every time I wanna do something kind of semi autonomously.
[00:35:15] Unknown:
So let the model kind of deploy and trust that that it's okay. And now digging into the commercial aspect of what you're building, we're speaking to the open source project. I'm wondering what are some of the additional capabilities and services that you're layering on top of the open source project and how you think about the governance and sustainability aspects of the deep checks, the package, and balance that against the business needs of DeepChex, the business.
[00:35:44] Unknown:
In general, we're an open core company. That means, you know, our focus is definitely the open source. And what we really care about is giving a quality product to the community. And we realize most people using deep checks won't pay, and we're completely fine with that. That's the way we're aiming. We're building community and the user base in a way that the sheer numbers are enough that even if, you know, some of the users paid, then that's enough for us. So typically, it makes sense for some of these companies to engage with us either, you know, whether they're a company that needs more governance and, you know, features that have to do with the governance for the models themselves. So that could mean, you know, if you're doing monitoring, then you're not just monitoring only be able to, like, the scientific aspect that you wanna be able to have control over, you know, who's in charge of which model and so forth. The other side is companies that are, you know, in production, not just in training phase. So they have typically, you know, companies will start and have between 1 5 models in production, depending on a few different characteristics.
For those types of companies, then, you know, it's performance monitoring, drift, things like that. But while we're integrating with the different types of databases, have a few different deployment options, it's built in a way that they would expect to use us as an enterprise offering. So it's interesting because it's the same algorithms, the same look and feel, the same API that users are using when they use the open source package. But then typically, where there are requirements from other stakeholders or other users, when you need more UI type features and so forth, then we think it made sense to move. So it's a pretty clear motion. Our vision is to be that deep checks is part of what the data scientist kind of looks and, you know, kind of what they feel data science is from like the first time you step into university throughout their, you know, more advanced trainings, throughout their first project until they're leading groups and are using it for everything, including monitoring and at some point auditing.
So that's the vision. We didn't complete all of that, of course, but we're seeing some very promising traction in the meantime. We're getting a lot of requests for the pro version.
[00:37:40] Unknown:
And so as you have been building the deep checks project and building the business around it and working with some of the early users and design partners, what are some of the most interesting or innovative or unexpected ways that you've seen it used? I think 1 of the most interesting things is just to see how these integrity
[00:37:58] Unknown:
checks are catching things that we never would have expected. So, you know, things like, you know, uppercase, lowercase, things like having, you know, different schemas. There are things that you wouldn't really imagine that happened. I think these teams, if you would ask them, they say that they'll happen. Then, you know, once DCHECKs is just deployed, then you see it there. I think something else that we've been seeing a lot you know, you could say this is actually predictable, but, you know, anything that has to do with the label is very interesting. Even if it's like a small deviation from what they expected. So anything that's around the label, either, you know, the actuals or the predictions or so forth, we see that it's kind of like amplifying times, I don't know, 10 from running whatever check on the features, Cause that's really the bottom line. Whatever happened before, you know, you could have 2 different problems that cancel each other out and then the label's okay. So
[00:38:46] Unknown:
For me, 1 of the interesting cases I think we faced was quite at the beginning of developing the package, and we were still quite far from releasing it open source. We were kind of releasing it to friends to try it out and get feedback and see how it works. And then we kind of were approached and said that, oh yeah, we tried the deep checks on testing recent problematic production data that we had. Like they had a big bug and they just downloaded the data and ran on DeepJax, and it immediately found the problem, like, in seconds. So it was really cool that for us, it wasn't, like, even production ready, and then it kind of immediately showed them the exact value. So, yeah, I'm not sure if it was unexpected, but it was fun to see. And at least 1 of the places where we announced the initial release of Deepgram. So, you know, a vast majority of the comments were very positive. There was someone who was, like, criticizing the package. I can't remember exactly for what. It was, like, 3 lines of criticism and 1 line of and, you know, even though I ran it on my own data and it found all sorts of problems, then it was like 1 more line of criticism. So to me, that was very interesting. But I would say it's like a different use case. I think this is a pretty common use case, but definitely warmed my heart in a weird way. Yeah. I didn't really like your package. By the way, it found a bug in my code. And, yeah, and you should do it better. Okay. Thank you.
[00:39:54] Unknown:
That's why they didn't like it. It's because they found the bug. My code should be bug free. I don't want anybody telling me different. And in your experience of building that out, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:40:09] Unknown:
Definitely 1 of the challenges we've been dealing with all the time is some sort of trade off between it's like a product kind of decision or, you know, realm of decisions that has to do with, you know, the level between getting value out of the box and then being accurate. So there are some other packages including some, you know, great other packages like great expectations where the focus is saying, you know, we're gonna make sure that the user has big control over kind of what passes and what doesn't pass to define everything. A lot of the dynamics between that and between trying to give a lot more value out of the box. And then, you know, worst comes to worst, the users don't, like, sound out, but then they'll change the parents later on. So we debated a lot. We interviewed a lot of users. We decided to go for the quick time to value and to get to the point where when you, you know, just run a few lines of code that you get this kind of detailed document. And, you know, there's definitely a price to pay for it, but, you know, we think that's that's something that's unique about our users or about the data scientists who are what we're aiming for. You know, they'll be fine with seeing something not inaccurate. You know, they want to get a quick feel for it. And if they really need it, we'll get the idea pretty quickly and know how to change it. Different user bases will probably have different types of behavior. I think 1 interesting learning is also in general, how to work open source,
[00:41:18] Unknown:
just because it's so different in so many manners. Like, everything you do is out there. It has many advantages of receiving ongoing feedback, and suddenly you wake up and there's, like, a few new issues. And you're like, oh, cool. Thank you. It really gives a lot of motivation to keep working on it, developing and making it better for various use cases. You never really know, like, what are different people going to use, what maybe people will, I don't worry about. So sometimes, I don't know, we, like, developed a check that we didn't know whether it would be useful or not, but we thought we'll put it in and kind of put it, like, as an experimental check and check it out. And then suddenly, we see an issue about it. So we're like, okay, someone found it and is now checking it out. And on the other hand, there's also things that you have to really be careful about because sometimes you may have something in mind. Like, for example, you want to be able to get really good feedback and a clear understanding of what exactly users do in order to make the package better. But you understand that there are also things like privacy considerations, even though you're not, like, sending any private data, of course. But still you can't just, like, I don't collect any data from general usage and things like that. So it's something you really have to keep in mind in how to work as an open source company and get if you want or need in order to really improve the package.
[00:42:25] Unknown:
For people who are looking for ways to add guardrails to their machine learning workflow and identify issues in the models as they go through the different stages of training and production and retraining and redeployment? What are the cases where deep checks is the wrong choice?
[00:42:41] Unknown:
I'd separate today and the future. First of all, you know, we don't have NLP yet. So, you know, do feature extraction from NLP, move it to Tabular, and Deep Checks is still good to go. But, you know, NLP, if you have, like, you know, a transformer model that's doing, you know, everything that you need pretty much for the use case, then I'd say deep checks isn't yet there. And also for computer vision, you know, we support some of the use cases with some customization. You can get it to some of other use cases, but we're not a 100% there yet. I think for those things, we're aiming to be, I don't know, within a year or so, I'd say, relevant for pretty much every machine learning data science use case. Of course, there's always a long tail, but I think even those long tail cases will have customizations that you cannot to get relevant checks. I think there are still some, you know, similar users that would like Deepgram to be relevant for them. It's not optimized for them. So, like, if it's a use case with data, but with large amounts of data without a machinery model, then even though some of the, you know, some of the checks are relevant, it would be the same thing. I think they'd probably still be better off working with a tool like your expectations.
[00:43:42] Unknown:
And you've mentioned a little bit about what you have planned for the future of the project and the product. And I'm wondering if you can just talk to some of the other things that are in flight for the near to medium term or any areas of focus that you're excited to dig into?
[00:43:58] Unknown:
I'd say in general, in terms of, you know, which data we support, that's a huge thing for us. So, you know, starting off with Tabular, expanding to you know, we're already in beta, as I mentioned, for the computer vision, and then expanding also to NLP. And later on, there's all sort of, you know, other use cases like, you know, time series, some of it's supported, but, you know, full support and then audio and so forth. So in terms of which data and use cases we support, that's definitely part of it. Then we have another significant effort, which is just integrations.
There are all these other great tools around, I think, in general, MLOps, which I'd say, at least broadly, we're part of the ecosystem. Sometimes it's a mess and there are too many companies or too many packages and users don't always know what to do. So we wanna have a clear way for how to use deep checks together with all these different tools. We started working on them and hope to release very soon at least another bunch. And then I'd say the 3rd effort is on expanding from the open source, what we call continuous validation. So it's testing CICD monitoring and auditing
[00:44:54] Unknown:
where they all share the same algorithm core. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add preferred contact information to the show notes. And as a final question, I'd like to get your perspectives on what you see as being the biggest barrier to adoption for machine learning today.
[00:45:11] Unknown:
So I always thought that the biggest barrier is overcorrection from management. So if you start working with a machine learning model, It's not good enough. It has some sort of problem. So then instead of saying, okay, let's continue to work on it. You just say, you know, I've had it. You know, for our use case, it's better to work without machine learning, and that could be a huge mistake. Because if your competitor or a similar company was more persistent and managed to kind of break the barrier, you could realize, like, you know, 3 years after that, okay, we should probably transfer over and that's huge. It's moving the whole tech stack. I think definitely, you know, some of the reasons for getting to the point where you have over correction has to do with the space that we're dealing with machine learning validation.
[00:45:52] Unknown:
So I'm not sure if this is the biggest barrier for adoption, but I think that currently 1 challenge we still need to crack is the kind of the balance between how much we really can trust the model and in what aspects and how can we verify it And also like the side of the machine learning and improving it and making sure it's going to work properly
[00:46:10] Unknown:
versus kind of the side of the regulation that is evolving as we go. So there is kind of a I wouldn't say tug of war, but there is a gradual process. And I think once we crack it and do the essential steps to reach the next step, then adoption will be able to be much more widespread in the long term. Well, thank you both very much for taking the time today to join me and share the work that you're doing on deep checks. It's definitely a very interesting project, an interesting business, and a very valuable problem domain to be able to focus on to help support the machine learning community. So I appreciate all of the time and energy that the both of you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much, and thank you for everything you've been doing for the community. It's equally important. And thank you for hosting us today.
[00:46:54] Unknown:
Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introductions: Shir Kharev and Philip Tanner
Journey into Machine Learning
DeepChex: Continuous Validation for Machine Learning
Target Audience and Use Cases
Testing Machine Learning Models
Addressing Bias in Machine Learning
Defining Correctness in Machine Learning
DeepChex Package Implementation
Assumptions and Learnings from Open Source Community
Adopting DeepChex in Workflow
CICD and Model Validation
Commercial Aspects and Governance
Interesting Use Cases and Lessons Learned
When DeepChex is Not the Right Choice
Future Plans for DeepChex
Biggest Barrier to Machine Learning Adoption
Closing Remarks