Summary
Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle. If you work as part of a team that is building machine learning models or other data intensive analysis then make sure to give this a listen and then start using DVC today.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to ​serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what DVC is and how it got started?
- How do the needs of machine learning projects differ from other software applications in terms of version control?
- Can you walk through the workflow of a project that uses DVC?
- What are some of the main ways that it differs from your experience building machine learning projects without DVC?
- In addition to the data that is used for training, the code that generates the model, and the end result there are other aspects such as the feature definitions and hyperparameters that are used. Can you discuss how those factor into the final model and any facilities in DVC to track the values used?
- In addition to version control for software applications, there are a number of other pieces of tooling that are useful for building and maintaining healthy projects such as linting and unit tests. What are some of the adjacent concerns that should be considered when building machine learning projects?
- What types of metrics do you track in DVC and how are they collected?
- Are there specific problem domains or model types that require tracking different metric formats?
- In the documentation it mentions that the data files live outside of git and can be managed in external storage systems. I’m wondering if there are any plans to integrate with systems such as Quilt or Pachyderm that provide versioning of data natively and what would be involved in adding that support?
- What was your motivation for implementing this system in Python?
- If you were to start over today what would you do differently?
- Being a venture backed startup that is producing open source products, what is the value equation that makes it worthwile for your investors?
- What have been some of the most interesting, challenging, or unexpected aspects of building DVC?
- What do you have planned for the future of DVC?
Keep In Touch
- dmpetrov on GitHub
- Blog
- @fullstackml on Twitter
Picks
- Tobias
- Dmitry
- Go outside and get some fresh air
Links
- DVC
- Iterative.ai
- Linear Regression
- Logistic Regression
- C++
- Perl
- Git
- Version Control System
- Uber Michaelangelo
- Domino Data Lab
- Git LFS
- AUC == Area Under Curve metric for evaluating machine learning model performance
- Wes McKinney Interview
- PyTorch
- Tensorflow
- TensorBoard
- MLFlow
- Quilt Data
- Pachyderm
- Apache Airflow
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale up. And for your tasks that need fast computations, such as training machine learning models or building your CI pipeline, they just launched dedicated CPU instances. Go to python podcast.com/linode.
That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to serve as a platform agnostic digital space for bot developers and enthusiasts of all skill levels to learn from 1 another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space, they have a beginner's guide that will teach you the basics of how bots work, what they can do, and where they are developed and published, And to help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need, they have compiled a list of the major options and how they compare.
Go to python podcast.com/ discoverbot today to get started and thank them for their support of the show. And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. Go to python podcast dotcom/conferences to learn more and to take advantage of our partner discounts when you register.
And visit the site at pythonpodcast.com to subscribe to the show, sign up for the newsletter, and read the show notes. And to help other people find the show, please leave a review on iTunes and tell your friends and co workers. Your host as usual is Tobias Macy. And today, I'm interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects. So, Dimitry, could you start by introducing yourself?
[00:02:36] Unknown:
Sure. Hi, Tobias. It is a pleasure to be on the show. I am Dimitry Petrov. I have a mixed background in data science and software engineering. About 10 years ago, I worked at academia. And sometimes I say I worked with machine learning for more than 10 years, but you probably know that 10 years ago, machine learning was mostly about, like, linear regression, logistic regression, and, yeah, this is pretty much what I was working on. And, then I switched to software engineering, and, I write, some production code. At that time, data science was not a thing. Around 5 years ago, I switched back to quantitative area. I became a data scientist at Microsoft, and I saw how modern, data science looks looks like.
Recently, I'm I'm actually back to software engineering. Today, we are working on DVC, and
[00:03:32] Unknown:
I basically build tools for machine learning. And do you remember how you first got introduced to Python?
[00:03:38] Unknown:
Yeah. It happened in 2, 000 4, I believe, and it happened accidentally. I got into an internship, during my master program, and I work on a Python project. It was my first, scripted language. I don't count, at least, in my functional programming class, of course. And it was part Python 2.2, and then we switched to 2.3. I have spent a number of sleepless nights debugging Unicode issues in Python 2. If you work with Python 2, you probably know what I'm talking about. And later, I found myself using Python in any single research project that I was working on. We were still using, MATLAB for machine learning. It was quite popular 10 years ago.
We use plan c for kind of a low level software. Let's imagine when we we need a driver to track some signals. But in general, Python was a primary language in my laboratory. When I worked as a software engineer, my primary language was, c plus plus So but I still use Python for kind of ad hoc project for automation. At that time, there were a lot of discussion about Python versus Perl. And to me, it was not a question. So I use Python all the time. And during my transition back data science, Python become my primary language because, like, in data science, it's kind of, 1 of the top languages that people use.
[00:05:02] Unknown:
Today, we build tools for data scientists, and we use Python. And so 1 of the tools that you've been working on is this project called DBC or data version control. Can you start by explaining a bit about what it is and how that project got started?
[00:05:17] Unknown:
DBC is basically a version control system for machine learning projects. You can think of DVC as a command line tool. It's kind of a a git for ML. What it gives you, it gives, like, 3 basic things. 1st, it helps you to track a large data files. For example, 10 GB data files, 100 GB data sets, ML models. And it helps you in your pipelines, machine building pipelines. It's not the same as data engineering pipelines. It's kind of a lightweight version pipelines. And DVC helps you to organize your experiments. My exam, self descriptive, descriptive, self documented. Basically, you know everything, how a model was produced, what kind of commands you need to run, what kind of metrics, were produced as a result of this experience.
And you will have all this information in your, Git history, and DVC helps you in, to navigate through your experiments. From technical point of view, we use, Git as a foundation. So the DVC works on top of Git and a cloud storage. You can use s 3. You can use Google Storage or Azure or just, like, random SSH, server, where you store data. And you see basically orchestrate Git, and cloud storages. And you also asked, like, how DVC got started. This is actually a separate story because, initially, when we started the project, we were not thinking about Git format. What we were thinking about was data science experience, what is the best practices, and we were inspired by ideas of, data science platforms.
You probably heard about, Michelangelo, a data science platform of Uber. You probably heard about Domino Data Lab. And today, actually, every large technology company has own, data science platform because they need to work somehow in a large teams. Right? They need to collaborate. And software engineering toolset is not perfect for ML projects, for data science projects. But when when we are looking on the platform, we are thinking, like, what how the platform of the future should look like? How we can create a platform which each data scientist can use and which can be, like, widely adopted, highly available. And we come up with this idea, the kind of a principle of, data science platform of the future. It has to be open source.
This is the way you can be, like, community driven. This is a way you can kind of create, like, a common scenario which everyone, data scientists, can, can use. And it has to be distributed. It's important because sometimes you work on your desktop because you have GPU, for example. Sometimes, you still do need your laptop to work, and sometimes you need a cloud resources to run, like, a specific model. So, for example, you need huge amount of memory, and it is important to be everywhere. So with this kind of principles, so we come up with the idea. Why don't we reuse, Git as a foundation, for data science platform for our tool?
And then we realize it's very close to idea of version. We just need versioning versioning dataset. We need versioning experience.
[00:08:38] Unknown:
And this is, how DPC got started. And so you mentioned that the traditional version control systems that are used for regular software projects aren't quite sufficient for machine learning because of the need for being able to track the datasets along with the code itself and some of the resulting models. So can you walk through some of the overall workflow of working with DBC and how that differs from just a regular Git workflow for a software engineering project that might be focused on like a Django application?
[00:09:09] Unknown:
First of all, we need to understand what is the difference between software engineering and machine learning. And machine learning is driven by experiments. The problem is you have a bunch of experiments. You have dozens, sometimes hundreds of experiments. And you need to communicate with the experiments with your colleagues and, basically, with yourself. Because tomorrow, you won't remember what what happened today. And in 1 month, there are no way you will remember what what you have done, today and why this, idea produced such a poor result. You need to track everything. Like, in in many cases, you can you can be in the situation when, like, 1 of your colleagues, came to your office and says, hey. You know what? I have spent, like, 2 days trying this idea, and you know what? It didn't work. And you're like, yeah. I tried I could try the same idea, like, 2 weeks ago. And, yeah, yeah, I know. This it didn't work. And it's hard to communicate these ideas because you have just dozens of those, sometimes hundreds of those. And, this is the difference. You need to framework to communicate by huge amount of ideas, huge huge amount of, ex experience.
In software engineering, workflow is different. You have a limited amount of idea. It might be like like fish request and bug fix. I can say 1 kind of a controversial statement, but in software engineering, you almost never fail. If you decide to implement some issue or implement some feature or fig, fix a bug, you create a branch And with, like in 9 case cases out of 10, the branch will be merged to your kind of a must to mainstream. Right? You can fail in term in terms of quality of software that you produced. You can fail in terms of budget that you spent because you thought it was, like, 1 day project, and you end up, like, spending 2 weeks of implementing this, feature. But, finally, it will be merged. In data set, it's opposite.
You try 10 ideas, 1 works as the very best, and the rest didn't work. But you still need to come need to communicate all of those. DVC, basically helps you to do this. It helps you to your experiments. You see kind of, you see list of your experiments. You see metrics which were produced. You you see the code which was used for this particular experiment with for this particular version of the dataset. And this work way better than, for example, Excel spreadsheet. Because today, many teams use just as Excel Excel spreadsheet to track their experiment.
So this is the basic difference. Self documented experiments, a clear way of collaborating on experiments, and and look into the result. Data versioning, it's kind of a separate thing, which is important for experiments. It's must have for experiments. And sometimes people use DVC just for data version. Usually, for example, engineers who work on deployment side, they use DVC to deploy the model. Sometimes they, data engineers, or data scientists use DVC just to to their datasets. Sometimes, d v, people use DVC as a replacement for GitLFS
[00:12:31] Unknown:
because GitLFS has some limitation, and DVC was built to was optimized for, like, dozens and hundreds of gigabyte range. And for the types of experiments that you're working with, I know that some of the inputs can be in the form of feature engineering where you're trying to extract different attributes of the data or changing the hyperparameters that you're tuning to try and get a certain output of the model. So can you discuss how DBC is able to track those and some of the communication duplicating the same work because of not being able to identify ahead of time that that particular set of features and hyperparameters and data inputs have all been tried together? Oh, from
[00:13:18] Unknown:
data files point of view and datasets point of view, DVC just track them. So your feature, it's, so, basically, DVC treats every file as just a blob, and we it usually, doesn't go into semantic of your data files and structure of your data files. And the same as a model. So models are just binary. And what DVC can understand, that, the model was changed, It can understand that this particular version of the binary file was produced on this particular step, and this particular
[00:13:51] Unknown:
input was consumed. But it doesn't go inside kind of a features. It doesn't go, into kind of a semantic of the data. And as far as the just overall communication pattern, what are some of the things that a data scientist would be looking at as they're working within a project that's managed through DBC to make sure that they're not duplicating that effort, any sort of signals that they would be identifying that would be lost otherwise with just using just Git by itself?
[00:14:20] Unknown:
Yeah. First of all, the structural experiment. This is what, super important data science. You need to know I mean, not in data science, in engineering case, pretty much the same. You need to understand what kind of code was used, what exact version of your code, and what exact version of the dataset was what you what was used. This is how you can, trust to an experiment that you did, like, let's say, 3 weeks ago. And is it, 1 more thing is metrics. If you know that this code with this version of the dataset produce, that particular value of the metric, it create your kind of a trust. You don't need to redo this job again. If you see this result in your Excel spreadsheet, it might be some discrepancy. Right? It it can be sound like, a year in your in your process, and you might end up redoing the same the same stuff. And, for documentation, you can use just a git commit.
When you commit your result, which means you commit a pipeline, you commit a version of your dataset and output results, you put a message. And message in a git is it's it's very simple form of your documentation, of your project documentation. And this is actually very powerful form. What DVC does, it basically make it work for data project. Because in a regular workflow, you can commit code, but you don't have connection with data. You don't have a clear connection with your output metrics. And sometimes, you don't even have a connection with your pipelines because this commit might be related with 1 particular changes of what 1 particular part
[00:16:01] Unknown:
part of pipeline, not the entire pipeline. And as far as the metrics that you're tracking, can you discuss a bit about how those are captured and some of the ways that they factor into the overall evaluation and experimentation cycle to ensure that the model is producing the types of outputs that you're desiring in the overall project?
[00:16:20] Unknown:
DVC is, first of all, DVC is language agnostic tool and framework agnostic. And DVC as is metrics agnostic as well. It mean that, we track metrics in a simple forms of, text files. Your metrics is a CSV file with some header or a CC file or a JSON file. When you just output, we had a 5 metrics with that values. And, DVC allows you to look at your entire set of experiments, which metrics were was produced. You can find that this idea, for example, which failed because metrics was not perfect, actually produced a better, another metrics, had a good value, and you probably need to dig, dive in into into this experiment. So it shows you metrics that across across your branch branches, across your commits, and it basically helps you to navigate through your, complicated git structure 3 biometrics. It's kind of a data driven way to navigate through your git repository.
And in a data project, you might have thousands of commits, hundreds of or does dozens of branches, and DBC basically helps to to navigate through, through this complicated structure. This is why we need metrics. This is why, DBC,
[00:17:42] Unknown:
has a special support for, metrics. And so is there built in support for being able to search across those metric values and compare them across branches, or is that something that you would need to implement on a per project basis as far as a specific function call that determines, you know, whether it's a positive or negative value across those different comparisons?
[00:18:02] Unknown:
Today, we basically show the metrics. If you need to navigate, you you need to probably implement something at all. But I won't be surprised if we, implement some logic, a metric specific logic. For example,
[00:18:14] Unknown:
show me a maximum value of this particular metrics. Show me some maximum value, like, of those, combination of those metrics, something like this. And in addition to version control that's typically used for regular software applications, there are also a number of other types of tooling that are useful to ensure that the projects that you're building are healthy and that you don't have code regressions. So things along the lines of linting or unit test support. And I'm wondering, what are some of those adjacent concerns that should be considered when you're building machine learning projects and any ways that, the work that you're doing either with DVC or any of your other projects tie into that? Yeah. It's a it's a good question because, in general, toolset,
[00:18:58] Unknown:
in data projects, it's not this it's not on the same status as a toolset for software projects. Right? It was an interview with, you did with, Wes McKenzie about, like, a month ago, and he said, in data science, we kind of, still in the wild west, and this is actually what is happening. We don't have a a great support, for many scenarios. But from tools point of view, what I am seeing today, it's become it it is quite mature in terms of algorithms because we have PyTorch. We have TensorFlow. We have, like, bunch of other algorithms, like random tree based algorithms. And, today, there is a kind of a race of online monitoring tools.
For example, TensorBoard. When you can report your metrics online, when you train and see what actually is going on on the training phase right now, It is especially important for deep learning because, like, algorithms are still quite slow, I would say. And there are a bunch of, commercial products in this space, And, MLflow, 1 of, open source project, which is becoming popular, which helps you to track your metrics and visualize, training process. So this is kind of a train trend today. Another trend is, how to visualize your models, how to understand what is inside your models. And, again, there are a bunch of tools, in order to do this, but state of this tooling, it's still not perfect, I would say. In terms of unit tests, I mean, you can use just a regular 1. Right? Just a regular, unit test framework, but I couldn't say it works really well for ML projects specifically. But what I have seen for many times is, unit test or probably not unit test, but functional test for data engineer, in data engineering part. When new set of data came into your system, you can easily understand, you can get basic metrics and make sure there are no drift of the metrics, there are no big changes of the metrics. So this is how unit test or test, at least, can work in the data world. But toolsets in general, it's kind of a still
[00:21:09] Unknown:
a real west. And moving on to the data versioning that is built into DVC. As I was reading through the documentation, it mentions that the actual data files themselves are stored in external systems such as s 3 or Google Cloud Storage or, something like NFS. And so I'm wondering if you can talk through how DBC actually manages the different versions of that data and any type of sort of diffing or incremental changes that it is able to track or, any difficulties or challenges that you faced in the process of building that system and integrating it with the software control that you use, Git for in the overall project structure? Yeah. Of course, we don't commit
[00:21:52] Unknown:
data to git repository. We push data to your servers, to your clouds, usually, and you can reconfigure, the the cloud. As I said, we we treat data as binary blobs. And for each particular commit, we can bring you, like, actual datasets and, all the data artifacts that were in use. We don't do any DIFs per file because you need to understand semantic of a file in order to do divs. Right? It's not a git. For in in a git, every file is a text. It it it makes sense to make it make it. In data science, you need to know what exactly what exact format of of a of a data file you use. However, we track directories as a kind of, separate types of structure. If you have a directory with, let's imagine, 100, 000 files, and then you added, like, a few more files into directory and committed this, as a new version of your dataset, then we understand that only small version of that file was changed, only, like, let's say, 2, 000 files was modified and 1, 000 was added, then we version on the div. So you can easily add your labels in a daily base, in a weekly base without any concern and, for the size of your of your directory. We do this this kind of optimization.
And another optimization important optimization that we do is check out files from an internal structure, it creates a copy into your workspace. However, in a data world, sometimes it just does not make sense because you don't wanna create, like, a copy of 100, let's say, gigabytes of data, another copy. And we optimize this process through handling some reference. So you you are not you are not having, you are not having a duplication of your data set. So you can easily work with, hundreds of gigabytes, dozens of gigabytes,
[00:23:48] Unknown:
without, like, this kind of concerns. For somebody who is onboarding onto an existing project and they're just checking out the state of the repository for the first time, is there any built in capacity for being able to say, I only wanna pull the code, I'm not ready pull down all the data yet, or I just wanna pull down a subset of the data because somebody who's working on a, you know, multi 100 gigabyte dataset doesn't necessarily wanna have all of that located on their laptop as they're building through these experiments. And, just curious what that overall workflow looks like as you're training the models when you're working locally, how it handles in interacting with these large data repositories to make sure that it doesn't just completely fill up your local disk. Yeah. This is a good question.
[00:24:28] Unknown:
Yes. We do this, granular, pool. We kind of optimize this as well. And you as a data scientist, you can decide what exactly you need. For example, if you'd like to deploy your model, which probably within, like, 100 megabytes, right, You probably don't need to take don't need to waste time to pull 100 GB data set, which was used in order to produce model. And then you can specify, I need only this data file. Like, clone your repository first with a code and meta file metadata. And then say, I need just a data file. I need just a model file. And all the model file will be delivered to your production system, to your, deployment machine. And the same as datasets.
[00:25:11] Unknown:
And there are some other projects that I've talked about and talked with people who are building it, such as Quilt or Packaderm that have built in support for versioning of data. And I'm wondering if you have any plans currently to work on integrating with those systems or just the overall process of what's involved in adding additional support for the data storage piece of DVC?
[00:25:35] Unknown:
Yeah. For some of the system, integration can be done, like, easily. For example, Pachyderm, it's a project. It's mostly about data engineering. They have a concept of pipelines, kind of data engineering pipelines, and DVC can can be used in data engineering pipelines. We it it has notion of ML pipelines. It's kind of a lightweight concept. It's optimized specifically for machine learning engineers. It doesn't have all this complexity of data engineering pipelines, but it can easily be used as a single step in engineering pipeline. And we have seen that for many times when people take, for example, DVC, DVC pipeline and put that inside Airflow as a single step. And with this kind of design, it's actually a good design because you give a lightweight tool for ML engineers and data scientists so they can easily produce a result and they can iterate faster in order to, create theirs.
And you have a production system where DVC can be easily just injected inside. And there is a term. I don't remember what company it used. Next Netflix probably. They used term DAG inside DAG, which means you have a DAG of, data pipelines and you have a ML DAG for 1 particular problem, and then basically inject a lot of, like, ML DAGs inside the data engineering DAG. So from this point of view, there are no problem to integrate. There are no issues with integration to Pachyderm or Airflow or other systems. Regarding real data, they do versioning. They work with s 3 as well. Potentially, we can we can be integrated with that, and we are thinking about about this.
And but we are trying to be consumer driven, customer driven, and the biggest need today is probably integration with MLflow because MLflow shines really well within, online metrics tracking site. And sometimes people like to use, MLflow for tracking, tracking metric online and DVC for your data files. And this is 1 of the integration that we are thinking about,
[00:27:49] Unknown:
today. And so in terms of the actual implementation of DBC itself, I know that it's primarily written in Python, and you mentioned that that's largely driven by the fact that that is becoming the lingua franca for data scientists. So I'm wondering if now that you have gone a bit further in the overall implementation and maintenance of DBC, if you think that that is still the right choice. And if you were to start over today, what are some of the things that you would do differently either in terms of language choice or system design or overall project structure?
[00:28:22] Unknown:
Regarding language, I I believe Python is a good choice for such kind of, projects, for 2 major reasons. 1, we are targeting data scientists, and, most of them are comfortable with Python, and we expect the data scientists to contribute to our code base. And if you write this this kind of project in, like, let's say, c or c plus plus or Golang, probably you won't see a lot of contribution from the community because community speaks a different language. For us, it's work works perfectly. Data scientists are contributing to code, which is great. And second reason was programming KPIs.
Before, we were thinking about creating, a DVC through APIs. It's kind of an another option of using DVC. And if we use, if we write code in Python, it kind of goes out of the box. You can reuse, DVC as a library and, inject into into your project. If you use a different language, it just creates some overhead. You need to think about, like, wrapping this into into kind of a nice form. This were the major reasons. And, so far, we are happy with with the Python, and it works so nicely. And you mentioned being able to use,
[00:29:43] Unknown:
DBC as a library as well. And so I'm wondering if there are any use cases that you've seen that were particularly interesting or unexpected or novel either in that library use case or just in the command line oriented
[00:29:57] Unknown:
way that it was originally designed? Oh, the sometimes you, people ask for library support because they need, they need to implement some kind of a more crazy scenarios, I can say. For example, people use DVC, to build their own kind of platforms, if you wish, data science platform. They'd like to build kind of a continuous integration framework, when DVC play plays role of this kind of a glue, between your local experience and CI experience. And they are asking for libraries, but we had such a great command line support, command line toolset, and people just switched back to command line experience. But 1 day, I won't be surprised if some someone will be using DVCs just purely as a library. And I'm also interested wrapping
[00:30:51] Unknown:
a single implementation step for running the wrapping a single implementation step for running the model training piece of it. So I'm wondering if you can talk a bit more about that and some of the specific implementations that you've seen. Yeah. Absolutely. This is actually a really good question.
[00:31:06] Unknown:
I believe that data engineers need pipelines. Right? And, data scientists and machine learning engineers need pipe need pipelines. But the fact is their needs absolutely different. Data engineers do care about stable system. If it fails, the system needs to do something. It has to recover, and, this is kind of a primary goal of the data engineering framework. In data science, it works kind of an opposite. You fail all the time when you come up with some idea. You write code, you run this code, it fails. You fix it, it fails, and etcetera, etcetera. And your goal is to it's to make a framework, to check ideas fast, to fail fast. This is the goal of ML engineering. And this is a good practice to separate 2 frameworks of, 2 kind of pipeline frameworks.
1 is stable engineering, and second 1, fast, lightweight kind of experimentation pipeline, if you wish. And when you separate separate those these 2 worlds, you simplify life of ML engineers a lot. They don't need to deal with complicated stuff. They don't need to waste time on understanding how, like, Airflow works, how Luigi works. They just live in their world, produce models. And once the model is ready, they need a clear way how to inject this pipeline into data pipeline. And you can you can build very simple kind of tool in order to do this. So I remember when I work at Microsoft, to me, it took maybe, like, couple hours to product to Shanaize my pipeline because I have a separate workflow. I I had a separate, I had a separate, tool for ML pipelines.
And this it works nicely. I believe this is kind of a future, future of engineering. We need to separate these 2 things. And I'm also interested in the deployment capabilities
[00:33:00] Unknown:
that DBC provides as far as being able to put models into production or revert the state of the model in the event that it's producing providing are providing problems to the business or, just some of the overall tooling and workflow that is involved in running machine learning models in production, and particularly as far as metrics tracking to know when the model needs to be retrained and just completing closing the loop of that overall
[00:33:34] Unknown:
yeah. Deployment is so we're interested in, Carrie, because it's, close to business. And, there are a lot of funny story about ML model deployment. Sometimes it goes like a software engineer asks data science team, like, can we remote revert our model to the not the previous 1, but the model from the previous week. And data scientist, like, yeah. Sure. We have code. We have datasets. I need just, like, a 5 hours to retrain it. And it does not make any sense, right, to spend 5 hours to reverse a model. In in software engineering work world, it does not work this way. You need to have everything available, and you need to revert, like, right away because waiting 5 hours means, wasting money for business. And DBC basically helps you to organize this process. It helps it it it basically creates a common language between your data scientists who produce the model and ML engineers who take model and deploy model. So next time, with a proper data management, you won't be even asking data scientists to give you, like, as a as a previous model. You should have everything, in your system, like DVC or not DVC. It doesn't matter. But you need to, have all all the artifacts available. And from online from metrics tracking point of view, this is actually a separate question. Because, when you're talking about metrics tracking, in production, it usually means online metrics.
It usually means metrics based on feedback loop from users. And this is a separate equation. So DBC, it's mostly about deployment or not deployment, mostly about development phase. It, today doesn't it does nothing basically with online metrics. And so
[00:35:24] Unknown:
you are actually building and maintaining this project under the auspices of iterative dotai, which is actually a venture backed company. So I'm curious what the overall value equation is for your investors that makes it worthwhile for them to fund your efforts on building and releasing this open source project and, just the overall strategy that you have for the business?
[00:35:49] Unknown:
You will be surprised how, investors are interested in open source projects today. Today, especially the last year, was super successful, for open source projects. Last year, MuleSoft was acquired for numbers of 1, 000, 000 oh, billions. Sorry. Last year, Elasticsearch went IPO. It's purely an open source company. And and when you do open source, it means it usually mean that you are doing, IT infrastructure. In many cases, you are doing IT infrastructure. IT infrastructure is a good area for monetization. And, with a successful open source project, there are a bunch of companies, which are monetizing this open source. It's just important it's very important to understand your business model because with open source, there are few common, models.
1 is obvious service model, a kind of consultancy model. The second is open core model. When you build software and produce a piece of the software with advanced feature for for money or a different version of your software as a product for enterprises. And the 3rd model is ecosystem. When you build a product, an open source product, and create services as a separate product. It's it's 1 example might be, Git and GitHub when they have open source and SaaS service, which is absolutely different product, it's absolutely different experience and use cases. And you need to understand which model you fit, fit in.
And for successful project, yeah, a lot of VC are interested in in this experience, in this, in this kind of kinds of businesses. Initially, I started the project as a it was my pet project for about a year. And then I was thinking, like, how to make something big out of this, how to spend more time on this, how to find more resources in order to do this. And it was clear that this project, if it is successful, there are few businesses will be monetizing this area. And why don't we be this business, which builds a product and monetizes product? So it it's kind of a natural path for a modern open source project, I would say. And as far as your overall experience
[00:38:04] Unknown:
of building and maintaining the DVC project and the community, I'm wondering what you have found to be some of the most interesting or challenging or unexpected lessons learned in the process.
[00:38:14] Unknown:
1 of the lessons that I learned I think it's a usual business lesson, actually. So when you build your project, you know what to do. You know your road map. You know your you know your goal, and you are just building. But 1 day, users came to you, and they asked for a lot of different stuff. And then you have kind of attention. You have attention between your vision, your plans, and demand from user side. And today, there is a point when every day we have a few requests from users. Sometimes we have, like, 10 requests per day, and it's not easy to balance the things.
Because if you if you do everything people ask, you have no time for your road map. And, actually, you have no time to answer, to fix or to fix and to implement everything that people ask. So you need to learn how to prioritize things. You need to learn how to say sometimes say no to users and say, we will do this, but probably not right now. And yeah. So this is not easy to do. This is what you need to learn during the process. And as I said, it's not something new in in open source. It's it it was the same the same way in the business environment, and I have seen that for many times. And looking forward in terms of the work that you have planned for DBC, I'm curious what types of features or improvements that you have on the road map. Regarding the features for for the near future, we are going to release, better support for, dataset versioning and ML model versioning.
We are introducing 2 new commands into DVC, which simplify your experience. Today, some company are using monorepo with a bunch of datasets inside, and we need kind of a new commands to to version them, better. Sometimes because these datasets are kind of evolving, in a in a in a different speed, and you need to work with 1 version of those dataset, or 1 version with another dataset. Sometimes so basically, this is 1 of the steps we we are taking. And another use case for datasets is cross repository references. Other companies are not using monorepo. They're using set of repos. For example, they might have, like, 10, 20 repos with datasets and 20 more with models, and they need to cross reference the datasets. This is the next command we are going to implement to support this, cross, reference cross, repository scenarios.
This is our near future. And in the long division, the next step would be implement more features for better experiment support, especially when people deal with, such a scenario as, hyperparameter tuning. And they need to have, let's say, 1, 000 experiment, and they need to still control them. They don't wanna have 1, 000 branches. And, this is, this experience we need to improve. And we have a kind of a we have a clear plan how to do how to do this. And that's pretty much it for the next, maybe, half a year. And, eventually, we believe that DBC can be this, platform when people can work on the same environment in 1 team and share idea between each other.
And in the future, we believe we can create experience, great experience when people can share idea even between companies, between teams. This is, the big feature,
[00:41:46] Unknown:
future of DVC that I believe in. And are there any other aspects of the work that you're doing on DVC or the overall workflow of machine learning projects that we didn't discuss yet that you think we should cover before we close out the show? I don't think I have something to add, but
[00:42:03] Unknown:
what I believe in, we need to pay more attention on how we organize our work. We need to pay more attention how we structure our project, where, we we need to find more places when we waste our time instead of doing actual work. And this is very important to be more organized, more productive as a data scientist because today, we are still on the Wild West, and it needs to be changed as soon as possible. And it is important to pay attention to this. It's important to understand,
[00:42:39] Unknown:
this problem set. Alright. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose a tool that I started using recently and experimenting with called Otter dotai, And it's billed as a, voice note taking service that will transcribe your, meeting notes or just mental notes to yourself to text so that they're searchable. And I've been experimenting with using it to try out, generating transcriptions for the podcast. So I'm looking forward to start using that more, frequently and starting to add transcripts to the show. So definitely worth checking out if you're looking for something that does a pretty good job of generating transcripts automatically
[00:43:27] Unknown:
and at a reasonable price. So with that, I'll pass it to you, Dimitry. Do you have any picks this week? So I I I thought the open source part, and open source versus, venture capital, it it it is the question that we will discuss. Actually, I have nothing special to suggest. But today, it's nice weather. It is spring just started. I suggest I spend more time outside reading outside and, walking around the city or town.
[00:43:54] Unknown:
Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing on DVC and adding some better structure to the overall project development of machine learning life cycles. So thank you for that, and I hope you enjoy the rest of your day. Thank you, Tobias. Thank you.
Introduction to Dmitry Petrov and His Background
Overview of DVC (Data Version Control)
Differences Between Software Engineering and Machine Learning Workflows
Tracking Experiments and Metrics with DVC
Tooling and Adjacent Concerns in Machine Learning Projects
Data Versioning and Storage Management in DVC
Integration with Other Systems and Tools
Deployment and Production Considerations for Machine Learning Models
Business Model and Investor Interest in Open Source Projects
Challenges and Lessons Learned in Building DVC
Future Features and Improvements for DVC