Version Control For Your Machine Learning Projects

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale up.

And for your tasks that need fast computations, such as training machine learning models or building your CI pipeline, they just launched dedicated CPU instances.

Go to python podcast.com/linode.

That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

Bots and automation are taking over whole categories of online interaction.

Discover.bot

is an online community designed to serve as a platform agnostic digital space for bot developers and enthusiasts of all skill levels to learn from 1 another, share their stories, and move the conversation forward together.

They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news.

For newcomers to the space, they have a beginner's guide that will teach you the basics of how bots work, what they can do, and where they are developed and published, And to help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need, they have compiled a list of the major options and how they compare.

Go to python podcast.com/

discoverbot

today to get started and thank them for their support of the show.

And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity,

and the Open Data Science Conference.

Go to python podcast dotcom/conferences

to learn more and to take advantage of our partner discounts when you register.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the newsletter, and read the show notes.

And to help other people find the show, please leave a review on iTunes and tell your friends and co workers. Your host as usual is Tobias Macy. And today, I'm interviewing Dmitry Petrov about DVC,

an open source version control system for machine learning projects. So, Dimitry, could you start by introducing yourself?

Sure. Hi, Tobias.

It is a pleasure to be on the show. I am Dimitry Petrov.

I have a mixed background in data science and software engineering.

About 10 years ago, I worked at academia.

And sometimes I say

I worked with machine learning for more than 10 years, but you probably know that 10 years ago, machine learning was mostly about, like, linear regression, logistic regression, and, yeah, this is pretty much what I was working on. And,

then I switched to software engineering,

and, I write, some production code. At that time, data science was not a thing. Around 5 years ago, I switched back to quantitative area.

I became a data scientist at Microsoft,

and I saw how modern,

data science looks looks like.

Recently,

I'm I'm actually back to software engineering.

Today, we are working on DVC, and

I basically build tools for machine learning. And do you remember how you first got introduced to Python?

Yeah. It happened in 2, 000

4, I believe, and it happened accidentally.

I got into an internship,

during my master program, and I work on a Python project. It was my first, scripted language. I don't count, at least, in my functional programming class, of course. And

it was part Python 2.2, and then we switched to 2.3.

I have spent

a number

of

sleepless nights debugging Unicode issues

in Python 2. If you work with Python 2, you probably know what I'm talking about. And

later, I found myself using Python in any single research project that I was working on. We were still using,

MATLAB for machine learning. It was quite popular 10 years ago.

We use plan c for kind of a low level software.

Let's imagine

when we we need a driver to track some signals. But in general, Python was a primary language in my laboratory.

When I worked as a software engineer, my primary language was, c plus plus So but I still use Python for kind of ad hoc project for automation.

At that time, there were a lot of discussion about Python versus Perl.

And to me, it was not a question. So I use Python all the time. And during my transition back data science, Python become my primary language because, like, in data science, it's kind of, 1 of the top languages that people use.

Today, we build tools for data scientists, and we use Python. And so 1 of the tools that you've been working on is this project called DBC or data version control. Can you start by explaining a bit about what it is and how that project got started?

DBC is basically a version control system for machine learning projects.

You can think of DVC as a command line tool. It's kind of

a a git for ML. What it gives you,

it gives, like, 3 basic things. 1st, it helps you to track a large data files. For example, 10 GB data files,

100 GB data sets, ML models.

And it helps you in your pipelines,

machine building pipelines. It's not the same as data engineering pipelines. It's kind of a lightweight

version pipelines.

And DVC helps you to organize your experiments.

My exam, self descriptive,

descriptive, self documented. Basically, you know everything,

how a model was produced, what kind of commands you need to run, what kind of metrics, were produced as a result of this experience.

And

you will have all this information in your,

Git history, and DVC helps you in, to navigate through your experiments.

From technical point of view,

we use,

Git as a foundation. So the DVC works on top of Git and a cloud storage.

You can use s 3. You can use Google Storage or Azure or just, like, random SSH,

server,

where you store data. And you see basically orchestrate Git,

and cloud storages.

And

you also asked, like, how DVC got started.

This is actually a separate story because, initially, when we started the project, we were not thinking about

Git format. What we were thinking about was

data science experience,

what is the best practices,

and

we were inspired by ideas

of, data science platforms.

You probably

heard about,

Michelangelo,

a data science platform of Uber. You probably heard about Domino Data Lab. And today, actually, every large technology company has own,

data science platform because

they need to work somehow in a large teams. Right? They need to collaborate.

And software engineering toolset is not perfect for ML projects, for data science projects. But when when we are looking on the platform,

we are thinking, like, what

how the platform of the future should look like?

How we can create a platform which each data scientist can use

and which can be, like, widely adopted, highly available. And we come up with this idea,

the kind of a principle of,

data science platform of the future. It has to be open source.

This is the way you can be, like, community driven. This is a way you can kind of create, like, a common scenario which everyone, data scientists, can,

can use.

And it has to be distributed. It's important because sometimes

you work on your desktop

because you have GPU, for example.

Sometimes, you still do need your laptop to work, and sometimes you need a cloud resources to run, like, a specific model. So, for example, you need huge amount of memory, and it is important to be everywhere. So with this kind of principles, so we come up with the idea. Why don't we reuse,

Git as a foundation,

for data science platform for our tool?

And then we realize it's very close to idea of version. We just need versioning versioning dataset. We need versioning experience.

And this is, how DPC got started. And so you mentioned that the traditional version control systems that are used for regular software projects aren't quite sufficient for machine learning because of the need for being able to track the datasets

along with the code itself and some of the resulting models.

So can you walk through some of the overall workflow of working with DBC

and how that differs from just a regular Git workflow for a software engineering project that might be focused on like a Django application?

First of all, we need to understand

what is the difference between software engineering and machine learning. And machine learning is driven by experiments.

The problem is you have a bunch of experiments. You have dozens, sometimes hundreds of experiments.

And you need to communicate

with the experiments

with your colleagues

and, basically, with yourself. Because tomorrow, you won't remember what what happened today.

And in 1 month,

there are no way you will remember what what you have done,

today and why this,

idea produced such a poor result. You need to track everything. Like, in in many cases, you can you can be in the situation when, like, 1 of your colleagues,

came to your office and says,

hey. You know what? I have spent, like, 2 days trying this idea, and you know what? It didn't work. And you're like, yeah. I tried I could try the same idea, like, 2 weeks ago. And, yeah, yeah, I know. This it didn't work. And it's hard to communicate these ideas because you have just dozens of those,

sometimes hundreds of those. And,

this is the difference. You need to framework

to communicate

by huge amount of

ideas, huge huge amount of, ex experience.

In software engineering,

workflow is different.

You have a limited amount of idea. It might be like like fish

request and

bug fix. I can say 1 kind of a controversial statement, but in software engineering,

you almost never fail. If you decide to implement some issue or implement some feature

or fig, fix a bug, you create a branch

And with, like in 9 case cases out of 10,

the branch will be merged to your kind of a must to mainstream. Right? You can fail in term in terms of quality of software that you produced.

You can fail in terms of budget

that you spent because you thought it was, like, 1 day project,

and you end up, like, spending 2 weeks

of implementing this,

feature. But, finally, it will be merged. In data set, it's opposite.

You try 10 ideas,

1 works as the very best, and

the rest didn't work. But you still need to come need to communicate all of those. DVC,

basically helps you to do this. It helps you to your experiments. You see kind of,

you see list of your experiments. You see metrics which were produced.

You you see the code which was used for this particular experiment with for this particular version of the dataset.

And this work way better than, for example, Excel spreadsheet. Because today,

many teams use just as Excel Excel spreadsheet

to track their experiment.

So this is the basic difference. Self documented experiments, a clear way of collaborating on experiments, and and look into the result.

Data versioning, it's kind of a separate thing, which is important for experiments. It's must have for experiments.

And

sometimes people use DVC just for data version.

Usually,

for example,

engineers

who work on deployment

side, they use DVC to deploy the model. Sometimes

they,

data engineers,

or data scientists use DVC just to

to their

datasets.

Sometimes, d v, people use DVC as a replacement for GitLFS

because GitLFS has some limitation, and DVC was built to was optimized for, like, dozens and hundreds of gigabyte range. And for the types of experiments that you're working with, I know that some of the inputs can be in the form of feature engineering where you're trying to extract different attributes of the data or

changing the hyperparameters

that you're tuning to try and get a certain output of the model. So can you discuss how DBC is able to track those and some of the communication

duplicating the same work because of not being able to identify ahead of time that that particular set of features and hyperparameters

and data inputs have all been tried together? Oh, from

data files point of view and datasets point of view, DVC just track them. So your feature, it's,

so, basically, DVC treats every file as just a blob, and we it usually,

doesn't go into semantic of your data files and structure of your data

files. And the same as a model. So models are just binary. And what DVC can understand,

that, the model was changed, It can understand that this particular version of the binary file was produced on this particular step, and this particular

input was consumed. But it doesn't go inside kind of a features. It doesn't go, into kind of a semantic of the data. And as far as the just overall communication pattern, what are some of the things that a data scientist would be looking at as they're working within a project that's managed through DBC to make sure that they're not duplicating that effort, any sort of signals

that they would be identifying

that would be lost otherwise with just using just Git by itself?

Yeah. First of all, the structural experiment.

This is what,

super important data science. You need to know I mean, not in data science, in engineering case,

pretty much the same. You need to understand

what kind of code was used, what exact version of your code, and what exact version of the dataset was what you what was used. This is how you can,

trust to an experiment that

you did, like, let's say, 3 weeks ago.

And is it, 1 more thing is metrics.

If you know that

this code with this version of the dataset produce,

that particular value of the metric, it create your kind of a trust. You don't need to redo this job again. If you see this result in your Excel spreadsheet, it might be some discrepancy. Right? It it can be sound like,

a year in your in your process, and you might end up redoing the same the same stuff. And,

for documentation, you can use just a git commit.

When you commit your result, which means you commit

a pipeline,

you commit a version of your dataset and

output results, you put a message. And message in a git is it's it's very simple form of your documentation,

of your project documentation.

And this is actually very powerful form. What DVC does, it basically make it work for data project.

Because in a regular workflow, you can commit code,

but you don't have connection with data. You don't have a clear connection with your output metrics. And sometimes, you don't even have a connection with your pipelines because this commit might be related with 1 particular changes of what 1 particular part

part of pipeline, not the entire pipeline. And as far as the metrics that you're tracking, can you discuss a bit about how those are captured and some of the ways that they factor into the overall evaluation and experimentation cycle to ensure that the model is producing the types of outputs that you're desiring in the overall project?

DVC is,

first of all, DVC is language agnostic tool and framework agnostic. And DVC as is metrics agnostic as well. It mean

that, we track metrics in a simple forms of, text files. Your metrics is a CSV file with some header

or a CC file or a JSON file. When you just output, we had a 5 metrics with that values. And, DVC

allows you to look at your entire set of experiments,

which metrics were was produced. You can find that

this idea, for example, which failed because metrics was not perfect,

actually produced a better,

another metrics, had a good value, and you probably need to dig,

dive in into into this experiment. So it shows you metrics that across across your branch branches, across your commits,

and it basically helps you to navigate through your,

complicated

git structure

3 biometrics. It's kind of a data driven way to navigate through your git repository.

And in a data project, you might have thousands of commits, hundreds of or does dozens of branches, and DBC basically helps to to navigate through, through this complicated structure.

This is why we need metrics. This is why, DBC,

has a special support for, metrics. And so is there built in support for being able to search across those metric values and compare them across branches, or is that something that you would need to implement on a per project basis as far as a specific function call that determines,

you know, whether it's a positive or negative value across those different comparisons?

Today, we basically

show the metrics. If you need to navigate, you you need to probably implement something at all. But I won't be surprised if we, implement some logic, a metric specific logic. For example,

show me a maximum value of this particular metrics. Show me some maximum value, like, of those, combination of those metrics, something like this. And in addition to version control that's typically used for regular software applications, there are also a number of other types of tooling that are useful to ensure that the projects that you're building

are healthy and that you don't have code regressions.

So things along the lines of linting or unit test support. And I'm wondering, what are some of those adjacent concerns that should be considered when you're building machine learning projects and any ways that, the work that you're doing either with DVC or any of your other projects tie into that? Yeah. It's a it's a good question because, in general, toolset,

in data projects,

it's not this it's not on the same status as a toolset for software projects. Right? It was an interview with, you did with, Wes McKenzie about, like, a month ago,

and he said,

in data science, we kind of, still in the wild west, and this is actually what is happening. We don't have a a great support,

for many scenarios. But from tools point of view, what I am seeing today,

it's become

it it is quite mature in terms of algorithms because we have

PyTorch. We have TensorFlow. We have, like, bunch of other algorithms, like random tree based algorithms.

And,

today, there is a kind of a race of

online monitoring

tools.

For example, TensorBoard.

When you can report your metrics online,

when you train and see what actually is going on on the training phase right now, It is especially important for deep learning because, like, algorithms are still

quite slow, I would say. And there are a bunch of, commercial products in this space,

And,

MLflow, 1 of, open source project,

which is becoming popular,

which helps you to

track your metrics and visualize,

training process. So this is

kind of a train trend today.

Another trend is,

how to visualize your models, how to understand what is inside your models. And, again, there are a bunch of tools, in order to do this, but state of this tooling, it's still not perfect, I would say. In terms of unit tests, I mean, you can use just a regular 1. Right? Just a regular, unit test framework, but I couldn't say it works really well for ML projects specifically. But what I have seen for many times is, unit test or probably not unit test, but functional test for data engineer,

in data engineering part. When new set of data came into your system, you can easily understand,

you can get basic metrics

and make sure there are no drift of the metrics, there are no big changes of the metrics. So this is how unit test or test, at least, can work in the data world. But toolsets in general, it's kind of a still

a real west. And moving on to the data versioning that is built into DVC. As I was reading through the documentation, it mentions that the actual data files themselves are stored in external systems such as s 3 or Google Cloud Storage or, something like NFS. And so I'm wondering if you can talk through how DBC actually manages

the different versions of that data and any type of sort of

diffing or incremental changes that it is able to track or, any difficulties or challenges that you faced in the process of building that system and integrating it with the software control that you use, Git for in the overall project structure? Yeah. Of course, we don't commit

data to git repository.

We push data to your servers,

to your clouds, usually, and you can reconfigure,

the the cloud.

As I said, we we treat data as binary blobs. And for each particular commit, we can bring you, like, actual

datasets and, all the data artifacts that were in use. We don't do any DIFs per file because

you need to understand semantic of a file in order to do divs. Right? It's not a git. For in in a git, every file is a text. It it it makes sense to make it make it. In data science,

you need to know what exactly

what exact format of of a of a data file you use. However,

we track directories

as a kind of, separate types of structure. If you have a directory with, let's imagine,

100, 000 files, and then you added, like, a few more files into directory

and committed this,

as a new version of your dataset,

then we understand

that only small version of that file was changed, only, like, let's say, 2, 000 files was modified and 1, 000

was added, then

we

version on the div. So you can easily

add your labels in a daily base, in a weekly base without any concern and,

for the size of

your of your directory. We do this this kind of optimization.

And

another optimization

important optimization that we do is

check out files from an internal structure, it creates a copy into your workspace. However,

in a data world,

sometimes it just does not make sense because you don't wanna create, like, a copy of 100,

let's say, gigabytes of data, another copy. And we optimize this process through handling some reference. So you you are not you are not having, you are not having a duplication of your data set. So you can easily work with, hundreds of gigabytes, dozens of gigabytes,

without, like, this kind of concerns. For somebody who is onboarding onto an existing project and they're just checking out the state of the repository

for the first time, is there any built in capacity for being able to say, I only wanna pull the code, I'm not ready pull down all the data yet, or I just wanna pull down a subset of the data because somebody who's working on a, you know, multi 100 gigabyte dataset doesn't necessarily wanna have all of that

located on their laptop as they're building through these experiments. And, just curious what that overall workflow looks like as you're training the models when you're working locally,

how it handles in interacting with these large data repositories to make sure that it doesn't just completely fill up your local disk. Yeah. This is a good question.

Yes. We do this,

granular,

pool. We kind of optimize this as well. And you as a data scientist, you can decide what exactly you need. For example,

if you'd like to deploy your model, which probably within, like, 100

megabytes, right, You probably don't need to take

don't need

to waste time

to pull 100 GB data set, which was used in order to produce model. And then you can specify, I need only this data file. Like, clone your repository first with a code and meta file metadata.

And then say, I need just a data file. I need just a model file. And all the model file will be delivered to your production system, to your,

deployment machine. And the same as datasets.

And there are some other projects that I've talked about and talked with people who are building it, such as Quilt or Packaderm that have built in support for versioning of data. And I'm wondering if you have any plans currently to work on integrating with those systems or just the overall process of what's involved in adding additional support for the data storage piece of DVC?

Yeah. For some of the system,

integration can be done, like, easily. For example,

Pachyderm, it's

a project. It's mostly about data engineering. They have a concept of pipelines, kind of data engineering pipelines,

and DVC can can be used in data engineering pipelines. We it it has notion of ML pipelines. It's kind of a lightweight concept. It's optimized specifically

for

machine learning engineers. It doesn't have all this complexity

of data engineering pipelines, but it can easily

be used

as a single step in engineering pipeline. And we have seen that for many times when people take, for example, DVC,

DVC pipeline and put that inside

Airflow as a single step. And with this kind of design, it's actually a good design because you give a lightweight tool for ML engineers and data scientists

so they can easily produce

a result and they can iterate faster in order to,

create theirs.

And you have a production system where DVC can be easily just injected

inside. And there is a term. I don't remember what company it used. Next Netflix probably. They used

term DAG inside DAG, which means you have a DAG of, data pipelines

and you have a ML DAG for 1 particular problem, and then basically

inject a lot of, like, ML DAGs inside the data engineering DAG. So from this point of view, there are no problem to integrate. There are no issues with integration

to

Pachyderm or Airflow or other systems. Regarding real data, they do versioning.

They work with s 3 as well. Potentially, we can we can be integrated with that, and we are thinking about about this.

And but we are trying to be

consumer

driven, customer driven, and

the biggest

need

today is probably integration with MLflow because MLflow

shines really well within, online metrics tracking site.

And sometimes people like to use, MLflow for tracking,

tracking metric online and DVC for your data files. And this is 1 of the integration that we are thinking about,

today. And so

in terms of the actual implementation

of DBC

itself, I know that it's primarily written in Python, and you mentioned that that's largely driven by the fact that that is becoming the lingua franca for data scientists. So I'm wondering if

now that you have gone a bit further in the overall implementation and maintenance of DBC,

if you think that that is still the right choice. And if you were to start over today,

what are some of the things that you would do differently either in terms of language choice or system design or overall project structure?

Regarding language, I I believe Python is a good choice for such kind of, projects,

for

2 major reasons.

1, we are targeting data scientists,

and, most of them are comfortable with Python, and we expect the data scientists to contribute to our code base.

And if you write this this kind of project in, like, let's say, c or c plus plus or Golang,

probably you won't see a lot of contribution from the community

because community speaks a different language. For us, it's work works perfectly. Data scientists are contributing to code,

which is great. And

second reason

was programming KPIs.

Before, we were thinking about creating,

a DVC through APIs. It's kind of an another option of using DVC.

And if we use,

if we write code in Python, it kind of

goes out of the box. You can reuse,

DVC as a library

and, inject

into into your project.

If you use a different language, it just creates some overhead.

You need to think about, like, wrapping this into into kind of a nice form.

This were the major reasons.

And, so far, we are happy with with the Python, and it works so nicely. And you mentioned being able to use,

DBC as a library as well. And so I'm wondering if there are any use cases that you've seen that were particularly

interesting or unexpected or novel either in that library use case or just in the command line oriented

way that it was originally designed? Oh, the sometimes you, people ask for library support because they need, they need to implement some kind of a more crazy scenarios,

I can say.

For example,

people use DVC,

to build their

own kind of platforms, if you wish, data science platform. They'd like to build kind of a continuous integration framework,

when DVC play plays role of this kind of a glue,

between your local experience and CI experience. And they are asking for libraries, but we had such a great command line support, command line toolset, and people just switched back to command line experience.

But 1 day, I won't be surprised if some someone will be using DVCs just purely as a library. And I'm also interested wrapping

a single implementation step for running the wrapping a single implementation step for running the model training piece of it. So I'm wondering if you can talk a bit more about that and some of the specific implementations that you've seen. Yeah. Absolutely. This is actually a really good question.

I believe that data engineers need pipelines.

Right? And, data scientists and machine learning engineers need pipe need pipelines. But the fact is their needs

absolutely different. Data engineers

do care about stable system. If it fails, the system needs to do something. It has to recover,

and, this is kind of a primary goal of the data engineering

framework. In data science, it works kind of an opposite.

You fail all the time

when you come up with some idea. You write code, you run this code, it fails. You fix it, it fails, and etcetera, etcetera. And your goal is to

it's to make a framework,

to check ideas fast, to fail fast. This is the goal of ML engineering. And this is a good practice to separate 2 frameworks of, 2 kind of pipeline frameworks.

1 is stable

engineering, and second 1, fast, lightweight kind of experimentation pipeline, if you wish. And when you separate

separate those these 2 worlds, you simplify

life of ML engineers a lot.

They don't need to deal with complicated stuff. They don't need to waste time on understanding how, like, Airflow works, how Luigi works. They just live in their world, produce models. And once the model is ready, they need a clear way how to inject

this pipeline

into

data pipeline.

And you can you can build very simple kind of

tool in order to do this. So I remember when

I work at Microsoft, to me, it took maybe, like, couple hours to product to Shanaize my pipeline

because I have a separate workflow. I I had a separate,

I had a separate,

tool for

ML pipelines.

And this it works nicely. I believe this is kind of a future, future of engineering. We need to separate these 2 things. And I'm also interested in the deployment capabilities

that DBC provides as far as being able to put models into production or revert the state of the model in the event that it's producing

providing are providing problems to the business or, just some of the overall tooling and workflow

that is involved in running machine learning models in production, and particularly as far as metrics tracking

to know when the model needs to be retrained and just completing closing the loop of that overall

yeah. Deployment is so we're interested in, Carrie, because it's, close to business. And, there are a lot of funny story about ML model deployment. Sometimes it goes like a software engineer asks data science team, like, can we remote revert our model to the not the previous 1, but the model from the previous week. And data scientist, like, yeah. Sure. We have code. We have datasets. I need just, like, a 5 hours to retrain it.

And it does not make any sense, right, to spend 5 hours to reverse a model.

In in software engineering work world, it does not work this way. You need to have everything

available,

and you need to

revert, like, right away because

waiting 5 hours means,

wasting

money for business. And

DBC basically helps you to organize this process. It helps

it it it basically creates

a common language between your data scientists

who produce the model and ML engineers

who

take model and deploy model. So next time,

with a proper data management,

you won't be even asking data scientists to give you, like, as a as a previous model. You should have everything,

in your

system, like DVC or not DVC. It doesn't matter. But you need to, have all all the artifacts available. And

from online from metrics

tracking point of view, this is actually a separate question. Because,

when you're talking about metrics tracking,

in production,

it usually means online metrics.

It usually means metrics based on

feedback loop from users. And this is a separate equation. So DBC, it's mostly about

deployment or not deployment, mostly about development phase. It, today

doesn't it does nothing basically

with online metrics. And so

you are actually building and maintaining this project under the auspices

of iterative dotai, which is actually a venture backed company. So I'm curious

what the

overall value equation is for your investors that makes it worthwhile for them to fund your efforts on building and releasing this open source project and, just the overall strategy that you have for the business?

You will be surprised

how, investors

are interested in

open source projects today. Today,

especially the last year, was super successful,

for open source projects.

Last year,

MuleSoft was acquired for numbers of 1, 000, 000 oh, billions. Sorry. Last year, Elasticsearch went IPO. It's purely an open source company.

And

and

when you do open source, it means it usually mean that you are doing,

IT infrastructure. In many cases, you are doing IT infrastructure. IT infrastructure is a good area for monetization.

And, with a successful open source project, there are a bunch of companies,

which are monetizing

this open source. It's just important it's very important to understand your business model because with open source, there are few common,

models.

1 is obvious service model, a kind of consultancy model. The second is

open core model.

When you build software and produce a piece of the software with advanced feature for for money or a different version

of your software

as a product for enterprises.

And the 3rd model is ecosystem.

When you build a product, an open source product, and create

services

as a separate product.

It's it's 1 example might be,

Git and GitHub when they have open

source and SaaS service, which is absolutely different product, it's absolutely different experience and use cases. And you need to understand which model you fit,

fit in.

And for successful project,

yeah, a lot of VC are interested in in this experience,

in this, in this kind of kinds of businesses.

Initially, I started the project as a it was my pet project for about a year. And then I was thinking, like, how to make something big out of this, how to spend more time on this, how to find more resources in order to do this. And it was clear that this project, if it is successful,

there are few businesses will be monetizing this area. And why don't we be this business,

which builds a product and monetizes product? So it it's kind of a natural path for a modern open source project, I would say. And as far as your overall experience

of building and maintaining the DVC project and the community, I'm wondering what you have found to be some of the most interesting or challenging or unexpected lessons learned in the process.

1 of the lessons that

I learned

I think it's a usual business lesson, actually. So when you build your project, you know what to do. You know your road map. You know your you know your goal, and you are just building. But 1 day,

users came to you, and they asked for a lot of different stuff. And then you have kind of attention. You have attention between your vision, your plans,

and demand from user side. And today, there is a point when every day we have a few requests from users. Sometimes we have, like, 10 requests per day, and it's

not easy to balance the things.

Because if you if you

do everything people ask, you have no time

for your road map. And, actually, you have no time to answer, to

fix or to fix and to implement everything that people ask. So you need to learn how to prioritize things. You need to learn how to say sometimes say no to users and say, we will do this, but probably not right now. And yeah. So this is not easy to do. This is what you need to learn during the process. And as I said, it's not something new in in open source. It's it it was the same the same way in the business environment, and I have seen that for many times. And looking forward in terms of the work that you have planned for DBC, I'm curious what types of features or improvements that you have on the road map. Regarding the features for

for the near future,

we are going

to release,

better support for, dataset

versioning

and ML model versioning.

We are introducing

2 new commands into DVC,

which simplify

your experience. Today,

some company are using monorepo with a bunch of datasets inside,

and we need kind of a new commands to

to version them, better. Sometimes because these datasets

are kind of evolving, in a in a in a different speed, and you need to work with 1 version of those dataset, or 1 version with another dataset.

Sometimes so basically, this is

1 of the steps we we are taking. And another use case for datasets is cross repository

references. Other companies are not using monorepo.

They're using

set of repos. For example, they might have, like, 10, 20 repos with datasets

and 20 more with models, and they need to cross reference the datasets. This is the next command we are going to implement to support this, cross,

reference cross, repository scenarios.

This is our near future.

And in the long division,

the next step would be implement

more features

for

better experiment support,

especially when people deal with, such a scenario as, hyperparameter tuning.

And they need to have, let's say,

1, 000 experiment,

and they need to still control them. They don't wanna have 1, 000 branches.

And,

this is,

this experience we need to improve. And we have a kind of a we have a clear plan how to do how to do this. And that's pretty much it for the next, maybe, half a year. And, eventually, we believe that DBC can be this, platform

when people can

work on the same environment

in 1 team and share idea between each other.

And in the future, we believe we can

create experience, great experience when people can

share idea

even between companies,

between teams. This is, the big feature,

future of DVC that I believe in. And are there any other aspects of the work that you're doing on DVC

or the overall workflow of machine learning projects that we didn't discuss yet that you think we should cover before we close out the show? I don't think I have something to add, but

what I believe in,

we need

to pay more attention on how we organize our work. We need to pay more attention how we

structure our project,

where,

we we need to find more

places when we waste our time instead of doing actual work.

And this is very important to

be more organized,

more productive as a data scientist because today, we are still on the Wild West, and it needs to be changed as soon as possible.

And it is important to

pay attention to this. It's important

to understand,

this problem set. Alright. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose a tool that I started using recently and experimenting with called Otter dotai,

And it's billed as a,

voice note taking service that will transcribe your,

meeting notes or just mental notes to yourself to text so that they're searchable.

And I've been experimenting with using it to try out, generating transcriptions for the podcast. So

I'm looking forward to start using that more,

frequently and starting to add transcripts to the show. So definitely worth checking out if you're looking for something that does a pretty good job of generating transcripts automatically

and at a reasonable price. So with that, I'll pass it to you, Dimitry. Do you have any picks this week? So I I I thought the open source part, and open source versus, venture capital, it it it is the question that we will discuss. Actually, I have nothing special to suggest. But today, it's

nice weather. It is spring just started.

I suggest I spend more time outside

reading outside and,

walking around the city or town.

Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing on DVC

and adding some better structure to the overall project development of machine learning life cycles. So thank you for that, and I hope you enjoy the rest of your day. Thank you, Tobias. Thank you.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.init