Giving Your Data Science Projects And Teams A Home At DagsHub

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Dean Pleban and Guy Smialowski about DAGS Hub, a platform to track experiments and version data models and pipelines for your data science and machine learning projects. So, Dean, can you start by introducing yourself?

My name is Dean. And

together with Guy, I'm 1 of the founders of DAGS Hub. My background is a bit of a combination of physics, computer science, data science, and design.

And I am excited about data science and DAGS. I'm Guy, the, the second half of the team.

My background is mostly from software,

some data engineering and other ML related things.

Just wanted to say it's great to be on the podcast.

I've been listening to both of your podcasts for

a long time,

so it's super exciting.

Yeah. It's definitely great to have the both of you on here. And going back to you, Dean, do you remember how you first got introduced to Python?

The first time I started writing Python was in my free time, so probably

high school. It seemed like a very

easy to start with language,

So I tried out a few things. I think it was aside from school projects because we were studying

on, Pascal.

So Python was in spare time, and then I had a bit more opportunities to work with Python

during my undergraduate studies where I did a bit of machine learning work. It was obviously Python, but also a bit of physics

research in quantum optics where we use mostly MATLAB, but I did take every opportunity I could to write in Python because it was so much fun.

And, Kai, do you remember how you first got introduced to Python? I Think it was at a very early age.

Sometime

doing probably

middle school or high school. So not really. But

I always thought it was, like, I don't know why

people didn't

catch on earlier, that it's just easier to do things in Python.

Right now, it seems to be the status quo.

The both of you have been working on building up this platform in the form of DAGS Hub. I'm wondering if you can just give a bit of an overview about what it is that you've built and your motivation for starting on that journey.

Basically, I've been a software developer, like, professionally for a bit more than 10 years.

Now a variety of of fields like mobile, cyber,

all kinds of unrelated

areas, but

my last job, which I've done for

3 years, was as 1 of the developers in a startup called Glassbox Digital, where I

came in to do,

like, heavy duty data engineering,

like how to

ingest and handle

data rates of, like, gigabits per second and just having to elastic search, do all kinds of analytics on top of them. That's kind of where I first got my feet wet in data engineering and data science.

We had all kinds of projects,

mostly around time series anomaly detection.

And while doing that, I also, like, took 1 day a week to to open a master's degree in computer science,

where I basically took all of the possible courses

in machine learning

and deep learning, and there was a year where I basically read every paper on archive. I don't know how,

but I found the

time, and they they dove headfirst into it. So

doing 1 of these courses where we were trying to, as a group of students,

reproduce

some state of the art deep learning model for face detection,

I

basically

became frustrated with how I was working. Like, okay. So I'm a software developer. I really

feel strongly about DevOps, about Git. Like, I'm always the the Git guy in the company, wherever I was.

And while working in this team, I basically found out that I don't know how to work together with the other 2 students who are in my team. Basically, what ended up happening is I told them, okay, listen, I don't know. Like, I have this notebook.

I am working on it. When I am done, maybe I'll let you know when I'm done, and you can do something maybe on top of it. I don't know how we can work in parallel,

how we can split the work. And I found myself doing things which

angered me, like I caught myself walking the way I would before I knew what git was, which was, for example,

I created as part of the notebook a script to run an experiment and save it into some folder with the name of the folder being

hyperparameter name_value

and kind of a chain of that. So lr_0.01

and kinds of hyperparameters like this, and then writing scripts to iterate over the folders and try to

extract some information from it. So when I noticed that, wait, before I knew what Git was, I would kind of copy code into folders

and do version management like that.

And

I said, okay. I'm just being silly. Like, someone has solved this, the data scientists in the world who are working for a long time.

So I went out and searched for how I should be doing this.

And what I found was

some promising

solutions, but mostly,

everyone's

saying, we don't know. This is how it is, and

we don't have an agreed solution.

So that's kind of where the idea came from, I especially liked, DVC. I thought it was the most

d DVC is, data version control. It's an open source project

to kind of extend Git to give it better capabilities for data versioning

and other

relevant features, and guess you had an episode about them already.

So I like that, but what I was missing is, like, the GitHub

to close this story because there's still a lot of tooling around that. So

what happened was

that was kind of the origin story, how I was frustrated with the current state of affairs, what I kind of had a vision of where the world should be.

And I had just, quit my job due to moving apartments, and

Dean had just finished his degree, and Dean and I are friends since kindergarten,

which we didn't mention earlier.

So we've been friends

a couple of decades. Friends for a long time.

And we've always done projects together. We both like software, and we kind of complete each other because Dean is like the

design guy, and I'm like the back end logic guy.

And we basically dove into it, and I kind of quit my, degree. So I'm a college dropout, which means I check the that checkbox.

And that was

the origin story.

Building a company together, particularly after having been friends so long, is definitely a good test to see how well you guys actually get along because I'm sure that there are points of stress in the process of building a business around this that strain your existing relationship?

I think we're doing good. Like, it's been 2 plus years.

I feel like, in a sense, we knew what we were getting into. Like, this is the first startup that we're doing together, but not the first project that we're putting an intensive amount of time into doing together.

So in a way, like,

when you look at the process of finding cofounders, that's always a challenging issue.

But for

me, I felt like it was a very straightforward process because it was very clear that Guy and I should do something together at some point. And when we landed on this idea, it was very clear that this is the idea that we should pursue together. I feel like I couldn't have chosen a better cofounder, at least from my end.

From looking at the website, it seems that it's still just the 2 of you who are building out this platform, which given the amount of features that you've already built into it and how far along you've gotten, it's definitely impressive that you've managed to make that progress with just the 2 of you.

We do have an additional

2 developers that we need to add to the website and another 2 joining very soon. So we are growing the team, and we haven't built everything on our own.

But I think even for that size of the team, we've built quite a few interesting things, and I'm really proud of what the team has been able to create over the past year. As I always like to say, like any good developers, we cheat. You take something that works well, and you use it.

Yeah. That's the secret.

It's not cheating. It's called standing on the shoulders of giants. Exactly.

And so you mentioned that

a big part of the motivation for building DAGS Hub was your frustration in

forms that are aiming to support some of the other complexities of collaboration

for data scientists and machine learning engineers.

And I'm wondering if you could just give a bit of an overview about the features of DAGS Hub that you think make it stand out in the field and maybe compare it to the other options that are available in the ecosystem and

how it fits with them or might compete with some of the existing capabilities?

Maybe to start because I didn't exactly explain it, what is the exam

right now.

So it's a platform,

site,

where you can collaborate on open source data science projects like GitHub for machine learning.

The analogy is very correct because you can actually get push your data science projects

and get features that are more useful specifically for data scientists versus

software to 1.0 developers,

like data and model hosting and the ability to contribute not just code changes,

but the more important parts like contributions to data and new experiments,

review tools,

automatic reproducibility,

experiment tracking.

That leads maybe to the things that we do differently. Dean? I think that

specifically, 1 thing that we have in front of our eyes, which is very important to us, is the whole open source data science collaboration, which is something that

not a lot of people are speaking about. But what we mean by that is that the technology, the platform should enable 2 people

that don't know each other personally, that are not working next to each other or might be on opposite sides of the globe to actually work together on a data science project.

Our sort of vision is that, in the end, open source software has made such a huge impact on the world, and we believe that data science is is going to go through that same process. So we want to provide the tooling that enables that to happen and help the people that are working on these projects work more effectively.

And I think that with respect to other solutions surrounding collaboration and data science, I think that the term collaboration is challenging.

It's ambiguous, and a lot of of people and tools use it, but mean very different things.

So the best 2 examples is you can say that pair programming on a Jupyter Notebook, that's a type of collaboration.

But what we mean with DAGSUB when we say collaboration is that you have a place where you can see all the meaningful components of your project, like the code, the data, the models, experiments, pipelines,

and you can compare their versions and contribute to them in an effective way. And and I think that that's sort of where we are coming from. Yep. I want to explain what we do mean when you say collaboration.

So

if you look at how collaboration on a project looks in reality,

it involves several steps that get repeated constantly. So

as a contributor, I want to discover what someone else did. I want to understand it, produce it, change it, and then contribute the changes back. And this is exactly what Git and GitHub

enable us to do with code, even if I don't know

the creator of the project.

It is to find some repo on GitHub,

read the docs,

try to look at the existing open issues and branches,

look at the git blame and the history, kind of understand the context,

what was done when, and by who, and why.

And then when I actually want to work on it, I can git clone, check out,

set up the environment, which is maybe the

difficult part, but hopefully, someone has set up at least a Docker container or something.

But let's assume it's not impossible.

I start hacking on the code,

and when I think I have made a useful contribution, I can commit it, push it, open a PR, CI can run, do linting, testing, whatever.

I can get the code review saying what should be changed, and after fixing, finally, the maintainer of the project can click a button and it's merged, and it will be part of the next release.

Any of those steps becomes too hard,

then although collaboration is still theoretically possible, like, I can email

code patches like the Linux kernel used to do in the past.

So it's theoretically possible, but it will be almost nonexistent.

So what we set out to do is to try to close this loop that I described

for data science.

So

if some other tools say collaboration, but

they don't allow you in the end

to click a button to say, okay.

These experiments

are good. I like them. After I discussed them with the maintainer, it maybe asked for more experiments or maybe asked for more changes to the data or model. In the end, to click a button and integrate those changes,

that's, kind of a missing part of the puzzle

that we haven't seen solved

anywhere else except, like, using

Git and GitHub for software,

and we're trying to do that for data science.

As you put it before, Tobias,

we are trying to build on the shoulders of giants. So we think that this is important as a platform that is looking at itself as community first, which is that we want to only be based on open source tools and open formats wherever it is possible.

Today, we are already based on some of the great tools that everyone knows, like Git, DBC,

MLflow, and we're adding support for more of these as time goes by, which also means that 1 of the issues that we're seeing, it's sort of a meta issue

in the ML field, is that there are so many tools and so much, reinventing of wheels going around that practitioners have a huge set of tools to evaluate,

and they never get to the bottom of it. So it's really hard to decide what to use. And what we're trying to do with DAGS Hub is to entirely avoid that. The approach we have is basically saying, here are the tools that we support, which are open source, widely adopted tools, and then we're creating a very convenient way to interface with them and collaborate on top of them, which I also think is a very important distinction.

Maybe to make it more concrete, imagine if you have, like, a project

without Daxabs. So you have a myriad of tools that log

information about your training runs, and you have maybe git to version your code if you're being orderly and

not just doing everything kind of ad hoc in a notebook.

And maybe all that data is stored in some S3 bucket and you have a URL pointing it to it in your code files,

but in the end, tying together all of this information

into 1 coherent picture that lets you actually

review and compare versions,

it's hard work. And we've talked to a lot of teams that due to it being hard work,

do it not as often as they should, and they built a lot of custom tooling and lot of ad hoc solutions for this. So that's very hard, and that's not even mentioning

how hard it is to reproduce someone else's work when it's scattered all over a bunch of tools. And we think it should be 1 command. Not to mention actually merging it after you reviewed it, and not to mention

actually trying to do all of that with complete strangers from the Internet

who don't have access to your s 3 or whoever you stole your data.

There are a couple of things that I wanna talk to out of that. 1 of it is the existing pain points that are still there for being able to build experiments and managing the source code and the source data and collaborating on projects as a data scientist.

And then the other aspect is some of

the useful patterns that you've seen grow

in terms of

making it easier for

new contributors or new users of a given data science project or set of experiments to be able to familiarize themselves with the code and with the project and with the data,

similar to the conventions that have grown up around different areas of software engineering.

So I guess maybe starting with the pain points that still exist in terms of what are the points of friction that are still around for being able to collaborate on data science.

In the long term, the challenge of scale,

which is right now, there's a barrier to entry for serious, quote, unquote,

machine learning work in a community setting, since you might need a lot of computing power and huge datasets to do state of the art work.

And even in smaller projects, there are still trade offs that require more thought than traditional software. Like, I can't retrain my model every time someone pushes a new commit to a pull request,

just like I could run a code linter or unit tests, which take 3 seconds in normal, continuous integration.

Also, data transfer costs in most cloud providers are ridiculously expensive

and that requires attention. That means, like, you can't do

things without thinking about it. And teams will need to be more selective about

these things, like when do I run automation,

and it breaks some of the existing abstractions about

collaboration and CI,

which will require adjustments.

But those are problems that I think will become less of an issue over time because

compute does get cheaper and the methods become more efficient. And besides that, we also believe that there are plenty of interesting

things and opportunities to exhaust before

we say, let's do an open source GPT 5 or something like that.

We have a lot of,

projects on the platform which are doing very interesting things just

with tabular data

in, not huge, scenarios.

But those are definitely

challenging issues which need to be solved somehow.

Even though the example of GPT 5, we can see that now there are teams that are actually working on open sourcing and implementation for GPT 3. They do need external support because, obviously, that costs

more than time, but it's happening. And so I feel like even now with all of those limitations, these things are happening. So there's room for hope that it will be more common in the future. I think, the same objections used to exist about all kinds of software that we now take for granted. Like, you used to need a supercomputer to run

Node.

Js or something.

And I think the maybe biggest challenge is actually

our mindset

when we create project. I don't

think there's a magic bullet that causes like, if I'm peeping stall or something, then it causes a project to be collaboratable

without the maintainers of the project making a conscious decision

and effort to make the trade off, to constantly fight against entropy.

And that doesn't mean that every project has to make this trade off. Sometimes we

create a 1 off, something that we want to be quick and dirty and be creative and get some kind of

results that we can learn from and then pick up the pieces later. I also do that in software. Like, sometimes I open a Apple and I just

start doing stuff and I learn very

fast, but then I would take what I learned and try to make code out of it. So

what this does mean is that data scientists who do work in collaborative projects need to think of themselves

as part of a larger system that needs to interoperate

and not as completely independent

research mavericks who do very creative work, which is but later leave the pieces on the ground to be picked up by someone

else. So practically speaking, that means as data scientists

to refactor your code into orderly modules and repeatable functions,

kind of use best practices like git, and that's kind of a cultural thing. I think it's inevitable, like, as people work on

specific types of projects. Let's say, for example, if I'm if my deliverable is a research paper

or a Kaggle competition submission or something, then maybe I don't care. Like, as long as I can make sense of what I did,

that it can run once and I get the result, and that's

great. But if it's a model that's going to be deployed to production as part of some product and it's going to be iterated on as we discover

problems in the data and the code, then definitely you need to have that mindset of this is a long term project that I need to

organize.

Going a bit more into the

useful patterns to make a project easier to dive into and get familiar with.

So you mentioned, for instance,

in software projects, maybe having a Docker container that's easy to get the environment set up and run with.

Or if you are used to working on Django projects, there's a particular

directory structure that you're familiar with as far as the different portions of the app are split out into sort of subapplications,

and the directory structure reflects that. Wondering if there are any useful patterns in terms of the

labeling paradigms for the source data or the data structures or the layout of the data that's useful for making projects easier to collaborate on or the formatting or structure of the code and some of the

useful patterns or abstractions that you've seen grow up as people are starting to collaborate more on these data science projects, and particularly in the context of DAGS Hub?

I think definitely there are some things, like we support

using cookie cutter data science

as a project template.

So cookie cutter data science is a project that's not related to the exam directly. It's like a standard folder structure for data science projects with, like, clearly defined

stages for here is the folder with the raw data, here is the folder with the process data,

and here is the code the Python,

file that

turns the raw data into process data.

So that's definitely

useful.

And I think maybe the first feature

we developed

in DAGS Hub is the ability to see

the

data pipeline

next to your

code. So this is actually the DVC pipeline.

DVC is another of those abstractions which make it

easier to dive into a project. Just like if you

you know git, so it's much easier to dive into any

project and at least see the history.

So the same can be said of DVC. Any project that uses it, you can now take a look at its data pipeline, see

what turns into what and what happens before what. And in DXA, we also provide a UI for it

and also allow you to kind of browse the

the data itself and play with it, interact with it. In terms of other abstractions,

so I think

1 thing which we are focusing on, which requires

some consolidation, but it's still very flexible, I guess, is review tools.

For example,

when you have

a real data science project that is not

1 accuracy metric, which I submit to Kaggle, like, in reality, you have a lot of trade offs. Maybe the model is a bit more accurate, but it takes a 100 times more time to run for example or maybe it's biased towards

something or other,

then I can't look at 1 metric. I have a lot of considerations to make. So 1 pattern which we find very useful is

to have, let's say, a notebook,

which is

the output

of an

experiment so that you can see side by side

all kinds of different aspects that change, like, see different loss curves side by side, see the different

metric side by side. So again, I'm not talking about using it exactly for the source code,

but for visualizing

what changed,

that's very useful and looking at it as part of pull request and the review process.

The first thing that Guy mentioned about the cookie cutter data science is also interesting because when we look at projects on DAGS Hub that aren't using the cookie cutter data science template,

many of them converge to that structure anyway. So I feel like it's something of an evolutionary

process where

it's just a structure that makes sense, so many projects have it. I think another thing which we are also doubling down on is using

generic formats wherever possible. So, concretely, we're saving metrics in a CSV format and saving parameters in a YAML format, which is human readable and very easy to understand.

So that means both it's really intuitive to share it with other people and for them to see what's going on.

Also, it's very portable. You can load it up into a Jupyter Notebook and analyze your results if that's something that you want, and it's very straightforward. And I think that's also a recommendation we have in general. Like,

there is a lot of reinventing the wheel going on, and the solution for that is that when you have a problem that you think is entirely novel, you should first think of similar

solutions that were done in other places and whether or not you can adopt most of that solution into your use case. So just to, again, double down on the example that Guy gave with the Jupyter Notebooks, everyone uses Jupyter Notebooks to analyze data. It's a great and very flexible tool. And 1 of the challenges with a lot of these sort of automatic data analysis tools is that they limit you because you have to have someone that implemented a feature to, for example, normalize a column or something like that. So instead of building that feature yourself, why not let users

interface with the data via Jupyter Notebooks, and then they have all the possibilities

that they would have in a regular project. They're already familiar with the interface, and that sort of empowers the user

while also enabling

tools built on top of that, like DAGSUB or any other tool that supports Jupyter Notebooks, to present

that information in a meaningful and useful way.

Yeah. They're very useful for showing

what was done. I think the the trouble maybe starts when you also treat them as the, like, the source code.

I think this is a good opportunity to dig more into the DAGS Hub platform itself. And if you can describe a bit about how you've built

underlying capabilities,

And maybe describe some of the value add features that you have created on top of those open source tool chains?

The basic protocol is to git push to the exact sub, and

we support any Git project. But in particular, we give you extra

nice features if you're using DVC on top of Git. So DVC does not replace Git and kind of like an add on.

And apart from that, we try to keep things very,

very

close to the standard protocols. So if you're using DVC, we can show you the pipelines. We can let you edit it through our UI.

We

can automatically

connect to your

existing cloud storage and

show you the data as it should be,

like, next to the code. Let's say, if you do git clone and DVC

checkout,

it would download

your

data from the cloud storage and

put it in your data slash raw folder next to your code. So if you browse it on GitHub, like, you won't see the data folder at all. It won't exist.

If you browse it on the desktop, then you can actually see everything as you would if you actually checked it out. So I think it's very critical,

especially for open source, where

you want to give people a frictionless way to to see what other people did. Apart from that,

we

recently launched desktop storage, which is

now, I think,

fair to say, officially the easiest way to

create a DVC remote and push and pull data to it. So every project you open on Lexap automatically gets some free storage that you can use as a DVC remote,

and you don't have to set up, like, putting your credit card for a cloud provider and deal with

access controls, which is always

a huge pain,

so that just works automatically.

Apart from that, we have a lot of convenience features. So Dean mentioned that

in terms of, let's say, experiment tracking, which we provide as part of every repo because we think it's, like, an inseparable part of every data science project. So right now, we treat every experiment as a git commit

directly connected. The way we actually do it is tell you, okay, we don't care

if you're using r, if you're using Python, if you're using even

Excel. Like, as long as you commit

params YAML file and a metric CSV file

into your git repo, we will scan your commits and show them to you in an experiments table, which you can now

search, filter,

order, see the loss curves,

like graphically compare things,

and even have a discussion on these experiments like as part of pull requests.

1 thing which we also offer is data science pull requests, which is actually, we believe the first time

people

can come to and open those data science projects. Let's say I go to papers with code, and I find machine learning research paper, which has its code on GitHub. Great. I go to GitHub.

I can see the code. Usually, I can't access the data. But let's say I can because they gave me a link and it's public. But now I have an improvement to make. I want to contribute to it.

So

let's say I found a bug in the data. This is something that actually happened to us. Like, this is not an invention. The first tutorial that we made

oh, sorry. The second tutorial that we made for the exam,

I downloaded the data from the stack

overflow API

and kind of did

a project

about predicting whether a question relates to machine learning or not.

And

I got good results and patent, like, published a tutorial, and then I found out after months

that the data was completely broken. That was totally my fault. I didn't notice that I didn't

sort the results by date. So I got, like, a very

weird distribution of results because it limits you to a certain number. And also I didn't filter

correctly.

And the end result was I got garbage data and got good results on garbage data.

So now if I go to papers with code and find a project on GitHub,

even if theoretical

theoretically all of this is available, I will have a very hard time seeing it.

And

even if I do see it and I do fix the data, let's say, I rerun the API query and get fixed data,

and now I got different results,

I don't have any way to contribute that back to the original project.

I can maybe email the writer of the original paper and say, here is some new data and my new results on it. Please

maybe try to cooperate it. And what we try to do in DAGSAP is say, okay, if you focus the project, you have a DVC remote, you can push the new data to it, you can create new experiments on it that are automatically

recorded as commits. Now when you open a pull request to the original repo, the maintainer can see everything that changed as the parts that matter, not just the code diff, but what changed in the data,

get automatically the new trained models,

be able to compare

the new created experiments, and with a click,

copy all of that new information

as part of the new git history and also the new data and everything.

And it's now the new master version of the

base repo.

Notebooks, you haven't mentioned?

Like we said, it's important to use it in some cases if you're sure that you want to use it in others. But we realized that a lot of people really like notebooks, and it's very useful in a lot of data science projects.

So we support

notebook viewing and notebook diffing in a human readable form. As you probably know, notebooks are JSON files. So if you do a code diff on them, it's horrible and you can't really make sense of anything. So with DAGSM, you can actually see that as integrated

into

the platform.

Also, since a lot of projects,

especially in in deep learning where you're using images or text, where the data is folders

full of images or text files, things like that. It's really hard to understand what's going on if you're doing a diff and you're seeing a list of a 1000 files that have changed.

So we tried to think of a convenient solution for that, and we have something that we call directory diffing, which basically lets you diff the project in the context of the directory structure. This is true also in in data science pull requests. And that makes it much easier for our users to sort of make sense of changes in their projects

and then contribute to them. Maybe the last part is we are still actively working on this, but to be able to

automatically scan your data.

And

let's say we have a few users who did this, like publish a paper.

The main point of the paper is I created a dataset,

and I want, like, this dataset to be useful to society. So please check out the repo on the exab where the dataset is hosted,

and anyone can fault it and change it and whatever.

So 1 thing which we started in doubling down on is to give you much richer

interaction with data

if it's in a standard format. Get or DVC push CSV file to us. Okay. Then we can let you

play with that data

to our UI to see if this is a project that you're actually interested in. I think that's 1 of the things that we learned from user interviews is that

when people are looking at

at an open source project, they first start by looking at the data because they want to see, is this garbage? Like, is this actually something

that is workable, something that I can create interesting things on top of? So we want to make that learning process as fast as possible

and as convenient.

Yeah. Just

making more and more ways to

show the users what they actually want to see as easily as possible.

In terms of the

bootstrapping problem of building up a mass of users and repositories on the platform to make it attractive

problem and just some of the dynamics that that brings up where

GitHub has gained a critical mass of users and has become the default place to go for working on open source code. I'm curious how you're thinking about that problem for DAGS Hub.

And I think as sort of data scientists, we really like to talk about our tools. But for this issue, it's much more important

what the interesting projects that exist on the platform. And I think that's what's going to draw data scientists

to DAGS Hub. Our approach with respect to GitHub is to be

again, GitHub is a great platform. It has a lot of advantages,

and we are supporting it wherever we can. So that means it's really easy because

both DAGS Hub and GitHub are based on Git

to move a project

from GitHub to DAGS Hub and vice versa

or to mirror a project from 1 to the other. So you can actually get the best of both worlds if that's important. And sort of on the technology side, we're trying to focus on, again, providing the features that will make it more

reasonable to do data science work on DAGS Hub while not losing the connection to GitHub.

But our general approach is to create and to help the community create interesting projects and then invite the community to collaborate on them. And I don't think you necessarily need a critical mass of projects. You just need a few meaningful projects that the community cares about,

and people

would go the extra mile, and we're already seeing that. Like, the interesting projects that are being created on deck setup are the things that are drawing the most interest in the platform, not necessarily

a feature here or there. So I think our plan is to continue to double down on, which means integrating with other tools to make it easier

for users to create their projects and get meaningful

advantages as data scientists working on the platform, but also to engage with the community and see where we can help. So that's

definitely helping or partnering with organizations that are working on open source data science, that are working on social good data science or public good data science projects. There are a few really interesting organizations there. 1 which we partnered with recently is Omdena,

which does basically project collaborative data science projects for public good causes.

We plan to continue to double down on that because we see that it's actually working.

1 of the interesting aspects of having these projects out in the open, similar to having platforms like GitHub and GitLab where there's source code available for viewing, is that it serves as a way for people to learn more about different patterns or ways of approaching problems. And I'm curious

in terms of your own experience of working on projects out in the open and looking at other people's projects.

What are some of the lessons that you've learned about how to approach data science and how that has impacted your own methods

of thinking about problems and working on problems for this particular type of application?

1 of the main things that

I learned as we were working on this and talking to people, seeing what they did, is that many different personas

like, this is probably true in any field, but I think it's very true for data science. And there are many different personas, and and it's 1 try to serve them all at once.

So for example, some people

and think maybe Jeremy Howard, this, champion

of this approach are super into

notebooks and will never leave them willingly.

I don't want to to start a flame war.

Not saying I don't agree with that approach, but I'm saying for those people,

any DevTools isn't directly usable in some notebook

magic, or Python API is just a nonstarter. Like, there's nothing to talk to discuss,

and there's usually not much point to talk either about, like, reproducibility, versioning, and stuff like that.

It all comes down to personal goals for the person and the project and taste. Like, if you're doing quick and dirty experiments, then and trying to be as creative as possible, and that's amazing.

If you're working on a long longer term project, then

you need

to invest more in Stockdale.

And not that 1 is good and 1 is bad. It's It's just different use cases. I think I think that's the

thing that I learned,

which is very impactful and true for many other

cases. I'm trying to think of those, like, data science technique that I learned,

which was exciting for me. Not that I can think of, immediately.

And in terms of the ways that people are using DAGS Hub, what are some of the most interesting or unexpected or innovative ways that you've seen it exercised?

This was a while ago, but it's something that stuck in our mind.

We had a really enthusiastic user

we engaged with on a few different occasions,

which really wanted to use DAGZUB for, version control in video game development.

Obviously,

video games have very large assets.

You have the, like, 3 d models. If it's a 3 d game, textures, music, many things that are very large files, not exactly data,

but they are 100%

part of the source code of the project. They require versioning that you have, like, designers working on these assets, and they're changing all the time. And so

he wanted us to add more features

to sort of compare 3 d models, for example, across branches. And we were

obviously very tempted because we're both,

gamers.

He also wanted us to go and present DAGZUB

at

a major game company event.

But in the end, we decided that, obviously, our initial

focus was data science and that we should continue focusing on that. But it was definitely a very interesting use case that we did not think about

when we started DAGS Hub. In your experience of building out the DAGS Hub platform

and using it for your own work and seeing other people using it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

This 1 is actually relatively

recent.

We were speaking with a user about creating automations

for training machine learning models.

So our approach with DAGSUB is, again, integrating with existing solutions to do this.

But we realized that today,

there are no good solutions for adding automations that actually affect the project.

So what I mean by that is

tests,

sort of you run them on the code that is, for example, being contributed in a pull request, and you get a success fail

result. But what happens if if you have an automation that is actually supposed to push code into the repository,

and you want to do that for an open source project that

accepts contributions

from strangers.

So

this is actually a challenge that we're now sort of thinking about how to solve. And I think it's unique to a community first platform,

but it's definitely very interesting. And there are a few sort of nuances there that we didn't expect.

Like, it seems like it's a solved problem, but when you start thinking about it, you realize that it's complicated.

I think 1 thing which I never thought about it, but after

thinking about it more, I realized what problem it is is

doing continuous integration

in a closed project with only your collaborators

is so much more flexible. Because, for example, I can say that the build script itself is

part of the source code. It makes sense. Like, I have a Jenkins file or a make file or whatever file.

And the CI

is very thin. It just checks out the pull request version,

builds it based on the Dockerfile

or whatever file that tells it what to build.

And it either passes or it doesn't. But But if you're in an open source project, suddenly,

anyone can create a pull request and change the Dockerfile

to do Bitcoin mining

for them. Just

open a ton of pull requests to do automatic Bitcoin mining

at the expense of the open source project maintainer,

at the expense of GitHub, or whoever is running the CI system.

Maybe Bitcoin mining is less good example because,

throttling would pretty much solve it. But let's say export secrets that the CI system is keeping in, which are only usable as part of the build, like in environment variables or something.

So

those are the kind of concerns that Dean is talking about on top of the other concerns, which are like, okay, if I want to actually push back to the pull request,

for example,

a new

experiment, which is a commit, then there are a lot of nuances that we didn't think about before.

And for people who are looking to be able to collaborate

on data science and machine learning projects, whether in the open or just within the confines of their team? What are the cases where DAGS Hub is the wrong choice?

The easy answer to that is that if you're creating a throwaway project or a short term project, it probably doesn't make sense to use DAGS Hub. It might sound trivial, but

a lot of data science work is sort of still in the proof of concept stage. So you just want a notebook that shows some

capability of a model to learn, for example,

and you don't really care about the details.

So you're creating a notebook. You try some different things out, and then you never return to it. So you probably

don't want

to go through the process of setting everything up if that's what you're going to do. On the other hand, if you plan to share your work with others or sort of work with collaborators or maybe your boss or something like that, you'd still probably find DAGS Hub as a very good way to show them what you've done, and, obviously, if you get back to that project later. I think another thing here that we stumbled upon is if you're

working on projects as, like, a contractor

and you have, like, a 1 month project and then you don't know what is going on You don't care anymore. You tossed it over the fence.

Yeah. So in such a case, that's also going to feel like it's a lot of work to use DexHub.

Obviously, that's not true if you consider

the entire sort of pipeline of the project because now you're going to throw it over the fence and someone else is going to worry about it. But I would say that in such a case, it's probably in the interest of the client to insist on using a platform like DAGS Hub so that the project is managed in an ordered way. But, yeah, short term projects

doesn't make sense in that case.

As you continue to build out the platform

and bring on new users and expand the capabilities, what do you have planned for the future of both the platform and the business and just the sustainability of this effort?

On 3 fronts. The first is a product front.

So our near term plans are to

add real time experiment tracking capabilities. That's something that's been coming up from users.

Obviously,

you can already see your experiments on DAGS Hub. But if you want in real time, if you're training a heavy

deep learning model, you want to see that the loss curve is is converging,

We plan to add support for such a use case in the near future.

We also

plan on adding

sort of integrations and help with automations, for example, continuous training and deployment, which we men we mentioned a few questions ago.

Again, in all of these cases, we don't plan on building these things from scratch, but integrating with the best solutions that are already in the field since that offers our users the the best options that they could have.

Another aspect that we plan to double down, which I think we also mentioned, is the sort of ability to interface with data. So it's arguably

the most important part of a data science project. It has to be easy and really convenient to understand what's going on with the data, interact with it in the environment that you choose, and then review and contribute to it. We're already providing value here, but we definitely plan to expand that. And on the community front, we plan to continue to add and encourage others to add open source data science projects.

To continue partnering with organizations that are working on this front, I gave a few examples for collaborative data science project

platforms,

but also things like Papers with Code and others.

And I think that another area which is neglected on this front is sort of examples for projects that are combining more than 1 tool. So a lot of times, people are using a tool and they want to add a capability

for something,

and it's really hard to understand how they should do that. So we want to lead by example and create many of these options so that if you have a tool combination that you'd like to use, you can find a great example for a project like that and get started as quickly as possible. On the sustainability front,

basically, I think it's very straightforward for us. Like, the approach

is DAGS Hub is free and will continue to be free for the community, for open projects, for social good projects,

etcetera.

And then

if we take the example of open source software,

the best workflows, the best tooling

was built around open source software and then sort of permeated into the industry,

which usually paid for it. So our approach is

similar in the sense of if a company

wants to incorporate

DAGS Hub and the workflow that we are building

into their organization, then that will be a paid option. Another sort of,

point here is that many companies have data which they cannot

expose,

so they would need a non premises installation, things like that. So that's our approach with respect to monetization.

Are there any other aspects of the Dagestud platform

or the challenges of collaboration and data science or doing

data science out in the open that we didn't discuss yet that you'd like to cover before we close out the show?

Yeah. Definitely. So

there is, like, the

open question of

licensing privacy,

which

about licensing, I'm not so worried. Like, sometimes it's mentioned as an issue.

My god, we don't know what the open source licenses are for data science for all for data.

And

maybe I'm being naive, but I think in the end,

I see that a lot of people just want to share the data and let anyone do whatever with it, which is like the

BSD license

or any number of other open source licenses licenses

that exist. Sorry.

And maybe you could have variations which are equivalent to GPL, Copyleft,

things like that.

But I think

those are the kinds of issues which

will sort themselves out because people actually want to share data and to work on it openly.

Having said that,

1 thing which in the far future, I think we would like to address is

the issue of

supporting

maintainers of open source datasets, something that GitHub is now and other companies are trying to do to connect

open source developers with

money contributions.

And so that's something very interesting that I think we would very much like to solve at some far future point.

The other issue is privacy, which is more sticky. Like, if I want to

maybe some project will be very useful as an open source project. Like, face detection is not a good example because politically it's being misused. But you can think of other examples where you actually use people's information to train something which is very useful in general

and can be used for good. But let's say, medical companies who have very useful data which can be used for

a lot of drug discovery and things that are beneficial to society, but they just can't

release the data. And that's,

a shame. Maybe things like differential privacy and stuff could solve this, but I'm not sure.

Yeah. I think that the questions of privacy and bias in particular are an interesting 1, particularly for work that's being done out in the open, but also the licensing and ownership challenges around the datasets and

how do you determine

what data is acceptable to put it in an open repository? How much of it needs to be redacted? How do you handle things like you mentioned, differential privacy?

Definitely interesting problems that don't have any easy answers right now.

Yeah. I think there's enough of a mass of data which could just be open source to make this a very interesting

area of activity

anyway before those problems are solved

sufficiently. I would also add that putting aside, like, PII and HIPAA compliant data, like, if we put aside all the problematic data with respect to legal legal issues, I think the issues of bias, which are obviously

very important to solve,

would be solved faster if the data was accessible to everyone. It means more eyeballs,

more working hands contributing to fixing the biases in data or adding data to sort of counterbalance

the problematic data that we already have. And that's definitely something that we would want DAGS Hub to contribute to and lead as much as that as possible. When we came up with, like, value proposition

originally,

that was really our go to example. Like, we would call

biases just data bugs. Like, okay, you don't have representative data. If it was open sourced

and some contributor from some not well represented community

in another part of the world could come in and say you have a data bug, you don't have enough examples

from my

area, from my culture, and here. I just contributed it to you. Instead of arguing about it online, we can just fix it.

Yeah. It's definitely analogous to the security issues that plague software projects

and the fact of it being open, meaning that more people who have the necessary expertise can have eyes on and contribute fixes

in the areas you mentioned for bias in these source datasets or the way this that the data is being processed.

If there are more people who are able to access it, then, as you said, they can contribute back

and incrementally

move towards a more equitable result. Yep. Exactly.

Well,

for anybody who wants to follow along with the work that you're both doing and contribute or try out the platform, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks.

And this week, I'm going to choose a book that I am revisiting with my family and reading aloud to the kids called the Remarkable Journey of Prince Jen by Lloyd Alexander.

It's a book that I read a long time ago when I was a kid. It's a very interesting storyline.

The way that it's presented and structured is a little bit different than your typical sort of youth or young adult novel. It's a story of a sort of journey of discovery that this young oriental prince goes on

out into his provinces in just very interesting way of presenting some of the useful sort of lessons that are necessary to help us all sort of be better people. So definitely a fun book to read and worth checking out. And so with that, I'll pass it to you, Dean. Do you have any picks this week?

On the book front, I'm going to recommend a book that I'm currently reading, which I got from Guy for my birthday,

which is called Quantum Computing Since Democritus.

I don't know what you think that book is about, but I think it's really interesting because it covers a lot of scientific topics

from math, computer science, physics, and more, and some philosophy.

But the nice thing about it is that I've read a lot of sort of pop science books, and I've read a lot of academic

content.

And this book sort of strikes the balance between not being too pop sciency. He's not afraid of showing complex proofs

as part of the story,

but also not entirely academic, which would arguably

be boring. I find that I'm really enjoying that book. And on the viewing front,

I am currently watching the 5th season of The Expanse.

It is a great show. For those that don't know, it's a sci fi show, which tries to be relatively realistic.

So, basically, it's

not too far in the future where

humans have inhabited

the moon, Mars, and the asteroid belt. And there are a lot of sort of politics and dynamics inside which are interesting.

It's really recommended. The 5th season is great, so I recommend.

Game of Thrones in Space. Definitely a great show.

I've been enjoying it myself as well. So, yeah, Guy, what do you have for picks this week? By the way, this one's wasn't 1 of my pics, but it's important. We mentioned,

papers with codes. We would be remiss not to mention

connected papers.

While we sponsor them, they're a great service and idea which everyone should check out. They let you put in, let's say, an archive link,

and you get a graph of all the related papers and how impact for each 1 is and kind of can build

a reading list on a topic. So it's very useful and very recommended.

They're great friends.

In terms of picks, I kind of always wanted to

prepare a recommendation list for podcast which I will be interviewed to, because I always listen to podcasts, and hear the recommendations, and I get frustrated when the guests say, I don't know. I haven't thought of anything.

My recommendation is, like, I thought,

appropriate to make it timeless because I always advise, try to consume

only the very best of available content, and not the things that are coming out right now. The expense, although it's new, but it's very good, so that checks out. And I think it applies to books, TV shows, movies, programming, whatever.

So my picks

are kind of the things I think are most useful

in the long term, not necessarily new things. So,

something which has influenced me a lot in my thinking, like,

while I was

unrelatedly learning about

machine learning is I discovered thesong.com,

which is a blog about rationality,

AI, all kinds of topics which I think you might not agree with everything said there, but it's

really, really can change

how you think of things and maybe sometimes afterwards,

you'll you would see online

discussions and think, why are they even arguing about this? Like, Les

Long has already answered this question about AI philosophy very conclusively,

so I would really recommend it.

And also the offshoot blog, Slate style codex or Astar codex 10, as it's now called

in terms of TV shows. I recently

watched Avatar the Last Airbender, like, very late,

and I was shocked. Like, it's such a masterpiece.

So I would recommend anyone who skipped it, maybe because they were too old or something, go and watch Avatar the last airbender. It's not a kid's show. It's just a masterpiece.

If you haven't already, go see 3 blue and brown on YouTube.

And

1 last thing, I always recommend to people who are into code,

learn these 2 languages, Haskell and Clojure

or any other Lisp, but Clojure is like a modern,

actually usable

alternative because those are the 2 languages which completely changed how I think about things permanently.

So those are the kinds of recommendations that I like.

Well, thank you to the both of you for taking the time today to join me and share your experience

of building DAGS Hub and working through the problem of collaboration and data science.

Definitely a very interesting and complex problem and 1 that is increasingly relevant and necessary to address. So I appreciate all the time and energy that you've put into that, and I hope you 2 enjoy the rest of your day. Thank you for having us, Tobias. Thank you for the opportunity.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__