Summary
Collaborating on software projects is largely a solved problem, with a variety of hosted or self-managed platforms to choose from. For data science projects, collaboration is still an open question. There are a number of projects that aim to bring collaboration to data science, but they are all solving a different aspect of the problem. Dean Pleban and Guy Smoilovsky created DagsHub to give individuals and teams a place to store and version their code, data, and models. In this episode they explain how DagsHub is designed to make it easier to create and track machine learning experiments, and serve as a way to promote collaboration on open source data science projects.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Dean Pleban and Guy Smoilovsky about DagsHub, a platform to track experiments, and version data, models & pipelines for your data science and machine learning projects.
Interview
- Introduction
- How did you first get introduced to Python?
- Can you start by describing what the DagsHub platform is and why you built it?
- There are a number of projects and platforms that aim to support collaboration among data scientists. What are the distinguishing features of DagsHub and how does it compare to the other options in the ecosystem?
- What are the biggest opportunities for improvement that you still see in the space of collaboration on data projects?
- What do you see as the biggest points of friction for building experiments and managing source data collaboratively?
- Can you describe how the DagsHub platform is implemented?
- How have the design and goals of the system changed or evolved since you first began working on it?
- How has your own understanding and practices of working on data science/ML projects changed changed?
- GitHub has a number of convenience features beyond just storing a git repository. What are the capabilities that you are focusing on to add value to the data science workflow within DagsHub?
- How are you approaching the bootstrapping problem of building a critical mass of users to be able to generate a beneficial network effect?
- Are there any conventions that make it easier or more familiar for newcomers to a given project? (e.g. code layout, data labeling/tagging formats, etc.)
- What are your recommendations for managing onwership/licensing of data assets in public projects?
- What are some of the most interesting, innovative, or unexpected ways that you have seen DagsHub used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building DagsHub?
- When is DagsHub the wrong choice?
- What do you have planned for the future of the platform and business?
Keep In Touch
Follow us on Twitter or LinkedIn, join our Discord, sign up to DAGsHub
Picks
- Tobias
- The Remarkable Journey of Prince Jen by Lloyd Alexander
- Dean
- Quantum Computing Since Democritus by Scott Aaronson
- The Expanse TV Series
- Guy
- Try to consume only the very best of available content, not the things that are coming out right now.
- Applies to textbooks, TV shows, movies
- Less Wrong blog
- Slate Star Codex \ Astral Codex Ten
- Avatar: The Last Airbender
- 3 Blue 1 Brown YouTube Channel
- Haskell
- Clojure
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Dean Pleban and Guy Smialowski about DAGS Hub, a platform to track experiments and version data models and pipelines for your data science and machine learning projects. So, Dean, can you start by introducing yourself?
[00:01:11] Unknown:
My name is Dean. And together with Guy, I'm 1 of the founders of DAGS Hub. My background is a bit of a combination of physics, computer science, data science, and design. And I am excited about data science and DAGS. I'm Guy, the, the second half of the team.
[00:01:29] Unknown:
My background is mostly from software, some data engineering and other ML related things. Just wanted to say it's great to be on the podcast. I've been listening to both of your podcasts for a long time, so it's super exciting.
[00:01:45] Unknown:
Yeah. It's definitely great to have the both of you on here. And going back to you, Dean, do you remember how you first got introduced to Python?
[00:01:52] Unknown:
The first time I started writing Python was in my free time, so probably high school. It seemed like a very easy to start with language, So I tried out a few things. I think it was aside from school projects because we were studying on, Pascal. So Python was in spare time, and then I had a bit more opportunities to work with Python during my undergraduate studies where I did a bit of machine learning work. It was obviously Python, but also a bit of physics research in quantum optics where we use mostly MATLAB, but I did take every opportunity I could to write in Python because it was so much fun.
[00:02:33] Unknown:
And, Kai, do you remember how you first got introduced to Python? I Think it was at a very early age.
[00:02:38] Unknown:
Sometime doing probably middle school or high school. So not really. But I always thought it was, like, I don't know why people didn't catch on earlier, that it's just easier to do things in Python. Right now, it seems to be the status quo.
[00:02:55] Unknown:
The both of you have been working on building up this platform in the form of DAGS Hub. I'm wondering if you can just give a bit of an overview about what it is that you've built and your motivation for starting on that journey.
[00:03:06] Unknown:
Basically, I've been a software developer, like, professionally for a bit more than 10 years. Now a variety of of fields like mobile, cyber, all kinds of unrelated areas, but my last job, which I've done for 3 years, was as 1 of the developers in a startup called Glassbox Digital, where I came in to do, like, heavy duty data engineering, like how to ingest and handle data rates of, like, gigabits per second and just having to elastic search, do all kinds of analytics on top of them. That's kind of where I first got my feet wet in data engineering and data science. We had all kinds of projects, mostly around time series anomaly detection.
And while doing that, I also, like, took 1 day a week to to open a master's degree in computer science, where I basically took all of the possible courses in machine learning and deep learning, and there was a year where I basically read every paper on archive. I don't know how, but I found the time, and they they dove headfirst into it. So doing 1 of these courses where we were trying to, as a group of students, reproduce some state of the art deep learning model for face detection, I basically became frustrated with how I was working. Like, okay. So I'm a software developer. I really feel strongly about DevOps, about Git. Like, I'm always the the Git guy in the company, wherever I was.
And while working in this team, I basically found out that I don't know how to work together with the other 2 students who are in my team. Basically, what ended up happening is I told them, okay, listen, I don't know. Like, I have this notebook. I am working on it. When I am done, maybe I'll let you know when I'm done, and you can do something maybe on top of it. I don't know how we can work in parallel, how we can split the work. And I found myself doing things which angered me, like I caught myself walking the way I would before I knew what git was, which was, for example, I created as part of the notebook a script to run an experiment and save it into some folder with the name of the folder being hyperparameter name_value and kind of a chain of that. So lr_0.01 and kinds of hyperparameters like this, and then writing scripts to iterate over the folders and try to extract some information from it. So when I noticed that, wait, before I knew what Git was, I would kind of copy code into folders and do version management like that.
And I said, okay. I'm just being silly. Like, someone has solved this, the data scientists in the world who are working for a long time. So I went out and searched for how I should be doing this. And what I found was some promising solutions, but mostly, everyone's saying, we don't know. This is how it is, and we don't have an agreed solution. So that's kind of where the idea came from, I especially liked, DVC. I thought it was the most d DVC is, data version control. It's an open source project to kind of extend Git to give it better capabilities for data versioning and other relevant features, and guess you had an episode about them already.
So I like that, but what I was missing is, like, the GitHub to close this story because there's still a lot of tooling around that. So what happened was that was kind of the origin story, how I was frustrated with the current state of affairs, what I kind of had a vision of where the world should be. And I had just, quit my job due to moving apartments, and Dean had just finished his degree, and Dean and I are friends since kindergarten, which we didn't mention earlier. So we've been friends
[00:07:03] Unknown:
a couple of decades. Friends for a long time.
[00:07:06] Unknown:
And we've always done projects together. We both like software, and we kind of complete each other because Dean is like the design guy, and I'm like the back end logic guy. And we basically dove into it, and I kind of quit my, degree. So I'm a college dropout, which means I check the that checkbox. And that was the origin story.
[00:07:30] Unknown:
Building a company together, particularly after having been friends so long, is definitely a good test to see how well you guys actually get along because I'm sure that there are points of stress in the process of building a business around this that strain your existing relationship?
[00:07:44] Unknown:
I think we're doing good. Like, it's been 2 plus years.
[00:07:48] Unknown:
I feel like, in a sense, we knew what we were getting into. Like, this is the first startup that we're doing together, but not the first project that we're putting an intensive amount of time into doing together. So in a way, like, when you look at the process of finding cofounders, that's always a challenging issue. But for me, I felt like it was a very straightforward process because it was very clear that Guy and I should do something together at some point. And when we landed on this idea, it was very clear that this is the idea that we should pursue together. I feel like I couldn't have chosen a better cofounder, at least from my end.
[00:08:25] Unknown:
From looking at the website, it seems that it's still just the 2 of you who are building out this platform, which given the amount of features that you've already built into it and how far along you've gotten, it's definitely impressive that you've managed to make that progress with just the 2 of you.
[00:08:39] Unknown:
We do have an additional 2 developers that we need to add to the website and another 2 joining very soon. So we are growing the team, and we haven't built everything on our own. But I think even for that size of the team, we've built quite a few interesting things, and I'm really proud of what the team has been able to create over the past year. As I always like to say, like any good developers, we cheat. You take something that works well, and you use it.
[00:09:07] Unknown:
Yeah. That's the secret.
[00:09:09] Unknown:
It's not cheating. It's called standing on the shoulders of giants. Exactly. And so you mentioned that a big part of the motivation for building DAGS Hub was your frustration in forms that are aiming to support some of the other complexities of collaboration for data scientists and machine learning engineers. And I'm wondering if you could just give a bit of an overview about the features of DAGS Hub that you think make it stand out in the field and maybe compare it to the other options that are available in the ecosystem and how it fits with them or might compete with some of the existing capabilities?
[00:09:53] Unknown:
Maybe to start because I didn't exactly explain it, what is the exam right now. So it's a platform, site, where you can collaborate on open source data science projects like GitHub for machine learning. The analogy is very correct because you can actually get push your data science projects and get features that are more useful specifically for data scientists versus software to 1.0 developers, like data and model hosting and the ability to contribute not just code changes, but the more important parts like contributions to data and new experiments, review tools, automatic reproducibility, experiment tracking.
That leads maybe to the things that we do differently. Dean? I think that
[00:10:39] Unknown:
specifically, 1 thing that we have in front of our eyes, which is very important to us, is the whole open source data science collaboration, which is something that not a lot of people are speaking about. But what we mean by that is that the technology, the platform should enable 2 people that don't know each other personally, that are not working next to each other or might be on opposite sides of the globe to actually work together on a data science project. Our sort of vision is that, in the end, open source software has made such a huge impact on the world, and we believe that data science is is going to go through that same process. So we want to provide the tooling that enables that to happen and help the people that are working on these projects work more effectively. And I think that with respect to other solutions surrounding collaboration and data science, I think that the term collaboration is challenging.
It's ambiguous, and a lot of of people and tools use it, but mean very different things. So the best 2 examples is you can say that pair programming on a Jupyter Notebook, that's a type of collaboration. But what we mean with DAGSUB when we say collaboration is that you have a place where you can see all the meaningful components of your project, like the code, the data, the models, experiments, pipelines, and you can compare their versions and contribute to them in an effective way. And and I think that that's sort of where we are coming from. Yep. I want to explain what we do mean when you say collaboration.
[00:12:08] Unknown:
So if you look at how collaboration on a project looks in reality, it involves several steps that get repeated constantly. So as a contributor, I want to discover what someone else did. I want to understand it, produce it, change it, and then contribute the changes back. And this is exactly what Git and GitHub enable us to do with code, even if I don't know the creator of the project. It is to find some repo on GitHub, read the docs, try to look at the existing open issues and branches, look at the git blame and the history, kind of understand the context, what was done when, and by who, and why.
And then when I actually want to work on it, I can git clone, check out, set up the environment, which is maybe the difficult part, but hopefully, someone has set up at least a Docker container or something. But let's assume it's not impossible. I start hacking on the code, and when I think I have made a useful contribution, I can commit it, push it, open a PR, CI can run, do linting, testing, whatever. I can get the code review saying what should be changed, and after fixing, finally, the maintainer of the project can click a button and it's merged, and it will be part of the next release. Any of those steps becomes too hard, then although collaboration is still theoretically possible, like, I can email code patches like the Linux kernel used to do in the past.
So it's theoretically possible, but it will be almost nonexistent. So what we set out to do is to try to close this loop that I described for data science. So if some other tools say collaboration, but they don't allow you in the end to click a button to say, okay. These experiments are good. I like them. After I discussed them with the maintainer, it maybe asked for more experiments or maybe asked for more changes to the data or model. In the end, to click a button and integrate those changes, that's, kind of a missing part of the puzzle that we haven't seen solved anywhere else except, like, using Git and GitHub for software, and we're trying to do that for data science.
[00:14:28] Unknown:
As you put it before, Tobias, we are trying to build on the shoulders of giants. So we think that this is important as a platform that is looking at itself as community first, which is that we want to only be based on open source tools and open formats wherever it is possible. Today, we are already based on some of the great tools that everyone knows, like Git, DBC, MLflow, and we're adding support for more of these as time goes by, which also means that 1 of the issues that we're seeing, it's sort of a meta issue in the ML field, is that there are so many tools and so much, reinventing of wheels going around that practitioners have a huge set of tools to evaluate, and they never get to the bottom of it. So it's really hard to decide what to use. And what we're trying to do with DAGS Hub is to entirely avoid that. The approach we have is basically saying, here are the tools that we support, which are open source, widely adopted tools, and then we're creating a very convenient way to interface with them and collaborate on top of them, which I also think is a very important distinction.
[00:15:37] Unknown:
Maybe to make it more concrete, imagine if you have, like, a project without Daxabs. So you have a myriad of tools that log information about your training runs, and you have maybe git to version your code if you're being orderly and not just doing everything kind of ad hoc in a notebook. And maybe all that data is stored in some S3 bucket and you have a URL pointing it to it in your code files, but in the end, tying together all of this information into 1 coherent picture that lets you actually review and compare versions, it's hard work. And we've talked to a lot of teams that due to it being hard work, do it not as often as they should, and they built a lot of custom tooling and lot of ad hoc solutions for this. So that's very hard, and that's not even mentioning how hard it is to reproduce someone else's work when it's scattered all over a bunch of tools. And we think it should be 1 command. Not to mention actually merging it after you reviewed it, and not to mention actually trying to do all of that with complete strangers from the Internet who don't have access to your s 3 or whoever you stole your data.
[00:16:46] Unknown:
There are a couple of things that I wanna talk to out of that. 1 of it is the existing pain points that are still there for being able to build experiments and managing the source code and the source data and collaborating on projects as a data scientist. And then the other aspect is some of the useful patterns that you've seen grow in terms of making it easier for new contributors or new users of a given data science project or set of experiments to be able to familiarize themselves with the code and with the project and with the data, similar to the conventions that have grown up around different areas of software engineering.
So I guess maybe starting with the pain points that still exist in terms of what are the points of friction that are still around for being able to collaborate on data science.
[00:17:37] Unknown:
In the long term, the challenge of scale, which is right now, there's a barrier to entry for serious, quote, unquote, machine learning work in a community setting, since you might need a lot of computing power and huge datasets to do state of the art work. And even in smaller projects, there are still trade offs that require more thought than traditional software. Like, I can't retrain my model every time someone pushes a new commit to a pull request, just like I could run a code linter or unit tests, which take 3 seconds in normal, continuous integration. Also, data transfer costs in most cloud providers are ridiculously expensive and that requires attention. That means, like, you can't do things without thinking about it. And teams will need to be more selective about these things, like when do I run automation, and it breaks some of the existing abstractions about collaboration and CI, which will require adjustments.
But those are problems that I think will become less of an issue over time because compute does get cheaper and the methods become more efficient. And besides that, we also believe that there are plenty of interesting things and opportunities to exhaust before we say, let's do an open source GPT 5 or something like that. We have a lot of, projects on the platform which are doing very interesting things just with tabular data in, not huge, scenarios. But those are definitely challenging issues which need to be solved somehow.
[00:19:11] Unknown:
Even though the example of GPT 5, we can see that now there are teams that are actually working on open sourcing and implementation for GPT 3. They do need external support because, obviously, that costs more than time, but it's happening. And so I feel like even now with all of those limitations, these things are happening. So there's room for hope that it will be more common in the future. I think, the same objections used to exist about all kinds of software that we now take for granted. Like, you used to need a supercomputer to run
[00:19:43] Unknown:
Node. Js or something. And I think the maybe biggest challenge is actually our mindset when we create project. I don't think there's a magic bullet that causes like, if I'm peeping stall or something, then it causes a project to be collaboratable without the maintainers of the project making a conscious decision and effort to make the trade off, to constantly fight against entropy. And that doesn't mean that every project has to make this trade off. Sometimes we create a 1 off, something that we want to be quick and dirty and be creative and get some kind of results that we can learn from and then pick up the pieces later. I also do that in software. Like, sometimes I open a Apple and I just start doing stuff and I learn very fast, but then I would take what I learned and try to make code out of it. So what this does mean is that data scientists who do work in collaborative projects need to think of themselves as part of a larger system that needs to interoperate and not as completely independent research mavericks who do very creative work, which is but later leave the pieces on the ground to be picked up by someone else. So practically speaking, that means as data scientists to refactor your code into orderly modules and repeatable functions, kind of use best practices like git, and that's kind of a cultural thing. I think it's inevitable, like, as people work on specific types of projects. Let's say, for example, if I'm if my deliverable is a research paper or a Kaggle competition submission or something, then maybe I don't care. Like, as long as I can make sense of what I did, that it can run once and I get the result, and that's great. But if it's a model that's going to be deployed to production as part of some product and it's going to be iterated on as we discover problems in the data and the code, then definitely you need to have that mindset of this is a long term project that I need to organize.
[00:21:46] Unknown:
Going a bit more into the useful patterns to make a project easier to dive into and get familiar with. So you mentioned, for instance, in software projects, maybe having a Docker container that's easy to get the environment set up and run with. Or if you are used to working on Django projects, there's a particular directory structure that you're familiar with as far as the different portions of the app are split out into sort of subapplications, and the directory structure reflects that. Wondering if there are any useful patterns in terms of the labeling paradigms for the source data or the data structures or the layout of the data that's useful for making projects easier to collaborate on or the formatting or structure of the code and some of the useful patterns or abstractions that you've seen grow up as people are starting to collaborate more on these data science projects, and particularly in the context of DAGS Hub?
[00:22:44] Unknown:
I think definitely there are some things, like we support using cookie cutter data science as a project template. So cookie cutter data science is a project that's not related to the exam directly. It's like a standard folder structure for data science projects with, like, clearly defined stages for here is the folder with the raw data, here is the folder with the process data, and here is the code the Python, file that turns the raw data into process data. So that's definitely useful. And I think maybe the first feature we developed in DAGS Hub is the ability to see the data pipeline next to your code. So this is actually the DVC pipeline.
DVC is another of those abstractions which make it easier to dive into a project. Just like if you you know git, so it's much easier to dive into any project and at least see the history. So the same can be said of DVC. Any project that uses it, you can now take a look at its data pipeline, see what turns into what and what happens before what. And in DXA, we also provide a UI for it and also allow you to kind of browse the the data itself and play with it, interact with it. In terms of other abstractions, so I think 1 thing which we are focusing on, which requires some consolidation, but it's still very flexible, I guess, is review tools.
For example, when you have a real data science project that is not 1 accuracy metric, which I submit to Kaggle, like, in reality, you have a lot of trade offs. Maybe the model is a bit more accurate, but it takes a 100 times more time to run for example or maybe it's biased towards something or other, then I can't look at 1 metric. I have a lot of considerations to make. So 1 pattern which we find very useful is to have, let's say, a notebook, which is the output of an experiment so that you can see side by side all kinds of different aspects that change, like, see different loss curves side by side, see the different metric side by side. So again, I'm not talking about using it exactly for the source code, but for visualizing what changed, that's very useful and looking at it as part of pull request and the review process.
[00:25:12] Unknown:
The first thing that Guy mentioned about the cookie cutter data science is also interesting because when we look at projects on DAGS Hub that aren't using the cookie cutter data science template, many of them converge to that structure anyway. So I feel like it's something of an evolutionary process where it's just a structure that makes sense, so many projects have it. I think another thing which we are also doubling down on is using generic formats wherever possible. So, concretely, we're saving metrics in a CSV format and saving parameters in a YAML format, which is human readable and very easy to understand. So that means both it's really intuitive to share it with other people and for them to see what's going on.
Also, it's very portable. You can load it up into a Jupyter Notebook and analyze your results if that's something that you want, and it's very straightforward. And I think that's also a recommendation we have in general. Like, there is a lot of reinventing the wheel going on, and the solution for that is that when you have a problem that you think is entirely novel, you should first think of similar solutions that were done in other places and whether or not you can adopt most of that solution into your use case. So just to, again, double down on the example that Guy gave with the Jupyter Notebooks, everyone uses Jupyter Notebooks to analyze data. It's a great and very flexible tool. And 1 of the challenges with a lot of these sort of automatic data analysis tools is that they limit you because you have to have someone that implemented a feature to, for example, normalize a column or something like that. So instead of building that feature yourself, why not let users interface with the data via Jupyter Notebooks, and then they have all the possibilities that they would have in a regular project. They're already familiar with the interface, and that sort of empowers the user while also enabling tools built on top of that, like DAGSUB or any other tool that supports Jupyter Notebooks, to present that information in a meaningful and useful way.
[00:27:12] Unknown:
Yeah. They're very useful for showing what was done. I think the the trouble maybe starts when you also treat them as the, like, the source code.
[00:27:21] Unknown:
I think this is a good opportunity to dig more into the DAGS Hub platform itself. And if you can describe a bit about how you've built underlying capabilities, And maybe describe some of the value add features that you have created on top of those open source tool chains?
[00:27:41] Unknown:
The basic protocol is to git push to the exact sub, and we support any Git project. But in particular, we give you extra nice features if you're using DVC on top of Git. So DVC does not replace Git and kind of like an add on. And apart from that, we try to keep things very, very close to the standard protocols. So if you're using DVC, we can show you the pipelines. We can let you edit it through our UI. We can automatically connect to your existing cloud storage and show you the data as it should be, like, next to the code. Let's say, if you do git clone and DVC checkout, it would download your data from the cloud storage and put it in your data slash raw folder next to your code. So if you browse it on GitHub, like, you won't see the data folder at all. It won't exist.
If you browse it on the desktop, then you can actually see everything as you would if you actually checked it out. So I think it's very critical, especially for open source, where you want to give people a frictionless way to to see what other people did. Apart from that, we recently launched desktop storage, which is now, I think, fair to say, officially the easiest way to create a DVC remote and push and pull data to it. So every project you open on Lexap automatically gets some free storage that you can use as a DVC remote, and you don't have to set up, like, putting your credit card for a cloud provider and deal with access controls, which is always a huge pain, so that just works automatically.
Apart from that, we have a lot of convenience features. So Dean mentioned that in terms of, let's say, experiment tracking, which we provide as part of every repo because we think it's, like, an inseparable part of every data science project. So right now, we treat every experiment as a git commit directly connected. The way we actually do it is tell you, okay, we don't care if you're using r, if you're using Python, if you're using even Excel. Like, as long as you commit params YAML file and a metric CSV file into your git repo, we will scan your commits and show them to you in an experiments table, which you can now search, filter, order, see the loss curves, like graphically compare things, and even have a discussion on these experiments like as part of pull requests.
1 thing which we also offer is data science pull requests, which is actually, we believe the first time people can come to and open those data science projects. Let's say I go to papers with code, and I find machine learning research paper, which has its code on GitHub. Great. I go to GitHub. I can see the code. Usually, I can't access the data. But let's say I can because they gave me a link and it's public. But now I have an improvement to make. I want to contribute to it. So let's say I found a bug in the data. This is something that actually happened to us. Like, this is not an invention. The first tutorial that we made oh, sorry. The second tutorial that we made for the exam, I downloaded the data from the stack overflow API and kind of did a project about predicting whether a question relates to machine learning or not.
And I got good results and patent, like, published a tutorial, and then I found out after months that the data was completely broken. That was totally my fault. I didn't notice that I didn't sort the results by date. So I got, like, a very weird distribution of results because it limits you to a certain number. And also I didn't filter correctly. And the end result was I got garbage data and got good results on garbage data. So now if I go to papers with code and find a project on GitHub, even if theoretical theoretically all of this is available, I will have a very hard time seeing it.
And even if I do see it and I do fix the data, let's say, I rerun the API query and get fixed data, and now I got different results, I don't have any way to contribute that back to the original project. I can maybe email the writer of the original paper and say, here is some new data and my new results on it. Please maybe try to cooperate it. And what we try to do in DAGSAP is say, okay, if you focus the project, you have a DVC remote, you can push the new data to it, you can create new experiments on it that are automatically recorded as commits. Now when you open a pull request to the original repo, the maintainer can see everything that changed as the parts that matter, not just the code diff, but what changed in the data, get automatically the new trained models, be able to compare the new created experiments, and with a click, copy all of that new information as part of the new git history and also the new data and everything.
And it's now the new master version of the base repo.
[00:32:44] Unknown:
Notebooks, you haven't mentioned? Like we said, it's important to use it in some cases if you're sure that you want to use it in others. But we realized that a lot of people really like notebooks, and it's very useful in a lot of data science projects. So we support notebook viewing and notebook diffing in a human readable form. As you probably know, notebooks are JSON files. So if you do a code diff on them, it's horrible and you can't really make sense of anything. So with DAGSM, you can actually see that as integrated into the platform.
Also, since a lot of projects, especially in in deep learning where you're using images or text, where the data is folders full of images or text files, things like that. It's really hard to understand what's going on if you're doing a diff and you're seeing a list of a 1000 files that have changed. So we tried to think of a convenient solution for that, and we have something that we call directory diffing, which basically lets you diff the project in the context of the directory structure. This is true also in in data science pull requests. And that makes it much easier for our users to sort of make sense of changes in their projects
[00:33:51] Unknown:
and then contribute to them. Maybe the last part is we are still actively working on this, but to be able to automatically scan your data. And let's say we have a few users who did this, like publish a paper. The main point of the paper is I created a dataset, and I want, like, this dataset to be useful to society. So please check out the repo on the exab where the dataset is hosted, and anyone can fault it and change it and whatever. So 1 thing which we started in doubling down on is to give you much richer interaction with data if it's in a standard format. Get or DVC push CSV file to us. Okay. Then we can let you play with that data to our UI to see if this is a project that you're actually interested in. I think that's 1 of the things that we learned from user interviews is that when people are looking at at an open source project, they first start by looking at the data because they want to see, is this garbage? Like, is this actually something that is workable, something that I can create interesting things on top of? So we want to make that learning process as fast as possible and as convenient.
Yeah. Just making more and more ways to show the users what they actually want to see as easily as possible.
[00:35:11] Unknown:
In terms of the bootstrapping problem of building up a mass of users and repositories on the platform to make it attractive problem and just some of the dynamics that that brings up where GitHub has gained a critical mass of users and has become the default place to go for working on open source code. I'm curious how you're thinking about that problem for DAGS Hub.
[00:35:42] Unknown:
And I think as sort of data scientists, we really like to talk about our tools. But for this issue, it's much more important what the interesting projects that exist on the platform. And I think that's what's going to draw data scientists to DAGS Hub. Our approach with respect to GitHub is to be again, GitHub is a great platform. It has a lot of advantages, and we are supporting it wherever we can. So that means it's really easy because both DAGS Hub and GitHub are based on Git to move a project from GitHub to DAGS Hub and vice versa or to mirror a project from 1 to the other. So you can actually get the best of both worlds if that's important. And sort of on the technology side, we're trying to focus on, again, providing the features that will make it more reasonable to do data science work on DAGS Hub while not losing the connection to GitHub.
But our general approach is to create and to help the community create interesting projects and then invite the community to collaborate on them. And I don't think you necessarily need a critical mass of projects. You just need a few meaningful projects that the community cares about, and people would go the extra mile, and we're already seeing that. Like, the interesting projects that are being created on deck setup are the things that are drawing the most interest in the platform, not necessarily a feature here or there. So I think our plan is to continue to double down on, which means integrating with other tools to make it easier for users to create their projects and get meaningful advantages as data scientists working on the platform, but also to engage with the community and see where we can help. So that's definitely helping or partnering with organizations that are working on open source data science, that are working on social good data science or public good data science projects. There are a few really interesting organizations there. 1 which we partnered with recently is Omdena, which does basically project collaborative data science projects for public good causes.
We plan to continue to double down on that because we see that it's actually working.
[00:37:49] Unknown:
1 of the interesting aspects of having these projects out in the open, similar to having platforms like GitHub and GitLab where there's source code available for viewing, is that it serves as a way for people to learn more about different patterns or ways of approaching problems. And I'm curious in terms of your own experience of working on projects out in the open and looking at other people's projects. What are some of the lessons that you've learned about how to approach data science and how that has impacted your own methods of thinking about problems and working on problems for this particular type of application?
[00:38:26] Unknown:
1 of the main things that I learned as we were working on this and talking to people, seeing what they did, is that many different personas like, this is probably true in any field, but I think it's very true for data science. And there are many different personas, and and it's 1 try to serve them all at once. So for example, some people and think maybe Jeremy Howard, this, champion of this approach are super into notebooks and will never leave them willingly. I don't want to to start a flame war. Not saying I don't agree with that approach, but I'm saying for those people, any DevTools isn't directly usable in some notebook magic, or Python API is just a nonstarter. Like, there's nothing to talk to discuss, and there's usually not much point to talk either about, like, reproducibility, versioning, and stuff like that.
It all comes down to personal goals for the person and the project and taste. Like, if you're doing quick and dirty experiments, then and trying to be as creative as possible, and that's amazing. If you're working on a long longer term project, then you need to invest more in Stockdale. And not that 1 is good and 1 is bad. It's It's just different use cases. I think I think that's the thing that I learned, which is very impactful and true for many other cases. I'm trying to think of those, like, data science technique that I learned, which was exciting for me. Not that I can think of, immediately.
[00:39:52] Unknown:
And in terms of the ways that people are using DAGS Hub, what are some of the most interesting or unexpected or innovative ways that you've seen it exercised?
[00:40:02] Unknown:
This was a while ago, but it's something that stuck in our mind. We had a really enthusiastic user we engaged with on a few different occasions, which really wanted to use DAGZUB for, version control in video game development. Obviously, video games have very large assets. You have the, like, 3 d models. If it's a 3 d game, textures, music, many things that are very large files, not exactly data, but they are 100% part of the source code of the project. They require versioning that you have, like, designers working on these assets, and they're changing all the time. And so he wanted us to add more features to sort of compare 3 d models, for example, across branches. And we were obviously very tempted because we're both, gamers.
He also wanted us to go and present DAGZUB at a major game company event. But in the end, we decided that, obviously, our initial focus was data science and that we should continue focusing on that. But it was definitely a very interesting use case that we did not think about
[00:41:12] Unknown:
when we started DAGS Hub. In your experience of building out the DAGS Hub platform and using it for your own work and seeing other people using it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:41:23] Unknown:
This 1 is actually relatively recent. We were speaking with a user about creating automations for training machine learning models. So our approach with DAGSUB is, again, integrating with existing solutions to do this. But we realized that today, there are no good solutions for adding automations that actually affect the project. So what I mean by that is tests, sort of you run them on the code that is, for example, being contributed in a pull request, and you get a success fail result. But what happens if if you have an automation that is actually supposed to push code into the repository, and you want to do that for an open source project that accepts contributions from strangers.
So this is actually a challenge that we're now sort of thinking about how to solve. And I think it's unique to a community first platform, but it's definitely very interesting. And there are a few sort of nuances there that we didn't expect. Like, it seems like it's a solved problem, but when you start thinking about it, you realize that it's complicated.
[00:42:28] Unknown:
I think 1 thing which I never thought about it, but after thinking about it more, I realized what problem it is is doing continuous integration in a closed project with only your collaborators is so much more flexible. Because, for example, I can say that the build script itself is part of the source code. It makes sense. Like, I have a Jenkins file or a make file or whatever file. And the CI is very thin. It just checks out the pull request version, builds it based on the Dockerfile or whatever file that tells it what to build. And it either passes or it doesn't. But But if you're in an open source project, suddenly, anyone can create a pull request and change the Dockerfile to do Bitcoin mining for them. Just open a ton of pull requests to do automatic Bitcoin mining at the expense of the open source project maintainer, at the expense of GitHub, or whoever is running the CI system.
Maybe Bitcoin mining is less good example because, throttling would pretty much solve it. But let's say export secrets that the CI system is keeping in, which are only usable as part of the build, like in environment variables or something. So those are the kind of concerns that Dean is talking about on top of the other concerns, which are like, okay, if I want to actually push back to the pull request, for example, a new experiment, which is a commit, then there are a lot of nuances that we didn't think about before.
[00:44:00] Unknown:
And for people who are looking to be able to collaborate on data science and machine learning projects, whether in the open or just within the confines of their team? What are the cases where DAGS Hub is the wrong choice?
[00:44:12] Unknown:
The easy answer to that is that if you're creating a throwaway project or a short term project, it probably doesn't make sense to use DAGS Hub. It might sound trivial, but a lot of data science work is sort of still in the proof of concept stage. So you just want a notebook that shows some capability of a model to learn, for example, and you don't really care about the details. So you're creating a notebook. You try some different things out, and then you never return to it. So you probably don't want to go through the process of setting everything up if that's what you're going to do. On the other hand, if you plan to share your work with others or sort of work with collaborators or maybe your boss or something like that, you'd still probably find DAGS Hub as a very good way to show them what you've done, and, obviously, if you get back to that project later. I think another thing here that we stumbled upon is if you're working on projects as, like, a contractor and you have, like, a 1 month project and then you don't know what is going on You don't care anymore. You tossed it over the fence.
Yeah. So in such a case, that's also going to feel like it's a lot of work to use DexHub. Obviously, that's not true if you consider the entire sort of pipeline of the project because now you're going to throw it over the fence and someone else is going to worry about it. But I would say that in such a case, it's probably in the interest of the client to insist on using a platform like DAGS Hub so that the project is managed in an ordered way. But, yeah, short term projects doesn't make sense in that case.
[00:45:45] Unknown:
As you continue to build out the platform and bring on new users and expand the capabilities, what do you have planned for the future of both the platform and the business and just the sustainability of this effort?
[00:45:58] Unknown:
On 3 fronts. The first is a product front. So our near term plans are to add real time experiment tracking capabilities. That's something that's been coming up from users. Obviously, you can already see your experiments on DAGS Hub. But if you want in real time, if you're training a heavy deep learning model, you want to see that the loss curve is is converging, We plan to add support for such a use case in the near future. We also plan on adding sort of integrations and help with automations, for example, continuous training and deployment, which we men we mentioned a few questions ago. Again, in all of these cases, we don't plan on building these things from scratch, but integrating with the best solutions that are already in the field since that offers our users the the best options that they could have.
Another aspect that we plan to double down, which I think we also mentioned, is the sort of ability to interface with data. So it's arguably the most important part of a data science project. It has to be easy and really convenient to understand what's going on with the data, interact with it in the environment that you choose, and then review and contribute to it. We're already providing value here, but we definitely plan to expand that. And on the community front, we plan to continue to add and encourage others to add open source data science projects. To continue partnering with organizations that are working on this front, I gave a few examples for collaborative data science project platforms, but also things like Papers with Code and others.
And I think that another area which is neglected on this front is sort of examples for projects that are combining more than 1 tool. So a lot of times, people are using a tool and they want to add a capability for something, and it's really hard to understand how they should do that. So we want to lead by example and create many of these options so that if you have a tool combination that you'd like to use, you can find a great example for a project like that and get started as quickly as possible. On the sustainability front, basically, I think it's very straightforward for us. Like, the approach is DAGS Hub is free and will continue to be free for the community, for open projects, for social good projects, etcetera.
And then if we take the example of open source software, the best workflows, the best tooling was built around open source software and then sort of permeated into the industry, which usually paid for it. So our approach is similar in the sense of if a company wants to incorporate DAGS Hub and the workflow that we are building into their organization, then that will be a paid option. Another sort of, point here is that many companies have data which they cannot expose, so they would need a non premises installation, things like that. So that's our approach with respect to monetization.
[00:48:47] Unknown:
Are there any other aspects of the Dagestud platform or the challenges of collaboration and data science or doing data science out in the open that we didn't discuss yet that you'd like to cover before we close out the show?
[00:49:00] Unknown:
Yeah. Definitely. So there is, like, the open question of licensing privacy, which about licensing, I'm not so worried. Like, sometimes it's mentioned as an issue. My god, we don't know what the open source licenses are for data science for all for data. And maybe I'm being naive, but I think in the end, I see that a lot of people just want to share the data and let anyone do whatever with it, which is like the BSD license or any number of other open source licenses licenses that exist. Sorry. And maybe you could have variations which are equivalent to GPL, Copyleft, things like that.
But I think those are the kinds of issues which will sort themselves out because people actually want to share data and to work on it openly. Having said that, 1 thing which in the far future, I think we would like to address is the issue of supporting maintainers of open source datasets, something that GitHub is now and other companies are trying to do to connect open source developers with money contributions. And so that's something very interesting that I think we would very much like to solve at some far future point. The other issue is privacy, which is more sticky. Like, if I want to maybe some project will be very useful as an open source project. Like, face detection is not a good example because politically it's being misused. But you can think of other examples where you actually use people's information to train something which is very useful in general and can be used for good. But let's say, medical companies who have very useful data which can be used for a lot of drug discovery and things that are beneficial to society, but they just can't release the data. And that's, a shame. Maybe things like differential privacy and stuff could solve this, but I'm not sure.
[00:50:56] Unknown:
Yeah. I think that the questions of privacy and bias in particular are an interesting 1, particularly for work that's being done out in the open, but also the licensing and ownership challenges around the datasets and how do you determine what data is acceptable to put it in an open repository? How much of it needs to be redacted? How do you handle things like you mentioned, differential privacy? Definitely interesting problems that don't have any easy answers right now.
[00:51:23] Unknown:
Yeah. I think there's enough of a mass of data which could just be open source to make this a very interesting area of activity anyway before those problems are solved
[00:51:33] Unknown:
sufficiently. I would also add that putting aside, like, PII and HIPAA compliant data, like, if we put aside all the problematic data with respect to legal legal issues, I think the issues of bias, which are obviously very important to solve, would be solved faster if the data was accessible to everyone. It means more eyeballs, more working hands contributing to fixing the biases in data or adding data to sort of counterbalance the problematic data that we already have. And that's definitely something that we would want DAGS Hub to contribute to and lead as much as that as possible. When we came up with, like, value proposition
[00:52:12] Unknown:
originally, that was really our go to example. Like, we would call biases just data bugs. Like, okay, you don't have representative data. If it was open sourced and some contributor from some not well represented community in another part of the world could come in and say you have a data bug, you don't have enough examples from my area, from my culture, and here. I just contributed it to you. Instead of arguing about it online, we can just fix it.
[00:52:41] Unknown:
Yeah. It's definitely analogous to the security issues that plague software projects and the fact of it being open, meaning that more people who have the necessary expertise can have eyes on and contribute fixes in the areas you mentioned for bias in these source datasets or the way this that the data is being processed. If there are more people who are able to access it, then, as you said, they can contribute back and incrementally move towards a more equitable result. Yep. Exactly. Well, for anybody who wants to follow along with the work that you're both doing and contribute or try out the platform, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And this week, I'm going to choose a book that I am revisiting with my family and reading aloud to the kids called the Remarkable Journey of Prince Jen by Lloyd Alexander.
It's a book that I read a long time ago when I was a kid. It's a very interesting storyline. The way that it's presented and structured is a little bit different than your typical sort of youth or young adult novel. It's a story of a sort of journey of discovery that this young oriental prince goes on out into his provinces in just very interesting way of presenting some of the useful sort of lessons that are necessary to help us all sort of be better people. So definitely a fun book to read and worth checking out. And so with that, I'll pass it to you, Dean. Do you have any picks this week?
[00:54:04] Unknown:
On the book front, I'm going to recommend a book that I'm currently reading, which I got from Guy for my birthday, which is called Quantum Computing Since Democritus. I don't know what you think that book is about, but I think it's really interesting because it covers a lot of scientific topics from math, computer science, physics, and more, and some philosophy. But the nice thing about it is that I've read a lot of sort of pop science books, and I've read a lot of academic content. And this book sort of strikes the balance between not being too pop sciency. He's not afraid of showing complex proofs as part of the story, but also not entirely academic, which would arguably be boring. I find that I'm really enjoying that book. And on the viewing front, I am currently watching the 5th season of The Expanse.
It is a great show. For those that don't know, it's a sci fi show, which tries to be relatively realistic. So, basically, it's not too far in the future where humans have inhabited the moon, Mars, and the asteroid belt. And there are a lot of sort of politics and dynamics inside which are interesting. It's really recommended. The 5th season is great, so I recommend.
[00:55:19] Unknown:
Game of Thrones in Space. Definitely a great show.
[00:55:23] Unknown:
I've been enjoying it myself as well. So, yeah, Guy, what do you have for picks this week? By the way, this one's wasn't 1 of my pics, but it's important. We mentioned,
[00:55:31] Unknown:
papers with codes. We would be remiss not to mention connected papers. While we sponsor them, they're a great service and idea which everyone should check out. They let you put in, let's say, an archive link, and you get a graph of all the related papers and how impact for each 1 is and kind of can build a reading list on a topic. So it's very useful and very recommended. They're great friends. In terms of picks, I kind of always wanted to prepare a recommendation list for podcast which I will be interviewed to, because I always listen to podcasts, and hear the recommendations, and I get frustrated when the guests say, I don't know. I haven't thought of anything. My recommendation is, like, I thought, appropriate to make it timeless because I always advise, try to consume only the very best of available content, and not the things that are coming out right now. The expense, although it's new, but it's very good, so that checks out. And I think it applies to books, TV shows, movies, programming, whatever.
So my picks are kind of the things I think are most useful in the long term, not necessarily new things. So, something which has influenced me a lot in my thinking, like, while I was unrelatedly learning about machine learning is I discovered thesong.com, which is a blog about rationality, AI, all kinds of topics which I think you might not agree with everything said there, but it's really, really can change how you think of things and maybe sometimes afterwards, you'll you would see online discussions and think, why are they even arguing about this? Like, Les Long has already answered this question about AI philosophy very conclusively, so I would really recommend it.
And also the offshoot blog, Slate style codex or Astar codex 10, as it's now called in terms of TV shows. I recently watched Avatar the Last Airbender, like, very late, and I was shocked. Like, it's such a masterpiece. So I would recommend anyone who skipped it, maybe because they were too old or something, go and watch Avatar the last airbender. It's not a kid's show. It's just a masterpiece. If you haven't already, go see 3 blue and brown on YouTube. And 1 last thing, I always recommend to people who are into code, learn these 2 languages, Haskell and Clojure or any other Lisp, but Clojure is like a modern, actually usable alternative because those are the 2 languages which completely changed how I think about things permanently.
So those are the kinds of recommendations that I like.
[00:58:11] Unknown:
Well, thank you to the both of you for taking the time today to join me and share your experience of building DAGS Hub and working through the problem of collaboration and data science. Definitely a very interesting and complex problem and 1 that is increasingly relevant and necessary to address. So I appreciate all the time and energy that you've put into that, and I hope you 2 enjoy the rest of your day. Thank you for having us, Tobias. Thank you for the opportunity. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Overview
Meet the Founders: Dean and Guy
The Origin Story of DAGS Hub
Features and Benefits of DAGS Hub
Challenges in Data Science Collaboration
Useful Patterns for Data Science Projects
Building and Enhancing DAGS Hub
Attracting Users to DAGS Hub
Lessons Learned from Open Data Science Projects
When DAGS Hub is Not the Right Choice
Future Plans for DAGS Hub
Addressing Licensing and Privacy Issues
Closing Remarks and Picks