Growing Dask To Make Scaling Python Data Science Easier At Coiled

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes

platform, it's easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com

/linode today. That's l I n o d e, and get a $60 credit to try out our Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And and today I'm interviewing Matthew Rocklin and Hugo Bone Anderson about their work building a business around the Dask ecosystem at Coiled. So, Matt, can you start by introducing yourself? Hey, Tobias. Yeah. My name is Matthew Rocklin. I've been a long time open source maintainer in sort of the Py data space. So the NumPy, Pandas, Scikit, or Jupyter space.

Mostly I think about scalable computing, so paralyzing that ecosystem, mostly with the library Dask.

I've been sort of 1 of the maintainers of Dask for the last 5 or 6 years. And you're also CEO of Coiled. And I'm also CEO of Coiled. That's right.

And Hugo, how about yourself? Well, that's a wonderful segue

into into me. I'm, I run, data science evangelism and marketing, wearing a few other hats at at COIL, which we've we've we've just founded. My background

is in,

math,

research science in cell biology,

and data science. And I've done a lot of work in,

data science education. I I recently

joined Code from DataCamp where I I built out,

I suppose,

foundational,

high data ecosystem

educational material. And that's actually how Matt and I met around 4 years ago. We're collaborating with with with Anaconda

on a lot of, online educational content for the PI Data Stack. Podcast listeners will send a Hugo from his previous podcast, Dataframed,

where he was the host.

How is it like being on this side of the microphone, Hugo? I really I really like it. I I like both sides, but I,

I like also finding a new audience and a new listenership and getting new feedback on the types of things we work on and, interested in. I gotta say, I do like asking the questions

though, as well. So forgive me if that happens

at at some point, if my curiosity gets the better of me. Yeah. Absolutely. As somebody who's been on both sides of the mic myself, I can definitely relate to. It's very different been on both sides of the mic myself, I can definitely relate to. It's very different experience being the person being asked the questions than being the person who's driving the conversation.

Absolutely.

And so, Matt, do you remember how you first got introduced to Python? I think so. I've actually been on your show a couple of times, so I'm going to skip that question and instead answer how the first time I contributed a patch to Python. Yeah, that sounds great. Yeah, that was I was,

I was playing around with SymPy a little bit. It's a sort of Mathematica clone inside of Python. And I, was a Google Summer of Code project. So Google Summer of Code is a great, great program. They give, you know, small stipends or they give stipends to people to work on open source projects.

And I did that. It was a great time and that they taught me how to sort of live in the open source world and that, I stayed ever since. And Hugo, do you remember how you first got introduced to Python? Oh, I definitely do. So I did my grad graduate work in pure math, and I've done a bunch of applied math, and I started my first post doc or my only post doc. It it it moved. But,

in Germany, in a cell biology lab, ostensibly to do applied math and mathematical modeling, became clear that a lot of my colleagues were dealing with very large,

datasets.

And I've done a bunch of programming bits and pieces around in in the past, but I needed to, you know, ingest, analyze,

and reproduce scientific results on really large datasets that my biological collaborators were producing. And then I started self teaching Python. And it was actually I I learned R and Python and worked in both at that time, but it was the advent

of what were then item, iPython Notebooks that really helped me so much along the way and helped me to educate, you know, in the end, hundreds of other people in in workshops at this, at the Max Planck Institute as well. And so as we've already mentioned, we're here to talk about your work on the DASK project and the business that you're building on top of it. So for people who aren't familiar with it, we did do an interview on the data engineering podcast that went fairly in-depth about Dask itself. But for anybody who hasn't listened to that or who wasn't familiar with the project, can you just give a quick overview about what Dask is and some of the motivations that you had for building it? Sure. Yeah. So Dask is an open source Python library that comes out of the PyData space. It was designed to paralyze other libraries, like NumPy, Pandas, scikit learn,

but it very quickly became far more general. And so today it's used by dozens of other Py PyData libraries for a variety of different fields, doing everything from, you know, advanced machine learning with PyTorch or XGBoost to, you know, GPU accelerated work with RAPIDS to, you know, backing Django websites in the same way you might use Celery or, you know, backing Airflow or Prefect. So DAOs is sort of now like a general purpose distributed computing platform that's very Python native. It's pure Python. I don't know if it's a Tornado application. So, yeah, it helps you run Python as And I might just build on that by saying, Matt recently wrote, a wonderful what I think is a wonderful, but blog post on our our blog, which is coil.ioforward/blog

for those interested.

But about the goals of DASK and something that I didn't quite recognize, which of course now makes so much sense is, Matt listed 2 technical goals. 1 was to harness the power of all the cause of workstations in parallel, and the other was to support larger than memory computation.

But in addition to that, there was a social goal, which was to invent nothing. And I quote that Matt and,

everyone who created Dask wanted to be as familiar as possible to what users already knew in the Py data stack. And I think that's really important to recognize that all those things we're talking about are technical challenges, but they're also sociocultural

challenges. And so having an API, which not only mimics the APIs that we know and love, but also for the most part runs those I APIs in in and runs the code you're thinking of writing in the back end makes it Python data science native. And this is in contrast to distributed computing frameworks such as Spark, which has its own strengths, But it it it's a strength of Dask that it's it's native for people already working in the Pi data ecosystem. Hugo, I love that in an explanation, you also referred to coiled. Io, the website, and the blog. You're like you're doing the the cross marketing

pitch. Absolutely. I I appreciate that. And I'll also second the fact that it's a notable accomplishment that you have so far been able to make Dask essentially transparent to people who are trying to build these types of applications and numerical analyses

because

it's definitely not easy to be able to keep up with the API changes of libraries that you are not primarily responsible for? Yeah. No. I think, I mean, an interesting point is that there is no Dask API. Right? We, we work with the other communities, you know, Py Pandas, you know, Tornado, Concurrent Futures.

Every Dask API is a pre existing API people are already familiar with. So, you know, if you are in an async application,

you can, you know, submit features in the same way you would with an executor, like a thread pool executor, and you can await them with async await, and you can, you know,

do all the things you normally do. So Dask is very much like a movement

to scale the existing code we all already use

rather than a reinvention

of the Python space. So it's very much part of,

the Python Python community.

And as I mentioned, you were 1 of the early guests on the data engineering podcast where we dug fairly deep into Dask itself, but that was about 3 and a half years ago now. So I'm curious how the project has changed or evolved in that time since we last talked about it.

Yeah. I would say,

technically it hasn't actually changed a ton. Like DASK has been relatively stable over the last few years. Where we've seen a lot of activity

is in the growth of sort of DASK adjacent projects. So this might be in domain specific applications. So, like, all the earth scientists made Pangio, which builds on top of X-ray, which is a DAS powered library, to do things like, you know, satellite imagery with NASA, climate change, oceanography.

And there's, you know, 100 or 1000 people over there who we've been supporting, who all use DASK.

You know, similarly, there's the RAPIDS effort out of NVIDIA. There's a large effort inside of NVIDIA, which is all powered by DASK and CUDA code to provide GPU accelerated data science. There's other projects like Prefect, which is sort of like a reinvented airflow and lots more. So I think most of what we've been seeing is a lot of growth of people using DaaS, and most of the core contributors to DaaS are mostly in sort of service mode. We are serving other communities who are now trying to scale. I think this week, for example, we've actually seen a lot of great growth in the life sciences. So we've seen, like, biomedical imaging pop up. We've seen population genomics pop up. We've seen, you know, single cell analysis.

So this week, I've had, you know, 2 conversations with completely different groups who are both building

infrastructure on top of DASK. 1 example that I love is is NASA.

I think they're doing a lot of lot of incredible work. And on top of that, like, think about how DASK has changed or evolved in the past 3 and a half years is also a function of how

Python has evolved and the increased adoption. I mean, Dave Robinson, when he was at Stack Overflow, wrote that wonderful post, the incredible growth of Python, looking at Stack Overflow trends. Right? And that was 3 years ago, but we've been seeing such incredible growth growth since. And I think, you know, scikit learn is probably a dominant example, but there are so many others across different domains. And once again, as Dask is part of the Pi data ecosystem,

we've seen kind of all these the entire ecosystem be be lifted up together. The other thing I'll add is with my evangelist hat hat on, I'm just so excited to start servicing more and more of the DASK stories that that exist. And I know I'm it may seem like I'm coming hard and fast with with sales y pictures, but this actually comes from a a place of serious excitement.

At the moment, it's it's, mid August 2020.

We're running, weekly livestreams where we have people who use DAS come on. So check out our YouTube channel if you wanna wanna check these out. Last week, we had Alex Egg,

a senior data scientist at Grubhub come and talk about how he uses Dask all throughout the productionization of,

search queries and intent recognition using Dask, TensorFlow, snorkel for week week supervision,

and, like, using

DAS to parallelize so many steps throughout the process. And that's that's really what I'm super excited about at the moment. So if you wanna check out more of these stories, definitely do come and join us for for a Science Thursday. Yeah. That's definitely 1 of the things that I've seen that's notable about the past few years is that the majority of the growth and evolution has been in that surrounding ecosystem and the number of additional libraries that are using Dask as their method for distributing computation

rather than Dask taking the forefront. So you mentioned Prefect. I know that Daxter is another project that supports Dask as the

Celery code with the Dask dot distributed project for being able to scale out their asynchronous

tasks. And also, I think that when we first spoke, the dasqdot distributed

effort was still fairly nascent. So seeing that become a sort of full fledged project in its own right as just the execution and, distribution method of the code being something that you can use in

that we don't have an have insight into. So it'd be great to hear from listeners who use Dask about their stories as well. To echo on the the Grubhub example, so we were talking to Alex Haig. Actually, during

that that stream, Alex imported from snorkel.dask

import something. And I was like, woah. Wait. Like, snorkel has Dask support? I had no I had no idea that existed. And so that's kind of, like, the level of penetration we are now, which is which is a fun place to be. After having worked on Dask for the past few years,

you have ultimately decided to build a business around it. I'm wondering if you can talk a bit about some of the decision points that led you

to building that business and some of the motivations and goals that you have for building a company on top of and around task. Yeah. No. Building a business today on open source software, there's a lot of, like, interesting variables at play. In terms of motivations, there are several. I mean, first, I wouldn't mind getting some money, but let's, you know, put that aside for a moment and pretend that I'm kind of a nice guy. Yeah. I would say mostly it was that everyone was asking for it. Dask has been really well adopted by individuals

and by other open source library authors like we've talked about, but not particularly well adopted by large enterprises.

So you've looked at a company like, like, let's say, Ford or any sort of large Fortune 50 company. Almost certainly, there's, you know, hundreds of people inside the company using Dask, but it's very unlikely that the company as a whole has adopted it. And it seemed that they would adopt something like Oracle or Spark. To really adopt something like a software at a large institutional level, you really need to buy it from someone. Right? So Ford or NASA don't just adopt open source. They sort of want to buy something. So we need to give them something they could buy from. So we need to have a company for that. Also, we also sort of wanted to lower the barrier for individual users. So they're sort of both serving the very large users, large companies, and also serving individuals.

I care very deeply about making computation

and data literacy accessible. And, you know, scalability is the the hammer that I use the hammer that I work on. What we found is that a lot of people could could use Dask for a company on their laptops. And if their workplace happened to have a really nicely set up cluster, maybe they're at a university, they have access to some HPC machine, maybe they have an IT department who understands the cloud and Kubernetes. They were able to use DaaS happily at scale. But most people are under that situation. Most people were happy using Dask on a laptop, but they didn't really understand the sort of DevOps skills necessary to launch that scale. Yeah. So I might just add add on top of that. I think you hit the nail on the head several times in particular with all the DevOps that's that's required of, of modern data scientists,

particularly whether it be, you know, using Docker and Kubernetes and getting everything up and running on AWS and getting something running locally and moving seamlessly between your local computation and the cloud is something which is so challenging these days. Tobias, I I listened to a recent episode

of this podcast with someone from Netflix talking about meta flow. I mean, we have organized large organizations

are building things internally

to solve these these challenges as well, but not necessarily focused on the Piedata

ecosystem. So we're essentially

making a a product that makes scaling, really easy. And that's what our first product is that, we're we're launching

very soon, and so that is mid August

2020, and it's a DASK in the cloud service that just tries to be dead simple.

Our current users show that it's about 3 to 5 minutes from opening

their beta invitation email to doing computing on the cloud at scale from your laptop and having both environments

and data access

exactly the same on on your laptop and in the cloud. I'll also add 1 more thing that we're solving for the end user

of of the data scientists, but we're also solving for institutional,

cultural,

challenges. And I'll be more specific. What I mean by that, open source. The open source ecosystem

solves very well for people wanting to do and needing to do science. Right? Data science, scientific research. But it's building out some more collaborative

techniques for sure, but it doesn't necessarily, and nor should it solve for institutional culture cultural challenges. So allowing management to have insight into advanced telemetry,

so that they can, you know, encourage collaboration more, and and view costs and see what's happening across their org. Similarly,

serving IT's I IT's needs. So we check all the boxes from from IT, from network security or usage controls,

everything you need in order to analyze your your your data,

safely. And I think part of the value prop is that our team has been doing this for for for a long time as well. So productizing that and taking it into organizations that require all these boxes to be ticked,

is what we're really excited about as well as meeting the end needs of,

the needs of end user data scientists. And I think that the overall push of moving to distributed computation

is definitely very evident with the work that you're doing with DASK and the community that's built up around that and some of the other movers of the ecosystem. I'm thinking in particular about the Ray project and their recent release of the AnyScale company. So I'm wondering if you can just talk a bit more about some of the

broader ecosystem changes that are necessitating

these distributed capabilities and the ability to run Python code and other computation across large clusters of instances beyond just what's capable of maybe, you know, a laptop or a decently sized server? Yeah. It's interesting. There's a few things going on. 1 thing is the growth of data. Right? You got all these data data sets, they're scaling up,

and they're becoming,

you know, arduous to work on a laptop.

Additionally, there's the sort of rise of machine learning and the use of computation and the the clear value the company is getting out of computation. And then finally, there's just a lot of more corporate engagements in Python. Right? Previously, you would have done all of this stuff in Java or in Scala, and there's there's sort of a rich ecosystem

of tooling in the JVM world. You know, Spark, Hive, Hadoop, etcetera. As machine learning rises, as technologies like Docker have come up and show that you can run Python in production pretty comfortably, All of that sort of shifting over to the Python space. And as a result, we need to grow an distributed framework system,

very quickly. That's the kind of work we've been doing over the last 5 years. Right? DaaS maintainers maintain the s 3 connections and the GCS connections, and we maintain how to deploy on Yarn and all those sorts of things. So there's,

yeah, there's a lot of business needs. There's a rise of data, and there's a great there's a great need today. And I'll add add add to that. I I think we, you know, we had the big data hype start a decade ago or whatever it is now. And value, I think, was delivered clearly in in tech, but not necessarily in a lot of other verticals.

And so I I think there's been some sort of like, if you view it as the Gartner hype cycle in some industries, we may be in the trough of disillusionment,

but I I do think there are huge gains to be made with the amount of data we we we have out there, and but we just don't have the tooling to deal with it. And for data scientists and machine learning engineers and people doing, productionizing machine learning to to deal with it. So I'll use, Matt and I are writing an opinionated and and somewhat provocative

post currently that we truly believe in about

the fact that our all in 1 data platforms

serve very important purposes in infrastructure and and data engineering. But currently, they don't meet the meet the needs of end user data scientists who need nimbleness,

flexibility, and ergonomic products that allow them to, you know, work with the with the bleeding edge packages and moving, as we've said before, seamlessly between the laptop and the cloud. And I think in terms of actually getting deriving the next level of value from all the data out there, we're gonna require these types of products to to be built for data scientists to use to extract that value. And that's what we're what we're really excited about as well. I'll also add, though, that big data or whatever we wanna call it isn't always the answer. And if people like, should I do distributed machine learning? I always say no unless you you have to. Like, plot your learning curves as you increase the size of your dataset. And if your accuracy or whatever your, you know, evaluation metric

of interest plateaus well before you need to go distributed, maybe you don't need much more data. You know? Select your your features

accordingly before you go to a a distributed setting.

So I think it's key, but it isn't always the the solution. It's question and data specific. And to the point of the challenges of running things like Daskin production or getting machine learning and data science workloads into a distributed capacity.

For people who are using DASK, what are some of the existing sharp edges that you're looking to solve for as a company and just as ongoing maintainers of the Dask library and its ecosystem?

Yeah. I would say generally deployment is really hard, especially in an enterprise context and especially for the kind of people who use Dask, who are typically,

you know, data scientists, analysts, machine learning researchers

without a lot of DevOps experience. It's honestly mostly a cultural thing. Like, it's actually not there's no novel thing that needs to be built to solve this problem. It's just

how we sort of productionize it. How do we how do we turn that into sort of a a very routine operation? Tivu, you have lots of things thoughts here. Do you have thoughts? 1 thing springs to mind in particular is that when people

try to move between local and the cloud or move to larger datasets,

their sense of flow is is broken. And I don't necessarily wanna get too too holistic here, but data scientists

do their best work when they ins are in some sort of flow state with their with their,

environment, with their data, with their stack, with their computation, with their ecosystem. That's a huge thing that, PyData has done. It's something that the tidyverse has done for the r ecosystem

as well, I think. But in the countless conversations I've had with people doing distributed computing and needing to jump back and forth between different

contexts. It's the context switching which breaks a flow state, which makes the process arduous and not ergonomic, and it also,

I think means everything just takes a lot longer

as well. So that's why I always come back to, you know, meeting data scientists where they are. If they wanna do the DevOps and they enjoy doing that, that's fine. But for the most part, they wanna do what they do best and stay in that flow state of doing data science

and machine learning. And I do think that is a technical and and cultural challenge, as Matt has said. Yeah. So maybe let's, like, give it give it, like, maybe an anecdotal example. Let's say that you're using scikit learn, and you want you're doing some hyperparameter optimization on random forests, and you want to make it run faster because it's taking a while. So, you know, you you do do a web search for scikit learn fast. You find the scikit learn docs. You find they point to Dask. You try using Dask. You use it use it on local machine. You're pretty happy. The single core to multicore on single machine experience,

that that transition is really smooth today. A ton of people do that, and they're very happy. But then you wanna scale to many machines. So you say, hey. Great. I got this cloud account. You go to our IT department.

You get access to your AWS credentials. You then, you know, go on the DaaS documentation.

You don't know anything about the cloud, but the DaaS talk's pretty good. They point you at Kubernetes.

You, you know, spend an hour, learn how Kubernetes works and well enough to stand something up on, you know, AKS or something, and then you launch modules at DaaS Kubernetes.

If you're actually a sharp data science developer, you could probably do that today in about an hour, without really knowing much about how Kubernetes works, which is, you know, testing it to lots of open source projects. After that, you might get it running and you say, actually, you know, turns out the the Docker image are running remotely

doesn't have scikit learn installed. Bullcrap. Like, now I need to go build the Docker image. I need to push that up somewhere. You try it again, And, you know, the Python versions are different because, you know, we're serializing data in some weird way. You fix that again. You push it up again. Magically, it works. It's been a couple of days, but you're you're working now. You're really happy. A month later, IT and Knox Universe says, hey. Wait a minute. You left that machine up for a long time. You racked up around $10, 000,

and there's no security on that machine. And so, you know, fortunately, no 1 found the address you were using, but, you know, you actually had axe you had credentials. You copied on the machine that were totally open to the to everyone in the world. And so you've been sort of locked down by IT now. So those are the that's sort of a a brief narrative, the kinds of things

that, you know, data scientists runs into

if they're actually really sharp and are able to jump past all those hurdles.

But it's it's just the pain. Right? And it's not something that most people do. I couldn't agree more. And I I think 1 thing you've done wonderfully in there, Matt, is also help us reason about,

distributed data work and and machine learning. I'll I'll pull out 1 example. There are lots in in in there with respect to the processes that that are necessary. But, I think it was maybe Travis Olofan, who I heard this from the first time, that distinct even distinguishing scaling up to multiple cores in a single workstation

as opposed to scaling out to a variety of clusters or on the cloud or on on prem or whatever you're doing. And thinking

constructively and reasoning about these processes and what you need to do. Because maybe for the work you need to do, you only need to scale up, right, to,

multiple calls on on your local works workstation.

But then,

what what are the next steps? And that that will be scaling out. So I I think that framework's very instructive. And so in order to overcome some of these challenges

and the complexities of building these distributed computation platforms

and managing all of the different security implications and the scaling implications

and the operational aspects of running code in production. What are you working on building at Coiled to help alleviate some of those pain points?

Essentially, the first iteration of of product, hosted and scalable

DAS clusters so that you can use cloud to launch clusters on managed resources like the cloud with a single click. Certain things we handle really well are deploying containers, hooking up networking securely,

and making it easy to connect. So you as a data scientist can get back to your real work and, and and do what you do best.

As we stated earlier, part of the value proposition is we don't want you to have to wait on on on DevOps

or having data scientists have to do all the DevOps themselves. So you can use clusters of machines, advanced libraries, and GPUs.

Cloud works anywhere.

So currently including cloud services like AWS SageMaker, open source solutions like JupyterHub, or even from the comfort of your own laptop. This includes,

I think as we hinted at it earlier, managed software environments,

with conda or pip and docker, user management,

advanced telemetry, cost controls,

these types of things, and everything that you'd that your IT team, would be would be interested in as well with respect to auth and security. Yeah. Maybe I'll add to that because this is a Python podcast and because people actually on the podcast know what we're or know what we're doing. They can sort of appreciate this both from a user perspective and from an internal dev perspective. I'll tell you a little bit about how Coiled is engineered internally. Awesome. So Coiled combines,

it's a Django web app. It's Postgres. It's Amazon ECS

and Conda. Those are mostly the the tools that we're throwing together, and Dask. And so Coiled is a Django app that you can authenticate into that will launch on demand DaaS clusters on Amazon ECS. ECS is like a container service. It's like Kubernetes. It's a bit simpler, but more widely available. So you can log into Coiled, tell Coiled, hey, Create a task cluster for me somewhere, then it'll do that. And that's that's step 1. Right? You then wanna tell that task cluster, hey. You need to have these libraries installed.

And so Coiled handles building software environments. Most data scientists don't like using Docker, so we solve condor PIP environments for you. We hold on to those. We store those. We build Docker images for you, and we deploy them. We handle things like authentication with the cloud. So you may, on your laptop, have credentials to Amazon. We'll take those credentials, generate some secure tokens, ship those up to those workers as well, so they can operate as you. All of that's done securely, so you can drive that cluster of machines on

AWS from your laptop or from, you know, some CircleCI scripts or from, you know, some

GUI application running Python. So COIL is designed to solve

the DaaS deployment problem and the sort of like peripheral problems just around that problem. Software management, cost management, etcetera. Then we also because we're watching everything you do, because we're watching

how big you're scaling that thing up, we're we're a good place to impose limits. Right? You know, maybe you manage a team, you are swapping a credit card so people can use that, and you can give people access to that up to a certain amount. You can gate their access. You can say, hey. This guy can or cannot use GPUs.

This person's really good.

She can use, you know, $100 an hour of clusters if she wants to. And so that's it's all that sort of basic functionality around managing DASK. And, yeah, it's built with a bunch of actually like really boring technologies that have been around for a decade, but it's really comfortable. Like, as someone who uses Dask a ton, I use Dask in every single possible situation you can imagine it. We built the way that I prefer using Dask, and I I love it. It's actually really, it's really comfortable. It's really pleasant. It's really easy.

Yeah. That's Dask. That's Coiled. Yeah. There's definitely a lot to be said for boring technologies

because it's definitely interesting and exciting to be able to use bleeding edge new platforms

or projects. But when you actually go to production with something, you want it to be boring because you don't want it to wake you up in the middle of the night every day for the next 10 weeks. Yeah. It's actually I have actually never used Django before. I've been in Python for, you know, a decade plus,

and I've always been on the data science side. It's actually really fun for me to interact more with the website of Python. The people we have who were employed at Coio, the engineers, people like Rami or Dan or Marcos,

all, like, understand this stack super well. And it's like, hey. Can we like, how hard is it to add, you know, GitHub authentication?

And, like, sure. It takes a day because there's, like, the Django all auth package.

This is probably

really boring to most of your listeners. But for me, it's like I just discovered this new set of superpowers that we have. Like, all the Django plugins are it's great. It's like finding, you know, visualization libraries in the data science side for the first time. Yeah. It's definitely

interesting

how the breadth of the Python ecosystem

can lead you to be somebody who has worked in the overall

language for, you know, a decade plus and then still have some area of the community that is completely new to you that has all of these exciting new shiny bells and whistles that everybody else has considered old hat for the past 5 years.

It's also why building a product like this is super easy.

Right? We happen to have this amazing distributed computing stack

and this amazing web stack

and connections to everything.

The fact that everything is available in the same language

makes, you know, cobbling these things together just really

amazing. It's great.

Continuing on the thread of Python,

the main tagline that you have on your website right now is focused on the fact that you're scaling Python, which is

in some sense for somebody who's using Python, oh, this is great. But for somebody who is concerned primarily with just, I have this data problem,

it might not necessarily

be a a trigger

for them to say, oh, this is what I need. So I'm curious

what your thoughts are on the pros and cons of orienting your messaging on that aspect of the scalability of Python as opposed to focusing on a particular problem domain or industry?

Yeah. Super interesting question.

First, actually, like scaling Python right now is actually,

like, hitting pretty well in the market.

A lot of companies are building up their Python stack or they're scaling their Python stack, and they're looking for solutions. So So it's actually a pretty good time to try to sell that messaging. But more generally, I think you may have a good point that it's actually really tricky

to sell something that's so broad. Right? So, you know, if we go to the other extreme, for example, if we were selling a product that, you know, identified tumors and images

to save lives in

hospitals or, you know, a biofuel that increase fuel efficiency by 20%,

like, the the value of those propositions

is much clearer. It's much more easy and direct to sell. It's like, yep. I'm gonna save 20%. Makes sense to buy. Here's 10%.

Transaction done. Scaling Python is a lot more it's a lot more broad. It's a lot more indirect.

Right? You don't have to, like, get into a conversation with a customer about, k. Well, like, why do you care about data science? Why is that useful to you? Do you care about reducing cloud costs? Like, how how is it that making your people more efficient will actually save you money in the future? And so it's a much longer conversation to, to get to that point where they say, yes. This makes sense. We're going to invest, you know, a large chunk of money,

to purchase your product.

But on the other hand, it's super broad, and that's really, I think, where open source companies really shine. We can sell to, you know, to NASA, who who wants to care about satellite imagery. And then we can turn around and sell to,

you know, Pfizer or some pharma company who cares about biomedical imaging. And, presently, biomedical imaging and satellite imagery are actually kind of the same problem. So it's,

it's tricky. You have to really start to understand your customers and get a lot more empathy with people, but there's a ton of opportunity, which is fun. It's just like a very big puzzle to solve. I I agree. I agree with all of that, and I take it I over slightly different perspective. I think it's a great question as well, and this is something that is not boring and does wake me up at night, for for clarity.

But

we need to think about what our job is at we've got lots of jobs at COIL currently, but is our job to sell the product product as much as possible or at this early stage to build the best product possible? And I think it's it's definitely

the latter. And to to that end, we don't necessarily

want to be positioning ourselves as as solving industry specific problems. We wanna get data science teams on board to use our product and to form kind of an iterate a rapid iterative feedback loop into into product development. Right? So it's essentially, in that sense, a bottom up motion and a grassroots

movement of of data scientists

built into the product development cycle

and solving their pain points. And, of course, as you state,

if it later on, it becomes industry specific, we may even move to, you know, a a top down sales motion. I love, you know, Databricks and this type of stuff. I'm not saying that we will we'll we'll do that, but these are kind of a wide range of questions we're

we're thinking about. It's interesting coming out of our programming podcast and talking about sales motions and, and, you know, sales strategies. But it's actually it's kind of interesting in this perspective because it's 1 of those cases where I think the the profitable thing and the, like, community thing happened to be aligned. So we're gonna make most of our money selling to big companies.

Right? You know, Ford, NASA, whatever, are gonna buy something for 1, 000, 000 of dollars. But those can take a long time. In the meantime, like, we have a bunch of users who we can learn a lot about and who we can who can iterate with very quickly. And so we're both selling, you know, these big things that take years to close with big companies. But in the meantime, we're we're making a product that's really designed for individuals to use,

because they bring in a lot of information about how people want to scale their Python code. And so for the next year or so, we're really just focusing on making

the sort of public access

product as ergonomic and as friendly as possible. And that happens to align very well with, you know, what I care about altruistically

of, you know, increasing access, increasing

our society's ability to reason about,

the world through data. And so it's, again, 1 of those things where I think making money and helping the world end up being kind of the same actions. And it's also aligned with increasing

adoption of the tools that we believe do good in in the world as well and that we build to do good in in the world. So if we can build a product that grows the Dask and Py Datapie

as large as possible for people who find it valuable,

I I think that's amazing. And, of course, there are a lot of nuances here to make sure that these incentives

are aligned,

but that's that's the goal. And

I know Matt is, and I'm definitely excited to be part of of this movement of garnering even more OSS,

institutional adoption through commercial products as well.

This portion of podcast dot in it is brought to you by Datadog.

Do you have an app in production that is slower than you like? Is its performance all over the place, sometimes fast and sometimes slow? Do you know why?

With Datadog, you will. You can troubleshoot your app's performance with Datadog's end to end tracing and, in 1 click, correlate those Python traces with related logs and metrics.

Use their detailed flame

graphs to identify bottlenecks and latency in that app of yours.

Start tracking the performance of your apps with a free trial at pythonpodcast.com/datadog.

If you sign up for a trial and install the agent, Datadog will send you a free t shirt to keep you comfortable while you keep track of your apps.

And continuing on that topic of the dichotomy of open source and commercial product. I'm wondering what you have seen so far as the challenges and being able to manage those competing tensions between the open source work that you do

and working as maintainers and core contributors of the DASK project, as well as this proprietary stack that you're building around it and working on selling to companies to help drive the overall

business forward, as well as trying to help support the open source ecosystem around it? I have tons of thoughts on that topic. That's like a whole podcast.

Before I go off into things, Hugo, do you have thoughts? I I do, but I I will defer to you because I I think in you're actually in a very different position

to me with your with your role in terms of being a core contributor and a maintainer of a package as well as

building a product around that package. I okay. Having said that, my position

it's very I mean, I I run I'm I'm head of data science evangelism

on 1 side and marketing on on another. So I joke that I sometimes feel like 2 face from from Batman. I'm a good guy, though, in in in terms of I have an enterprise face with respect to marketing, but an evangelism face with respect to community development. And I I I feel like the role of COIL and the role of myself in in in these positions

is to form connective tissue

between both that creates value that's more than the sum of it some of its parts. So I know I know this is very abstract, but ideally, the work the open source community puts in, they get more out of. And similarly,

with with the enterprise.

And I think the direction from

OSS to the enterprise is what we've been talking about, but the other direction is is incredibly important as well. And I hinted at that with respect to garnering further adoption of of open source, but also,

thinking about viable models for for funding open source and for employing open source developers.

And, I mean, we're both very excited to be part of a a fascinating lineage here. And, you know, there have been companies for several decades now that have learned learned lessons lessons here that we're we're in close close connection with. And that, you know, Matt Matt has worked for for for example. So I think that's that's my take. But I'm even more interested in Matt's take, due to his unique position. Yeah. I think I like your 2 face example. I'll say maybe many hats is the word term that I would use.

It doesn't turn you into a superhero villain, at least, which I think. I'm sure there's superhero villains somewhere with lots of hats. They're not better. I like that. How about that?

But, yeah, I mean, we've actually been kind of threading this needle with DASK since its inception. Right? DASK came out of a for profit company. It came out of Anaconda. Right? Anaconda was amazing in its support for DASK early on, and it hired, you know, me and lots of other people to work on Dask over time. And we had to sort of be very aware of when we were being Anacondins and when we were being open source Python people. I then moved to NVIDIA, and I had the same challenges with GPUs.

And so, yeah, we've always had those sorts of challenges. I think it's now maybe, like, up another level. Like, I now get,

like, direct compensation based on how well certain things do with task. So we need to be careful. But it's it's a care it's something that we're we're used to doing. I think, you know, for example, as an anecdote, a very small anecdote, Hugo asked me, I think, just yesterday,

you know, hey. We're doing all these community blog posts about people who use Dask. Can we get the Dask Twitter account to retweet,

the coiled content? And I said, like, yeah. Sure. That sounds great, but, like, I can't do that. Right? That would be a conflict of interest.

Go find someone else who who also manages the Zast Twitter account

and have them,

retweet that for you. You've gotta work that out with them independently of me. So that's been you know, it's something we always deal with. Go ahead. I I was just gonna say, with respect to that example, I was fascinated that, the Dask developer community had already

thought about this, and you linked me to a GitHub issue in which it had been discussed, and there was already a path somewhat laid out laid out for this. And I think that what that flags for me is that, as you mentioned earlier,

a lot of these concerns have kind of almost been preempted in in in some sense, and it's about figuring out a a seamless way to get these flows flows happening. But the open source developed community is an incredibly,

intelligent, thoughtful,

empathetic bunch of people for for for the for the most part.

That's not entirely true. Always. I I agree. And I'm just thinking about Stack Overflow responses now, actually. What I would say maybe in in Dask's favor,

we have a lot of practice at this. And Dask is actually really well

well positioned to handle this sort of challenge because we already have a bunch of companies that are involved in Dask. Right? We have, you know, weekly maintainer calls every Tuesday morning.

There's gotta be people at 6 or 7 different companies who attend those calls.

It's not a lot of open source companies,

they're a mega company first. They release some open source software, and they maintain that software as part of their company. If the company fails, the software goes under,

and probably everyone in that soft everyone has commit rights to that software probably works for that company. Super common model. That's definitely not the DASK model. Right? DASK operates much more like Pandas or NumPy or Jupyter.

It's very much a multi institution endeavor.

It's not attached to 1 particular company. And that gives Dask a lot of its strength, a lot of its perspective, and a lot of its resilience.

Another interesting element of the fact that you are starting COIL ed at this particular point in time is that we're in the middle of a global pandemic,

and I'm sure that has added some additional complicating factors to what is already a stressful time. So I'm wondering if you can maybe talk a bit about some of the ways that that has manifested in your current journey of launching this company and building a business at the particular point of the world?

Yeah. No. It was, it was great great timing in various ways.

I think the company was incorporated, I wanna say, like, February 11th.

So, like, you know, a few weeks before the world ended.

Mostly, it's actually been fine.

I think it's, like, some of our, like,

intended anchor clients,

like, step back, said wait. We're gonna wait until things,

you know, sell down a little bit. They're not coming back. So there's definitely early clients that left,

but there's actually a lot of churn right now in lots of ways, which is really fascinating.

Hiring is super interesting right now. Right? A lot of people came on the market. They were then scooped up very quickly. There's a lot of sort of activity.

We're we're a remote first company. We DASK has always been a remote first, so Coil does too,

which has been great in some ways, although then stressed in some other ways.

I'll let Hugo maybe talk about that for a moment.

Yeah. I do think that it's interesting that we were already intending to build

a 100% remote

company already, or as I like to joke because we're in the distributed,

computing space,

100% distributed company.

All that I'm saying there essentially is that a lot of the challenges we're facing due to that aspect of everything that's happening at the moment,

we were we were preparing to to to face anyway.

I do think 1 of the biggest stresses

is the fact that I'm currently in Australia. So for context, I'm from Australia,

I've lived in the US for 6 years, and I was planning to to still work in in the US, but I'm currently waiting for US consulates to reopen

for,

visa visa processing for me to get back on American soil. So

the biggest stress currently is trying to build a company from seed stage

on opposite sides of the world. There are a few wins that that can happen with respect to these obscene time zone differences.

I think the biggest 1 being that we can hand off, stuff to each other at the end of my day, and vice versa, and then when I get up, Matt or whoever else has has has worked a lot on it, and and I can move on that. But in terms of, syncing on on a variety of things and being in the same headspace,

and at the same time of day,

there are a lot of kind of, daily,

challenges that arise,

because of that, but it's something we're thinking about actively and working on actively, and we're excited to find solutions as well.

Yeah. I'd say as as all of us have learned, I think COVID kind of killed

working hours. They killed the workday. It's a it's a mishmash today.

But there's also lots of great opportunities. Like, companies

are kind of in a state of reinvention right now, and, you know, they're looking for new things. It's actually, like, I would say sales interest has not dropped off significantly.

And on top of the challenges of building a business and doing it during pandemic times, you're also building it on top of open source, which we've touched on a little bit as far as some of the challenges there. But what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of building and launching a company that is oriented around an open source platform?

Yeah. I would say the biggest

the biggest not really a surprise, but maybe reinforced preexisting,

biases

was that nobody has any idea how to do this well yet. There are a ton of models of successful companies. You know, there's the Red Hat model. There's, you know, Mongo and Confluence. There's Databricks,

something totally different. There's Anaconda in the Python space.

But they're all very different paths, but none of them were worn very well. It's all a challenge. You know, for example, we're currently working through,

you know, contracts legal with, like, 5 giant companies,

each of whom have their own legal teams. You know, and we have, like, our

amazing

legal rep part time that we that we engage.

But none of those companies has any idea how to deal with open source software in any sort of reasonable manner. You know, IP issues, it's all it's all a pain. So

step by step, it feels very much like an uphill battle through relatively rough terrain. On the grand scheme of it, though, it's also great. Like, there's an incredible amount of support. It's surprising

how many people

know us,

know the DaaS project, and just genuinely want to see us succeed. It's very heartwarming to see the the support. I think you get that being in an open source company. Yeah. I I think Matt hit several nails on their respective heads,

a bunch of times there. I'll just add that there are a lot of people

who who we really enjoy working with, building similar,

similar companies

at the moment, and that we're really excited to be part of a community,

of people figuring this out, and a handful that that come immediately to mind from from friends and family.

Travis Oliphant and all the wonderful people he he works with at at Quanside, and then Wes McKinney at at UrsaLab, and then Peter Wang and the entire team there at at Anaconda

as well. So there are a lot of challenges to to figure out, but there are a lot of wonderful,

incredible people who we're excited to work with closely on on all of these things. And as you build out more of the Coiled business and continue on the path of bringing DASK to more users, what are some of the things that you have planned for the future of both the Coiled company and the DASK project?

On on the DASK side, I would say just keep it expanding.

Right? There's a ton of new users. There's a ton of use cases, and we're just gonna grow that to new scientific domains,

new industry verticals,

and that's that's the focus, I think that's where we're seeing the most activity, and it's super exciting right now. And on the coiled side, we are just super excited to be building a product that is dead simple to use,

for data scientists,

and anybody

who who needs to do data science and more and or machine learning at scale.

So as a call to action, if you're doing data science and or machine learning at scale and you love to break things, we'd really like for you to take Coiled for a test drive, by signing up for our beta. And you can do so, at bit.ly/coiledhyphenbeta.

And we're looking for feedback and conversations

and and all of these things to build,

as as good a product as we can.

Are there any other aspects of the work that you're doing on DASK or at Coiled or just the overall space of machine learning and data science and helping them scale that we didn't discuss yet that you'd like to cover before we close out the show?

There are hundreds of such topics, so I'll just I'll just hold those in my pocket for next time, Tobias. Alright. Well, you're welcome back anytime you like. So for anybody who does want to follow along with either of you and keep in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

And this week, I'm going to choose the book, the Hobbit. I've read it a few times and just recently started revisiting that in audiobook form with my kids. So definitely recommend,

giving that 1 another look or a first look if you've never read it before. Definitely just a great story and a lot of good fun and just a good adventure.

So, with that, I'll pass it to you, Matt. Do you have any picks this week? Sure. 1 of the DAS maintainers,

Jim Chris just moved to the the prefect company. So I'll maybe in honor of him. So let's let's say Prefect. I think you might've done a an an episode with the Prefect founder on your other podcast, Data Engineering Podcast. Yes. I did. I'll add a link to that. So that I mean, if you if you like Airflow, anything Airflow is a great tool, you might find the Prefectus is an even greater tool. It also happens to be DAS powered, so it's a little,

little self serving. And Hugo, do you have any picks this week? Yeah, man. I'm actually I'm currently rereading a book that I read late last year that impacted me a a great deal. It's called Race After Technology by Ruha Benjamin,

who has been since then a constant source of inspiration to me.

Ruha is an associate professor of African American Studies at Princeton University,

and Race After Technology,

delves essentially into

the coevolution

of both race and technology. And not only have I impact 1 another, but how actually they're so coupled and and and they co evolve. And what Ruha does in this book is provides

an in-depth critique and analysis

of a lot of the at scale

algorithms that

are harming already disenfranchised people,

in particular, the the racist structures that are emerging. But she also gives us a very serious language to start describing the things we're seeing. So 1 in particular is what she refers to as the new gym code. I'm gonna read, just a small part of of what she describes as the new gym code, which is the employment of new technologies that reflect and reproduce existing iniquities, but that are promoted and perceived as more

objective or progressive

than the discriminatory

systems of a previous era. Okay?

So Ruha does an incredible job,

at

allowing us to perceive what's happening by building a language and structure around it. She also had a fascinating keynote recently at ICLR, which I encourage everyone to check out,

where she makes clear,

such things as, you know, when we talk about,

deep learning and think, you know, how how powerful it is, but she makes clear. She says, for example, that computational depth in deep learning, for example, without sociological

sociological depth is actually superficial fish learning. So I encourage everyone to check out the book Race After Technology,

and her talk,

her keynote, which I'll link to in the show notes. Alright. Well, thank you both for taking the time to join me today and discuss the work that you've been doing with DASK and building Coiled as a business around that. DASK is a project that I've been keeping a close eye on for a number of years now, and so I appreciate all the time and energy that you each put into the open source work and the business that you're building on top of it. I look forward to seeing where it takes you, and I hope you enjoy the rest of your day. Yeah. You too, Spence. Thanks a lot.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.

And visit the site of python podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__