Summary
A perennial problem of doing data science is that it works great on your laptop, until it doesn’t. Another problem is being able to recreate your environment to collaborate on a problem with colleagues. Saturn Cloud aims to help with both of those problems by providing an easy to use platform for creating reproducible environments that you can use to build data science workflows and scale them easily with a managed Dask service. In this episode Julia Signall, head of open source at Saturn Cloud, explains how she is working with the product team and PyData community to reduce the points of friction that data scientists encounter as they are getting their work done.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Julia Signell about building distributed processing workflows in Python through the power of Dask
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what you are building at Saturn Cloud?
- Who are your target users and how does that inform the features and priorities that you build into your platform?
- What are the road blocks that data scientists typically encounter when working on their laptop/workstation?
- How does open source factor into the Saturn product?
- What are some of the projects that you are collaborating with/contributing to as part of your work at Saturn?
- How has your experience at Anaconda informed your work at Saturn?
- Can you describe how the Saturn Cloud platform is architected?
- How has it changed or evolved since it was first launched?
- Can you describe the learning curve that data scientists go through when adopting Dask?
- What are some examples of projects or workflows that Dask enables which are not possible/practical to do locally?
- How would you characterize the overall awareness/adoption of Dask in the Python data science community?
- What are the most interesting, innovative, or unexpected ways that you have seen Saturn Cloud used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Saturn Cloud?
- When is Saturn Cloud the wrong choice?
- What do you have planned for the future of Saturn Cloud?
Keep In Touch
Picks
- Tobias
- Julia
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Saturn Cloud
- Dask
- Pangeo
- XArray
- Conda
- Mamba
- Holoviz
- Dash
- Anaconda
- Kubernetes
- Tornado
- Prefect
- Dagster
- Airflow
- Ray
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Julia Signell about building distributed processing work flows in Python through the power of Dask and her work as the head of open source for Saturn Cloud. So, Julia, can you start by introducing yourself? Hi. I'm Julia.
[00:01:12] Unknown:
So I'm the head of open source at Saturn Cloud as you mentioned. So what that means is I spend half my time working on, like, the engineering side of Saturn Cloud, building the platform that we have as a company, and then it's been half my time working at open source projects. So I'm a maintainer on Dask, and then I also work on several other open source projects in a more minor role. And do you remember how you first got introduced to Python? Yeah. I was trying to think. I mean, I took, like, intro to comp sci in college, and that's probably when I first started using Python. But when I really started using it properly was when I got out of college and I was working in hydrology labs, and I was doing data management and visualization.
And the Py data ecosystem was just, like, huge for that. It was so helpful. Xfinity, in particular, was a really helpful library that I ended up really relying on. And so, like, that's how I really started getting into Python in the context of visualization
[00:02:11] Unknown:
and management of data in those labs. As you mentioned, now you're working with Saturn Cloud, and I know that the overall intent of that platform is to simplify some of the work of doing data science in Python. I'm wondering if you can just give a bit of an overview about what it is that you're building there and some of the overarching goals for the platform and who your target users are? Yeah. So Saturn's a platform that runs on AWS,
[00:02:37] Unknown:
and it provides easy access to DASK and Jupyter, both on CPU and GPU. So the people who we're targeting are really people who deal with data. So engineers, analysts, scientists, whoever that may be, and people who are comfortable in Python. So people who who are looking for to use Python to deal with large datasets. So the platform is purposely not trying to be anything too fancy. It's we're just really trying to use the most used, the most loved tools to create a space where it's just really easy to do your work. So we use Conde and Mamba for environment management.
We just use regular Jupyter. We don't have any special variation of for flavor of Jupyter. Regular Dask, you can deploy Flask apps or Vologda dash panel dashboards. So it's meant to feel really familiar and easy to work with for these people who are already familiar with this whole Python data science world.
[00:03:38] Unknown:
And in terms of the sort of building blocks for the platform, you mentioned that you're leaning heavily into sort of the core elements, and I'm wondering what the sort of overall approach is to being able to onboard new tools or new workflows to the Saturn cloud system. If somebody comes in and they've got some esoteric Python dependency that maybe isn't part of the out of the box Anaconda distribution or somebody who's maybe trying to do something that is sort of not the most common approach and workflow for being able to maybe do some sort of deep learning training or some complex aspect of their overall workflow, flow, sort of how that factors into what you're building at Saturn and sort of the product direction that you think about? Yeah. So
[00:04:27] Unknown:
Saturn is super flexible. Like, you just get a Jupyter notebook. You can fully customize your image. Right? So you can use our UI to put in a content environment dot yaml or a PIP requirement, so TXT. But you can also just build your own image and point us to that, and then you can use that as a basis for your Jupyter. So you have total flexibility in terms of what's in your environment, and then we don't restrict at all, like, what types of things you can deploy. I just rattled off a list of, like, you know, different dashboarding solutions. But there's nothing specific about those that's built into Saturn. It's just we have the notion of deployments.
And to support that, we expose a port. So, basically, like, if you know the right arguments for your whatever best API, then you can deploy it in Saturn. And similarly, if you're trying to use a specific package, you can either create an image that has it installed or you can install it in a start script that runs right at runtime.
[00:05:25] Unknown:
And then given your focus on data scientists and the work that they're trying to do and trying to simplify some of their overall work and the deployment story there, I'm wondering if you can talk through some of the common roadblocks that data scientists and data professionals run into as they're trying to iterate on their experiments, and I'm sure most of them are running it locally on their laptop or workstation due to some of the complexities that arise from trying to either build out a complex machine learning model or deal with larger volumes of data and just some of the difficulties that they encounter in that process?
[00:06:00] Unknown:
Yeah. So there's a couple things that people run into that are specifically solved by distributed computing and, in this case, DASK. And that's running out of memory and just time of computations. So just things taking a long time. And that's something that any distributed library will solve, right, because it's able to access a large number of nodes and spread the computation out, And that solves your memory issues and hopefully make things go faster. So those are 2 of the big issues that people come to us for. The others are reproducibility.
So this environment management that I was just talking about of bring your image or we'll build you an image, that's something that people really struggle with locally, especially when they're going from running things locally to sharing them with a small group of people, maybe, like, coworkers or something like that. It can be challenging for people to create those reproducible environments. The other big things are we give access to GPUs, which a lot of people don't have locally, and that speeds things up. And then the other last thing is colocation of data. So if your data's on s 3, we run on AWS, and that is often a better idea than running things locally and pulling the data from s 3. So you can do your computation sitting right next to your data, and you don't have to do all the transfer in ingress and egress. And then as far as the
[00:07:22] Unknown:
scalability and parallelization aspects, I'm wondering what are some of the potential issues that people might run into as they're going from I have this running on my local machine. Everything works great, but now I'm just starting to run into memory issues because the volume of data I'm processing is growing. And now I want to run this across a desk cluster and the in aggregate, the overall memory is larger, but now I need to figure out how to be able to chunk this up appropriately and how much of that is something that they need to factor into their overall code and approach, and how much of it is the sort of magic of DaaS that figures out how to distribute things for them? So the goal is that the magic of DaaS takes care of it. Right?
[00:08:02] Unknown:
The goal is that you don't have to think about it much at all. And take the example of data frames. Right? If you're using Pandas, the goal is that you can just replace import pandas with import DaaS data frame. That's true in some simple cases, but oftentimes, people find that they need to be a little more so let me take a step back. So Dao's data frame mimics the pandas API. It tries to recreate most of the surface area of the Pandas API, but instead of doing eager computations like Pandas where you get the result immediately back as soon as you call the method, you get you get a task graph that can then be optimized and computed later on with the whole function distributed across a cluster.
So that's a different thing. Right? Like, the thing that Dask is doing is different than what Pandas is doing. And so sometimes that becomes a problem for people or it's something that people have to deal with. So like you mentioned, people might have to think about how to chunk up their data or think about when they want to set the index of their data or when they want to, like, maybe trigger a partial computation. As you get more into using DASK, there are best practices. The DASK docs have really good documentation about the best practices for the different APIs. I've been talking about the data frame API, but there's Dask also has a NumPy like API, Dask array, and there's several others as well. So there's sort of best practices for each of those. But by default, DASK will try to do, like, a best guess about what your chunk should be or, you know, what your partition structure is based on whatever file type you're reading from. It's a long winded answer, but ideally, Dask handles the magic. Sometimes you need to think a little bit about what's actually happening and dive in a little more deeply as you get more involved or if you're trying to do slightly more custom operations.
But as Dask is developed, it's getting closer and closer to doing the best thing, by itself. Like, every time someone tries to submit a little patch or something that makes things better for their workbook, we try to think if there's a way to, like, generalize that to make it better for everyone. And I think there have been some big improvements recently in memory management, and the goal is to make it more and more so that the magic just works.
[00:10:19] Unknown:
We've been discussing things like Diask and Pandas and Jupyter, and all of those are very well established open source projects in the overall Python ecosystem. And I'm wondering if you can talk a bit about some of the ways that open source factors into the overall architecture and strategy of the Saturn cloud form and some of the ways that you're engaging with the broader community to identify what are the useful components to bring in, what are some of the ways that you can collaborate with the overall community and sort of try to kind of grow the ecosystem?
[00:10:51] Unknown:
The SoundCloud product is built around the idea that people already know and love these open source libraries, and we did that very seriously. So we want to make sure that we are giving back. So the way that we do that right now is that I spend half my time, DAS maintenance, and that's not related to specific Saturn Cloud agenda or anything. I just do bug triaging and issue handling. A lot of what I focus on is maintaining compatibility between the Pandas API and the DAS data frame API and just sort of, like, the low level grunt work of maintaining an old library. Or, you know, DASK isn't that old, but in these in this world, that's how we are foreseeing our contribution to open source that will contribute back to these packages that are the most essential ones within that our users use. So I contribute to Dask a little bit to minor stuff, maybe to pandas or NumPy.
And then, hopefully, we'll be able to expand that to maybe contributing to Jupyter or some of the other projects that are so essential to our users.
[00:12:00] Unknown:
When I was preparing for the interview, I noticed that you were previously working for Anaconda, which is also well known for being a very large contributor to the overall ecosystem of Python data tools. And I'm wondering if there are any particular lessons that you learned there that you've been able to take with you to your work at Saturn Cloud or some of the ways that your experience at Anaconda has influenced your overall thinking about open source as a community and some of the ways that you can be a good steward of that community as a corporate entity?
[00:12:30] Unknown:
So I worked at Anaconda in 2 very separate jobs. So I worked on Anaconda Enterprise, which is probably something that maybe a lot of Eurobins haven't heard of, but it's a similar thing to Saturn Cloud where it's a data science platform. I subsequently worked on HoloVis, which is a suite of of, like, high level visualization libraries that build on top of other renders and make it really easy to go from data to visualizations. So on a personal level, during that time, I learned that I like both those things very much. And, like, the 2 pretty separate worlds. Right? Like, building a open source tool and building a platform, an application in Python.
So that was a big learning moment for me to learn that I like to build those things. But I definitely, in my role on the HoloLens team, definitely learned about what it means to make promises to users about the API that you are providing on these open source tools and how to engage with the community, like you said, how to make decisions about what to include and what to not to include, how to, you know, how to review PRs. Like, there's all this stuff that goes along with open source maintenance
[00:13:40] Unknown:
that takes a while to get familiar with. And then in terms of the actual architecture of Saturn Cloud, I'm wondering if you can talk through some of the ways that it is designed and deployed and managed and some of the ways that you are working to integrate the different pieces of the stack to be able to create a smooth experience for the end user?
[00:14:02] Unknown:
So I can't speak to as much about how it's deployed and managed because I purposefully don't know about that part as much. But, essentially, it's built on Kubernetes, and it's a tornado app. That's, like, the bare bones of it, and then it's deployed in AWS. We have 2 versions, actually. We have a version that people can purchase on the marketplace and deploy into their own AWS. So that's something that's good for companies. And then we have a hosted version that we deploy where you can have a free account, where you get a certain limited number of compute hours and things, or you can be billed, you know, in a regular way. So that's the core of the functionality.
And from the user perspective, we're just trying to make it, like, as easy as possible to spin things up and to customize their environment to have what they need in it. So we're trying to take all the DevOps burden onto ourselves. And by ourselves, I mean, other people on my team, not my own self. So the goal is to really streamline that entirely out from the user's viewpoint, particularly on the version the hosted version that we manage. But even on the marketplace version, I think it's like you provide an I'm role and we go from there.
[00:15:20] Unknown:
And then in terms of the actual application itself, you mentioned that it is built on top of the tornado framework. I'm wondering if you can talk through some of the aspects of building the application and the environment for being able to tie together things like Jupyter and DaaS and be able to manage some of the interaction there and being able to pass the workflows from the user building their experiment or trying to build their model and being able to handle how they sort of distribute the data, you know, upload the data or generate the data within the Saturn ecosystem?
[00:15:54] Unknown:
Yeah. So it's a Tornado app with a Vue front end. And the user data, in my mind, we don't really interact with the user's code or anything. But the way the DASK part works is the part that I'm most familiar with, is we use DASK Kubernetes, which is a open source DASK project. And we send requests, basically. We have a special client, a special little library called DAS Saturn that creates these Saturn cluster objects. It's a thin wrap around the DAS cluster object that's available in DAS distributed. And we just send requests to create a cluster using a little microservice.
And then we just rely on regular DASK protocols to move the data around. The scheduler is exposed at a particular endpoint, and then there's a proxy to, like, make that all work. And Jupyter works in a similar way. There's just, like, proxy do, I think. This is this is maybe not the most specific answer, but it's not a very heavy system.
[00:17:02] Unknown:
In terms of DASK itself, as we were discussing earlier, there is some measure of sort of built in intelligence as to how to be able to chunk up data frames or chunk up processing to be able to handle the distribution aspect. But there might be a situation where somebody has written their code in such a way that it is not parallelizable until they either adopt some of the APIs of Pandas or Dask itself or sort of restructure the way that their computation is being executed. And I'm wondering if there are any sort of learning curves that you've seen people go through of going from, here's a simple workflow of I have a Jupyter notebook and everything works on my machine, and now I want to be able to scale it up to be able to handle larger data volumes or more complex workflows and just being able to figure out how to best leverage the capabilities of Dask while still being able to, you know, run it locally for, you know, quick experimentation or debugging purposes?
[00:18:01] Unknown:
You can run DaaS locally as well. So that's really the best way to get these workflows running is or it's a good way at least, is to just not take it back out of Dask. Once you've started down the DASK path, you can just stay in that world. And you can use a local cluster, or you can just not specify a cluster at all, and that'll all just work. So DaaS data frame is 1 of the higher level DaaS APIs, but there's also lower level ones. So there's DaaS delayed, which allows you to wrap any object or function and makes it lazy.
So, basically, if anyone has a workflow that has a for loop or something, that could benefit from a DAS delay. So there's really simple interventions that you can take that can improve parallelization. If you don't already have anything that's parallelizable, like, yeah, that might be some work. But oftentimes, maybe there's ways you could read in partial pieces of your data or you can, you know, find ways to partition up your workflow in a way that that does allow it to be parallelizable. So, yeah, there's cases where you might just, like, need to do some work, but I don't know if there's, like, common patterns that every person who goes on that path will will run into. The other motivation for somebody adopting
[00:19:24] Unknown:
Saturn is the reproducibility angle of, I have this execution environment running on my machine. Everything works fine. But now when I try to send this code to my coworker, they have to go through all the same steps to be able to get it set up. And I'm wondering if you can talk through some of the ways that Saturn is able to help in some of the collaboration aspects and some of the other motivations for reproducibility in a given data workflow.
[00:19:49] Unknown:
A lot of them are pretty simple. Like, Saturn gives you the ability to share your Jupiter or your deployment or whatever with anyone else who's on Saturn. So what that means is that they can then create a cone that has the same image. So it has the same environment as yours. It has the same files in it, and so then they can they can see and reproduce and just carry on with your work. You can also attach Git repositories to Jupyter. So if you're working from the same Git repository, you can you can connect that up and send them a link to that Jupyter that you've shared with them, and your collaborator can create their version of it and, you know, set up their credentials and push and pull and do whatever.
So in that regard, it makes it easy. The most important thing about reproducibility is the environment. Yeah. You can download you can email someone your notebook, but anytime an environment even if you have an environment dot yaml, every time that resolves, it's it's gonna be slightly different unless you've got it pinned, like, all the way down. So the having 1 image really helps cut down on the issues that you can run into with reproducibility.
[00:21:03] Unknown:
Digging a bit more into your experience of being a contributor to the DASK project and some of the ways that that plays into what you're building at Saturn, I'm wondering if you can just talk through some of the specific areas of the project that you found yourself working within or some of the sort of interesting edge cases that you've run into trying to run a DaaS as a service and some of the weird ways that people are stressing that overall infrastructure.
[00:21:30] Unknown:
1 of the common issues that I see is people thinking that Dask can be, like, even more magical than it purports to be. Maybe thinking that they can just install Dask and that will change things in their code or that just by having DASK in a GPU, that that will automatically distribute their computation across all the GPUs. And there's actually a couple more steps that you need to do to get that working properly. I don't take issues that our customers run into and go, like, file them and fix them on DaaS necessarily. I spend my time more reading issues that people have already reported on DaaS proper and and addressing those as they're written.
So the issues that I encounter are much more, like I tried to do this merge on this data frame and turns out, like, this keyword argument isn't supported. Like, can we add that? Like, there's a lot of small compatibility issues that we run into. And trying to smooth out the wrinkles between the pandas experience and the Dask experience is something that I'm really interested in focusing on and trying to make that as smooth a process as possible. And when it fails, when we can't do what Pandas is going to do, trying to raise a warning or an error as soon as possible to give people somewhere to go from there, some understanding of what's going on, why it's going on, and what to do next. That's really what I spend my time doing.
There are, like, gnarly customer things that customers run into. There are some issues right now that a bunch of people are working on on DASK around really large parquet files and how to read those efficiently. And that is that's super interesting work, and I think it's gonna be really beneficial to all sorts of people, especially people who have tabular data. But I haven't been as involved in that work. There's also an effort that's going on right now, which is a ways out yet, which is about higher level expressions and trying to figure out how to do better optimizations essentially on DAS data frames and trying to know more about what the outputs are gonna be and know more about what each specific task that you've chained together in your task graph, what it's trying to do so that we can do better optimizations.
[00:23:56] Unknown:
Bringing up the subject of a task graph reminds me too of the fact that, you know, a lot of the ways that people are interacting with DASK are particularly from a data science perspective are, you know, through its compatibility layers with pandas and NumPy, but there's also just the underlying Dask dot distributed, which is being used for a lot of systems to do things like workflow orchestration with projects such as Prefect or Dagster, and also, in some cases, just potentially replacing tools such as Celery for just doing asynchronous task execution. I'm wondering if there are any interesting applications of Dask that you've seen people leveraging beyond just the pandas and NumPy layers.
[00:24:37] Unknown:
Yeah. I mean, you can basically do anything. Right? A task graph is just a dictionary. So there's a lot of cool things that you can do just by accessing that layer, and the prefect is a good example. We really encourage people to use prefect. It has a similar goal to Airflow, right, of, like, just doing this workflow orchestration like you said, but it can use Dask as an executor to achieve parallelism. And you can basically access Dask at any level, and there's people doing it. Right? Like, it's not an uncommon thing for for people to be working at that really low level. I think they tend to be people that we hear from less, like banks or people who are really, really focused on this high high speed computation.
So I think it's a slightly less well understood group of users.
[00:25:30] Unknown:
Another angle on Dask that I've come across a couple of times is as a replacement for something like Spark where Spark has gained a lot of ground and popularity in the overall data ecosystem as a successor to the types of things people were doing with Hadoop and MapReduce because of its ability to do sort of, you know, microbatches and streaming style workflows as well as its capabilities as a workflow orchestration layer for doing things like ETL processing and then also some of the built in machine learning capabilities. But with Dask being so closely tied into the Python ecosystem, a lot of those machine learning aspects are easily filled with Python native tooling. And because of the fact that there is a lot of the data processing ecosystem moving into Python, Dask becomes a natural place for that to go as well. I'm wondering if you've seen any of that style of workflow being run-in Saturn.
[00:26:25] Unknown:
Machine learning workflows in particular are really common. And, yeah, DASK has integrations with scikit learn, and there's also some integrations with PyTorch and other machine learning libraries that make it easier to use these things together. I mean, the main thing that Dask has over Spark is really just Python, right, and the whole Python ecosystem. So, like, if people want that, then Dask is a great
[00:26:56] Unknown:
solution for them. In terms of the Dask community and the sort of visibility in the overall Python ecosystem, I'm wondering what your sense has been as far as the level of awareness that it has gained for people who are building machine learning workflows or doing work with data or who would benefit from this sort of distribution and parallelism, and particularly if you have any sense of its sort of relationship or juxtaposition with the Ray project, which is another project that is trying to aim for some of those same goals of parallelism and easily going from local to to distributed computation?
[00:27:35] Unknown:
So first off, it's really hard for me to tell because I'm, like, I'm nestled deep within Py dataland. And from my perspective, it seems like most people know about Dask and most people like Dask, but I understand that I'm in, like, this particular little bubble. I think that the people who work on DASK are a lot of the same people who work on pandas, on NumPy, on X-ray, on whatever else in the Py data world. So that seems like an advantage to me. So there's other little universes, right, in Python data science. It's not all Py data. But within this Py data world, Dask seems to be the 1 that is most tied into that community.
[00:28:19] Unknown:
As far as the sort of future direction of the DASK project, and you've mentioned some of the kind of ongoing work and some of the major projects going there. But what are some of the ways that you see the project and the community evolving and some of the, I guess, untapped potential in the project and the community and some of the ways that it can be used?
[00:28:42] Unknown:
I'm working on a with another DAS maintainer, I'm working on a documentation refresh right now. So, hopefully, that will clarify where we're at and make it clear what is currently possible in DaaS. But I think that there's definitely a a lot of potential to expand in the machine learning space, to expand in how DASK interoperates with other machine learning tools like TensorFlow or within the DASK org, there is a DASK ML module that that could be expanded that provides a lot of the more basic machine learning algorithms.
There's opportunities for expansion there, but from my perspective, I think there's a lot of opportunities to help people understand what exactly is going on and try when they call DAS operations and trying to make it so that if people get confused when they first use DASK, they don't then, like, stop using DASK. Right? Like, I think a lot of people's experiences, they try to use Dask as a replacement for pandas, and then it doesn't quite work the way that they expect, and that might trigger them to then stop using mask. But I try and figure out ways to catch those people and help them, like, get back into the fold.
That's what I hope for for the project. And I think that looks like documentation that looks like earlier, better error messages and little stuff like that that really helps people
[00:30:13] Unknown:
diagnose their own issues and gain understanding about what's going on. In your work, both in the DASK community and at Saturn, I'm wondering if you can share some of the most interesting or innovative or unexpected
[00:30:30] Unknown:
I'm not sure if you're familiar with, but it's hard to define. But Pangio is a I'm not sure if you're familiar with, but it's hard to define. But Pangio is a group of scientists, mostly geo related scientists, so meteorologists, oceanographers, and things like that. And they have been using DASK and X-ray and JupyterHub. There have been, like, power users of those tools for a while now, and their goal is really to enable computation on these massive earth science datasets that are stored on s 3 on Google Cloud. And so they're really, like, where I first learned about collocation of data being an important thing because Pangio's ability to use these tools to process their data right next to where it's stored has really enabled this whole world of open source computation in the geosciences that was previously rare.
It also allows people who maybe don't live in countries that have, like, massive computing institutions to actually work on their own datasets and to access cloud compute to do that. That's 1 of the coolest applications of Dask that I know of, and it's definitely worth looking at.
[00:31:51] Unknown:
In your experience of working at Saturn and helping to grow the platform and contribute back to the open source tools that you're using and building on top of, what are some of the most interesting unexpected or challenging lessons that you've learned in the process?
[00:32:05] Unknown:
So I feel like I keep learning over and over again about API creep and how to try and limit the scope of both tools like DASK and and applications like Saturn and how to I feel like there's this ongoing challenge of how to expand functionality in a responsible way that doesn't then create this maintenance burden down the road and doesn't make promises to users that you won't be able to keep. So that's a big part of what I think about, and that's something that I think people don't necessarily think about when they try to go do their first contribution to an open source project. But it's definitely
[00:32:46] Unknown:
something I think about. Yeah. There's the sort of old trope of open sources, you know, can be free as in puppy, where anytime somebody adds a contribution to the project, it it's great and wonderful until it makes a big mess.
[00:33:00] Unknown:
Right. Right. And it's hard to be protective of a project like that while still being excited and interested and curious about the enhancements and improvements that people are trying to make. I think that's a real challenge. Yeah. It's definitely 1 of the ongoing issues that a lot of projects have to relearn either
[00:33:21] Unknown:
from other projects or multiple times within the same community, particularly if there are sort of generations or cycles of contributors and maintainers who are coming into the project. So as the community grows, the same lesson has to be kind of relearned about what are the overall goals and, you know, intended scope of the project, what are the points at which we say that is a good idea, but it needs be implemented outside of this project and not as a sort of core component of it, which adds some layer of friction as well. Yeah. I think in DASK, this issue is particularly
[00:33:55] Unknown:
distilled because it's particularly in the APIs that are mimicking other APIs because it doesn't have to be a discussion about what the API will look like or how it'll be spelled because that's already you know, Pandas has already decided or NumPy has already decided. So So the issues are much more, like, whether it should be done and then, like, how. It's an interesting project to work on. So, definitely, if people don't have large data problems at all and they don't have collaborators, then, you know, they really don't need Saturn Cloud.
I mean, you can always stand up a JupyterHub. It's really a personal choice. Right? If you want to administer your JupyterHub and stand up like a DaaS gateway, that's great. You know, that's fine. Do that. But I think that our goal is really to have for people who don't want that development that DevOps burden, that they don't have to take that on. And that stuff's all getting easier and easier, and people should a 100% go try to do that. If that sounds like a fun challenge, they should do that. But if not, then sound of God is good for people who don't want to do that.
[00:35:27] Unknown:
And we talked a little bit about some of the future direction for the DASK project, but what is in store for the near to medium term for Saturn Cloud?
[00:35:35] Unknown:
Yeah. So I talked about how we have a version that you can just buy the AWS Marketplace, and it installs into your own AWS. And then we also have a hosted version that's really good for individuals. We'd like to have sort of an in between, like a hosted Teams version that makes it easy for smaller companies maybe or groups of academics or something like that to work together without having to go through the process of setting it up in their own AWS. AWS. That's something that we're
[00:36:05] Unknown:
gonna be working on in the short term. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose the movie Peter Rabbit 2. Watched that 1 recently with the family, and it was just a lot of fun, hilarious, well done movie. Just definitely worth a watch if you're looking for something to keep you entertained for a little while. And so with that, I'll pass it to you, Julia. Do you have any picks this week? Yeah. My pick is the fruit, the pawpaw.
[00:36:35] Unknown:
I had my first pawpaw recently. And also I just learned that they're indigenous to the East Coast of the United States, which I did not realize at all. And they're very good. They're in season right now. And if you're not familiar, they kind of look like a giant kiwi, but with no fur. And then they taste sort of like a persimmon, but custardy or
[00:36:57] Unknown:
so I recommend them if you can find them. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Saturn Cloud and helping to contribute back to the DASK project and some of the other components of the Python data ecosystem. Appreciate all of the time and effort you've put into both of those, and I hope you enjoy the rest of your day. Thank you. Thanks for having me. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Julia's Journey with Python
Overview of Saturn Cloud
Onboarding New Tools and Workflows
Common Roadblocks for Data Scientists
Scalability and Parallelization with Dask
Open Source and Community Engagement
Architecture of Saturn Cloud
Learning Curves and Best Practices
Challenges and Edge Cases in Dask
Dask Beyond Pandas and NumPy
Machine Learning Workflows
Dask vs. Ray
Future Directions for Dask
Interesting Applications of Dask
Lessons Learned at Saturn Cloud
Who Needs Saturn Cloud?
Future Plans for Saturn Cloud
Contact Information and Picks