Summary
Data applications are complex and continually evolving, often requiring collaboration across multiple teams. In order to keep everyone on the same page a high level abstraction is needed to facilitate a cross-cutting view of the data orchestration across integration, transformation, analytics, and machine learning. Dagster is an innovative new framework that leans on the power and flexibility of Python to provide an extensible interface to the complete lifecycle of data projects. In this episode Nick Schrock explains how he designed the Dagster project to allow for integration with the entire data ecosystem while providing an opinionated structure for connecting the different stages of computation. He also discusses how he is working to grow an open ecosystem around the Dagster project, and his thoughts on building a sustainable business on top of it without compromising the integrity of the community. This was a great conversation about playing the long game when building a business while providing a valuable utility to a complex problem domain.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This portion of Python Podcast is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place (sometimes fast, sometimes slow)? Do you know why? With Datadog, you will. You can troubleshoot your app’s performance with Datadog’s end-to-end tracing and in one click correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at pythonpodcast.com/datadog. If you sign up for a trial and install the agent, Datadog will send you a free t-shirt.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source data orchestrator for powering data engineering, analytics, and machine learning
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Dagster is and how it got started?
- What are the most common difficulties that organizations face when working with data projects?
- How does Dagster help in addressing those challenges?
- There are a number of workflow orchestration platforms, spanning a few generations of tooling. What do you see as the defining characteristics of the various options, and how does Dagster fit in that ecosystem?
- What are the assumptions that you made at the start of building Dagster and how have they been challenged, updated, or invalidated over the past year of working with end users?
- How are the internals of Dagster implemented?
- How has the design changed or evolved since you first began working on it?
- For someone who is building on top of Dagster, what is their workflow from first steps through to production?
- What are your guiding principles for desigining the user facing API?
- What are the available extension points for Dagster?
- What was your reason for implementing Dagster as a Python framework?
- With the benefit of hindsight, would you make the same decision today?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Dagster used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building Dagster and working to grow its ecosystem?
- When is Dagster the wrong choice?
- As you continue to build Dagster, what is your vision for it and its ecosystem?
- What are the next steps that you are taking to achieve that vision?
Keep In Touch
Picks
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Dagster
- Elementl
- IronPython
- Fluent Python
- GraphQL
- Maslow’s Hierarchy of Needs
- Hierarchy of Data Needs
- DAG == Directed Acyclic Graph
- Informatica
- Airflow
- Luigi
- Dagster Config Schema
- Dask
- gRPC
- MyPy
- Data Lineage
- Pandas
- Amundsen
- DataHub
- Gatsby.js
- Panama Papers
- Mode Analytics
- Papermill
- DBT
- Databricks
- Tobias’ Dagster Repository
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode today. That's l I n o d e, and get a $60 credit to try out our Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.
For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And today, I'm interviewing Nick Schrock about Dagster, an open source data orchestrator for powering data engineering, analytics, and machine learning workloads. So, Nick, can you start by introducing yourself? Yeah. My name is Nick Schrock, and I'm the founder of a company called Elementl. And our primary project is what you mentioned. Thanks. And do you remember HayFirst got introduced to Python? I actually don't remember kind of the first line of code.
[00:01:40] Unknown:
I believe in the late 2000, I was actually figuring out how to script the dot net framework and there was a project called Iron Python and I think that's how I got introduced to it. But in reality, Daxter is kind of my first significant project in Python. It's where I really started being a real Python programmer.
[00:01:59] Unknown:
For somebody who is using this as their first entree into Python programming, what has your experience been as far as being able to ramp up and use it effectively and and some of the useful resources that you've been able to lean on to build the framework that other people will enjoy using?
[00:02:15] Unknown:
Yeah. So I think it's really important to find, when you're doing a new programming language to find kind of a canonical book. Cannot just teach you the features, but the common idioms and the ways you should program because best practices are, you know, especially at kind of an older language are just important as knowing the features. I like the book called Fluid Python actually that I felt really got me up to speed. Really I think this is very common but the challenge of Python is not the programming language itself. There are a couple quirks but like it's very intuitive whatnot. The challenge of Python is all the surrounding infrastructure so managing dependencies, virtual environments, things of that nature and especially as a framework author, dealing with the universe of Python dependency management is, you know, a total nightmare, and I think that's been really challenging.
I like to joke that, you know, the zen of Python is that there should be 1 and only 1 obvious way of doing things, But we're actually gonna have 4 competing package management systems, and we're gonna fork the language. So I think those have been the most challenging parts of dealing with Python.
[00:03:19] Unknown:
Yeah. Package management is definitely the bailiwick of many Pythonistas and has been for a number of years. Although, I do feel like it's getting better and particularly with some of the forward looking packages like PyOxidizer for figuring out what the deployment experience looks like and simplifying the experience for end users of the package. So I'm excited to see where that goes. Anyway, bringing us back to the topic at hand, can you give a bit more of a description about what the DAXTER project is and how it got started and your motivation for building it? Let's start off with the origin story. My motivation was that I was a Facebook engineer from 2, 009 to 2017,
[00:03:55] Unknown:
and kind of the thing I'm known for outside of the company is that I was the original creator and 1 of the co creators of GraphQL, which is now a relatively popular open source technology. So I had this experience in developer tooling and when I left Facebook, I was kind of searching around for what to do next and I kept on asking companies both inside and outside the valley what their biggest technical liability was, what they felt their rate limiting step was in their engineering processes. And this data engineering stuff kept on coming up continuously as well as like ML infrastructure. People would say the same slightly different things, but it was this core problem. And then I started talking to real practitioners and seeing the tools they use, developer workflows, etcetera. And I was kind of, you know, many aspects of the systems that they use are really technically interesting, but in terms of like developer ergonomics, workflow, productivity, etcetera, I was like relatively appalled and I thought there was just a huge opportunity here and I just get personally very frustrated and mad when I see really really talented and smart people wasting their time or having an unnecessarily stressful experience doing their job that can be solved by software obstructions.
So I really got into this and then when I dug in I really felt that data engineering, data science and in terms of kind of engineering best practices and hygiene was similar to where front end was like 10 years ago. So like prior to GraphQL, React, etcetera, you know, the front end engineering was a real wild wild west and there's kind of a software engineering deficit there. Like you don't test things, there wasn't really software engineering deficit there. Like, you know, in test things, there wasn't really nicely structured frameworks, etcetera, etcetera. And I do feel this data domain is in a similar spot today. But it's in fact, like, societally more important, I think, to get ML and data analytics right than front end because ML is planting or replacing or augmenting human decision making. Analytics drives incredibly important business and policy decisions. So it's really important to get this stuff right. And in terms of the difficulties that those organizations
[00:05:53] Unknown:
are running into and the conversations that you were having as you were deciding what to build next, what were some of the biggest challenges or most common issues that they were running into when trying to build and manage data projects?
[00:06:07] Unknown:
It's difficult to know where to start because the problems are so fundamental and on multiple dimensions. You know, if you kind of know Maslow's Hierarchies of Needs where it starts at like for a human, like you need like food and shelter and then builds up to self actualization. I feel like the software equivalent of this, like the data stuff is kind of still at layer 1 where the basics of like getting these things working, keeping them alive, having reasonable testing, it's in fairly dire straits in my view. And from a business perspective, you talk to any major decision makers relatively clear, like, they like can't manage and organize the data that they have. So they don't know what the data is. They didn't know where it came from and the failure rate of these projects is very high. It's very difficult to hire the right people to do them. So there's just like at the very basics of just getting things working in these data systems and kind of like keeping them alive and having productive developers, It's just incredibly challenging and then you combine it with the fact that these systems are very complex and often involve multiple different kind of roles interacting with each other in novel ways. You know, this, like, data siloing. Right? So you often have, like, data engineers and data scientists, for instance, attempting to collaborate together, but it's very siloed. But it's just a very subtly difficult domain of software.
And I just think this
[00:07:33] Unknown:
software engineering deficit is really the primary thing going on here. And in terms of trying to address those challenges, what are the design elements of Dagster that you had in mind that would help to simplify some of those problems and help to address the shortcomings of the existing ecosystem of data tools? Yeah. That's a good question.
[00:07:55] Unknown:
I think what it is is that I felt you needed a more opinionated software framework to structure these computations. It does a few things. 1 is that it forces you to structure your computations more in terms of functions rather than just tasks. So there's inputs and outputs. We have a type system that's on top of that that helps you deal with makes these flows more self describing as well as more reliable. We also structured the software. There's natural seams for testability. I think another critical aspect is that you can actually view the graphs of computation prior to deployment, prior to execution, which makes them not just kind of this deployed artifact, but also have a way for the different personas to interact with each other. So like the data science is gonna view the graph of computations that the data engineers have created without having to, like, deploy it or execute it. And then just like higher quality tooling, I thought that, like, a lot of these systems, the workflow and orchestration systems that were out there really didn't focus enough on the full end to end workflow. So I think it's important for this orchestration graph as we call it or DAGS more commonly to be in the developer workflow, like, on your laptop, not as just a deployed artifact.
[00:09:14] Unknown:
To the point of the overall ecosystem different generations of projects that have tried to address this problem. They've all taken slightly different approaches, and they have slightly different ergonomics or focus for the end user. Wondering if you can just broadly characterize the existing ecosystem of the workflow orchestration tools as it stands today
[00:09:43] Unknown:
and some of the defining characteristics of the differing options and then how Dagster fits within that overall ecosystem and what you see as being the differentiating elements of it? Yeah. That's a good question. I view it as kind of like there's 3 generations a bit. There's very old things that, like, predate, like, airflow and Luigi, and often those are more, like, graphical or even, like, drag and drop tools, like Informatica or something. And then there's Airflow and Luigi, which I would call like generation 2. They were fairly exclusively workflow engines, meaning like they were only exclusively involved in the computations. Like if you look at Airflow, the DAG is essentially a deployment only artifact.
It is completely black box in terms of what the computations that are happening when it within each task. It is exclusively focused on just managing the order of computations, managing retries and things of that nature. Now there's just kind of more next generation of workflow engines and orchestrators of which there are many, and I think there's a few trends among them. Some are very focused on vertically stacking and really coupling themselves to containerization platforms. So you're vertically integrating Kubernetes and really focusing on that aspect of it, which I think is an improvement in terms of deployment and operation but by doing that, you often sacrifice kind of the developer workflow and local developer experience which can actually still be very very challenging. And then there's folks who are just kind of trying to do I would say like a better job of exactly what airflow did. I think our differentiator is that we think the orchestration graph, this DAG which interrelates all the computations should be a very rich object.
The edges should be typed. It should embrace functional data engineering principles. It should be tracking what assets it produces so that you can link the computations and the assets together. And then it should also have a slightly opinionated programming model to facilitate testability and other aspects like that. So if you look at Daxter, we have the basic requirements of a workflow engine, I e that we order computations, we do manage retries, we do that stuff but we also have these higher order concepts like a type system, we have a configuration system, we have this notion of resources and some abstractions, which I think are critical to structuring these systems properly.
[00:12:13] Unknown:
Yeah. I think the resources in particular are especially valuable because it makes it easy for 1 person to be able to define concretely how a particular set of computations needs to be able to interact with something like a database or the file system or for security purposes being able to pull credentials from HashiCorp Vault, for instance, and then package that up and either distribute it within the context of a pipeline definition or as its own independent Python package that another user of DAXTER can install and use just by passing in the necessary configuration objects. As somebody who has been working with Dagster, the exposure of the configuration schema to very clearly define what attributes are necessary to make a given resource function properly is very useful for having those strict contracts between the different components of the system so that there isn't a lot of the confusion that can happen if you are facing a function that just says star args and star star quarg and saying, okay. Well, what is actually going to happen when I pass in some of these arbitrary values into this function?
[00:13:25] Unknown:
I couldn't have said it better myself, Tobias. Yeah. I
[00:13:28] Unknown:
might quote you on that 1. That was a lovely explanation. Thank you. As somebody who is relatively new to this overall problem domain as well as the language that you're working in and you're trying to solve a very large and fundamental problem. What are some of the assumptions that you made going into the start of building Dagster that have been challenged or updated or invalidated over the past year or 2 that you've been working on it and working with end users who are giving you feedback on the ways that they're trying to use the framework?
[00:14:01] Unknown:
The 2 things that kind of come to mind, and I think they're related, is that when I kind of first started building the prototype and the initial version of the system, I was it was very much targeted towards what I call the leaf developer, that's a generalized term for the data scientists, the data engineer, the people who are responsible for actually the business logic of the computation and that we were going to try to be like very infrastructure agnostic. So, like, effectively, it would only be, like, a software abstraction that you could execute over existing workflow engines. Right? So you could, like, execute on airflow or DaaS, which is a slightly lower level of abstraction. But really try to be only a software abstraction.
That was a little naive, I think. I think we found out 2 things. 1 is that you we still don't believe in vertically integrating with infrastructure. Like, we support Kubernetes. We don't require it. You know? We can still execute an arbitrary computational substrates, but we have to have more opinions about the way that infrastructure works. And second of all, it's not just the leaf developers. I think what we found, you know, because, like, your resource example is perfect. Is that a really big challenge in the systems is managing the relationships between the different personas or jobs in the system. And this is where we get into who this is targeted for.
Now the framework that we use when we are designing or modifying the system as often is not just like the technical abstraction layers, but the abstraction layers between the different roles or jobs in the system, right? So this is what you were talking about with the resources stuff. Yeah, the way it's designed that our resource system is that in our head we're like, okay, there's like 1 team or 1 person who's responsible for kind of crafting these resources and providing an interface to the leaf developers. The resource abstraction is not just a technical API layer, but it's also like an organizational API layer. And even to someone like you who is more of a full stack developer responsible for both infrastructure and the domain logic, you are doing 2 jobs at the same time. So it's actually very useful for you to have an abstraction layer so you can like organize the thoughts in your head a bit. I think we've realized that in order to address the problem we had to kind of expand the scope of the system a bit and then really deeply think about how the different roles in the system interact.
[00:16:35] Unknown:
Yeah. And that's a common theme that's come up in a lot of the conversations on my other show, the data engineering podcast, where the reason that solving for a particular problem in the data space is so complicated is not just because of the technical issues, but because there are so many different stakeholders throughout the entire life cycle of any project, that is not the case with just a web application or an alerting tool where you're primarily focused on solving for the needs of the developer who's building the application. And then the developer's job is figuring out what the end user's concerns are.
Whereas in the data domain, you have this complete loop of the data provider who is producing information using something like a click tracking system or pulling data from an application database. You've got your data engineer who needs to be able to pull that information into a particular location, maybe model it so that it fits particular schema for being able to be analyzed easily. You got your analysts or data scientists who are actually working with that processed data to be able to gain some sort of insights. You've got the business users who are using those insights to make decisions, and then they have additional questions that they need to ask. But in order for those questions to be answered, it needs to go all the way back to the source systems to say, I need you to track this additional information, or I need the data engineer to be able to pull that from the source systems and process it for my analysts. And so there are a lot more people who need to be able to work within the context of that overall framework rather than just an isolated data engineer or an isolated web developer.
[00:18:12] Unknown:
I should no longer be stunned, but I am continually stunned by the complexity of both the systems and the interactions between people on these systems. 1 of the things that we believe is you can't just like wish this complexity away, You have to embrace it and manage it. So I think that a lot of people who just are like, oh we have this very simple solution often kind of missed the mark because the notion is that you can't make this stuff simple. What you wanna do is be able to compartmentalize and manage the complexity. So the right people are doing the right things
[00:18:42] Unknown:
rather than trying to have some sort of silver bullet here. That's also where a lot of the trends and things like microservices versus monoliths comes up is what are the actual communication patterns of your organization and the tendency for the communication patterns of your software to reflect the organizational hierarchies. And as you said, being able to compartmentalize those aspects and then having that within the framework of a tool chain that's intended to maintain the overall capabilities of the end to end workflow, it definitely means that you have to think about it very carefully as to how do you create those dividing lines and the handoff points to support those communication patterns while still being able to have the entire system interoperate as a single unit. Continuing on that, can you dig a bit more into how DAXTER itself is architected and the ways that the design and implementation of the system has changed or evolved since you first began working on it? Yeah. So there's a lot of different
[00:19:42] Unknown:
ways that we can approach that problem. I'll just pick 1, I guess, in terms of, I think, an interesting evolution and how we ended up being structured. So at the beginning, 1 would simply write 1 of these DAGs we call them a pipeline which consists of nodes which we call solids and then we have this local development tool as well as our production tool called Daggett, which is a web front end for this. And we would actually literally load that code into process into that same server process and that would be kind of how you would like inspect the pipelines and execute them and whatnot. But as timing has gone on, as as we've kind of gotten more mature about thinking how infrastructure works and the team structure works, we actually did a huge rearchitecture this spring where we made it so that we rip out the user code and then have our web server local development tool actually communicate with that user code over a gRPC layer. This is very interesting because this ends up allowing you to say in complicated deployments where you're serving multiple teams is you have the infrastructure team managing the process or container which contains the core infrastructure tools. Those are communicating over well structured gRPC interfaces to containers that contain the various user pipelines.
And so that means that all these things can be updated and deployed independently. You can keep your the core infrastructure up a 100% of the time without having to restart it as the users are updating their code and has a lot of other kind of positive aspects. So that's kind of like 1 infrastructure element of design where we've very clearly separated kind of user code from system code and then the team's code is separated from each other. So you can have 1 set of pipelines that takes in 1 set of dependencies, another set of pipelines that takes in another set of dependencies or even a different Python version. Speaking to my first gripe at the beginning of the show about how difficult it is to deal with the Python ecosystem on that front. So I think that's 1 interesting aspect of this. And then the other aspect is that we're very focused on layering the system in our view properly so that it becomes a true platform and not just a vertically integrated tool. So I talked about this like gRPC interface.
Right? We also have a GraphQL API over kind of our higher level like web server which means that we have our front end tool DAGIT but we can also build your own tools on top of that. And then as you go down the system, we also have other pluggable layers so that, for example, we can deploy and manage arbitrary infrastructure. We have, like, a prefab Kubernetes deployment, but we also have users plugging into those same abstractions that our Kubernetes extension uses to, like, execute this thing on their own custom pass or some, you know, kind of ECS on AWS or whatever custom infrastructure that people come up with. You know, we've really focused on layering the system to have, like, vertically stacked instances of the system that are easy to use, but make it possible to use across an entire universe of infrastructure.
[00:22:46] Unknown:
And for those extension points, what is your guiding principle or your overall heuristic for deciding how to create those interfaces and how to expose them to users in order to make them easy to implement, but also easy for you to maintain and achieve the necessary level of flexibility and expressivity of the overall system.
[00:23:10] Unknown:
Kind of my overall principle for API design is always to make simple things simple and complex things possible, which is really easy to say and hard to do because there's always this tension in these systems where you desire to impose constraints on your user and therefore that interface that constrained interface allows the infrastructure provider to have a ton of flexibility in how in terms of how to implement it. But if you do it in such a way where it's over constrained and there's no escape hatches, then you've severely limited the different use cases for your product.
This very deliberate approach about thinking about being very deliberate about, like, what constraints you're imposing on your user, but then allowing maximal flexibility on the other dimensions is something we think about a lot. You can see this in like for example our type system which is kind of this gradual optional type system where we try to make kind of very straightforward things easy, like passing around scalars and we have common libraries for passing around data frames and common tools and whatnot. But the core type check is just an arbitrary function that a user can execute and can impose whatever constraints they want, the data that's flowing to their system.
And that allows the system to be able to adjust to this extremely heterogeneous complicated world that is the reality of the world that we're dealing with.
[00:24:36] Unknown:
And on the note of typing, because of the fact that you have your own means of expressing certain types that are native to the DAGSTER ecosystem, but then you also have the type information that is associated with the different variables and return values of functions. How are you working to maintain compatibility in both directions where somebody can use your type system for making sure that the outputs of computations are in the appropriate shapes and the appropriate objects that are being passed around for the overall workflow execution, but that the type information for something like mypy is also being propagated through the system to ensure that you are able to try to reduce the occurrence of bugs because of improper values being passed through the actual logic that's being written in Python within the context of the Daxter framework. Yeah. Dealing with MyPy in particular
[00:25:28] Unknown:
is interesting because, like, we love it when people use MyPy, but now you're kind of, like, dealing with 2 types of systems at the same time. There's a bunch of behavior attached to DAXTER types that go beyond the Python type system. 1, it's more flexible because of its nature in terms of it's defined by arbitrary computation, but we also, like, tie other behavior to types, like types project into our config space so you can decide how to decide how to load it from the outside world via our config system. It also controls serialization behavior as we marshal data from 1 node of computation to another and other kind of behaviors like this. You know, what we try to do if you've been very structured about typing your existing code base, we make it very easy to create DAXTER types that are effectively just like listen like I've already typed this thing so don't impose any additional type checks. Do like an instance of check. There's like 1 dimension if you've done that and then there's the other dimension where like the code base isn't typed at all and then you wanna kind of put the typing information in the Daxter type checks itself. But the nice thing about the Daxter types is that they end up being exposed in our tooling. So you can, like, open up your DAG, and your DAG is typed. So it serves as like this really compelling substrate of documentation where you can kind of like trace through your computation to understand what's happening. That's another interesting element of the extensibility
[00:26:49] Unknown:
and integration capabilities of Dagster is the metadata that you're able to produce throughout the context of executing these computation graphs and the way that that hooks into the broader data ecosystem for things like lineage tracking or metadata management or auditability and governance of the data that you're processing. So I'm wondering if you could maybe talk to the ways that DAXTER fits within that broader ecosystem and the efforts that you're doing to build out an ecosystem around DAXTER as a platform to make it easier for newcomers to the system to be able to adopt it and be effective in the shortest possible time. I think 1 of the interesting things about the data ecosystem
[00:27:34] Unknown:
is that there's integrations to do on multiple dimensions. Yeah. You need to integrate it with a, like, a physical computational substrate, storage systems, data lineage, provenance systems, like you were talking about, the data tools themselves, you know, whether it be Spark, Pandas, Data Warehouse, etcetera. So the and the centralized thing that needs to be aware of all these things is this orchestration graph. Because as the computation unfolds, that's the thing that is like storing everything, orchestrating the computation which in turn is actually like should be in my view, which in turn is actually like should be in my view populating metadata management or whatnot. In terms of the more the topics you were talking about in your question, the data cataloguing, metadata management, etcetera, the primary way that we communicate and provide a substrate for metadata management is our events log.
So as these computations unfold, we actually have the structured event stream which says, like, hey, I started this step of computation. It has this input. This input has these properties. I've produced this asset, this, like, file and some like s 3 store say somewhere. We call those asset materializations and we end up building this immutable log of data of metadata about everything that's happened. That's that's the base of our operational tools. It's the basis of our tool that we call the asset manager. And I think the really powerful thing about that immutable log is that it ties all these artifacts whether it be an asset materialization or, like, a data quality test, we call those expectations. It ties it to a computation in, like, a very verifiable way. So you finally have a place where you can go and look up the name of an asset which is totally user defined. Let's say you're just like naming files in your s 3 data lake and you can look that up with like a fast type ahead and see like, oh, this was touched yesterday by this pipeline and 2 days ago by this pipeline. Then you go to another file and you see, like, it hasn't been touched for 8 days and the last success form of this pipeline. So I probably have to go talk to the person who runs that. So this this provides us like base layer which links computation to data.
I think that it will provide a very interesting kind of base substrate of metadata and data because we're not going to end up building like a full catalog in any sort of near term time frame. We're not gonna build like a competitor to a Munson or Datahub or any number of the proprietary tools. But what I think those tools can do is consume this event stream to build this kind of base layer of metadata and then link it to all the computations. And that's the basis of really interesting data lineage products and data cataloging and so on and so forth. The other element of building the ecosystem
[00:30:23] Unknown:
is making it discoverable for people to see examples of code that people have written using Dagster or libraries that people have written for being able to provide things like default resources for people who might be interacting with AWS or particular databases, as well as the solids that I know are composable that can be potentially used within multiple contexts because of the pluggability of the configuration schema and this functional orientation of the overall flow. So I'm curious what your thoughts are on fostering that type of ecosystem and growth of the community.
[00:31:03] Unknown:
That's a fantastic question because that's really something we need to focus on. We really haven't been focused on community growth from last year because we've been explicitly focused on a limited set of design partners to mature the system. And now we're kind of transitioning to more of a growth phase. We've really seen our kind of early users get a ton of value on the system, and we're really confident that's more generally applicable. But now this leads to your question, which is kind of community growth, both in terms of user growth as well as like ecosystem management. And this is something I think we really need to work on because right now the way we structure is we have a monorepo with all the kind of integrations that we manage as well as the core system. Makes it a lot easier to manage for us because if we do, like, a cross cutting change that touches a few integrations simultaneously, we can push those changes in an atomic fashion, and we end up, like, pushing up, I don't know, like, 20 packages every time we do a release.
But that is not gonna work forever. I think, you know, the the eventual vision is to have a broader ecosystem of plugins that can live in independent GitHub repos. I actually look up to the Gatsby project in terms of the way they manage this because they they do an interesting thing where if you, like, annotate a GitHub repo with certain I forget if it's tags or some way, but then they actually crawl those things and build an index of community available plugins. And I think we're gonna have to do something like that in order to manage the inevitable in the terminal success case, there will be probably 100 if not thousands of libraries to integrate with all the various tools that exist out there. You know, you can't have that 1 get your repo. I think it's a super insightful question, and I'm excited to get this built out. Yeah. And 1 of the stepping stones that I've seen work at least reasonably well sometimes is curating a sort of awesome Dagster GitHub repository where it's just a list of
[00:32:57] Unknown:
libraries or open source repositories of examples that people have built as a way to let people discover some of that versus having to maybe crawl through the backlog of the Slack history or look through the mailing list or GitHub issues to try and see who are the people who are actually using this or the dependence graph in GitHub that they've introduced.
[00:33:17] Unknown:
That's a fantastic idea to us. I might put this on the to do list for the team.
[00:33:23] Unknown:
Going back to the end user perspective of somebody who is building a data workflow and they're using Dagster, what is the overall workflow of getting started with writing a set of solids or resources and then going through to getting it put into production and executing it and maintaining it and just some of the edge cases or challenges that they should be aware of in that process?
[00:33:46] Unknown:
It's an open source Python framework. So the first step is pip install Dagster, and then you almost always wanna pip install Daggett, our graphical tool, and then you start writing some code. I think this aspect of system is something that works pretty well. You can go from kind of 0 to a hello world DAG and then have it running locally on your laptop and being able to play with the system very, very quickly. I think where we definitely need to do work better work is on the deployment side of things. I think it's partially because we made an explicit decision to be execution substrate and storage substrate agnostic, which means that the system is a little more pluggable and complicated.
So it all depends on what your deployment target is. Our probably best supported deployment target is Kubernetes. So if you have a Kubernetes cluster that you can target, we have a prefab Helm Chart. So you kind of build up your local pipeline, get it executing, then you have to kind of plug this into a Helm chart and decide like exactly how you can configure stuff and then you can deploy it relatively easily. And then you also have to think about how you store the intermediates that are flowing between the different nodes of computation in the system. Again, it's like if you use like s3, there's an out of the box integration for that which is very little effort. But if you need to use more something more complicated, you're gonna have to dig in and figure out how some of our internal interfaces work. Was there anything specific you were thinking of to buy? No. Just generally what the user experience looks like from I'm interested in this tool to I've built something with this tool. And as you said, the deployment
[00:35:31] Unknown:
story, it depends. So the fact that it's easily deployable on Kubernetes is definitely a useful first step, particularly for people who are inclined to just use a managed Kubernetes platform. Or the fact that you can just deploy it all on a single box in EC 2 is also a useful aspect, which is the approach that I'm taking personally for my deployment, at least to start with, largely trying to get an idea of what the developer experience looks like for building for DAXTER and writing code in the DAXTER framework. I think that 1 of the interesting aspects of the Daxter programming model is that functional composition and the solid as the core building block. And the fact that each of those different solids, you can pass in a configuration scheme to say very concretely, this is the information that you need to provide to this. This is what's optional.
These are the dependent inputs and outputs so that, as you said, you can construct that overall graph to see how everything flows through the system, but also being able to treat them as isolated functions for testability purposes. Maybe you wanna dig more into the testing aspect and the sort of validation and the quality control capabilities that DAGSTER provides.
[00:36:47] Unknown:
You know, the goal of 1 of these solids is for it to be relatively self describing. So there's a few dimensions of that. Right? There's it's inputs and outputs. Right? So what's the input data that's gonna need in order to successfully compute? And then what's kind of the output data that would pass to the downstream solid so that they can instigate computation. There's also itself describes the configuration it needs as you mentioned, right, which is kind of like, but it's less of like the input data more of like knobs you can think about it control knobs like changes of behavior and whatnot And then lastly, they also self describe what resources they require.
Right? Do they require a connection to an s 3, you know, system or a connection to a database? And the goal here is so you can look at 1 of these solids in isolation in our tooling and understand everything that needs to be true in the world in order for that computation successfully complete. And it provides by doing so and embedding it in our artifacts like that, it provides 2 dimensions of value. Right? 1 is what I just described which is like understandability. But then by also kind of allowing programmers to get into these seams, so to speak, and, like, articulate these seams, you also provide a testability aspect because these data computations are very difficult to test. You You have to be able to parameterize the data, but they often also depend on heavy external resources like a spark cluster or data warehouse or s3 or something.
So if you want to test your code you actually have to have a layer of indirection between your business logic and the infrastructure concerns, and that's where that resource aspect comes into play. And then you also wanna be able to change the behavior of the solid based on configuration. And I really think this architectural decision is what distinguishes the framework from other similar frameworks. So just comparing to the trade offs that, like, Airflow makes, which is kind of the, I would say, the primary incumbent in the space, their nodes in their task graph, their DAGs are explicitly bound to a single environment.
Right? In fact, their documentation says that if you have to pass data between the tasks, you probably need to restructure your tasks. So they explicitly guide you to create functions which have no parameters and that are bound to a specific execution environment, which makes that DAG inherently not testable. We made the trade off and we had to build more abstractions to do it. So it's like you have to learn more things in order to get this done. But our graph is a much more logical graph that's executable in multiple contexts, on arbitrary data, you can execute arbitrary subsets of that graph, which those are really kind of the it's a 3 legged stool of testability.
If you can only change if any of those tools go away, it kind of all falls apart, right? Like you might be able to parameterize data, but if you can't execute it in a different environment, the production, it's still not testable. Likewise, if you can execute in different environments but you can't change the input data, it is likewise not testable. So we've really gone to a lot of effort to make these units of computation, they're logical units, They are named units of business logic, and they're meant to be executable in multiple contexts, which makes them much more testable and reusable.
[00:40:21] Unknown:
This episode of podcast dot on it is sponsored by Datadog, the premier monitoring solution for modern environments. Datadog's latest features help teams visualize granular application data for more effective troubleshooting and optimization. Datadog Continuous Profiler analyzes your production level code and collects different profile types, such as CPU, memory allocation, IO, and more, enabling you to search, analyze, and debug code level performance in real time. Correlate and pivot between profiles and Application Performance Monitoring live search lets you search across a real time stream of all ingested traces from your services.
For even more detail, filter individual traces by infrastructure, application, and custom tags. Datadog has a special offer for podcast dot in it listeners. Sign up for a free 14 day trial at pythonpodcast.com/datadog. Install the Datadog agent and receive 1 of Datadog's famously cozy t shirts for free. Digging more into the specifics of as a Python framework, what was your decision process for figuring out what language to implement this since it started off as a greenfield project where you were focused more on the overall problem space than on any particular language capabilities.
And with the benefit of hindsight and the past couple of years of work behind you, do you think you would make the same decision today, or are there any changes to the overall architecture or implementation strategy that you would go with?
[00:41:57] Unknown:
So, obviously, if I could go back in time, I would, you know, take all of everything I've learned and apply it then. But I certainly don't regret using Python at all. It is the lingua franca of data today. Everyone who deals in the data's ecosystem at some level has to deal with Python. I think the power of Python, the reason why it's so popular to the data community or 1 of the reasons is that it's a language of what I call wide dynamic range. Right? You can build very real systems in Python. Like it really does scale up conceptually. The Instagram back end is a Python monolith, for example.
But it also scales down very effectively in a conceptual way. Meaning, you can throw a relatively nontechnical user in a Jupyter notebook in that environment and they can be productive and do useful things. So this wide dynamic range of Python makes it a very powerful tool in these multi persona systems. So I certainly don't regret using it. That was the reason why we chose it a couple of years ago. It's kind of the obvious choice. You know, as time goes on and our infrastructure needs get more complex, I can imagine extracting extracting out services and more complicated deployments or even like commercial context where we'll have services written in other languages for performance reasons or technical reasons or so on and so forth. But the APIs and the system that we present to our direct customer definitely have no regrets choosing Python and would do the same thing today.
[00:43:33] Unknown:
And in terms of the ways that you've seen DAXTER used or things that you've built with it, what are some of the most interesting or innovative or unexpected things that you've seen built? Oh, this is 1 of my most favorite subjects. So a couple things come to mind. We have 1 user who's actually
[00:43:50] Unknown:
I believe they are either in Israeli intelligence or adjacent to Israeli intelligence. So they use DAGSTER in an air gap data center to do data analysis on graphs of payment data in order to do investigatory work on the Panama Papers network of money laundering, which I thought was amazing. That was, like, almost like filling out a Mad Lib of, like, interesting subjects to use for data processing. I also liked it because it spoke to kind of the flexibility. Like, they had all these, like, crazy custom infrastructure needs that were integrating with a lot of interesting tools. They actually posted about it, I believe, and blogged about it. Another really interesting use case was comes to mind.
1 of our early design partners is at Good Eggs and they use the system so they use it for like their data processing. They do all sorts of interesting stuff. So they have to ingest Google sheets that are manually inputted and they use our type checks to ensure that those Google Sheets abide by their contract. If those fail, they actually email out the person responsible for manually inputting it as well as their manager who's a nontechnical ops person. That ops person is actually able to talk to that person. They fix the spreadsheet and that ops person can actually go into our tools and, like, restart the pipeline. So they've been able to build this self-service platform where someone who cannot code is able to kind of self-service 1 of these pipelines and completely take the data platform team out of loop. So that was super cool. The other awesome thing that this team does is they actually build pipelines to manage the data system itself.
So they have a pipeline which actually goes out and talks to both their data warehouse as well as their mode analytics instance, and they detect which tables are no longer being accessed by reports that are effectively which tables are no longer used and which reports are no longer used. And they do this and they actually through this 1 pipeline do the computations which do that analysis, the meta analysis, the data pipelining system and then they use Dagster. They do their actual production computation in a Jupyter notebook using our integration with a thing called Papermill and then they actually automatically submit GitHub PRs which delete the tables or the motor ports in question and then they use our asset tracking system to track those GitHub PRs.
So you can actually go in and search for a specific GitHub PR, you know, they're like why was this produced? And you can see, like, oh, this pipeline produced it. Here's why it produced it. This provides us, like, amazing capability to for them to, like, manage their own data platform and they've been doing that with their own data platform, which I thought was, like, incredibly novel. And we're actually gonna be partnering with them to produce a case study in the next few months, which I'm really excited about. I just thought I was blown away by the novelty of that application. It was so cool.
[00:46:53] Unknown:
Interesting to see the problems that people will solve when you give them the tools that help them think about it in a particular way. Yeah. Exactly. And as somebody who has been building the technology and managing the team who is producing the framework and working with the community of end users, users? What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:47:15] Unknown:
It's a big challenge to grow a user base and, you know, and I think we're still working on it. Like, I think someone could listen to this podcast and, like, critique me and like listen you still don't have like the exact right messaging around this, but like translating what you've built into very precise messaging that can be quickly understood, especially with 1 of these systems where you don't really get it a 100% until you're in it. You know? I think it's 1 thing for us to be talking about the resource system, Just as an example since you mentioned that. And someone listening might be kind of like, oh, that's somewhat interesting, I guess. I don't think you can really get its value and power until you use it. I think like that art form of knowing what's the right thing to communicate to the right people in order to get them interested in the system and then once they're in it they kind of like end up getting all the other aspects of it that are interesting.
I don't know if that resonates with you, given the fact that I referred to exactly what you were talking about. Yeah. No. It's definitely
[00:48:22] Unknown:
easy to see something and say, oh, well, that's kinda interesting, but not realize what you personally might use it for until you see somebody who's doing something similar and solving a problem that you happen to have. And then you say, oh, I get it now. Now I'm gonna go and actually check this out and see what other things I can use it to build. You know, it's definitely a problem in the technical space of there are so many different choices for so many different problems and so many different areas of potential overlap that it's hard to really be able to focus down on. This is something that solves my problem in the way that I think about it versus that's something that is interesting but doesn't address my needs.
[00:48:58] Unknown:
We struggle with this with the GraphQL experience to some degree too. 1 always illustrative example I talk about is that in the GraphQL context, a lot of our messaging was around solving this problem of multiple round trips between client and server for mobile developers. And that was like a very instead of like going back and forth from the client server multiple times, you could just do it once. And for a mobile developer, that made a ton of sense. It was an immediate value proposition that you could grab on to. And then only through that, they actually learned that the real value is this kind of new client serving programming programming model that GraphQL entails. Honestly, that dimension of value, that single round trip value doesn't like, most people are using GraphQL to build like internal apps that companies have, in which case, like the round trips between the client and the server don't matter. And it's really actually this programming model. So I think, like, that's an interesting example of, like, what you communicate in terms of getting people interested and a clear value proposition versus, like, what the true value is once you're dealing with the system. For people who are taking a look at Dagster and they're considering using it to solve their problems, what are the cases where it's the wrong choice? Yeah. So the most obvious 1 that comes to mind is that if your computations are exclusively in SQL and exclusively on a single data warehouse, Daxter is not the right choice.
Somebody like dbt is, for example. But the other thing is like if you don't care about the things that we care about, if you don't think like testing your pipeline is important and you don't think there's a software engineering deficit in these systems and that this seems like a bunch of unnecessary work and I don't get it then like, it's probably not the right choice for you and that's totally fine. You know, where we found it to be right are people who are feeling like they're like they wanna do testing, they want more well structured systems, they don't really have a toolkit through which to do it. So it's that engineering ergonomics thing combined with really successful teams have been these data platform teams who need to serve multiple constituencies and want like a structured way of like plugging in their data science team, their analysts, and their data engineers at 1 cohesive system where you can actually logically interlight all the work that they're doing.
But that's not everyone.
[00:51:20] Unknown:
And as you continue to build out Dagster and grow its ecosystem and grow the business around it, what is your overall vision for the technical and business and community aspects of the project?
[00:51:33] Unknown:
Yeah. That's a great question, and there's, I think, a few ways to answer it. The community aspect, you know, 1 of the things and you alluded to it in our discussion of like reusable solids and resources and all these other components is that I think Dagster has a real opportunity to have a very very powerful network and ecosystem effect. So as an ecosystem dealing with, you know, this cloud provider or this infrastructure or, oh, there's a super reliable testable way of interacting with this sort of subsystem and this will have like a network effect. So I'm very excited to kick off the growth Vika system and really start to accumulate that. In terms of other vision, I still think there is a ton of legs on really having a system that's aware of both the assets and the computation.
And we're really envisioning some, like, higher level more constrained programming models that are built on top of DAXTER that really, really focus on making this so that you declare this logical graph of computations and then in a more constrained fashion. And then as a result, you get, like, incremental computation for free and you know, know, because, like, I really want the system these systems to move beyond effectively Kron based scheduling. That's where we're at right now. We're kind of doing a roadmap planning right now and I'm really excited for the notion of being able to kind of more continuously fire these pipelines and then make them abide by incremental computations. So you kind of can like it's not just like I call it continuous data processing. So I think like that's a super interesting technical direction.
We really want to go, like, up the stack, meaning that as we get these base layers of, like, our event log with this, like, system of record for metadata, I think there's an opportunity to go any number of directions to provide a ton of value to the ecosystem. Those are just the things that come to mind.
[00:53:39] Unknown:
And are there any particular next steps on that path that you have planned out that listeners should be looking for or people might be excited about or that might draw them to dig deeper into Dagster?
[00:53:52] Unknown:
We've been relatively silent up until now. That's gonna change. I think you'll see, weekly kind of cadence of really interesting technical content where we can explain the unique things you can do with the system or partnering with other projects in the ecosystem to do, like, jointly communicate about how our systems integrate with each other. I think that's really exciting. We have a ton of stuff planned in terms of new features I'm gonna build we're gonna build or, you know, the team's actually in the midst of its roadmap planning process right now. But I think you'll see us moving really digging deep into this as I mentioned before this kind of like asset awareness and having the workflow system, the orchestration system be aware of what it's producing, which is incredibly valuable.
And I think you also see us really focusing on having some really streamlined deployment scenarios for common deployment targets. I think in the next month or 2, you'll really start to see this asset awareness shine. We're producing some really novel UIs and workflows for backfilling, which is a perpetual source of extraordinary complexity and wasted computation and errors, and it's really difficult. So we always have a lot of stuff cooking. But, yeah, you
[00:55:22] Unknown:
know, expect to hear more from us as time is going on. And are there any other aspects of the Dijkstra project or the data ecosystem that you're working with or ways that listeners can help to contribute and grow the project that we didn't discuss that you'd like to cover before we close out the show? That is always the best thing
[00:55:42] Unknown:
that anyone can do to help out 1 of these projects is to use it and then provide us feedback. You know? We have a public Slack. We're very responsive on that Slack. We have people in there all the time that are providing useful feedback, both about the system as well as documentation. So that is honestly the most helpful thing. And then if you really wanna go nuts, then starting to build integrations and community contribution, Community contributions are always welcome. You know, I'm always like open source communities never cease to surprise me. You know, we had this 1 gentleman from England just kinda come out of the blue and build a really nice integration with Databricks, for example.
So this allowed you us to really have this amazing kind of showcase of our system where you could write a logical HighSpark computation and then with only changing configuration, execute on your laptop, AWS's EMR and then the Databricks runtime. It was really the community contribution that really made that demo sing, for example. So, yeah, we love that. But, yeah, just, you know, the primary thing is just to, like, use the system and give feedback is, like, really the thing that's most helpful.
[00:56:55] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And I'll also add a link to the repository that's open source where I'm building my workflows with for somebody who wants to take a look at an example. And so with that, I'll move us into the PIX. And this week, I'm going to choose the Caddy web server. I recently started using that and playing around with it, and it's a sort of new generation of web server that will replace things like NGINX or Apache and has some nice characteristics in terms of deployability and automatic TLS management and pluggable module system for being able to do things like have a exec function that fires when you hit an endpoint as well as being able to wrap that in authentication. So definitely worth poking at if you're looking for something to handle static or dynamic web content. And so with that, I'll pass it to you, Nick. Do you have any picks this week? Oh, boy. Put me on the spot here. I guess I already mentioned that fluent Python book.
[00:57:54] Unknown:
1 pick that I'll say, I'd probably a tool that's familiar to users, but I guess I like it from a kind of tooling culture aspect. Is the tool black in the Python ecosystem. All it is is a automatic kind of code formatter. It's very similar to Prettier from the JS ecosystem, but I'll just pick it because I think this type of way of operating of, like, avoiding formatting wars or linting or, like, code style wars and just solving it with tooling is really important. And I think that it's much more common in the JS community than the Python community. I think it just, like, makes for much healthier software projects.
[00:58:32] Unknown:
Thank you very much for taking the time today to join me and discuss the work that you've been doing with Dagster and for all the effort that you and your team have put into that. I appreciate Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story.
To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Nick Schrock: Introduction to Dagster
Dagster's Origin and Motivation
Design Elements of Dagster
Comparison with Other Workflow Orchestration Tools
Challenges and Assumptions in Building Dagster
Architecture and Evolution of Dagster
Integration and Ecosystem of Dagster
User Experience and Deployment
Choosing Python for Dagster
Interesting Use Cases of Dagster
Lessons Learned and Challenges
When Dagster is the Wrong Choice
Future Vision for Dagster
How Listeners Can Contribute
Contact Information and Picks