Digging Into Dagster: An Opinionated Open Source Framework For Data Orchestration

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode

today. That's l I n o d e, and get a $60 credit to try out our Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And today, I'm interviewing Nick Schrock about Dagster, an open source data orchestrator for powering data engineering, analytics, and machine learning workloads. So, Nick, can you start by introducing yourself? Yeah. My name is Nick Schrock, and I'm the founder of a company called Elementl. And our primary project is what you mentioned. Thanks. And do you remember HayFirst got introduced to Python? I actually don't remember kind of the first line of code.

I believe in the late 2000, I was actually figuring out how to script the dot net framework and there was a project called Iron Python and I think that's how I got introduced to it. But in reality, Daxter is kind of my first

significant project

in Python.

It's where I really started being a real Python programmer.

For somebody who is using this as their first entree into Python programming,

what has your experience been as far as being able to ramp up and use it effectively and

and some of the useful resources that you've been able to lean on to build the framework that other people will enjoy using?

Yeah. So I think it's really important to find,

when you're doing a new programming language to find kind of a canonical book. Cannot just teach you the features, but the common idioms and the ways you should program because best practices are, you know, especially at kind of an older language are just important as knowing the features. I like the book called Fluid Python

actually

that I felt really got me up to speed.

Really I think this is very common but the challenge of Python is not the programming language itself. There are a couple quirks but like it's very intuitive whatnot. The challenge of Python is all the surrounding infrastructure

so managing dependencies, virtual environments, things of that nature and especially as a framework author, dealing with the universe

of Python dependency management is, you know, a total nightmare,

and I think that's been really challenging.

I like to joke that, you know, the zen of Python is that there should be 1 and only 1 obvious way of doing things, But we're actually gonna have 4 competing package management systems, and we're gonna fork the language. So I think those have been the most challenging parts of dealing with Python.

Yeah. Package management is definitely the bailiwick of many Pythonistas

and has been for a number of years. Although, I do feel like it's getting better and particularly with some of the forward looking packages

like PyOxidizer

for figuring out what the deployment

experience looks like and simplifying the experience for end users of the package. So I'm excited to see where that goes. Anyway, bringing us back to the topic at hand, can you give a bit more of a description about what the DAXTER project is and how it got started and your motivation for building it? Let's start off with the origin story. My motivation was that I was a Facebook engineer from 2, 009 to 2017,

and kind of the thing I'm known for outside of the company is that I was the original creator and 1 of the co creators

of GraphQL,

which is now a relatively popular open source technology.

So I had this experience in developer tooling and when I left Facebook, I

was kind of searching around for what to do next and I kept on asking companies both inside and outside the valley what their biggest technical liability was,

what they felt their rate limiting step was in their engineering processes. And this data engineering stuff kept on coming up continuously as well as like ML infrastructure. People would say the same slightly different things, but it was this core problem. And then I started talking to real practitioners and seeing the tools they use, developer workflows, etcetera. And I was kind of, you know, many aspects of the systems that they use are really technically interesting, but in terms of like developer ergonomics, workflow, productivity,

etcetera, I was like relatively appalled

and I thought there was just a huge opportunity

here

and I just get personally very frustrated and mad

when I see

really really talented and smart people wasting their time or having an

unnecessarily stressful experience

doing their job that can be solved

by software obstructions.

So I really got into this and then when I dug in I really felt that data engineering, data science and in terms of kind of engineering best practices and hygiene was

similar to where front end was like 10 years ago. So like prior to GraphQL, React, etcetera, you know, the front end engineering was a real wild wild west and there's kind of a software engineering deficit

there. Like you don't test things, there wasn't really software engineering deficit there.

Like, you know, in test things, there wasn't really nicely structured frameworks, etcetera, etcetera. And I do feel this data domain is in a similar spot today. But it's in fact, like, societally more important, I think, to get ML and data analytics right than front end

because ML is planting or replacing or augmenting human decision making. Analytics drives incredibly important business and policy decisions. So it's really important to get this stuff right. And in terms of the difficulties that those organizations

are running into and the conversations that you were having as you were deciding what to build next, what were some of the biggest challenges or most common issues that they were running into when trying to build and manage data projects?

It's difficult to know where to start because the problems are so fundamental

and on multiple dimensions. You know, if you kind of know Maslow's Hierarchies of Needs where it starts at like for a human, like you need like food and shelter and then builds up to self actualization.

I feel like the software equivalent of this, like the data stuff is kind of still at layer 1 where the basics of like getting these things working, keeping them alive, having reasonable testing, it's in fairly dire straits in my view. And from a business perspective,

you talk to any major decision makers relatively clear, like, they like can't manage and organize the data that they have. So they don't know what the data is. They didn't know where it came from and the failure rate of these projects is very high. It's very difficult to hire the right people to do them. So there's just like at the very basics of just getting things working

in these data systems and kind of like keeping them alive and having productive developers,

It's just incredibly challenging and then you combine it with the fact that these systems are very complex and often

involve

multiple different kind of roles interacting with each other in novel ways. You know, this, like, data siloing. Right? So you often have, like, data engineers and data scientists, for instance,

attempting to collaborate together, but it's very siloed. But it's just a very subtly difficult domain of software.

And I just think this

software engineering deficit is really the primary thing going on here. And in terms of trying to address those challenges, what are the design elements of Dagster that you

had in mind that would help to simplify

some of those problems and help to address the shortcomings of the existing ecosystem of data tools? Yeah. That's a good question.

I think what it is is that I felt you needed a more opinionated software

framework

to structure these computations.

It does a few things.

1 is that it forces you to structure your computations

more in terms of functions rather than just tasks. So there's inputs

and outputs.

We have a type system that's on top of that that helps you deal with makes these flows more self describing

as well as more reliable.

We also structured the software.

There's natural seams

for testability.

I think another critical aspect

is that you can actually view the graphs of computation

prior to deployment,

prior to execution,

which makes them not just kind of this deployed artifact, but also have a way for the different personas to interact with each other. So like the data science

is gonna view the graph of computations that the data engineers have created without having to,

like, deploy it or execute it. And then just like higher quality tooling, I thought that, like, a lot of these systems, the workflow

and orchestration systems that were out there really didn't focus enough on the full end to end workflow. So I think it's important for this orchestration graph as we call it or DAGS

more commonly

to be in the developer workflow,

like, on your laptop, not as just a deployed artifact.

To the point of the overall ecosystem

different generations

of projects that have tried to address this problem. They've all taken slightly different approaches, and they have slightly different ergonomics or focus for the end user.

Wondering if you can just broadly characterize

the

existing ecosystem

of the workflow orchestration tools as it stands today

and some of the defining characteristics of the differing options and then how Dagster fits within that overall ecosystem and what you see as being the differentiating elements of it? Yeah. That's a good question. I view it as kind of like there's 3 generations a bit. There's very old things that, like, predate, like, airflow and Luigi, and often those are more, like, graphical

or even, like, drag and drop tools, like Informatica or something. And then there's Airflow and Luigi, which I would call like generation 2. They were fairly exclusively

workflow

engines, meaning like they were only exclusively involved in the computations. Like if you look at Airflow,

the DAG is essentially a deployment only artifact.

It is completely black box in terms of what the computations that are happening when it within each task.

It is exclusively focused

on just managing the order of computations,

managing retries and things of that nature. Now there's just kind of more next generation

of workflow engines and orchestrators

of which there are many, and I think there's a few trends among them. Some are very focused on

vertically stacking and really coupling themselves to containerization platforms. So you're vertically integrating Kubernetes

and really focusing on that aspect of it, which I think is an improvement in terms of deployment

and operation

but by doing that, you often sacrifice kind of the developer workflow and local developer experience which can actually still be very very challenging. And then there's folks who are just kind of trying to do I would say like a better job of exactly what airflow did. I think our differentiator is that we think the orchestration graph,

this

DAG which interrelates all the computations

should be a very rich object.

The edges should be typed.

It should embrace functional

data engineering principles.

It should be tracking

what assets it produces

so that you can link the computations and the assets together. And then it should also have a slightly opinionated

programming model to facilitate testability

and other aspects like that. So if you look at Daxter,

we have the basic requirements of a workflow engine, I e that we order computations, we do manage retries, we do that stuff but we also have these higher order concepts like a type system,

we have a configuration system,

we have this notion of resources

and

some abstractions,

which

I think are critical to structuring these systems properly.

Yeah. I think the resources in particular

are especially valuable because it makes it easy for 1 person to be able to define concretely

how a particular set of computations needs to be able to interact with something like a database or the file system or

for security purposes being able to pull credentials from HashiCorp Vault, for instance,

and then package that up and either distribute it within the context of a pipeline definition or as its own independent Python package

that another user of DAXTER can install and use just by passing in the necessary configuration objects. As somebody who has been working with Dagster,

the exposure of the

configuration

schema to very clearly define what attributes are necessary to make a given resource function properly is very useful for

having those strict contracts between the different components of the system so that there isn't a lot of the confusion that can happen if you are facing a function that just says star args and star star quarg and saying, okay. Well, what is actually going to happen when I pass in some of these arbitrary values into this function?

I couldn't have said it better myself, Tobias. Yeah. I

might quote you on that 1. That was a lovely explanation. Thank you. As somebody who is relatively new to this overall problem domain as well as the language that you're working in and you're trying to solve a very large and fundamental problem.

What are some of the assumptions that you made going into the start of building Dagster

that have been challenged or updated or invalidated

over the past year or 2 that you've been working on it and working with end users who are giving you feedback on the ways that they're trying to use the framework?

The 2 things

that

kind of come to mind, and I think they're related,

is that when I kind of first started building the prototype

and the initial version of the system,

I was

it was very much targeted

towards what I call the leaf developer, that's a generalized term for the data scientists, the data engineer, the people who are responsible for actually the business logic of the computation

and that we were going to try to be like very

infrastructure agnostic.

So, like, effectively, it would only be, like, a software abstraction

that you could execute

over

existing workflow engines. Right? So you could, like, execute on airflow

or DaaS, which is a slightly lower level of abstraction. But

really try to be only a software abstraction.

That was

a little naive, I think. I think we found out 2 things. 1 is that you we still don't believe in vertically integrating with infrastructure. Like, we support Kubernetes. We don't require it. You know? We can still execute an arbitrary

computational substrates, but we have to have more opinions about the way that infrastructure works.

And second of all, it's not just the leaf developers. I think what we found,

you know, because, like, your resource example is perfect.

Is that

a really big challenge in the systems is managing

the relationships between the different personas or jobs in the system. And this is where we get into

who this is targeted for.

Now the framework that we use when we are designing or modifying the system as often is not just like the technical abstraction layers,

but the abstraction layers between the different roles or jobs in the system, right? So this is what you were talking about with the resources stuff. Yeah, the way it's designed that our resource system is that in our head we're like, okay, there's like 1 team or 1 person

who's responsible for kind of crafting these resources and providing an interface to the leaf developers.

The resource abstraction is not just a technical API layer, but it's also like an organizational API layer. And even to someone like you who is more of a full stack developer responsible for both infrastructure and the domain logic, you are doing 2 jobs at the same time. So it's actually very useful for you to have an abstraction layer so you can like organize the thoughts in your head a bit. I think we've realized that in order to address the problem we had to kind of expand the scope of the system a bit and then really deeply think about

how the different roles in the system interact.

Yeah. And that's a common theme that's come up in a lot of the conversations on my other show, the data engineering podcast, where the reason that solving for a particular problem in the data space is so complicated is not just because of the technical issues, but

because there are so many different stakeholders throughout the entire life cycle of any project, that is not the case with just a web application

or an alerting tool

where you're primarily

focused on

solving for the needs of the developer who's building the application. And then the developer's job is figuring out what the end user's

concerns are.

Whereas in the data domain, you have this complete loop of the data provider who is producing information

using something like a click tracking system or pulling data from an application database.

You've got your data engineer who needs to be able to pull that information into a particular location, maybe model it so that it fits particular schema for being able to be analyzed easily.

You got your analysts or data scientists who are actually working with that processed data to be able to gain some sort of insights. You've got the business users who are using those insights to make decisions, and then they have additional questions that they need to ask. But in order for those questions to be answered, it needs to go all the way back to the source systems to say, I need you to track this additional information, or I need the data engineer to be able to pull that from the source systems and process it for my analysts. And so there are a lot more people who need to be able to work within the context of that overall framework

rather than just an isolated data engineer or an isolated web developer.

I should no longer be stunned, but I am continually stunned by

the complexity of both the systems and the interactions between people on these systems.

1 of the things that we believe is you can't just like wish this complexity away, You have to embrace it and manage it. So I think that a lot of people who just are like, oh we have this very simple solution often kind of missed the mark because the notion is that you can't make this stuff simple. What you wanna do is be able to compartmentalize

and manage the complexity.

So the right people are doing the right things

rather than trying to have some sort of silver bullet here. That's also where a lot of the trends and things like microservices

versus monoliths comes up is what are the actual communication patterns of your organization

and the tendency for

the communication patterns of your software to reflect the organizational hierarchies.

And as you said, being able to

compartmentalize

those aspects and then having that within the framework

of a tool chain that's intended to maintain the overall

capabilities

of the end to end workflow, it definitely means that you have to think about it very carefully as to how do you create those dividing lines and the handoff points to support those communication patterns while still being able to have the entire system interoperate as a single unit.

Continuing on that, can you dig a bit more into how DAXTER itself is architected and the ways that the design and implementation of the system has changed or evolved since you first began working on it? Yeah. So there's a lot of different

ways that we can approach that problem. I'll just pick 1, I guess, in terms of, I think, an interesting evolution and how we ended up being structured.

So at the beginning,

1 would simply write 1 of these DAGs we call them a pipeline which consists of nodes which we call solids and then we have this local development tool as well as our production tool called Daggett, which is a web front end for this. And we would actually literally load that code

into process into that same server process and that would be kind of how you would like inspect the pipelines and execute them and whatnot. But as timing has gone on, as as we've kind of gotten more mature about thinking how infrastructure works and the team structure works, we actually did a huge rearchitecture

this spring

where we made it so that we rip out the user code and then have our web server local development tool actually communicate with that user code over a gRPC layer. This is very interesting

because

this ends up allowing you to say in complicated deployments where you're serving multiple teams is you have the infrastructure team managing

the process or container which contains the core infrastructure tools. Those are communicating

over well structured gRPC interfaces

to containers that contain the various user pipelines.

And so that means that all these things can be updated and deployed independently.

You can keep your the core infrastructure up a 100% of the time without having to restart it as the users are updating their code and has a lot of other kind of positive aspects. So that's kind of like 1

infrastructure element of design where we've very clearly separated kind of user code from system code and then the team's code is separated from each other. So you can have 1 set of pipelines that takes in 1 set of dependencies, another set of pipelines

that takes in another set of dependencies or even a different Python version. Speaking to my first gripe at the beginning of the show about how difficult it is to deal with the Python ecosystem on that front. So I think that's 1 interesting

aspect of this. And then the other aspect is that we're very focused

on layering the system

in our view properly

so that it becomes a true platform and not just a vertically integrated tool. So I talked about this like gRPC interface.

Right? We also have a GraphQL API

over kind of our higher level like web server which means that we have our front end tool DAGIT but we can also build your own tools on top of that. And then as you go down the system, we also have other pluggable layers so that, for example,

we can deploy and manage

arbitrary infrastructure.

We have, like, a prefab Kubernetes deployment, but we also have users

plugging into those same abstractions that our Kubernetes extension

uses to, like, execute this thing on their own custom pass or some, you know, kind of ECS on AWS

or whatever custom infrastructure that people come up with. You know, we've really focused on layering the system

to have, like, vertically stacked instances of the system that are easy to use, but make it possible to use

across an entire universe of infrastructure.

And for those extension points, what is your

guiding principle or your overall heuristic for deciding

how to

create those interfaces and how to expose them to users

in order to make them easy to implement, but also easy for you to maintain

and achieve the necessary level of flexibility and expressivity

of the overall system.

Kind of my overall principle for API design

is always to make

simple things simple and complex things possible,

which is really easy to say and hard to do because there's always this tension in these systems

where you desire to impose constraints on your user and therefore that interface

that constrained interface allows the infrastructure provider

to have a ton of flexibility in how in terms of how to implement it. But if you do it in such a way where it's over constrained

and there's no escape hatches,

then you've severely limited the different use cases for your product.

This

very deliberate approach about thinking about being very deliberate about, like, what constraints

you're imposing on your user, but then allowing maximal flexibility on the other dimensions is something we think about a lot. You can see this in like for example our type system which is kind of this gradual optional type system where we try to make kind of very straightforward things

easy, like passing around scalars and we have common libraries for passing around data frames and common tools and whatnot.

But the core type check is just an arbitrary function that a user can execute and can impose whatever constraints they want, the data that's flowing to their system.

And that allows

the system to be able to adjust to this extremely

heterogeneous complicated world

that is the reality of the world that we're dealing with.

And on the note of typing, because of the fact that you have your own means of expressing certain types that are native to the DAGSTER ecosystem, but then you also have the type information

that is associated with the different

variables and return values of functions.

How are you working to

maintain compatibility in both directions where somebody can use your type system for making sure that the outputs of computations are in the appropriate shapes and the appropriate objects that are being passed around for the overall workflow execution,

but that the type information for something like mypy is also being propagated through the system to ensure that you are able to

try to reduce the occurrence of bugs because of improper values being passed through the actual

logic that's being written in Python within the context of the Daxter framework. Yeah. Dealing with MyPy in particular

is interesting because, like, we love it when people use MyPy, but now you're kind of, like, dealing with 2 types of systems at the same time. There's a bunch of behavior attached to DAXTER types that go beyond the Python type system. 1, it's more flexible

because of its nature in terms of it's defined by arbitrary computation, but we also, like, tie other behavior to types,

like types project into our config space so you can decide how to decide how to load it from the outside world via our config system.

It also controls serialization behavior as we marshal data from 1 node of computation to another

and other kind of behaviors like this. You know, what we try to do if you've been very structured

about typing your existing code base, we make it very easy to create DAXTER types that are effectively just like listen like I've already typed this thing so don't impose any additional type checks. Do like an instance of check. There's like 1 dimension if you've done that and then there's the other dimension where like the code base isn't typed at all and then you wanna kind of put the typing information in the Daxter type checks itself. But the nice thing about the Daxter types is that they end up being exposed in our tooling. So you can, like, open up your DAG, and your DAG is typed. So it serves as like this really compelling

substrate of documentation where you can kind of like trace through your computation to understand what's happening. That's another interesting element of the extensibility

and integration capabilities of Dagster is the metadata that you're able to produce

throughout the context of executing these

computation graphs

and the way that that hooks into the broader data ecosystem for things like lineage tracking or metadata management

or auditability

and governance of the data that you're processing. So I'm wondering if you could maybe talk to the

ways that

DAXTER fits within that broader ecosystem

and the efforts that you're doing to build out an ecosystem around DAXTER as a platform to make it easier for newcomers to the system to be able to adopt it and be effective in the shortest possible time. I think 1 of the interesting things about the data ecosystem

is that

there's integrations to do on multiple dimensions.

Yeah. You need to integrate it with a, like, a physical computational substrate,

storage systems,

data lineage, provenance systems, like you were talking about, the data tools themselves, you know, whether it be Spark, Pandas, Data Warehouse, etcetera.

So

the and the centralized thing

that needs to be aware of all these things is this orchestration graph.

Because as the computation unfolds,

that's the thing that is like storing everything, orchestrating the computation which in turn is actually like should be in my view, which in turn is actually like should

be in my view populating metadata management or whatnot. In terms of the

more the topics you were talking about in your question,

the data cataloguing,

metadata management, etcetera,

the primary

way that we communicate and provide a substrate for metadata management is our events log.

So as these computations unfold, we actually have the structured event stream which says, like, hey, I started this step of computation.

It has this input. This input has these properties.

I've produced this asset, this, like, file and some like s 3 store say somewhere.

We call those asset materializations

and we end up building this immutable log

of data of metadata

about everything that's happened. That's that's the base of our operational tools. It's the basis of our tool that we call the asset manager.

And I think the really powerful thing

about that immutable log is that it ties all these artifacts whether it be an asset materialization

or, like, a data quality test, we call those expectations. It ties it to a computation

in, like, a very verifiable way. So you finally have a place where you can go and look up the name of an asset which is totally user defined. Let's say you're just like naming

files in your s 3 data lake and you can look that up with like a fast type ahead and see like, oh, this was touched yesterday by this pipeline and 2 days ago by this pipeline. Then you go to another file and you see, like, it hasn't been touched for 8 days

and the last success form of this pipeline. So I probably have to go talk to the person who runs that. So this this provides us like base layer which links

computation

to data.

I think that it will provide

a very interesting

kind of base substrate

of metadata and data because we're not going to end up building like a full

catalog

in any sort of near term time frame. We're not gonna build like a competitor to a Munson or Datahub or any number of the proprietary

tools. But what I think those tools can do is consume this event stream to build this kind of base layer of metadata and then link it to all the computations.

And that's the basis of really interesting data lineage products and data cataloging and so on and so forth. The other element of building the ecosystem

is making it discoverable

for people to see examples

of code that people have written using Dagster

or libraries that people have written for being able to provide things like default resources for people who might be interacting with AWS or particular databases,

as well as

the

solids that I know are composable that can be potentially used within multiple contexts because of the pluggability of the configuration

schema and this functional orientation of the overall flow. So I'm curious what your thoughts are on fostering

that type of ecosystem and growth of the community.

That's a fantastic question because that's really something we need to focus on. We really haven't been focused on community growth from last year because we've been explicitly focused on a limited set of design partners to mature the system. And now we're kind of transitioning to more of a growth phase. We've really seen our kind of early users get a ton of value on the system, and we're really confident that's more generally applicable.

But now this leads to your question, which is kind of community growth, both in terms of user growth as well as like ecosystem management. And this is something I think we really

need to work on because right now the way we structure is we have a monorepo

with all the kind of integrations that we manage as well as the core system. Makes it a lot easier to manage for us because if we do, like, a cross cutting change that touches a few integrations simultaneously,

we can push those changes in an atomic fashion,

and we end up, like, pushing up, I don't know, like, 20 packages every time we do a release.

But that is not gonna work forever.

I think, you know, the the eventual vision

is to have a broader

ecosystem of plugins that can live in independent GitHub repos.

I actually look up to the Gatsby project in terms of the way they manage this because they they do an interesting thing where if you, like, annotate a GitHub repo

with certain I forget if it's tags or some way, but then they actually crawl those things and build an index

of community available plugins. And I think we're gonna have to do something like that in order to manage the inevitable

in the terminal success case, there will be probably 100 if not thousands of libraries to integrate with all the various tools that exist out there. You know, you can't have that 1 get your repo. I think it's a super insightful question, and I'm excited to get this built out. Yeah. And 1 of the stepping stones that I've seen work at least reasonably well sometimes is curating a sort of awesome Dagster GitHub repository where it's just a list of

libraries or open source repositories

of examples that people have built as a way to

let people discover some of that versus having to maybe crawl through the backlog of the Slack history or look through the mailing list or GitHub issues to try and see who are the people who are actually using this or the dependence graph in GitHub that they've introduced.

That's a

fantastic idea to us. I might

put this on the to do list for the team.

Going back to the end user perspective of somebody who is building a data workflow and they're using Dagster,

what is the overall workflow of getting started with writing a set of solids or resources

and then going through to getting it put into production and

executing it and maintaining it and just some of the edge cases or challenges that they should be aware of in that process?

It's an open source Python framework. So the first step is pip install Dagster,

and then you almost always wanna pip install Daggett,

our graphical tool, and then you start writing some code. I think this aspect of system is something that works pretty well. You can go from kind of 0

to a hello world DAG

and then have it running

locally on your laptop and being able to play with the system very, very quickly.

I think where we definitely need to do work

better work is on the deployment side of things. I think it's partially because we made an explicit decision

to be

execution substrate and storage substrate agnostic,

which means that the system is a little more pluggable and complicated.

So it all depends on what your deployment target is. Our probably best supported

deployment target is Kubernetes.

So if you have a Kubernetes cluster that you can target,

we have a prefab Helm Chart. So you kind of build up your local pipeline,

get it executing,

then you have to kind of plug this into a Helm chart and decide like exactly how you can configure stuff and then you can deploy it relatively easily. And then you also have to think about

how you store

the intermediates that are flowing between the different nodes of computation in the system. Again, it's like if you use like s3, there's an out of the box integration for that which is very little effort. But if you need to use more something more complicated, you're gonna have to dig in and figure out how some of our internal interfaces work. Was there anything specific you were thinking of to buy? No. Just generally what the user experience looks like from I'm interested in this tool to I've built something with this tool. And as you said, the deployment

story, it depends.

So the fact that it's easily deployable on Kubernetes is definitely

a

useful

first step, particularly for people who are inclined to just use a managed Kubernetes platform. Or

the fact that you can just deploy it all on a single box in EC 2 is also a useful aspect, which is the approach that I'm taking personally for my deployment, at least to start with,

largely trying to get an idea of what the developer experience looks like for building for DAXTER and writing code in the DAXTER framework.

I think that 1 of the interesting aspects of the Daxter programming model is that functional composition

and the solid as the core building block. And the fact that each of those different solids, you can pass in a configuration scheme to say very concretely, this is the information that you need to provide to this. This is what's optional.

These are the

dependent inputs and outputs so that, as you said, you can construct that overall graph to see how everything flows through the system, but also being able to treat them as isolated functions for testability purposes.

Maybe you wanna dig more into the testing aspect and the sort of validation

and the quality control capabilities that DAGSTER provides.

You know, the goal of 1 of these

solids

is for it to be relatively self describing. So there's a few dimensions of that. Right? There's it's inputs and outputs. Right? So what's the input data that's gonna need in order to successfully

compute? And then what's kind of the output data that would pass to the downstream

solid so that they can instigate computation.

There's also itself describes the configuration it needs as you mentioned, right, which is kind of like, but it's less of like the input data more of like knobs you can think about it control knobs like changes of behavior and whatnot

And then lastly,

they also self describe what resources they require.

Right? Do they require

a connection to an s 3, you know, system or a connection to a database?

And the goal here is so you can look at 1 of these solids in isolation in our tooling

and understand

everything that needs to be true in the world in order for that computation successfully

complete.

And it provides

by doing so

and embedding it in our artifacts like that, it provides 2 dimensions of value. Right? 1 is what I just described which is like understandability.

But then

by also kind of allowing programmers to get into these seams, so to speak, and, like, articulate these seams, you also provide a testability aspect because these data computations are very difficult to test. You You have to be able to parameterize the data,

but they often

also depend on heavy external resources like a spark cluster or data warehouse or

s3 or something.

So if you want to test your code you actually have to have a layer of indirection between your business logic and the infrastructure concerns, and that's where that resource aspect comes into play. And then you also wanna be able to change the behavior of the solid based on configuration.

And I really think this architectural decision

is what distinguishes the framework from other

similar frameworks.

So just comparing to the trade offs that, like, Airflow makes, which is kind of the, I would say, the primary incumbent in the space,

their nodes

in their task graph, their DAGs

are explicitly

bound to a single environment.

Right? In fact, their documentation says that if you have to pass data

between

the tasks, you probably need to restructure your tasks. So they explicitly guide you to create functions which have no parameters

and that are bound to a specific execution environment, which makes that DAG inherently not testable.

We made the trade off

and we had to build more abstractions to do it. So it's like you have to learn more things in order to get this done. But our graph is a much more logical graph that's executable in multiple contexts, on arbitrary data, you can execute arbitrary subsets of that graph, which those are really kind of the

it's a 3 legged stool of testability.

If you can only change

if any of those tools go away, it kind of all falls apart, right? Like you might be able to parameterize data,

but if you can't execute it in a different environment, the production, it's still not

testable. Likewise, if you can execute in different environments but you can't change the input data, it is likewise not testable.

So we've really gone to a lot of effort

to

make these units of computation, they're logical units,

They are named

units of business logic,

and they're meant to be executable in multiple contexts, which makes them much more testable

and reusable.

This episode of podcast dot on it is sponsored by Datadog, the premier monitoring solution for modern environments.

Datadog's latest features help teams visualize granular application data for more effective troubleshooting and optimization.

Datadog Continuous Profiler analyzes your production level code and collects different profile types, such as CPU, memory allocation,

IO, and more,

enabling you to search, analyze, and debug code level performance in real time.

Correlate and pivot between profiles and

Application Performance Monitoring live search lets you search across a real time stream of all ingested traces from your services.

For even more detail, filter individual traces by infrastructure,

application, and custom tags.

Datadog has a special offer for podcast dot in it listeners. Sign up for a free 14 day trial at pythonpodcast.com/datadog.

Install the Datadog agent and receive 1 of Datadog's famously cozy t shirts for free.

Digging more into the specifics of

as a Python framework, what was your

decision process for figuring out what language to implement this since it started off as a greenfield project where you were focused more on the overall problem space than on any particular

language capabilities.

And with the benefit of hindsight and the past couple of years of work behind you, do you think you would make the same decision today, or are there any changes to the overall architecture or implementation strategy that you would go with?

So, obviously, if I could go back in time, I would, you know, take all of everything I've learned and apply it then. But I certainly don't regret

using Python at all. It is the lingua franca of

data today.

Everyone who deals in the data's ecosystem at some level has to deal with Python.

I think the power of Python, the reason why it's so popular to the data community or 1 of the reasons

is that it's a language of what I call wide dynamic range.

Right? You can build very real systems in Python. Like it really does scale up conceptually.

The Instagram back end is a Python monolith, for example.

But it also scales down

very effectively in a conceptual way. Meaning, you can throw a relatively nontechnical

user in a Jupyter notebook

in that environment and they can be productive and do useful things.

So this wide dynamic range of Python makes it a very powerful tool in these multi persona systems.

So I certainly don't regret using it.

That was the reason why we chose it a couple of years ago.

It's kind of the obvious choice.

You know, as time goes on and our infrastructure needs get more complex, I can imagine extracting

extracting out services and more complicated deployments or even like commercial

context where we'll have services written in other languages for performance reasons or technical reasons or so on and so forth. But

the APIs and the system that we present to our direct customer

definitely have no regrets choosing Python and would do the same thing today.

And in terms of the ways that you've seen DAXTER used or things that you've built with it, what are some of the most interesting or innovative or unexpected things that you've seen built? Oh, this is 1 of my most favorite subjects. So a couple things come to mind. We have 1 user who's actually

I believe they are either in Israeli intelligence or adjacent to Israeli intelligence. So they use DAGSTER

in an air gap data center

to do data analysis

on graphs of payment data in order to

do

investigatory work on the Panama Papers

network of money laundering,

which

I thought was amazing. That was, like, almost like filling out a Mad Lib of, like, interesting subjects to use for data processing.

I also liked it because it spoke to kind of the flexibility. Like, they had all these, like, crazy custom infrastructure needs that were integrating

with a lot of interesting tools. They actually posted about it, I believe, and blogged about it. Another really interesting use case was comes to mind.

1 of our early design partners is at Good Eggs

and they use the system

so they use it for like their data processing.

They do all sorts of interesting stuff. So they have to ingest

Google sheets that are manually inputted

and they use our type checks to ensure that those Google Sheets

abide

by their contract. If those fail, they actually email out the person responsible for manually inputting it as well as their manager who's a nontechnical ops person.

That ops person is actually able to talk to that person. They fix the spreadsheet and that ops person can actually go into our tools and, like, restart the pipeline.

So they've been able to build this self-service platform where

someone who cannot code is able to kind of self-service 1 of these pipelines and completely take the data platform team out of loop. So that was super cool. The other awesome thing

that this team does is they actually build

pipelines

to manage the data system itself.

So they have a pipeline

which actually goes out and

talks to both their

data warehouse

as well as their mode analytics instance, and they detect

which tables are no longer being accessed by

reports that

are effectively which tables are no longer used and which reports are no longer used. And they do this and they actually through this 1 pipeline

do the computations which do that analysis, the meta analysis, the data pipelining system and then they use Dagster.

They do their actual

production computation in a Jupyter notebook

using our integration with a thing called Papermill

and then they actually automatically

submit GitHub PRs

which delete

the tables or the motor ports in question and then they use our asset tracking system

to track those GitHub PRs.

So you can actually go in and search for a specific GitHub PR, you know, they're like why was this produced? And you can see, like, oh, this pipeline produced it. Here's why it produced it. This provides us, like, amazing capability

to for them to, like, manage their own data platform and they've been doing that with their own data platform,

which I thought was, like, incredibly novel. And we're actually gonna be partnering with them to produce a case study in the next few months, which I'm really excited about. I just thought I was blown away by the novelty of that application. It was so cool.

Interesting to see the problems that people will solve when you give them the tools that help them think about it in a particular way. Yeah. Exactly.

And as somebody who has been

building the technology and managing the team who is producing the framework and working with the community of end users, users? What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?

It's a big challenge

to grow

a user base

and, you know, and I think we're still working on it. Like, I think someone could listen to this podcast and, like, critique me and like listen you still don't have like the exact right messaging around this, but like translating

what you've built

into very precise messaging

that can be quickly understood,

especially with 1 of these systems where you don't really get it a 100% until you're in it. You know?

I think it's 1 thing for

us to be talking about the resource system,

Just as an example since you mentioned that. And someone listening might be kind of like, oh, that's somewhat interesting, I guess. I don't think you can really get its value and power until you use it. I

think

like that art form of knowing what's the right thing to communicate to the right people in order to get them interested in the system and then once they're in it they kind of like end up getting all the other aspects of it that are interesting.

I don't know if that resonates with you, given the fact that I referred to exactly what you were talking about. Yeah. No. It's definitely

easy to see something and say, oh, well, that's kinda interesting, but not realize what you personally might use it for until you see somebody who's doing something similar and solving a problem that you happen to have. And then you say, oh, I get it now. Now I'm gonna go and actually check this out and see what other things I can use it to build. You know, it's definitely a problem in the technical space of there are so many different choices for so many different problems and so many different areas of potential overlap that it's hard to really be able to focus down on. This is something that solves my problem in the way that I think about it versus that's something that is interesting but doesn't address my needs.

We struggle with this with the GraphQL experience to some degree too.

1 always illustrative example I talk about is that in the GraphQL context,

a lot of our messaging

was around solving this problem of multiple round trips

between

client and server for mobile developers.

And that was like a very instead of like going back and forth from the client server multiple times, you could just do it once. And for a mobile developer, that made a ton of sense. It was an immediate value proposition that you could grab on to. And then only through that, they actually learned that the real value is this kind of new client serving programming programming model that GraphQL entails.

Honestly, that dimension of value, that single round trip value doesn't like, most people are using GraphQL to build like internal apps that companies have, in which case, like the round trips between the client and the server don't matter. And it's really actually this programming model. So I think, like, that's an interesting example of, like, what you communicate

in terms of getting people interested and a clear value proposition versus, like, what the true value is once you're dealing with the system. For people who are taking a look at Dagster and they're considering using it to solve their problems, what are the cases where it's the wrong choice? Yeah. So the most obvious 1 that comes to mind is that if your computations

are exclusively

in SQL

and exclusively on a single data warehouse,

Daxter is not the right choice.

Somebody like dbt is, for example. But the other thing is like if you don't care about the things that we care about, if you don't think like testing your pipeline is important

and you don't think there's a software engineering deficit

in these systems and that this seems like a bunch of unnecessary work and I don't get it then like, it's probably not the right choice for you and that's totally fine.

You know, where we found it to be right are

people who are feeling like they're

like they wanna do testing,

they want more well structured systems, they don't really have a toolkit

through which to do it. So it's that engineering ergonomics thing combined with really successful teams

have been these data platform teams

who need to serve multiple constituencies

and want like a structured way of like plugging in their data science team, their analysts, and their data engineers at 1 cohesive system where you can actually logically interlight all the work that they're doing.

But that's not everyone.

And as you continue to build out Dagster and grow its ecosystem

and grow the business around it, what is your overall vision for the technical

and business and community aspects of the project?

Yeah. That's a great question, and there's, I think, a few ways to answer it. The community

aspect,

you

know, 1 of the things and you alluded to it in our discussion of like reusable solids and resources and all these other components is that I think Dagster has a real opportunity to have a very very powerful

network and ecosystem effect.

So as an ecosystem

dealing with, you know, this cloud provider or

this infrastructure

or,

oh, there's a super reliable testable way

of

interacting with this sort of subsystem and this will have like a network effect.

So I'm very excited to kick off the growth Vika system and really start to accumulate that. In terms of other vision, I still think

there is a ton of

legs on really

having a system that's aware of both the assets and the computation.

And we're really envisioning some, like, higher level more constrained programming models that are built on top of DAXTER

that really,

really

focus on making this so that you declare this logical graph

of computations

and then in a more constrained fashion. And then as a result, you get, like,

incremental computation

for free

and you know, know, because, like, I really want the system these systems to move beyond effectively Kron based scheduling.

That's where we're at right now. We're kind of doing a roadmap planning right now and I'm really excited for the notion of being able to kind of more continuously fire these pipelines

and then make them

abide by incremental computations. So you kind of can like it's not just like I call it continuous data processing. So I think like that's a super interesting technical direction.

We really want to go, like, up the stack,

meaning that

as we get these base layers of, like, our event log with this, like, system of record for metadata, I think there's an opportunity to go any number of directions

to provide a ton of value to the ecosystem.

Those are just the things that come to mind.

And are there any particular next steps on that path that you have planned out that listeners should be looking for or people might be

excited about or that might draw them to dig deeper into Dagster?

We've been relatively silent up until now. That's gonna change. I think you'll see,

weekly

kind of cadence of

really interesting technical content where we can explain the unique things you can do

with the system

or partnering with other

projects in the ecosystem

to do, like, jointly communicate

about

how our systems integrate with each other. I think that's really exciting.

We have a ton of stuff planned

in terms of new features I'm gonna build we're gonna build or, you know, the team's actually in the midst of its roadmap planning process right now. But I think you'll see us moving

really digging deep into this as I mentioned before

this kind of like asset awareness and having the workflow system, the orchestration system be aware of what it's producing,

which is incredibly valuable.

And I think you also see us really focusing on having some really streamlined

deployment scenarios

for common deployment targets.

I think in the next

month or 2,

you'll really start to see this asset awareness shine.

We're producing some really novel UIs

and

workflows for backfilling,

which is a perpetual

source of extraordinary

complexity

and wasted computation

and errors, and it's really difficult. So we always have a lot of stuff cooking.

But, yeah, you

know, expect to hear more from us as time is going on. And are there any other aspects of the Dijkstra project or the data ecosystem that you're working with or

ways that listeners can help to contribute and grow the project that we didn't discuss that you'd like to cover before we close out the show? That is always the best thing

that anyone can do to help out 1 of these projects

is to use it and then provide us feedback. You know? We have a public Slack. We're very responsive on that Slack.

We have people in there all the time that are providing useful feedback,

both about the system as well as documentation.

So that is honestly the most helpful thing. And then if you really wanna

go nuts, then starting to build

integrations and community contribution,

Community contributions are always welcome. You know, I'm always like open source communities never cease to surprise me. You know, we had this 1

gentleman from England just kinda come out of the blue

and build a really nice integration

with

Databricks, for example.

So this allowed you us to really have this amazing

kind of showcase of our system where you could write a logical

HighSpark

computation and then with only changing configuration,

execute on your laptop,

AWS's EMR and then the Databricks runtime.

It was really the community

contribution that really made that demo sing, for example. So,

yeah, we love that. But, yeah, just, you know, the primary thing is just to, like, use the system and give feedback

is, like, really the thing that's most helpful.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And I'll also add a link to the repository

that's open source where I'm building

my workflows with for somebody who wants to take a look at an example.

And so with that, I'll move us into the PIX. And this week, I'm going to choose the Caddy web server. I recently started using that and playing around with it, and it's a sort of new generation of web server that will replace things like NGINX or Apache and has some nice characteristics in terms of deployability and automatic TLS management and pluggable module system for being able to do things like have a

exec function that fires when you hit an endpoint as well as being able to wrap that in authentication. So definitely worth poking at if you're looking for something to handle

static or dynamic web content. And so with that, I'll pass it to you, Nick. Do you have any picks this week? Oh, boy. Put me on the spot here. I guess I already mentioned that fluent Python book.

1 pick that I'll say, I'd probably a tool that's familiar to users, but I guess I like it from a kind of tooling culture aspect. Is the tool black in the Python ecosystem.

All it is is a automatic

kind of code formatter. It's very similar to Prettier from the JS ecosystem, but I'll just pick it because I think this type

of way of operating

of, like, avoiding formatting wars or linting

or, like, code style wars and just solving it with tooling is really important. And I think that it's much more common in the JS community than the Python community. I think it just, like, makes for much healthier software projects.

Thank you very much for taking the time today to join me and discuss the work that you've been doing with Dagster and for all the effort that you and your team have put into that. I appreciate

Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com

for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__