Airflow with Maxime Beauchemin

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great. You can subscribe to our show on iTunes, Stitcher, TuneIn Radio, or add our RSS feed to your podcatcher of choice.

You can also follow us on Twitter or Google Plus and please give us feedback.

You can leave us a review on iTunes to help other people find the show, send us a tweet or an email, leave us a message on Google plus or on our show notes, and please join our community. You can visit discourse.

Pythonpodcast.com

for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.

I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show, you can visit our site at pythonpodcast.com.

Linode is sponsoring us this week. Check them out at linode.com/podcastinit

and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project.

I would also like to thank Hired, a job marketplace for developers and designers, for sponsoring this episode of podcast.init.

Use the link hired.com/podcastinit

to double your signing bonus.

Your host, as usual, are Tobias Macey and Chris Patti. And today, we are interviewing Maxime Beauchampant about his work on the Airflow project.

Max, could you please introduce yourself?

Yeah. Sure. So my name is Maxime Beauchampagne. I'm a data engineer at Airbnb.

So I'm speaking today from, San Francisco

from the Airbnb

Airbnb office. And, yeah. So today, we're gonna talk about Airflow. Looking forward to, you know, share a little bit about the project.

How did you get introduced to Python?

So I I think I first heard of Python when I was at Yahoo in about 2007.

So I came I was you know, when I first joined Yahoo, there was a fair amount of Perl Perl scripting there, so it was kinda de facto scripting language.

And I I heard from, you know, coworkers that there was this greater,

faster,

more sane language called Python. So I picked up a book,

and pretty much I grew a set of wings,

from that point. Right? I just rediscovered

software. I rediscovered

web development through Django at the time. So I I had done some web development in early 2000.

So,

quite a while ago. And,

you know, I remember back then that that I thought there there must be something greater, some easier way to do things. And when I discovered Django at the time, I got super enthusiastic

about just how fast and easy was it was to build a website from that point. So fell in love with Python and then just started using it for for most of my personal projects

and at work, whenever I could.

Very cool. It's really funny to me how the sort of xkcdflight

meme

has become so intimately,

ingrained into the dynamic language landscape. It's like,

I'm not sure how, but the Python community should do something with that.

We we can, we could do a crossover with Red Bull.

Yeah. There you go. There you go.

So just to take a little detour for a minute, you mentioned that you're a data engineer. And while I'm familiar with it, having done a fair bit of it myself, I'm wondering if you could just give our listeners a bit of a flavor of how you differentiate data engineering from data science.

Right. Sure. And maybe I'll start too with, the

maybe, where where it started from too. So, so there used to be people

called, business intelligence engineer or data warehouse architect,

and and this this world has changed quite a bit. So the world of of business intelligence is kinda being renamed into

analytics,

and the world of, yeah, being a business intelligence engineer nowadays, you know, people are calling us

data engineer, and the function has changed quite a bit too.

Data engineers that when compared to data scientists, so we're definitely more on the engineering side where

really often 1 1 of the way you can describe the role is

as the people build the data pipelines and the data structures,

that fill the, kind of, the the data of the company. So maintaining the data warehouse, building, growing the data warehouse, and organizing and cleaning the company's data

as an asset

where data scientists might be more on

finding insight or, I guess, analyst, data analyst, or data scientists might,

run machine learning models, and they might also

try to squeeze insight out of data. And it really helps them if the data has been organized

for them, by data engineers.

What is airflow

and what are some of the kinds of problems it can be used to solve?

Right. So so Airflow is an open source platform to programmatically

author, schedule, and monitor,

data pipe pipelines

or workflows.

So, when you think of our like, large organization and not so large organizations like Airbnb,

you know, these organizations are making significant investment in data and, you know, most most companies now have,

analytics or a data team. And there's often people in that this data team who will,

write batch processes. Right? And they need to they need to make sure that these really often these batch processes need to run every day,

And it doesn't take very long before you have

quite quite a lot of data processes that need to run every day or on a schedule or every hour.

And really often, these, data processes depend on data structures or or data being there or other processes or other jobs.

And it doesn't take long before you end up with a very complex dependency graph and this basically, the symphony that that needs to be orchestrated every day where,

certain scripts and certain sensors need to run-in in a very, defined order.

So airflow is really after making it sane, you know, keeping data workers sane while working with,

this complex network of jobs that need to run every day. I can describe quickly can the kind of stuff we orchestrate or the kind of data processes I'm talking about. So the very classic use case is to build a data warehouse where you take

all your log data, you take all of your events and your,

maybe your production databases and

you store it in a place like Hadoop,

or,

you know, or or in a persistent store where you organize this data. You might, you might cleansing in some way. You might,

patch in some holes. You might apply some business rules.

So that is kind of the classic art of data warehousing.

And then there's all the machine learning models that need to run. There's all of the

analytics that needs to be built and all the stats work and things like AB testing framework, search ranking, all these batch jobs need to run every day. So Airflow helps orchestrating all of that.

That's really interesting. So it's funny. I mean, I've, you know, been working in the tech industry for quite a number of years. I guess, I didn't realize

that there had been this progression from analytics and business intelligence to data warehousing to something like this.

It almost seems like things have become more and more general over time. Right? Like, you had these kind of specialized

software packages to do data warehousing, and it it almost seems

based on what you're saying, like,

the datasets have become larger and the number of and number and complexity of tasks

that you need to actually

enact on the data have have grown

such that the old sort of business analytics or business intelligence model

doesn't even apply anymore. Is that is that fair?

Completely. Yeah. So this the field has changed a lot. And with the advent of Hadoop too, everyone has started storing all sorts of data that they probably would not have stored before. And I think

the the solutions that we had before around ETL tools, so if we talk about some of the industry solutions that were in the landscape 5 or 10 years ago that are still in the landscape, I guess, but not that modern companies are evolving away from,

so there's things like Informatica,

IBM DataStage,

things like AB Initial or SQL Server

analysis or integration services. So all these packages were designed for kind of drag and drop usage

so that engineers or potentially not real software engineers, but people that knew about data structures and data modeling and and how to process data would drag and drop,

in a in a UI,

you know, their data flow. And that's the way that people used to think about that. And I think the problems we're solving nowadays

are

far too complex to be solved and maintained in a GUI. Right? We need to write code that,

can express

dynamically how to generate these pipelines.

So that's 1 of the key differentiator too with airflow as

as we define

data pipelines programmatically, it makes it much harder to collaborate and maintain and test.

So that's kind of 1 1 of the things that differentiates

airflow with earlier,

solutions.

I can relate to some of that evolution as well because, when I first started in the tech industry, I had some exposure to

SSIS or SQL Server Integration

Services or Studio.

As you mentioned, it's just a drag and drop. And each of the elements, you can do some configuration, but there wasn't a lot of actual

software design happening there. And it would run on some sort of periodic basis,

usually with a Chrome job or a Windows Scheduler. But now with the volume and diversity of data sources and data

destinations,

like you said, you just can't keep up, and the manipulations that need to happen to that data. And also things like SQL Server Integration Studio don't really have any capability for

feeding your data into another processing engine, something like maybe Spark or Hadoop or another service where you're doing some sort of machine learning analysis on the data before it then gets shipped off somewhere else?

If you were limited by the system or the platform, if the platform is not very extensible or only in extensible,

in their prop proprietary

ways, It became really hard for people to

to break out from that or even share components. Right? So open source wins in the end because,

you know, we can build all sorts of of operators in airflow that we can share with other companies, and we can actually keep up with the pace of change where,

other other companies or packaged software at the mercy of these vendors to to keep up with the stack you might be using, and stack is is moving really fast nowadays. So it's hard to keep up. I feel like we're still in a a phase of

divergence in terms of stack. There's some consolidation in some areas, but it's still,

hard to keep up with all say say all just the database

ecosystem or key value store ecosystem is still going crazy.

And along those lines, what are some of the biggest challenges that you've seen when implementing a data pipeline with a workflow engine? And, what are some of the signs that you actually need that workflow engine?

Right. So I think I think the signs come through

maintenance really often where you're kinda unclear as to did this job run, or when did it start, when did it end,

how do I get to my log file. So when you start to lose clarity as to who's the owner of something and did it run,

that's that's when you you you know you need a robust solution to to handle your workflow. And that happens typically very early in the life cycle of a data team. You only need a few people that will create workflows

or individual processes for,

so so say you have a team of 3 that create

1, process a day. It doesn't take very long before you have a complex thing to main to manage and to monitor every day. So when when you you start,

some of the symptoms might be things like data quality issues

or data latency

issues

and just a lack of clarity in the team as to what happened.

In terms of the challenges,

I think it's always been a challenge to work with data at scale

and having,

unit tests. So in in the world of software engineering,

you know, there's tons of best practices in terms of of unit tests and, you know, you can get some really good reports on on coverage from instance, and have some really good continuous integration

on the on the data processing side or on ETL,

or data engineering, it's it's a lot more challenging because we cannot really afford to run an extra cluster

at scale to to test,

that our pipelines are running well. So we have we have solution in airflow to help,

performing and making it really easy to perform data quality checks

as a kinda,

break points in the pipeline.

But it's always been a challenge, you know, change management on large data structures,

large pipelines that that might,

process for hours.

So that's that's been a challenge and we don't have a perfect solution for all of this at this point.

Yeah. Fast feedback is definitely 1 of the critical pieces when developing effective software. And, yes, as soon as you start dealing with massive amounts of data, it's hard to actually maintain that

rapid feedback cycle, and particularly when you need to chain together multiple different stages.

So like you said, doing unit testing and probably doing some data sampling for verifying the actual flows, I imagine, would be 1 way around that. But when you're working at scale and you run into an error, it's hard to fix it and then verify that you fixed it appropriately because

all problems become exacerbated

at an appropriate scale.

Exactly. Assembly is definitely definitely, like, a a good solution to make sure that, you know, you can you can have an immutable dataset and making it go through the pipe and make sure you get a specific set of result, and that that works for some unit tests.

But but there's, yeah, there's a lot of things that come only at scale, and that's that is still a challenge. 1 interesting thing is that Airflow does not necessarily it's not necessarily very structuring on on the way of doing that. So it's for every data team and every

pipeline or workflow to to define

how you want to implement your your tests or your your methodology

in terms of, like, data quality checks

or your unit tests. But airflow certainly

allows you to parameterize

your workflow and run it in test mode and for you to define what your test mode or your or or your, load testing might look like. Right? So you're you're free to define that yourself using the framework.

And can you share some of the design and architecture of Airflow and how you arrived at those decisions?

Right. So it's very Python centric. So that's good for a Python podcast.

But yeah. So I think a lot of the decisions around architecture were were made thinking about, I've got the these sets of problems I'm trying to solve. What are some of the great things in the Python ecosystem that will help me,

solve these problems?

There's definitely a challenge

that's kind of specific to building a workflow engine is that it needs to be connected with pretty much all of the data systems.

Eventually all the data systems that are relevant in this day and age, which means,

there's a lot of external dependencies. So,

early on, I found some ways to use the set of tools extra requires

kind of sub, parameter to be able to, define sub packages for Airflow. So you can PIP install Airflow, which will install the core, but you can also state PIP install airflow brackets

Hive, which will install all the Hive related libraries you might need. So we we've done some extensive,

use use of that,

and then some dynamic importing so that

you only,

we basically only expose the packages and modules for which you have the dependencies installed.

So that's been a challenge, but I think that's been a solved challenge,

in terms of, like, how we

define kind of the some of the architectural decisions. So I think in all cases, when you design software, you definitely have a bias on on what you've used before. So say the people who,

designed Flask, I'm sure they they had some great experiences with Django and some shortcomings with Django, and that informed the decision that they've made in their architecture and design. And, you know, what came out of Flash is something more modular,

probably because the person building it wanted something more modular.

So from my experience, I have worked, before Airbnb. I was at Facebook where I've used a a tool called Data Swarm, and that was definitely a source of

inspiration, you know, on the good side as well as, you know, there was some definitely some patterns that did not want to repeat,

from that solution.

And then, you know, I've used all the ETL tools along the years,

the more, vendor type solution, and I definitely, you know, am very well aware of the solutions that are also in the Python and open source

community. So so it's like you you look at all these things and as you move forward and you solve your problems, you try to,

take the right the steps in the right direction.

And am I correct

in understanding that

a good portion of the actual execution engine that powers airflow is actually celery under the covers?

Right. So that so that's, that has to do also, with with the architecture. So maybe I I should talk a little bit more about the extensibility. So for for a workflow engine to be successful, it needs to be very extensible.

And

I think the reason why there's maybe not that many out there, though there's a growing number of workflow engine, is because

often people will tightly couple it where their with their environment. So as we built airflow at Airbnb, we could have, you know, tied it,

very, very,

have a tight coupling, would say something like Kafka for logging or something like ZooKeeper,

for for managing state.

And we we intentionally

decided not to do that and to write interfaces instead

so that people could run it and make use of it in their environment. So the Celery engine is 1 of the executors that exist for airflow, and we're actually at Airbnb, we're planning on using a different executor in the future. So we're planning on writing a yarn executor and moving away from salary

to to have more support for, containment.

So the other parts or let maybe I'll talk a little bit more about executor. So the executor part of Airflow

is an interface,

that allows to run jobs remotely. So the

so the airflow scheduler can schedule jobs to run on different pieces of software that,

can do that. So Celery is 1 of those.

Mesos, the Mesos executor is

another alternative

at scale.

And then we have a sequential executor, which is just a local

sequential

executor that runs in process. So that's used for testing,

and we have another 1 called the, like, a multi threaded local executor that you can use to on a local machine to say spin off 32 threads

or n threads that will run airflow jobs locally.

So we do use Celery,

but we do support Celery, but we have alternative.

Now we're writing a yarn executor, which will leverage some of the some of our Hadoop infrastructure so that if you have a yarn cluster, you'll be able to kinda

rationalize your resources and use the the yarn,

processes

to run your airflow jobs.

Max, you mentioned that the Yarn,

executive that you're writing has better support for containment,

and you mentioned Mesos just a moment ago.

Does containment in this context mean something like a Docker container?

Completely.

Yeah. So so containment is really important when you run,

processes at scale and when you don't know what people are gonna use your workflow engine to run,

inside of it. So, you know, it would be somewhat easy with the salary executor for someone to write a job that will,

creep up to use all the memory on your system

or to,

to use to spin up multi threads and take over the CPU on on a worker.

So things like Docker and Docker is built on top of Linux containers.

I think Mesos is also,

kind of

a distributed computing platform built on top of Linux containers as well. And what Linux containers allow you to do is to run a small, lightweight

virtual environment that can limit,

resource usage. So based on this assumption,

you can ensure more stability of the system.

Saying, for instance, I will only allow this process to use

a CPU core, 2 gigs of RAM, and a certain amount of virtual disk disk space. And if it goes outside these boundaries, you know, it will either

starve or it might get killed externally.

So it is kind of a vital feature feature and and workflow engine or in any sort of distributed computing.

And though I think we've had, like, quite a run with with Celery.

If we wanna grow beyond that and sleep at night, you know, it's it's a good thing to have containment.

Absolutely. Would that allow you to

essentially

have the Docker containers

that

represent the jobs in your workflow

sort of

I guess, what I'm what I'm wondering is, would you be able to do something like implement the equivalent of software contracts

where you say, you know, this job requires a container that is listening on these ports and does the following things with the data such that you could swap out the underpinnings if you needed to and still have your job work as designed?

Right. So that's definitely a a direction we're moving towards. I think someone from the community I just I just merged a PR that is a docker operator.

Maybe a little bit con context and an operator. So an operator in the airflow is a task factory.

And and when you call, say, a hive operator, it will return a a hive job and it it receives the parameters that are relevant to this operator.

So in this case, probably a Hive script and,

some parameters,

specific to to Hive.

So there's kinda 3 types of operators in there. So there are sensors that,

are kinda just waiting for an event to happen. So that could be waiting for a file to land in HDFS

or a file to land in s 3

or waiting for arbitrary things in your environment to arrive. Then there's like remote,

jobs operator that will tell an external system,

to run a job. So that could be

a MapReduce operator or a Pig operator or batch operator or Python operator that will just execute a script in their remote system.

And then we have the all the data transfer operators, which will just take data from a system and move it to, to another system.

So in the context of of of what an operator is, a a Docker operator

will fire up a a Docker container of your choice, so and and it will run a command within it. So that's pretty empowering. So then we can have this contract of you provide arbitrary, you

know, containers

and then we'll run them for you.

There is I think we want a deeper integration in the future with Docker where every task

so every operator could be run within a Docker container on demand. So that means

the base operator, when you call an operator, you would say in which container you want this, you want this task to run, and then we can

fire it up within that within that, Docker container.

So that is that does not exist today, but that's definitely on the road map, and it's coming up soon.

It's kind of amazing to watch how containers, which

at first, you know, the old guard was kinda, I don't know. You know, this is not a new concept and why is this such a big deal? But it seems like Docker has really driven adoption

and and utilization in in all kinds of interesting ways. And so you're seeing these containers

enable

all kinds of really interesting reuse

and efficiencies

that you can

pull out of that architecture. It's it's really cool to see.

Yeah. Definitely. I think there was a bad separation of concern that exist existed before where maybe ops people would be in charge of,

defining of software to be deployed on machines and then developers would be

in charge of of maintaining the code that would run on these machines. And that separation was kinda unhealthy because,

you know, it's it's hard to to work together and maybe the ops people were trying

to fight the developers saying, oh, we need to have like sup like supported packages that are validated and in in CentOS, you know, 2 0 or whatever, and some packages that have been written that are very stable, but,

not up to date. And then I was competing with, developers that maybe wanted to use the the bleeding edge features of of the new library out there. So now I feel like it makes more sense for the software developer to package his own stuff and just send this package to an ops person that know it is contained.

And,

you know, the contract is a lot more clear in that sense now, and it allows for, you know, more better distributed computing in general.

So how does Airflow compare to other workflow management solutions,

and why did you choose to write your own?

Right. So,

coming out of Facebook so so Facebook, we we know is an innovator in the data processing or just in the data analytics

space.

And Facebook has a great array of tools that empowers

the people who use them to kinda naturally build some very dynamic

processes. So coming out of there, I wanted to make sure I was gonna have a similar set of tools in some ways where I could do things like analysis,

automation,

analytics as a service,

and, you know, build build some great things with,

the right packages.

So I think, you know, sometimes when you have the right tool for the job, it becomes kind of trivial and it can change the way you think about this job. Right? So if you've held a hammer,

you know, the world really changes once you've you've held a,

kind of a how they call it? The the automated

hammers that work with air pressure. Mhmm. Oh. So so so once you've held 1 of those, you're like, oh, that really changes the way I think about, I don't know, redoing my roof or or whatever.

So I I think it's it it looking at the ecosystem and the different tools that were out there coming out, of a place like Facebook, you you can't help but just thinking, you know, I'm gonna need to build my own set of tools if I if I want to be able to accomplish things at the same level.

So looking at what was out there at the time,

we we have at Airbnb people coming from,

from LinkedIn that have used Azkaban and

we have people from Yahoo that have used Uozi and have I've used some of these tools,

and I've definitely, like, reviewed them. And the people that had used these tools were, like

we're we're saying, like, what whatever whatever we we need to do, you can't we can't use the 1 that I used before.

And, you know, I was coming from a place where people love their workflow engine. So I was like, it's possible to build a workflow engine that people will actually,

love love to use.

So so that was the intuition, and and just looking at kinda where I was at,

I felt like I have a good grasp on what was needed

and that I could

just build it. You know? Sometimes you just feel like I can do this. I can build this thing on my own or not necessarily on my own, but, I I know how to carry this project carry this project through.

And moreover, it seems like from what you're saying,

Airflow also represents

kind of a new level, a new kind of workflow management solution that really doesn't compare to to other comparative

solutions out there. Right?

Yeah. They're probably the closest 1. I don't know if you've got you guys have, looked into Luigi. So Luigi seems like a really interesting project. I looked into it, and

when when I looked at it, there was a few things that, I wanted built a slightly different way. And it's it's not necessarily criticism towards Luigi. I think, which is a a great tool and people are getting tons of value out of it. But 1 thing that was really important to me was for the API to allow

a really natural way

of creating tasks dynamically.

And in airflow, to create a task, you you need to

instantiate,

an object,

where in Luigi, you need to derive a class.

Meaning that if you wanna create,

you wanna write a for loop that, creates a set of task in Luigi, you have to get into, meta programming.

And in airflow, you can you can just simply instantiate objects, put them in a list,

associate them with your workflow object, and that just works.

Looking at that, I was like, I want to make it very natural, and I I think it's a game changer to allow people to

author

their workflows in a very dynamic

fashion.

Yeah. Luigi is definitely an interesting tool. And when I was in the process

of picking a workflow engine for managing some of our data pipeline,

and it was mostly a toss-up between Airflow and Luigi,

and trying to figure out which 1 would fit best with our requirements.

And, they're both definitely very interesting projects with their own particular

pros and cons. And from my understanding of it anyway, it seems that

Luigi gains a lot of popularity because of its

simplicity in terms of the deployment requirements, because it doesn't really have any hard dependencies from an infrastructure level, and you can build your pipeline however you want. Whereas it seems that Airflow has a little bit more in terms of an initial setup. So I don't know if you can elaborate a bit on, I guess, what the what the deployment story looks like with Airflow.

Right. So 1 thing that was really important for us was to make it really easy for people to go through the tutorial without having to to install anything in particular. So to just be able to PIP install Airflow

and get a feel for it, get the UI, get the web server running, and and go through the tutorial. So I think we've achieved that very well.

When you want to run it,

when you go wanna go beyond the the tutorial very quickly, you you need a decent database or my a MySQL instance or a Postgres instance to,

to to store the state. So that's 1 thing that

Airflow does that Luigi

does in a different way. So we store state for each task instance, and there's,

there's this notion of of schedule that that is kinda deeply ingrained in the airflow fabric. So it kinda assume that,

most jobs or most workflow run on a schedule. So say on a daily schedule, on an hourly schedule,

where weGA, I I think that, I believe,

defines whether a task needs to run or not based on whether the target,

is present

or isn't,

which

changes

things in the way you think about it in some ways. Sometimes when you look at like, should I build my own tool or should I use someone else's, is

whether you're able to do the exercise of bending your mind to fit the maker's mind. Right? And when I when I looked into the different tools out there, it's like, the way I think about workflows and how to author them and schedule them is not fully compatible. Or I feel like if I try to bend my mind in thinking like these other makers, I might something might crack. So I need to build something that that fits.

So quickly

quickly you need a, database because by default, it ships with a SQLite database, which won't allow you to run tasks in parallel,

because there's issue with SQLite and having multiple threads, talk to the same file in your local disk.

So quickly you need,

you need either any of the SQL Alchemy supported,

database back end. So Postgres MySQL.

And the next step after that so so you can run your local executor on a single machine

until you essentially run out of

resources on 1 box or it starts to feel a little bit tight. At which point, you probably wanna set up celery

or mesos.

And Celery is is is very lightweight. It's easy to set up, but you do need a broker. So you might need a Redis instance or a RabbitMQ

instance as the broker for Celery.

For for those who are not familiar with with Celery, Celery is a Python.

It's a mature Python

async processing framework,

which is it's great. It's it's it's super easy to use, and it's it makes it really easy to say execute a function remotely on a an array of workers. So people use it to do to perform all sorts of async

tasks. I think people use it,

quite a bit in the in the web world to say,

process thumbnails

or do all sorts of, asynchronous work workloads.

So

to grow an airflow

cluster,

you you need at some point to to,

install Redis or, run it in queue. And then, you know, that comes with,

managing multiple servers. So probably if you work at a company, you already have a way to deploy,

to spin off machines either on AWS or, however, you do that, but you need some sort of infrastructure to manage your machines.

And then airflow is kinda agnostic in in those terms. Right? Airflow knows which task run on which machines they're running,

but you still need to monitor these boxes the same way you monitor any boxes on your on your network.

And Airflow, like a number of other tools in the space, support interoperability

with Hadoop and its ecosystem, like you mentioned. I'm wondering if you can

postulate on why the JVM technologies have become so prevalent in the big data space and how Python fits into that overall domain.

Right. So, yeah, the JVM the JVM is great for for a lot of reason.

I think, you know, there there's some containment that comes with it. There's there's all sorts of, best practices there.

I'm not much of a JVM guy.

And I think that those lines are also starting to blur,

now. Right? So we see that we have, faster

just in time compilers for Python. Right? And then there's on the other side, there's more dynamic languages built on top of the JVM,

like Scala, and there's,

Jython. I'm not sure how viable Jaithen is nowadays, but

I was considering it, considering

enabling it in some ways on on the on the airflow workers at some point,

because the the JVM is is is a great kinda unit of containment or unit of work with a whole set of

of of guarantees.

And it seems like it's very prevalent,

right, in the Hadoop ecosystem, but we're seeing kind of the counter movement of more dynamic languages,

starting to being being built on the JVM itself. So

so that's kinda interesting to see where this is all moving.

And Airflow comes with a web UI for visualizing the workflows as do a few of the other Python engines. I'm wondering why that's an important feature for this kind of tool, and what are some of the tasks and use cases that are supported in the Airflow web portal?

Right. So so it is it is vital to know what's going on. Right? So you're

we're executing,

probably close to 10, 000 tasks a day in in airflow and,

at Airbnb.

And there's multiple people iterating on these tasks, and

there's all sorts of things going on,

in in this big, this big symphony that's taking place every day. And if you're not able to see what's going on,

you have no clarity and it's you can't basically

do the work of a data engineer if you don't have an easy way to know,

the the the state of things and get access to your logs.

So, some of the things that the airflow UI does, and it's probably 1 of the best

UIs out there in terms of of, workflow engine. So the first thing is to visualize your graph of dependency

and to easily see the state. So Airflow will show

you a graph view where you can kinda zoom in and zoom out and hover over your tasks and get all sorts of,

tool tips that will tell you,

you know,

what is the state of that task, when it when did it start, when did it end.

If you wanna see any of your parameters or attributes for any of your objects or tasks, it's really easy to do so.

There is a tree view that will show you kind of the tree representation of your graph

over time, and that brings a lot of clarity when you're trying to understand why is this task blocked or how long has it been blocked for.

Airflow will also allow you to interact with the state of your task. So you can say, for instance, I need to rerun,

this task for a certain date range and everything downstream from it.

And those are things that we do

every day as data engineers. Right? So there's false positive, false negatives. There are some errors,

and we need to, reprocess things. So the airflow UI gives you a lot of clarity on what needs to be done. It allows you to do to perform all these surgeries on your workflows.

There's also a lot of clarity around,

start time and end time.

Things like

Gantt charts or charts that will show for each task at what at what time it lands,

every day. So time series of landing time, time series of duration of tasks.

You can see,

the source code for the

workflow

as well.

So all of this make it a lot ease a lot easier for people to,

to be productive

and,

to have clarity and not be frustrated

while authoring

batch processes.

And 1 problem with data management is tracking the provenance of the data as it's manipulated and shuttled between different systems. And wondering if Airflow has any support for maintaining that kind of information,

And if not, if you have recommendations for how practitioners can approach the issue.

Right. So,

so data lineage is what you're talking about. So Which is, like, trying to understand, you know, when you look at a piece of information, say, in a

dashboard, in a report to understand where is this coming from.

And,

if you have good data lineage, you should be able to answer the question of like this metric came from

these tables,

and see the whole track of the the business rules that were applied on the way there. So airflow does not have a graph of data lineage per se. So we have a lot of clarity on the lineage of jobs and how jobs depends on on other jobs.

But,

there there's there's no kind of built

in direct way of visualizing your data objects and how they're

kinda graphed together, how they're related.

So, like, having clarity on your jobs,

your job structure and how they depend on each other. And if you have kind of some sane rules around,

may maybe you have

most jobs are associated with the underlying table, and maybe there are naming conventions that, you

know, PRC

fact session is the process

that loads the table called fact sessions. And it's pretty easy to to infer the data lineage from the job lineage.

So that's the way we've been doing it. We're also interested in in

allowing for,

some code introspection. So Airflow could look into the SQL code and some of the the code that's

underlying to be able to infer the lineage

or to allow people to annotate their workflow to to structure and and clearly define, you know,

the data lineage

behind certain tasks. So that's on their road map. That's something we're we're committed to and, you know, a problem we definitely wanna solve for ourself and for the community.

And are there any other kinds of metadata that Airflow can track as executes tasks? And what are some of the interesting uses you've seen or created for that information?

Right. So so I I think 1 key thing is like metadata is the key of success in in in data processing world. Right? So, we do analytics for

top of.

So

having metadata and doing analytics on on it is a way to stay on top,

while being a data engineer.

So so

outside of the tasks,

task instance metadata

that we have in airflow,

internally at Airbnb, we we've written a job parser that

after each hive job,

we're able to gather some of the statistics around,

resource usage behind all of our Hive jobs. So we,

we're Hive shop. We run a lot of our batch processes in,

Apache Hive.

And every time the Hive job finishes, we parse the logs to try to get the the most information or as around, like,

how many CPU cycles,

how much IO, and how many mappers and reducers,

we've used,

and try to gather stats around,

SKUs to say join SKU,

or group by SKU.

So so that's some of the stuff we've built on top of airflow that gathers

underlying metadata.

I think

be beyond that, we we do metadata processing job using the airflow engine.

So, we're able to call APIs on a schedule and gather information and metadata and and stamp it or snapshot it in the database using Airflow, and that's something we surely do.

So with all the other languages competing for mindshare,

what made you choose Python when you built Airflow?

Right. So if English is the language of business, I say, like, Python might be the the language of data or at least, like, a language that a lot of people speak in the data world,

especially for for everything that's kinda gluing things together

related. So

when when I look at the tool that that we were building, 1 thing that's really important is connectivity

with all these other systems because it's a workflow engine, and we need to be able to talk to,

the different threat services. We need to be able to talk to any external database.

And Python's got extremely good support or a very wide coverage of interfacing with external tools.

So that was 1 key element. And I know I knew we're gonna need some async processing. I knew we're gonna need

a good web framework to build a nice UI.

And I wanted something that was high velocity where

I could write code quickly and get results quickly also. So

as a Python programmer too, it's just

my my language of choice too. I think in Python.

I

when I write in other languages, I often think in Python first and translate

it to that language with a little bit of frustration in the process. So I think Python

is a really good choice for a workflow engine. I think we we can see that with other

efforts out there too, like, Weeti and,

Pinball.

And it seems like

it's a great fit to build that kind of solution.

Absolutely.

Speaking of Python and and Python being the language of data, I love your quote, by the way. Have you or has anybody else written any integrations between Airflow

and Jupyter or IPython Notebook?

Right. So I remember starting that when I start first started the project, I was actually iterating within Ipython notebook or actually, well, Jupyter. And,

I wanted to to make it easy for people to use

the the operate the operators and run them interactively using the I the API directly in Jupyter.

So today, I think it's still possible to do that, but it's not in the way that people necessarily do it. It is possible that you use the the hooks. So the hooks is the,

the interface to talk to external

systems

in in airflow. And, you know, a lot of people at Airbnb have been using the hooks

in a knife, python, notebook,

context.

That works pretty well. But it's it's not necessarily a first class citizen,

but I think I think it should be, and there's a lot of things that could be leveraged directly from there.

Very cool. So I noticed that airflow supports Kerber at Kerberos.

It's an incredibly capable security model, but that comes at a high price in terms of complexity.

What were the challenges, and was it worth the additional implementation effort?

That's an interesting 1. So this 1 is community contributed.

So the the mind share there was to try to keep it,

to review it while not necessarily knowing what it was all about. So I I believe that might be under the contrib folder of the project.

So it is not fully supported by Airbnb, but it is community contributed, and we definitely wanna, you know, grow in that direction. I think we're we're having this year, we're having kind of a security effort internally,

and we'll most likely be, using Kerberos

at Airbnb on top of some of our Hadoop,

services.

So it's really cool to see that the community contributed something to the project that we actually

need, and that will that will become really handy once we be we get to that bridge. We'll be able to cross that bridge without

having to build it. So that's 1 of the success of open source, you know, where the distributors have built things that we actually ended up needing.

Absolutely. It's it's kind of amazing to me, you know, when you think about just how old Kerberos is,

how successful and pervasive

it has been, I mean, I think it's a testament to the good design and and and really serious thought that went into it that it's not only still around, but really still thriving. And, you know, people like your company is considering adopting it, like, now in, you know, 2016.

It's it's a pretty incredible piece of technology.

Yeah. I'm interested to to look into it some more. At this point, to me, it it has been a

a PR that I made sure was not gonna affect the core of the project.

And I'm I'm generally not very inclined towards,

towards security

type of effort, but that makes me even happier that it it's been done on on on our behalf.

Yeah. Earlier, when you were mentioning the ability to introspect the source code of the workers,

that was 1 of the things that came to mind is with the security model in Airflow to help lock that down a bit so that that information doesn't end up getting out into the open because that could potentially be sensitive data or, could potentially become an attack vector if the actual source code is too

out in the open.

I mean, obviously, you would wanna lock down your workflow engine and keep it all internal anyway, but having some additional layers of security would definitely benefit that.

Yes. Definitely very important to keep everything secure. I know a lot of companies,

seem seem to be,

taking an approach with security where most engineers have

most access to,

to most internal databases

to kinda create this this, safe zone within which people have rights. And I think I think it's good to to start with a default of most people have access to most things that really

allows engineers to be efficient, not having to ask permissions to to do something useful.

I guess it's true also of source code. Right? Like, it used to be that

you would people different teams would keep their source code insulated, and you'd have to ask for permission to

to see the source code. I think that's changing,

that's changing too where I think most companies are enabling

engineers to have access

to data and to the source code because it might be useful to them. And if you, if you don't start with that

default of having access to things, then you there might be a lot of of missed opportunities

there.

So when does the data pipeline and workflow management paradigm breakdown and what other approaches or tools can be used in those cases?

So I've seen I've seen it break down quite a bit and

and in the community too. So, and typically, it's where people are trying to either write a software with,

you using a workflow engine. So that that would show as a workflow that

changes in shape at every execution.

Right? So so for me, a a workflow or if you think

of it as a

as a graph of dependencies,

should be

slowly changing. Right? From an execution to the next, the workflow, it's fairly similar.

There might be some branching, right, saying I'm gonna go down this path and not this path.

But the structure of the workflow itself should be pretty static.

In some cases, people wanna build very dynamic workflow

that,

say, where a task define the next stat the nest the next task that should run after this task dynamically based on their result.

So that's, probably not a super good use case for a workflow engine.

There's also when people start, scheduling things to run every minute.

So workflow,

Airflow is a workflow engine

that was designed for

daily, hourly,

batch processing.

If you start running things every 5 minutes, it it works,

but it becomes kinda metadata heavy, and you're getting almost close to the world of real time data and stream processing.

And there are some much better solutions out there to do data stream processing like

Storm.

There's, I think, something you call Heron. There's Spark Streaming,

SAMSA. So there's all sorts of solutions there that work very well and are designed for those use cases.

And if you're trying to schedule your airflow job to run every minute, most likely you're not using the right system.

So you wrote another tool recently called Panoramics. Can you describe what it is and explain how it fits in the data management domain in relation to Airflow?

Sure. Yeah. So Panoramix

is started,

last

June as a hackathon project. So it's a fairly new project,

and it is a data exploration

and dashboarding

platform.

It's going to be open source, and we're gonna announce it,

very soon as an Airbnb project over the next, few few months.

The the name of the project might still change. So, by the time you listen to this podcast,

you can look for data exploration

Airbnb

and and see what, name the tool has at that point in time.

But

this this tool allows people to,

visualize and explore datasets, arbitrary dataset that may live in SQL, different,

databases

or in Druid.io.

So Druid.io

is a column store real time database that we use at Airbnb.

And panoramics can connect to Druid dataset as well as,

MySQL databases,

Postgres databases,

you know, Presto,

Impala,

Redshift, all of these things,

and make it really easy for people to create data visualizations,

explore data, and create collections of data slices or data visualization in in dashboards.

So look forward to,

to hear about this project.

It should be announced very soon.

Are there any questions that we didn't ask that you think we should have or anything else that you wanna bring up before we move on?

Certainly. You know, there's there's a lot more to talk about, around workflow engines and and airflow, but I think we covered,

definitely the basics and a little bit beyond. So if you have, there's a great documentation

out there for people that have more questions to start looking into, and we have some very

active we have a Google group that's very active as well as a Jitter,

chat channel that, everyone's welcome to, to join the party.

Is there anything that our users can help out with for the project?

Well, so, on the GitHub, we have a contributing MD that tries to define, you know, how people can help.

But, you know, always for for users of the project, everyone's welcome to,

probably the first level of helping is as you onboard is to read the documentation,

help clarify the documentation,

you know, help improve the tutorial,

as well as, like, raising issues. So for all of our users to come to GitHub and whenever they find things that could be improved to, open issues, that really helps.

And as people onboard, you know, we're very welcoming to, any PR that makes our product better.

And, 1 last question I just thought of is actually when I was reading through the documentation

a little while back when I was evaluating it,

it mentioned briefly that it only supported Python 27. So I'm wondering if that's still the case or if there's Python 3 support as well. Right. I think the docs might still say only 27,

but but we do support 34 now. And I think, many people in the community are using 34, and we run continuous integration against 27,

34,

Postgres and MySQL are built matrix for, continuous integration,

includes,

includes 3, 4. Great.

So for anybody who wants to follow you and keep up to date with projects you're working on, what would be the best way for them to do that?

I would say, the the Google Groups,

is probably the best place to start, and the Jitter channel would be the the the next place. There's links on the GitHub. So probably watching the GitHub is is also a great place,

to to get updates and and follow the project.

And on the GitHub itself, you can find all the links to,

related resources, including

automation as to, like, say, Puppet and Chef recipes

as well as,

other ways to that that make it really easy for you to install, Airflow and make it run-in your environment.

So with that, I will move on to the picks. And my picks this week are I recently finished a book called or actually a trilogy of books called the

Empire of the East by Fred Saberhagen.

And they're a very interesting

blend

of sci fi and fantasy.

The general thematics are predominantly fantasy, but there's a lot

of sci fi elements throughout in that it takes place far into the future

after our modern era has fallen

and given rise to a new era of magic.

And

there are still artifacts from our time that are present in this

far distant future, so things like our

telescope

or different machinery or

flashlights, things like that that show up as artifacts

throughout the story line. And also some other elements that I'm not gonna give away because, it would sort of spoil the story for anybody who wants to read it. But he does a really great job of seamlessly blending the 2

themes and the 2 styles.

And he has some really compelling

storyline as to how that change came about and what it leads to. And then it also leads into

another series that I am currently rereading,

called the book of swords. And this takes place further into the future beyond the events of Empire of the East. And

the gods

have sort of arisen

and forged these 12 swords, and each sword has a different

set of powers

that

the bearer can take advantage of. And so,

it's, again, just another interesting

view of

humanity's interaction with the gods and what it actually means for them to be gods

and how the

swords affect the power dynamics within this world. So a lot of fun to read. Definitely recommend it for anybody who is a fan of sci fi and or fantasy. And with that, I'll pass it to you, Chris.

Very cool. My first pick is a band,

that I found on Spotify

called Baraca Son Systema. They're a Portuguese

band.

Their music is a really interesting mix of sort of

African tropical

infused,

techno

electronic

beats,

combined with this African music form called Zouk that I had never heard of. And I just personally think the word Zouk is really cool, z o u k.

So, that's my first pick. I think they're a lot of fun. It's it's definitely not music to fall asleep to. It's it's definitely incredibly upbeat. I like to work to it. My next pick is a pick of dubious legality, but I'm gonna pick it anyway.

It is a,

a fan made

edit of the Star Wars trilogy called the despecialized

edition.

This thing is a labor of love. There's a YouTube video, I believe, linked to from the site that that I've linked to in the show notes

that shows just how much work these folks put into

restoring the Star Wars trilogy to as close as they could possibly get it

to the original 1970, you know, seventies film release,

but updated for the modern day in terms of upscale to HD and the like.

It's just you know, for someone who grew up with Star Wars and saw it in the theaters in in the late seventies when it came out,

I've just, for a long time, felt so disappointed

by all the official, you know, DVD and Blu ray releases that were coming out, that with all the Lucasfilm

edits,

and this is just such a

a refreshing,

development to see these folks bringing it back to its original,

state.

It's totally worth seeing.

My next pick is a series of books by Kevin Hearn. These are definitely not like, Shakespeare or or or literary classics for all time, but, boy, they are a lot of fun. They're called the Iron Druid Chronicles,

and they are a about the last surviving

druid,

Atticus O'Sullivan,

roaming the earth, getting himself into various, you know, types of trouble.

But, it's a really interesting mix of sort of,

interesting bits of Celtic myth.

A lot of, sort of, you know, great geeky fan service,

and just some really entertaining story lines, especially if you enjoy,

reading fantasy that's infused with the gods and goddesses and

and and other minor,

denizens of various mythos from around the world. It's just it's really fun stuff.

I I really enjoy it and and really look forward to each new book as it comes out. That's it for me. And, Maxime, what do you have for us?

So since this is a Python podcast, I felt like I I had to pick a Python package as a, as a pick.

So I picked the a project that I think deserves a lot more attention. It's called Flask app builder.

And Flask app builder is,

is a framework on top of Flask. So Flask is a framework already, but it it's very modular and doesn't do much in certain areas. It it lets people build their own things.

Flask app builder,

ships with,

authentication. So it makes it really easy for if you build an app and especially if you build an open source app like Panoramics,

to allow the people that will install your app to define which

which type of authentication they wanna use. So they can plug in, you know, LDAP, OpenID,

OAuth, and

and it just all works. It manages,

it creates a set of permissions

for you and,

allows

to tie those with roles and users.

So the things

those are things, you know, that that came,

with Django that don't ship with Flask.

There's also,

CRUD modules. So to do all the create, update, delete dynamically,

based on the the modules from your ORM.

So it's it's just a very productive

framework,

that is super easy to use and works very well. So if you're, you know, starting a new,

web app, check out Flask app builder.

If you want something maybe slightly lighter weight and less

mature than Jengo so that you can,

contribute to this project.

Well, we really appreciate you taking the time out of your day to join us and tell us all about Airflow and tantalize us with,

promises

of panoramics.

So we definitely look forward to

seeing some more of the work that you put out. And,

again, thank you for joining us. Have a good evening.

Thank you very much. Very exciting to be I've been listening to, to all sorts of podcasts in the past, and it's good to, to be on on this side of the microphone.

So thank you for having me. We're happy to have you. Thank you. Bye.

The Python Podcast.init

Summary

Brief Introduction

Interview with Maxime Beauchemin

Keep In Touch

Picks

The Python Podcast.__init__