Visit our site to listen to past episodes, support the show, join our community, and sign up for our mailing list.
Summary
Are you struggling with trying to manage a series of related, interdependent batch jobs? Then you should check out Airflow. In this episode we spoke with the project’s creator Maxime Beauchemin about what inspired him to create it, how it works, and why you might want to use it. Airflow is a data pipeline management tool that will simplify how you build, deploy, and monitor your complex data processing tasks so that you can focus on getting the insights you need from your data.
Brief Introduction
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- Subscribe on iTunes, Stitcher, TuneIn or RSS
- Follow us on Twitter or Google+
- Give us feedback! Leave a review on iTunes, Tweet to us, send us an email or leave us a message on Google+
- Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
- I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
- Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
- I would also like to thank Hired, a job marketplace for developers and designers, for sponsoring this episode of Podcast.__init__. Use the link hired.com/podcastinit to double your signing bonus.
- Your hosts as usual are Tobias Macey and Chris Patti
- Today we are interviewing Maxime Beauchemin about his work on the Airflow project.
Interview with Maxime Beauchemin
- Introductions
- How did you get introduced to Python? – Chris
- What is Airflow and what are some of the kinds of problems it can be used to solve? – Chris
- What are some of the biggest challenges that you have seen when implementing a data pipeline with a workflow engine? – Tobias
- What are some of the signs that a workflow engine is needed? – Tobias
- Can you share some of the design and architecture of Airflow and how you arrived at those decisions? – Tobias
- How does Airflow compare to other workflow management solutions, and why did you choose to write your own? – Chris
- One of the features of Airflow that is emphasized in the documentation is the ability to dynamically generate pipelines. Can you describe how that works and why it is useful? – Tobias
- For anyone who wants to get started with using Airflow, what are the infrastructure requirements? – Tobias
- Airflow, like a number of the other tools in the space, support interoperability with Hadoop and its ecosystem. Can you elaborate on why JVM technologies have become so prevalent in the big data space and how Python fits into that overall problem domain? – Tobias
- Airflow comes with a web UI for visualizing workflows, as do a few of the other Python workflow engines. Why is that an important feature for this kind of tool and what are some of the tasks and use cases that are supported in the Airflow web portal? – Tobias
- One problem with data management is tracking the provenance of data as it is manipulated and shuttled between different systems. Does Airflow have any support for maintaining that kind of information and if not do you have recommendations for how practitioners can approach the issue? – Tobias
- What other kinds of metadata can Airflow track as it executes tasks and what are some of the interesting uses you have seen or created for that information? – Tobias
- With all the other languages competing for mindshare, what made you choose Python when you built Airflow? – Chris
- I notice that Airflow supports Kerberos. It’s an incredibly capable security model but that comes at a high price in terms of complexity. What were the challenges and was it worth the additional implementation effort? – Chris
- When does the data pipeline/workflow management paradigm break down and what other approaches or tools can be used in those cases? – Tobias
- So, you wrote another tool recently called Panoramix. Can you describe what it is and maybe explain how it fits in the data management domain in relation to Airflow? – Tobias
Keep In Touch
Picks
- Tobias
- Empire of the East by Fred Saberhagen
- The Book of Swords by Fred Saberhagen
- Chris
- Maxime
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. You can subscribe to our show on iTunes, Stitcher, TuneIn Radio, or add our RSS feed to your podcatcher of choice. You can also follow us on Twitter or Google Plus and please give us feedback. You can leave us a review on iTunes to help other people find the show, send us a tweet or an email, leave us a message on Google plus or on our show notes, and please join our community. You can visit discourse. Pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show, you can visit our site at pythonpodcast.com. Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project. I would also like to thank Hired, a job marketplace for developers and designers, for sponsoring this episode of podcast.init. Use the link hired.com/podcastinit to double your signing bonus.
Your host, as usual, are Tobias Macey and Chris Patti. And today, we are interviewing Maxime Beauchampant about his work on the Airflow project. Max, could you please introduce yourself?
[00:01:25] Unknown:
Yeah. Sure. So my name is Maxime Beauchampagne. I'm a data engineer at Airbnb. So I'm speaking today from, San Francisco from the Airbnb Airbnb office. And, yeah. So today, we're gonna talk about Airflow. Looking forward to, you know, share a little bit about the project.
[00:01:41] Unknown:
How did you get introduced to Python?
[00:01:44] Unknown:
So I I think I first heard of Python when I was at Yahoo in about 2007. So I came I was you know, when I first joined Yahoo, there was a fair amount of Perl Perl scripting there, so it was kinda de facto scripting language. And I I heard from, you know, coworkers that there was this greater, faster, more sane language called Python. So I picked up a book, and pretty much I grew a set of wings, from that point. Right? I just rediscovered software. I rediscovered web development through Django at the time. So I I had done some web development in early 2000. So, quite a while ago. And, you know, I remember back then that that I thought there there must be something greater, some easier way to do things. And when I discovered Django at the time, I got super enthusiastic about just how fast and easy was it was to build a website from that point. So fell in love with Python and then just started using it for for most of my personal projects and at work, whenever I could.
[00:02:48] Unknown:
Very cool. It's really funny to me how the sort of xkcdflight meme has become so intimately, ingrained into the dynamic language landscape. It's like, I'm not sure how, but the Python community should do something with that.
[00:03:04] Unknown:
We we can, we could do a crossover with Red Bull.
[00:03:07] Unknown:
Yeah. There you go. There you go.
[00:03:10] Unknown:
So just to take a little detour for a minute, you mentioned that you're a data engineer. And while I'm familiar with it, having done a fair bit of it myself, I'm wondering if you could just give our listeners a bit of a flavor of how you differentiate data engineering from data science.
[00:03:26] Unknown:
Right. Sure. And maybe I'll start too with, the maybe, where where it started from too. So, so there used to be people called, business intelligence engineer or data warehouse architect, and and this this world has changed quite a bit. So the world of of business intelligence is kinda being renamed into analytics, and the world of, yeah, being a business intelligence engineer nowadays, you know, people are calling us data engineer, and the function has changed quite a bit too. Data engineers that when compared to data scientists, so we're definitely more on the engineering side where really often 1 1 of the way you can describe the role is as the people build the data pipelines and the data structures, that fill the, kind of, the the data of the company. So maintaining the data warehouse, building, growing the data warehouse, and organizing and cleaning the company's data as an asset where data scientists might be more on finding insight or, I guess, analyst, data analyst, or data scientists might, run machine learning models, and they might also try to squeeze insight out of data. And it really helps them if the data has been organized for them, by data engineers.
[00:04:43] Unknown:
What is airflow and what are some of the kinds of problems it can be used to solve?
[00:04:49] Unknown:
Right. So so Airflow is an open source platform to programmatically author, schedule, and monitor, data pipe pipelines or workflows. So, when you think of our like, large organization and not so large organizations like Airbnb, you know, these organizations are making significant investment in data and, you know, most most companies now have, analytics or a data team. And there's often people in that this data team who will, write batch processes. Right? And they need to they need to make sure that these really often these batch processes need to run every day, And it doesn't take very long before you have quite quite a lot of data processes that need to run every day or on a schedule or every hour.
And really often, these, data processes depend on data structures or or data being there or other processes or other jobs. And it doesn't take long before you end up with a very complex dependency graph and this basically, the symphony that that needs to be orchestrated every day where, certain scripts and certain sensors need to run-in in a very, defined order. So airflow is really after making it sane, you know, keeping data workers sane while working with, this complex network of jobs that need to run every day. I can describe quickly can the kind of stuff we orchestrate or the kind of data processes I'm talking about. So the very classic use case is to build a data warehouse where you take all your log data, you take all of your events and your, maybe your production databases and you store it in a place like Hadoop, or, you know, or or in a persistent store where you organize this data. You might, you might cleansing in some way. You might, patch in some holes. You might apply some business rules.
So that is kind of the classic art of data warehousing. And then there's all the machine learning models that need to run. There's all of the analytics that needs to be built and all the stats work and things like AB testing framework, search ranking, all these batch jobs need to run every day. So Airflow helps orchestrating all of that.
[00:07:06] Unknown:
That's really interesting. So it's funny. I mean, I've, you know, been working in the tech industry for quite a number of years. I guess, I didn't realize that there had been this progression from analytics and business intelligence to data warehousing to something like this. It almost seems like things have become more and more general over time. Right? Like, you had these kind of specialized software packages to do data warehousing, and it it almost seems based on what you're saying, like, the datasets have become larger and the number of and number and complexity of tasks that you need to actually enact on the data have have grown such that the old sort of business analytics or business intelligence model doesn't even apply anymore. Is that is that fair?
[00:07:54] Unknown:
Completely. Yeah. So this the field has changed a lot. And with the advent of Hadoop too, everyone has started storing all sorts of data that they probably would not have stored before. And I think the the solutions that we had before around ETL tools, so if we talk about some of the industry solutions that were in the landscape 5 or 10 years ago that are still in the landscape, I guess, but not that modern companies are evolving away from, so there's things like Informatica, IBM DataStage, things like AB Initial or SQL Server analysis or integration services. So all these packages were designed for kind of drag and drop usage so that engineers or potentially not real software engineers, but people that knew about data structures and data modeling and and how to process data would drag and drop, in a in a UI, you know, their data flow. And that's the way that people used to think about that. And I think the problems we're solving nowadays are far too complex to be solved and maintained in a GUI. Right? We need to write code that, can express dynamically how to generate these pipelines.
So that's 1 of the key differentiator too with airflow as as we define data pipelines programmatically, it makes it much harder to collaborate and maintain and test. So that's kind of 1 1 of the things that differentiates airflow with earlier, solutions.
[00:09:26] Unknown:
I can relate to some of that evolution as well because, when I first started in the tech industry, I had some exposure to SSIS or SQL Server Integration Services or Studio. As you mentioned, it's just a drag and drop. And each of the elements, you can do some configuration, but there wasn't a lot of actual software design happening there. And it would run on some sort of periodic basis, usually with a Chrome job or a Windows Scheduler. But now with the volume and diversity of data sources and data destinations, like you said, you just can't keep up, and the manipulations that need to happen to that data. And also things like SQL Server Integration Studio don't really have any capability for feeding your data into another processing engine, something like maybe Spark or Hadoop or another service where you're doing some sort of machine learning analysis on the data before it then gets shipped off somewhere else?
[00:10:20] Unknown:
If you were limited by the system or the platform, if the platform is not very extensible or only in extensible, in their prop proprietary ways, It became really hard for people to to break out from that or even share components. Right? So open source wins in the end because, you know, we can build all sorts of of operators in airflow that we can share with other companies, and we can actually keep up with the pace of change where, other other companies or packaged software at the mercy of these vendors to to keep up with the stack you might be using, and stack is is moving really fast nowadays. So it's hard to keep up. I feel like we're still in a a phase of divergence in terms of stack. There's some consolidation in some areas, but it's still, hard to keep up with all say say all just the database ecosystem or key value store ecosystem is still going crazy.
[00:11:15] Unknown:
And along those lines, what are some of the biggest challenges that you've seen when implementing a data pipeline with a workflow engine? And, what are some of the signs that you actually need that workflow engine?
[00:11:26] Unknown:
Right. So I think I think the signs come through maintenance really often where you're kinda unclear as to did this job run, or when did it start, when did it end, how do I get to my log file. So when you start to lose clarity as to who's the owner of something and did it run, that's that's when you you you know you need a robust solution to to handle your workflow. And that happens typically very early in the life cycle of a data team. You only need a few people that will create workflows or individual processes for, so so say you have a team of 3 that create 1, process a day. It doesn't take very long before you have a complex thing to main to manage and to monitor every day. So when when you you start, some of the symptoms might be things like data quality issues or data latency issues and just a lack of clarity in the team as to what happened.
In terms of the challenges, I think it's always been a challenge to work with data at scale and having, unit tests. So in in the world of software engineering, you know, there's tons of best practices in terms of of unit tests and, you know, you can get some really good reports on on coverage from instance, and have some really good continuous integration on the on the data processing side or on ETL, or data engineering, it's it's a lot more challenging because we cannot really afford to run an extra cluster at scale to to test, that our pipelines are running well. So we have we have solution in airflow to help, performing and making it really easy to perform data quality checks as a kinda, break points in the pipeline.
But it's always been a challenge, you know, change management on large data structures, large pipelines that that might, process for hours. So that's that's been a challenge and we don't have a perfect solution for all of this at this point.
[00:13:31] Unknown:
Yeah. Fast feedback is definitely 1 of the critical pieces when developing effective software. And, yes, as soon as you start dealing with massive amounts of data, it's hard to actually maintain that rapid feedback cycle, and particularly when you need to chain together multiple different stages. So like you said, doing unit testing and probably doing some data sampling for verifying the actual flows, I imagine, would be 1 way around that. But when you're working at scale and you run into an error, it's hard to fix it and then verify that you fixed it appropriately because all problems become exacerbated at an appropriate scale.
[00:14:09] Unknown:
Exactly. Assembly is definitely definitely, like, a a good solution to make sure that, you know, you can you can have an immutable dataset and making it go through the pipe and make sure you get a specific set of result, and that that works for some unit tests. But but there's, yeah, there's a lot of things that come only at scale, and that's that is still a challenge. 1 interesting thing is that Airflow does not necessarily it's not necessarily very structuring on on the way of doing that. So it's for every data team and every pipeline or workflow to to define how you want to implement your your tests or your your methodology in terms of, like, data quality checks or your unit tests. But airflow certainly allows you to parameterize your workflow and run it in test mode and for you to define what your test mode or your or or your, load testing might look like. Right? So you're you're free to define that yourself using the framework.
[00:15:05] Unknown:
And can you share some of the design and architecture of Airflow and how you arrived at those decisions?
[00:15:10] Unknown:
Right. So it's very Python centric. So that's good for a Python podcast. But yeah. So I think a lot of the decisions around architecture were were made thinking about, I've got the these sets of problems I'm trying to solve. What are some of the great things in the Python ecosystem that will help me, solve these problems? There's definitely a challenge that's kind of specific to building a workflow engine is that it needs to be connected with pretty much all of the data systems. Eventually all the data systems that are relevant in this day and age, which means, there's a lot of external dependencies. So, early on, I found some ways to use the set of tools extra requires kind of sub, parameter to be able to, define sub packages for Airflow. So you can PIP install Airflow, which will install the core, but you can also state PIP install airflow brackets Hive, which will install all the Hive related libraries you might need. So we we've done some extensive, use use of that, and then some dynamic importing so that you only, we basically only expose the packages and modules for which you have the dependencies installed.
So that's been a challenge, but I think that's been a solved challenge, in terms of, like, how we define kind of the some of the architectural decisions. So I think in all cases, when you design software, you definitely have a bias on on what you've used before. So say the people who, designed Flask, I'm sure they they had some great experiences with Django and some shortcomings with Django, and that informed the decision that they've made in their architecture and design. And, you know, what came out of Flash is something more modular, probably because the person building it wanted something more modular.
So from my experience, I have worked, before Airbnb. I was at Facebook where I've used a a tool called Data Swarm, and that was definitely a source of inspiration, you know, on the good side as well as, you know, there was some definitely some patterns that did not want to repeat, from that solution. And then, you know, I've used all the ETL tools along the years, the more, vendor type solution, and I definitely, you know, am very well aware of the solutions that are also in the Python and open source community. So so it's like you you look at all these things and as you move forward and you solve your problems, you try to, take the right the steps in the right direction.
[00:17:39] Unknown:
And am I correct in understanding that a good portion of the actual execution engine that powers airflow is actually celery under the covers?
[00:17:50] Unknown:
Right. So that so that's, that has to do also, with with the architecture. So maybe I I should talk a little bit more about the extensibility. So for for a workflow engine to be successful, it needs to be very extensible. And I think the reason why there's maybe not that many out there, though there's a growing number of workflow engine, is because often people will tightly couple it where their with their environment. So as we built airflow at Airbnb, we could have, you know, tied it, very, very, have a tight coupling, would say something like Kafka for logging or something like ZooKeeper, for for managing state.
And we we intentionally decided not to do that and to write interfaces instead so that people could run it and make use of it in their environment. So the Celery engine is 1 of the executors that exist for airflow, and we're actually at Airbnb, we're planning on using a different executor in the future. So we're planning on writing a yarn executor and moving away from salary to to have more support for, containment. So the other parts or let maybe I'll talk a little bit more about executor. So the executor part of Airflow is an interface, that allows to run jobs remotely. So the so the airflow scheduler can schedule jobs to run on different pieces of software that, can do that. So Celery is 1 of those.
Mesos, the Mesos executor is another alternative at scale. And then we have a sequential executor, which is just a local sequential executor that runs in process. So that's used for testing, and we have another 1 called the, like, a multi threaded local executor that you can use to on a local machine to say spin off 32 threads or n threads that will run airflow jobs locally. So we do use Celery, but we do support Celery, but we have alternative. Now we're writing a yarn executor, which will leverage some of the some of our Hadoop infrastructure so that if you have a yarn cluster, you'll be able to kinda rationalize your resources and use the the yarn, processes to run your airflow jobs.
[00:20:05] Unknown:
Max, you mentioned that the Yarn, executive that you're writing has better support for containment, and you mentioned Mesos just a moment ago. Does containment in this context mean something like a Docker container?
[00:20:18] Unknown:
Completely. Yeah. So so containment is really important when you run, processes at scale and when you don't know what people are gonna use your workflow engine to run, inside of it. So, you know, it would be somewhat easy with the salary executor for someone to write a job that will, creep up to use all the memory on your system or to, to use to spin up multi threads and take over the CPU on on a worker. So things like Docker and Docker is built on top of Linux containers. I think Mesos is also, kind of a distributed computing platform built on top of Linux containers as well. And what Linux containers allow you to do is to run a small, lightweight virtual environment that can limit, resource usage. So based on this assumption, you can ensure more stability of the system.
Saying, for instance, I will only allow this process to use a CPU core, 2 gigs of RAM, and a certain amount of virtual disk disk space. And if it goes outside these boundaries, you know, it will either starve or it might get killed externally. So it is kind of a vital feature feature and and workflow engine or in any sort of distributed computing. And though I think we've had, like, quite a run with with Celery. If we wanna grow beyond that and sleep at night, you know, it's it's a good thing to have containment.
[00:21:50] Unknown:
Absolutely. Would that allow you to essentially have the Docker containers that represent the jobs in your workflow sort of I guess, what I'm what I'm wondering is, would you be able to do something like implement the equivalent of software contracts where you say, you know, this job requires a container that is listening on these ports and does the following things with the data such that you could swap out the underpinnings if you needed to and still have your job work as designed?
[00:22:20] Unknown:
Right. So that's definitely a a direction we're moving towards. I think someone from the community I just I just merged a PR that is a docker operator. Maybe a little bit con context and an operator. So an operator in the airflow is a task factory. And and when you call, say, a hive operator, it will return a a hive job and it it receives the parameters that are relevant to this operator. So in this case, probably a Hive script and, some parameters, specific to to Hive. So there's kinda 3 types of operators in there. So there are sensors that, are kinda just waiting for an event to happen. So that could be waiting for a file to land in HDFS or a file to land in s 3 or waiting for arbitrary things in your environment to arrive. Then there's like remote, jobs operator that will tell an external system, to run a job. So that could be a MapReduce operator or a Pig operator or batch operator or Python operator that will just execute a script in their remote system.
And then we have the all the data transfer operators, which will just take data from a system and move it to, to another system. So in the context of of of what an operator is, a a Docker operator will fire up a a Docker container of your choice, so and and it will run a command within it. So that's pretty empowering. So then we can have this contract of you provide arbitrary, you know, containers and then we'll run them for you. There is I think we want a deeper integration in the future with Docker where every task so every operator could be run within a Docker container on demand. So that means the base operator, when you call an operator, you would say in which container you want this, you want this task to run, and then we can fire it up within that within that, Docker container.
So that is that does not exist today, but that's definitely on the road map, and it's coming up soon.
[00:24:25] Unknown:
It's kind of amazing to watch how containers, which at first, you know, the old guard was kinda, I don't know. You know, this is not a new concept and why is this such a big deal? But it seems like Docker has really driven adoption and and utilization in in all kinds of interesting ways. And so you're seeing these containers enable all kinds of really interesting reuse and efficiencies that you can pull out of that architecture. It's it's really cool to see.
[00:24:56] Unknown:
Yeah. Definitely. I think there was a bad separation of concern that exist existed before where maybe ops people would be in charge of, defining of software to be deployed on machines and then developers would be in charge of of maintaining the code that would run on these machines. And that separation was kinda unhealthy because, you know, it's it's hard to to work together and maybe the ops people were trying to fight the developers saying, oh, we need to have like sup like supported packages that are validated and in in CentOS, you know, 2 0 or whatever, and some packages that have been written that are very stable, but, not up to date. And then I was competing with, developers that maybe wanted to use the the bleeding edge features of of the new library out there. So now I feel like it makes more sense for the software developer to package his own stuff and just send this package to an ops person that know it is contained.
And, you know, the contract is a lot more clear in that sense now, and it allows for, you know, more better distributed computing in general.
[00:26:03] Unknown:
So how does Airflow compare to other workflow management solutions, and why did you choose to write your own?
[00:26:10] Unknown:
Right. So, coming out of Facebook so so Facebook, we we know is an innovator in the data processing or just in the data analytics space. And Facebook has a great array of tools that empowers the people who use them to kinda naturally build some very dynamic processes. So coming out of there, I wanted to make sure I was gonna have a similar set of tools in some ways where I could do things like analysis, automation, analytics as a service, and, you know, build build some great things with, the right packages. So I think, you know, sometimes when you have the right tool for the job, it becomes kind of trivial and it can change the way you think about this job. Right? So if you've held a hammer, you know, the world really changes once you've you've held a, kind of a how they call it? The the automated hammers that work with air pressure. Mhmm. Oh. So so so once you've held 1 of those, you're like, oh, that really changes the way I think about, I don't know, redoing my roof or or whatever.
So I I think it's it it looking at the ecosystem and the different tools that were out there coming out, of a place like Facebook, you you can't help but just thinking, you know, I'm gonna need to build my own set of tools if I if I want to be able to accomplish things at the same level. So looking at what was out there at the time, we we have at Airbnb people coming from, from LinkedIn that have used Azkaban and we have people from Yahoo that have used Uozi and have I've used some of these tools, and I've definitely, like, reviewed them. And the people that had used these tools were, like we're we're saying, like, what whatever whatever we we need to do, you can't we can't use the 1 that I used before.
And, you know, I was coming from a place where people love their workflow engine. So I was like, it's possible to build a workflow engine that people will actually, love love to use. So so that was the intuition, and and just looking at kinda where I was at, I felt like I have a good grasp on what was needed and that I could just build it. You know? Sometimes you just feel like I can do this. I can build this thing on my own or not necessarily on my own, but, I I know how to carry this project carry this project through.
[00:28:31] Unknown:
And moreover, it seems like from what you're saying, Airflow also represents kind of a new level, a new kind of workflow management solution that really doesn't compare to to other comparative solutions out there. Right?
[00:28:44] Unknown:
Yeah. They're probably the closest 1. I don't know if you've got you guys have, looked into Luigi. So Luigi seems like a really interesting project. I looked into it, and when when I looked at it, there was a few things that, I wanted built a slightly different way. And it's it's not necessarily criticism towards Luigi. I think, which is a a great tool and people are getting tons of value out of it. But 1 thing that was really important to me was for the API to allow a really natural way of creating tasks dynamically. And in airflow, to create a task, you you need to instantiate, an object, where in Luigi, you need to derive a class.
Meaning that if you wanna create, you wanna write a for loop that, creates a set of task in Luigi, you have to get into, meta programming. And in airflow, you can you can just simply instantiate objects, put them in a list, associate them with your workflow object, and that just works. Looking at that, I was like, I want to make it very natural, and I I think it's a game changer to allow people to author their workflows in a very dynamic fashion.
[00:29:57] Unknown:
Yeah. Luigi is definitely an interesting tool. And when I was in the process of picking a workflow engine for managing some of our data pipeline, and it was mostly a toss-up between Airflow and Luigi, and trying to figure out which 1 would fit best with our requirements. And, they're both definitely very interesting projects with their own particular pros and cons. And from my understanding of it anyway, it seems that Luigi gains a lot of popularity because of its simplicity in terms of the deployment requirements, because it doesn't really have any hard dependencies from an infrastructure level, and you can build your pipeline however you want. Whereas it seems that Airflow has a little bit more in terms of an initial setup. So I don't know if you can elaborate a bit on, I guess, what the what the deployment story looks like with Airflow.
[00:30:48] Unknown:
Right. So 1 thing that was really important for us was to make it really easy for people to go through the tutorial without having to to install anything in particular. So to just be able to PIP install Airflow and get a feel for it, get the UI, get the web server running, and and go through the tutorial. So I think we've achieved that very well. When you want to run it, when you go wanna go beyond the the tutorial very quickly, you you need a decent database or my a MySQL instance or a Postgres instance to, to to store the state. So that's 1 thing that Airflow does that Luigi does in a different way. So we store state for each task instance, and there's, there's this notion of of schedule that that is kinda deeply ingrained in the airflow fabric. So it kinda assume that, most jobs or most workflow run on a schedule. So say on a daily schedule, on an hourly schedule, where weGA, I I think that, I believe, defines whether a task needs to run or not based on whether the target, is present or isn't, which changes things in the way you think about it in some ways. Sometimes when you look at like, should I build my own tool or should I use someone else's, is whether you're able to do the exercise of bending your mind to fit the maker's mind. Right? And when I when I looked into the different tools out there, it's like, the way I think about workflows and how to author them and schedule them is not fully compatible. Or I feel like if I try to bend my mind in thinking like these other makers, I might something might crack. So I need to build something that that fits.
So quickly quickly you need a, database because by default, it ships with a SQLite database, which won't allow you to run tasks in parallel, because there's issue with SQLite and having multiple threads, talk to the same file in your local disk. So quickly you need, you need either any of the SQL Alchemy supported, database back end. So Postgres MySQL. And the next step after that so so you can run your local executor on a single machine until you essentially run out of resources on 1 box or it starts to feel a little bit tight. At which point, you probably wanna set up celery or mesos.
And Celery is is is very lightweight. It's easy to set up, but you do need a broker. So you might need a Redis instance or a RabbitMQ instance as the broker for Celery. For for those who are not familiar with with Celery, Celery is a Python. It's a mature Python async processing framework, which is it's great. It's it's it's super easy to use, and it's it makes it really easy to say execute a function remotely on a an array of workers. So people use it to do to perform all sorts of async tasks. I think people use it, quite a bit in the in the web world to say, process thumbnails or do all sorts of, asynchronous work workloads.
So to grow an airflow cluster, you you need at some point to to, install Redis or, run it in queue. And then, you know, that comes with, managing multiple servers. So probably if you work at a company, you already have a way to deploy, to spin off machines either on AWS or, however, you do that, but you need some sort of infrastructure to manage your machines. And then airflow is kinda agnostic in in those terms. Right? Airflow knows which task run on which machines they're running, but you still need to monitor these boxes the same way you monitor any boxes on your on your network.
[00:34:40] Unknown:
And Airflow, like a number of other tools in the space, support interoperability with Hadoop and its ecosystem, like you mentioned. I'm wondering if you can postulate on why the JVM technologies have become so prevalent in the big data space and how Python fits into that overall domain.
[00:34:56] Unknown:
Right. So, yeah, the JVM the JVM is great for for a lot of reason. I think, you know, there there's some containment that comes with it. There's there's all sorts of, best practices there. I'm not much of a JVM guy. And I think that those lines are also starting to blur, now. Right? So we see that we have, faster just in time compilers for Python. Right? And then there's on the other side, there's more dynamic languages built on top of the JVM, like Scala, and there's, Jython. I'm not sure how viable Jaithen is nowadays, but I was considering it, considering enabling it in some ways on on the on the airflow workers at some point, because the the JVM is is is a great kinda unit of containment or unit of work with a whole set of of of guarantees.
And it seems like it's very prevalent, right, in the Hadoop ecosystem, but we're seeing kind of the counter movement of more dynamic languages, starting to being being built on the JVM itself. So so that's kinda interesting to see where this is all moving.
[00:36:07] Unknown:
And Airflow comes with a web UI for visualizing the workflows as do a few of the other Python engines. I'm wondering why that's an important feature for this kind of tool, and what are some of the tasks and use cases that are supported in the Airflow web portal?
[00:36:21] Unknown:
Right. So so it is it is vital to know what's going on. Right? So you're we're executing, probably close to 10, 000 tasks a day in in airflow and, at Airbnb. And there's multiple people iterating on these tasks, and there's all sorts of things going on, in in this big, this big symphony that's taking place every day. And if you're not able to see what's going on, you have no clarity and it's you can't basically do the work of a data engineer if you don't have an easy way to know, the the the state of things and get access to your logs. So, some of the things that the airflow UI does, and it's probably 1 of the best UIs out there in terms of of, workflow engine. So the first thing is to visualize your graph of dependency and to easily see the state. So Airflow will show you a graph view where you can kinda zoom in and zoom out and hover over your tasks and get all sorts of, tool tips that will tell you, you know, what is the state of that task, when it when did it start, when did it end.
If you wanna see any of your parameters or attributes for any of your objects or tasks, it's really easy to do so. There is a tree view that will show you kind of the tree representation of your graph over time, and that brings a lot of clarity when you're trying to understand why is this task blocked or how long has it been blocked for. Airflow will also allow you to interact with the state of your task. So you can say, for instance, I need to rerun, this task for a certain date range and everything downstream from it. And those are things that we do every day as data engineers. Right? So there's false positive, false negatives. There are some errors, and we need to, reprocess things. So the airflow UI gives you a lot of clarity on what needs to be done. It allows you to do to perform all these surgeries on your workflows.
There's also a lot of clarity around, start time and end time. Things like Gantt charts or charts that will show for each task at what at what time it lands, every day. So time series of landing time, time series of duration of tasks. You can see, the source code for the workflow as well. So all of this make it a lot ease a lot easier for people to, to be productive and, to have clarity and not be frustrated while authoring batch processes.
[00:39:01] Unknown:
And 1 problem with data management is tracking the provenance of the data as it's manipulated and shuttled between different systems. And wondering if Airflow has any support for maintaining that kind of information, And if not, if you have recommendations for how practitioners can approach the issue.
[00:39:16] Unknown:
Right. So, so data lineage is what you're talking about. So Which is, like, trying to understand, you know, when you look at a piece of information, say, in a dashboard, in a report to understand where is this coming from. And, if you have good data lineage, you should be able to answer the question of like this metric came from these tables, and see the whole track of the the business rules that were applied on the way there. So airflow does not have a graph of data lineage per se. So we have a lot of clarity on the lineage of jobs and how jobs depends on on other jobs. But, there there's there's no kind of built in direct way of visualizing your data objects and how they're kinda graphed together, how they're related.
So, like, having clarity on your jobs, your job structure and how they depend on each other. And if you have kind of some sane rules around, may maybe you have most jobs are associated with the underlying table, and maybe there are naming conventions that, you know, PRC fact session is the process that loads the table called fact sessions. And it's pretty easy to to infer the data lineage from the job lineage. So that's the way we've been doing it. We're also interested in in allowing for, some code introspection. So Airflow could look into the SQL code and some of the the code that's underlying to be able to infer the lineage or to allow people to annotate their workflow to to structure and and clearly define, you know, the data lineage behind certain tasks. So that's on their road map. That's something we're we're committed to and, you know, a problem we definitely wanna solve for ourself and for the community.
[00:41:10] Unknown:
And are there any other kinds of metadata that Airflow can track as executes tasks? And what are some of the interesting uses you've seen or created for that information?
[00:41:19] Unknown:
Right. So so I I think 1 key thing is like metadata is the key of success in in in data processing world. Right? So, we do analytics for top of. So having metadata and doing analytics on on it is a way to stay on top, while being a data engineer. So so outside of the tasks, task instance metadata that we have in airflow, internally at Airbnb, we we've written a job parser that after each hive job, we're able to gather some of the statistics around, resource usage behind all of our Hive jobs. So we, we're Hive shop. We run a lot of our batch processes in, Apache Hive.
And every time the Hive job finishes, we parse the logs to try to get the the most information or as around, like, how many CPU cycles, how much IO, and how many mappers and reducers, we've used, and try to gather stats around, SKUs to say join SKU, or group by SKU. So so that's some of the stuff we've built on top of airflow that gathers underlying metadata. I think be beyond that, we we do metadata processing job using the airflow engine. So, we're able to call APIs on a schedule and gather information and metadata and and stamp it or snapshot it in the database using Airflow, and that's something we surely do.
[00:42:59] Unknown:
So with all the other languages competing for mindshare, what made you choose Python when you built Airflow?
[00:43:06] Unknown:
Right. So if English is the language of business, I say, like, Python might be the the language of data or at least, like, a language that a lot of people speak in the data world, especially for for everything that's kinda gluing things together related. So when when I look at the tool that that we were building, 1 thing that's really important is connectivity with all these other systems because it's a workflow engine, and we need to be able to talk to, the different threat services. We need to be able to talk to any external database. And Python's got extremely good support or a very wide coverage of interfacing with external tools.
So that was 1 key element. And I know I knew we're gonna need some async processing. I knew we're gonna need a good web framework to build a nice UI. And I wanted something that was high velocity where I could write code quickly and get results quickly also. So as a Python programmer too, it's just my my language of choice too. I think in Python. I when I write in other languages, I often think in Python first and translate it to that language with a little bit of frustration in the process. So I think Python is a really good choice for a workflow engine. I think we we can see that with other efforts out there too, like, Weeti and, Pinball.
And it seems like it's a great fit to build that kind of solution.
[00:44:36] Unknown:
Absolutely. Speaking of Python and and Python being the language of data, I love your quote, by the way. Have you or has anybody else written any integrations between Airflow and Jupyter or IPython Notebook?
[00:44:49] Unknown:
Right. So I remember starting that when I start first started the project, I was actually iterating within Ipython notebook or actually, well, Jupyter. And, I wanted to to make it easy for people to use the the operate the operators and run them interactively using the I the API directly in Jupyter. So today, I think it's still possible to do that, but it's not in the way that people necessarily do it. It is possible that you use the the hooks. So the hooks is the, the interface to talk to external systems in in airflow. And, you know, a lot of people at Airbnb have been using the hooks in a knife, python, notebook, context.
That works pretty well. But it's it's not necessarily a first class citizen, but I think I think it should be, and there's a lot of things that could be leveraged directly from there.
[00:45:43] Unknown:
Very cool. So I noticed that airflow supports Kerber at Kerberos. It's an incredibly capable security model, but that comes at a high price in terms of complexity. What were the challenges, and was it worth the additional implementation effort?
[00:45:58] Unknown:
That's an interesting 1. So this 1 is community contributed. So the the mind share there was to try to keep it, to review it while not necessarily knowing what it was all about. So I I believe that might be under the contrib folder of the project. So it is not fully supported by Airbnb, but it is community contributed, and we definitely wanna, you know, grow in that direction. I think we're we're having this year, we're having kind of a security effort internally, and we'll most likely be, using Kerberos at Airbnb on top of some of our Hadoop, services.
So it's really cool to see that the community contributed something to the project that we actually need, and that will that will become really handy once we be we get to that bridge. We'll be able to cross that bridge without having to build it. So that's 1 of the success of open source, you know, where the distributors have built things that we actually ended up needing.
[00:46:58] Unknown:
Absolutely. It's it's kind of amazing to me, you know, when you think about just how old Kerberos is, how successful and pervasive it has been, I mean, I think it's a testament to the good design and and and really serious thought that went into it that it's not only still around, but really still thriving. And, you know, people like your company is considering adopting it, like, now in, you know, 2016. It's it's a pretty incredible piece of technology.
[00:47:28] Unknown:
Yeah. I'm interested to to look into it some more. At this point, to me, it it has been a a PR that I made sure was not gonna affect the core of the project. And I'm I'm generally not very inclined towards, towards security type of effort, but that makes me even happier that it it's been done on on on our behalf.
[00:47:50] Unknown:
Yeah. Earlier, when you were mentioning the ability to introspect the source code of the workers, that was 1 of the things that came to mind is with the security model in Airflow to help lock that down a bit so that that information doesn't end up getting out into the open because that could potentially be sensitive data or, could potentially become an attack vector if the actual source code is too out in the open. I mean, obviously, you would wanna lock down your workflow engine and keep it all internal anyway, but having some additional layers of security would definitely benefit that.
[00:48:25] Unknown:
Yes. Definitely very important to keep everything secure. I know a lot of companies, seem seem to be, taking an approach with security where most engineers have most access to, to most internal databases to kinda create this this, safe zone within which people have rights. And I think I think it's good to to start with a default of most people have access to most things that really allows engineers to be efficient, not having to ask permissions to to do something useful. I guess it's true also of source code. Right? Like, it used to be that you would people different teams would keep their source code insulated, and you'd have to ask for permission to to see the source code. I think that's changing, that's changing too where I think most companies are enabling engineers to have access to data and to the source code because it might be useful to them. And if you, if you don't start with that default of having access to things, then you there might be a lot of of missed opportunities there.
[00:49:30] Unknown:
So when does the data pipeline and workflow management paradigm breakdown and what other approaches or tools can be used in those cases?
[00:49:38] Unknown:
So I've seen I've seen it break down quite a bit and and in the community too. So, and typically, it's where people are trying to either write a software with, you using a workflow engine. So that that would show as a workflow that changes in shape at every execution. Right? So so for me, a a workflow or if you think of it as a as a graph of dependencies, should be slowly changing. Right? From an execution to the next, the workflow, it's fairly similar. There might be some branching, right, saying I'm gonna go down this path and not this path. But the structure of the workflow itself should be pretty static.
In some cases, people wanna build very dynamic workflow that, say, where a task define the next stat the nest the next task that should run after this task dynamically based on their result. So that's, probably not a super good use case for a workflow engine. There's also when people start, scheduling things to run every minute. So workflow, Airflow is a workflow engine that was designed for daily, hourly, batch processing. If you start running things every 5 minutes, it it works, but it becomes kinda metadata heavy, and you're getting almost close to the world of real time data and stream processing. And there are some much better solutions out there to do data stream processing like Storm.
There's, I think, something you call Heron. There's Spark Streaming, SAMSA. So there's all sorts of solutions there that work very well and are designed for those use cases. And if you're trying to schedule your airflow job to run every minute, most likely you're not using the right system.
[00:51:26] Unknown:
So you wrote another tool recently called Panoramics. Can you describe what it is and explain how it fits in the data management domain in relation to Airflow?
[00:51:34] Unknown:
Sure. Yeah. So Panoramix is started, last June as a hackathon project. So it's a fairly new project, and it is a data exploration and dashboarding platform. It's going to be open source, and we're gonna announce it, very soon as an Airbnb project over the next, few few months. The the name of the project might still change. So, by the time you listen to this podcast, you can look for data exploration Airbnb and and see what, name the tool has at that point in time. But this this tool allows people to, visualize and explore datasets, arbitrary dataset that may live in SQL, different, databases or in Druid.io.
So Druid.io is a column store real time database that we use at Airbnb. And panoramics can connect to Druid dataset as well as, MySQL databases, Postgres databases, you know, Presto, Impala, Redshift, all of these things, and make it really easy for people to create data visualizations, explore data, and create collections of data slices or data visualization in in dashboards. So look forward to, to hear about this project. It should be announced very soon.
[00:53:04] Unknown:
Are there any questions that we didn't ask that you think we should have or anything else that you wanna bring up before we move on?
[00:53:10] Unknown:
Certainly. You know, there's there's a lot more to talk about, around workflow engines and and airflow, but I think we covered, definitely the basics and a little bit beyond. So if you have, there's a great documentation out there for people that have more questions to start looking into, and we have some very active we have a Google group that's very active as well as a Jitter, chat channel that, everyone's welcome to, to join the party.
[00:53:36] Unknown:
Is there anything that our users can help out with for the project?
[00:53:40] Unknown:
Well, so, on the GitHub, we have a contributing MD that tries to define, you know, how people can help. But, you know, always for for users of the project, everyone's welcome to, probably the first level of helping is as you onboard is to read the documentation, help clarify the documentation, you know, help improve the tutorial, as well as, like, raising issues. So for all of our users to come to GitHub and whenever they find things that could be improved to, open issues, that really helps. And as people onboard, you know, we're very welcoming to, any PR that makes our product better.
[00:54:21] Unknown:
And, 1 last question I just thought of is actually when I was reading through the documentation a little while back when I was evaluating it, it mentioned briefly that it only supported Python 27. So I'm wondering if that's still the case or if there's Python 3 support as well. Right. I think the docs might still say only 27,
[00:54:39] Unknown:
but but we do support 34 now. And I think, many people in the community are using 34, and we run continuous integration against 27, 34, Postgres and MySQL are built matrix for, continuous integration, includes, includes 3, 4. Great.
[00:54:59] Unknown:
So for anybody who wants to follow you and keep up to date with projects you're working on, what would be the best way for them to do that?
[00:55:06] Unknown:
I would say, the the Google Groups, is probably the best place to start, and the Jitter channel would be the the the next place. There's links on the GitHub. So probably watching the GitHub is is also a great place, to to get updates and and follow the project. And on the GitHub itself, you can find all the links to, related resources, including automation as to, like, say, Puppet and Chef recipes as well as, other ways to that that make it really easy for you to install, Airflow and make it run-in your environment.
[00:55:42] Unknown:
So with that, I will move on to the picks. And my picks this week are I recently finished a book called or actually a trilogy of books called the Empire of the East by Fred Saberhagen. And they're a very interesting blend of sci fi and fantasy. The general thematics are predominantly fantasy, but there's a lot of sci fi elements throughout in that it takes place far into the future after our modern era has fallen and given rise to a new era of magic. And there are still artifacts from our time that are present in this far distant future, so things like our telescope or different machinery or flashlights, things like that that show up as artifacts throughout the story line. And also some other elements that I'm not gonna give away because, it would sort of spoil the story for anybody who wants to read it. But he does a really great job of seamlessly blending the 2 themes and the 2 styles.
And he has some really compelling storyline as to how that change came about and what it leads to. And then it also leads into another series that I am currently rereading, called the book of swords. And this takes place further into the future beyond the events of Empire of the East. And the gods have sort of arisen and forged these 12 swords, and each sword has a different set of powers that the bearer can take advantage of. And so, it's, again, just another interesting view of humanity's interaction with the gods and what it actually means for them to be gods and how the swords affect the power dynamics within this world. So a lot of fun to read. Definitely recommend it for anybody who is a fan of sci fi and or fantasy. And with that, I'll pass it to you, Chris.
[00:57:46] Unknown:
Very cool. My first pick is a band, that I found on Spotify called Baraca Son Systema. They're a Portuguese band. Their music is a really interesting mix of sort of African tropical infused, techno electronic beats, combined with this African music form called Zouk that I had never heard of. And I just personally think the word Zouk is really cool, z o u k. So, that's my first pick. I think they're a lot of fun. It's it's definitely not music to fall asleep to. It's it's definitely incredibly upbeat. I like to work to it. My next pick is a pick of dubious legality, but I'm gonna pick it anyway. It is a, a fan made edit of the Star Wars trilogy called the despecialized edition.
This thing is a labor of love. There's a YouTube video, I believe, linked to from the site that that I've linked to in the show notes that shows just how much work these folks put into restoring the Star Wars trilogy to as close as they could possibly get it to the original 1970, you know, seventies film release, but updated for the modern day in terms of upscale to HD and the like. It's just you know, for someone who grew up with Star Wars and saw it in the theaters in in the late seventies when it came out, I've just, for a long time, felt so disappointed by all the official, you know, DVD and Blu ray releases that were coming out, that with all the Lucasfilm edits, and this is just such a a refreshing, development to see these folks bringing it back to its original, state.
It's totally worth seeing. My next pick is a series of books by Kevin Hearn. These are definitely not like, Shakespeare or or or literary classics for all time, but, boy, they are a lot of fun. They're called the Iron Druid Chronicles, and they are a about the last surviving druid, Atticus O'Sullivan, roaming the earth, getting himself into various, you know, types of trouble. But, it's a really interesting mix of sort of, interesting bits of Celtic myth. A lot of, sort of, you know, great geeky fan service, and just some really entertaining story lines, especially if you enjoy, reading fantasy that's infused with the gods and goddesses and and and other minor, denizens of various mythos from around the world. It's just it's really fun stuff.
I I really enjoy it and and really look forward to each new book as it comes out. That's it for me. And, Maxime, what do you have for us?
[01:00:38] Unknown:
So since this is a Python podcast, I felt like I I had to pick a Python package as a, as a pick. So I picked the a project that I think deserves a lot more attention. It's called Flask app builder. And Flask app builder is, is a framework on top of Flask. So Flask is a framework already, but it it's very modular and doesn't do much in certain areas. It it lets people build their own things. Flask app builder, ships with, authentication. So it makes it really easy for if you build an app and especially if you build an open source app like Panoramics, to allow the people that will install your app to define which which type of authentication they wanna use. So they can plug in, you know, LDAP, OpenID, OAuth, and and it just all works. It manages, it creates a set of permissions for you and, allows to tie those with roles and users.
So the things those are things, you know, that that came, with Django that don't ship with Flask. There's also, CRUD modules. So to do all the create, update, delete dynamically, based on the the modules from your ORM. So it's it's just a very productive framework, that is super easy to use and works very well. So if you're, you know, starting a new, web app, check out Flask app builder. If you want something maybe slightly lighter weight and less mature than Jengo so that you can, contribute to this project.
[01:02:18] Unknown:
Well, we really appreciate you taking the time out of your day to join us and tell us all about Airflow and tantalize us with, promises of panoramics. So we definitely look forward to seeing some more of the work that you put out. And, again, thank you for joining us. Have a good evening.
[01:02:36] Unknown:
Thank you very much. Very exciting to be I've been listening to, to all sorts of podcasts in the past, and it's good to, to be on on this side of the microphone. So thank you for having me. We're happy to have you. Thank you. Bye.
Introduction and Host Welcome
Interview with Maxime Beauchampant
Maxime's Background and Introduction to Python
Data Engineering vs Data Science
Introduction to Airflow
Challenges in Data Pipeline Implementation
Design and Architecture of Airflow
Airflow Executors and Extensibility
Comparison with Other Workflow Management Solutions
Airflow's Web UI and Visualization Features
Data Lineage and Metadata Tracking
Choosing Python for Airflow
Security and Kerberos Support
When Workflow Management Paradigms Break Down
Introduction to Panoramics
Community Involvement and Contributions
Picks and Recommendations