Summary
The need to process unbounded and continually streaming sources of data has become increasingly common. One of the popular platforms for implementing this is Kafka along with its streams API. Unfortunately, this requires all of your processing or microservice logic to be implemented in Java, so what’s a poor Python developer to do? If that developer is Ask Solem of Celery fame then the answer is, help to re-implement the streams API in Python. In this episode Ask describes how Faust got started, how it works under the covers, and how you can start using it today to process your fast moving data in easy to understand Python code. He also discusses ways in which Faust might be able to replace your Celery workers, and all of the pieces that you can replace with your own plugins.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at podcastinit.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Ask Solem about Faust, a library for building high performance, high throughput streaming systems in Python
Interview
- Introductions
- How did you get introduced to Python?
- What is Faust and what was your motivation for building it?
- What were the initial project requirements that led you to use Kafka as the primary infrastructure component for Faust?
- Can you describe the architecture for Faust and how it has changed from when you first started writing it?
- What mechanism does Faust use for managing consensus and failover among instances that are working on the same stream partition?
- What are some of the lessons that you learned while building Celery that were most useful to you when designing Faust?
- What have you found to be the most common areas of confusion for people who are just starting to build an application on top of Faust?
- What has been the most interesting/unexpected/difficult aspects of building and maintaining Faust?
- What have you found to be the most challenging aspects of building streaming applications?
- What was the reason for releasing Faust as an open source project rather than keeping it internal to Robinhood?
- What would be involved in adding support for alternate queue or stream implementations?
- What do you have planned for the future of Faust?
Keep In Touch
Picks
- Tobias
- Ask
- Microsound by Curtis Roads
Links
- Faust
- RobinHood
- Kafka Streams
- RabbitMQ
- AsyncIO
- Celery
- Kafka
- Confluent
- Write-Ahead Log
- RocksDB
- Redis
- Pulsar
- KSQL
- Exactly Once Semantics
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200 gigabit network, all controlled by a brand new API, you've got everything you need to scale. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute. And visit the site at podcastinit.com to subscribe to the show, sign up for the newsletter, and read the show notes. Your host as usual is Tobias Macy. And today, I'm interviewing Ask Solom about FAUST, a library for building high performance, high throughput streaming systems in Python. So, Ask, could you start by introducing yourself?
[00:00:53] Unknown:
Hey. My name is Ask Solom. I work at Robinhood on the data team, and I build files with Vineeth Ghul, who also works on the data team, but he's he's not here. And do you remember how you first got introduced to Python? Yeah. It was actually my friend who was pestering me to learn Python. I was writing Perl at the time, and he was constantly saying how much better Python is. And finally, I gave in and learned it, and he was right.
[00:01:24] Unknown:
So can you start by talking a bit about what the FAUST project is and your motivation for building it initially? Yeah. So at Robinhood,
[00:01:32] Unknown:
we are a financial institution. We process market orders and, do risk and fraud checks and things like that. And to do that we have many batch jobs that run-in the evening and they're becoming so many that we are having problems, processing it all before the next market day opens. So we needed something to do stream processing with. And we looked at Kafka streams, but we didn't wanna write Java because our developers are used to Python.
[00:02:11] Unknown:
So when you, first started building it, sounds like you had settled on Kafka as the data source, and you just, tried to build an alternate interface on top of that platform. So what were the initial project requirements that led you to choose Kafka as that data source and the requirements from the developer perspective as far as the interface that you provided to them for being able to consume and process that data? Well, Kafka is already becoming,
[00:02:42] Unknown:
very much an industry standard in that regards, but we also have compliance requirements. So everything that we process needs to be logged. And Kafka is, perfect, is perfect for that because it is a log. So we also keep the history of of everything that we have put into it, as compared to something like RabbitMQ where the data is informal and just disappears.
[00:03:10] Unknown:
Yeah. That makes sense because particularly with Rabbit, if you were to create those logs, you would have to have another consumer just to persist that information in an alternate storage system. So might as well get the best of, having it all in 1 interface rather than having to put together a Rube Goldberg contraption, and then have to deal with whether or not the logging consumer is actually still running or if it's dropping things. Yeah. Exactly. And can you talk a bit about how you came up with the name for it? Because it's a interesting, name given the literary history.
[00:03:44] Unknown:
Yeah. Yeah. No. We were just trying to find names and we thought that stream processing is kind of like selling your soul to the devil. And that was why we named it Faust. It reminds us that we have to be careful with the data that we store.
[00:04:00] Unknown:
That's that's, it's very clever. It's good to have a reminder every time you're working with it that you need to be thinking about how you're working with the data and what kinds of information you're keeping. Yeah. Exactly. That was the idea behind it. And so can you discuss the internal architecture of the library itself and some of the components that it consists of? Yeah. So the internal internally,
[00:04:21] Unknown:
the worker is composed out of many different small services that are started and stopped in a certain order. And we also have supervisors that handle failures to restart and,
[00:04:35] Unknown:
yeah. I don't know if you wanna dig into the, asynchronous aspects of it and some of the, design challenges that you faced while building it up. Yeah. I mean, the architecture is based on asyncio. So,
[00:04:48] Unknown:
it's a library that you can drop into any asyncio program, like your web server or network server or whatever it is that using asyncio. And you can actually run alongside it just fine. If you think about, like, Celery, you need to start it separately from your web server, but files can actually run-in the web server itself,
[00:05:09] Unknown:
which I think is is it's kinda cool. Yeah. That definitely simplifies the deployment mechanism where you don't have to worry about keeping a separate process running
[00:05:17] Unknown:
or whether or not it has access to the right system libraries or the right run time. Yeah. And you can use it as a database as well. So which means that you can basically deploy a full back end service where only dependency is Kafka.
[00:05:32] Unknown:
And, 1 of the topics that I came across while I was looking through the documentation is the idea of, managing consensus and fail over with multiple workers consuming from 1 topic, but having the stream partitioned based on some identifier. So I'm curious what you're using to manage that synchronization between multiple workers so that you don't have them either processing the same event multiple times or having an issue of split brain?
[00:06:03] Unknown:
So, well, for this, we depend on Kafka pretty much completely. But, partition is never shared between workers. 1 partition is always handled by 1 worker at a time. But if that work goes down, it's assigned to a new worker. So we have a partition assigner that assigns partitions to our workers in a smart way. 1 of them is that the if the worker is subscribed to partition 0 of 1 topic, it needs to be subscribed to partition 0 for all the topics. This ensures that there's a relationship between the source event and, any state that you store after processing the event.
[00:06:46] Unknown:
And for anybody who's not familiar, you are the primary author and maintainer of the Celery library as well, and that has been around for a very long time and used in a large variety of contexts. So I'm curious, what are some of the lessons that you learned while building Celery or some of the design issues that you encountered while trying to adapt it to all of these different use cases that you were considering while you were working on Faust?
[00:07:15] Unknown:
Well, 1 of the lessons that I took from Celery was the how the internals are organized, based on having small services that are started and stopped in a certain order because that's how Celery works as well. But in this case, we've had the opportunity to start out that way. So the yeah. There there's not a lot of backward compatibility issues at this point. So the the code is much more readable, I think. But also, there I mean, there's there's so much stuff that I learned from writing Celery, like the importance of having stress tests and integration tests is is 1 of them. So that's why we it's something that we do a lot for Faust is writing integration tests and stress test that make sure everything continues to work as we make changes. And in terms of
[00:08:02] Unknown:
the way that Celery was built, it is very adaptable in terms of the queuing mechanisms that you can, integrate it with, whereas FAUST is primarily focused on Kafka. So I'm wondering if that has helped to simplify some of the overall design and implementation of Faust as compared to Celery. And, if you have any plans in the future to add a pluggable option for alternate, queuing or streaming mechanisms.
[00:08:30] Unknown:
Yeah. Definitely. I mean, that that's just how, I write software in, generally, is by having it keeping it extensible. And you can replace Kafka in Faust very easily. It's actually easier to write something like a RabbitMQ backend for Faust than it is for Celery. Yeah. We have, created an abstraction called the channel, which is basically just a queue that you put into and you get from. So to create a alternate back end, all you need to define is a new channel type. And with FAUST, when you're consuming from a topic, does the process keep track of the index offset from where it was consuming so that if it the process dies, it can just resume from where it left off? Yes. Exactly. It keeps track of the last committed offset, but that's not in files. That happens in Kafka itself. And so for somebody who is starting a new project,
[00:09:20] Unknown:
do you think that they would be better suited building on top of Celery or working with Faust and possibly adding an adapter for a different queuing mechanism that fits their needs? I don't know yet. I think Faust is radically different
[00:09:35] Unknown:
from Celery in many ways, but it also enables, a lot many more use cases that are difficult to do in Celery. For example, if you wanna have a a task that always runs on the only 1 worker as mutual exclusion, that is easy to do in Faust, but very tricky to do in Celery.
[00:09:54] Unknown:
And 1 thing that comes to mind that might not be built into Faust is the idea of having the scheduled tasks that Celery has supported for a long time, but that can sometimes be problematic when you need to run multiple instances of a Celery worker. So I'm curious what your thoughts are on things like that or some of the other features that are possibly present in Celery that you don't plan on implementing directly in FiOS or that somebody might be able to layer on top of it? Yeah. So the scheduling,
[00:10:23] Unknown:
then you mean, like, the countdown tasks or the periodic tasks? The periodic tasks, like the, Chrome style ones. Yeah. I mean, we could build that on Faust pretty easily, I think. Like, you could have so in Faust, we have leader elections. So 1 of the workers can be the leader. So to implement scheduling there, you only need to implement it, as a leader task. So the leader can send the task as their schedule. And that would also make it, highly available which salary periodic tasks are not. That sounds very useful. Yeah. I think that is, definite, a good use case for Faust. And so for people who are,
[00:10:57] Unknown:
just getting started building on top of Faust, are there any particular areas of confusion or difficulty that they come across particularly when working with the async aspects of it? Well, async IO is quite
[00:11:10] Unknown:
difficult to learn, I would say. But I don't think people generally have a problem with that. What people have more problems with is the partitioning and thinking about, like, repartitioning a stream when that's required or when it's not. I think that is the most confusing element of it. So say you have a stream of bank withdrawals and the and it's partitioned by the account ID of the user. That means that whenever you send a withdrawal request, it will always be sent to the same worker. But sometimes you need to repartition the stream by something else, say the country that the user originates from. And, yeah, the concept of partitioning is easy, but it's something that you have to think about and optimize for. By example, creating the a sufficient amount of partitions in Kafka.
And every topic in Kafka needs to have, configured number of partitions. I hope I think Confluent is working on something that will automatically manage it as in in the future. But right now, it it is, it is a challenge
[00:12:17] Unknown:
that you have to be aware of. And 1 of the other interesting aspects that you mentioned briefly is that you can embed Faust directly into a web process. And in the documentation, it mentions even being able to interface with things like Flask and Django via either Gevent or Eventlet. So I'm wondering what are some of the other sort of interesting or unexpected uses of Faust that you have either built or seen people build on top of it? So at Robinhood, we have a a new chat. It's a you can chat about cryptocurrencies,
[00:12:52] Unknown:
and this is completely built in files. The back end for this is completely built in files. We have a Faust service with a WebSocket server that serves 50, 000 concurrent WebSockets in running just 2 Python processes. This is running alongside, Faust. So it's it is a full web server serving the API for the service for the chat. And the chat messages go through files. We deliver them from Kafka to the WebSockets,
[00:13:20] Unknown:
the clients are connected to. And also, you mentioned that Faust has the capacity of storing data as well using the Rocks DB layer. And in terms of a high availability topology, do you have a synchronization mechanism for being able to replicate that information, or do the standbys just process all of the tasks in the same manner as the primary to manage that consistency?
[00:13:48] Unknown:
Yeah. They so we have tables. Tables are basically just a Python dictionary that you can update. But whenever you change something in it, we send a message to Kafka. So if the process dies, the worker dies. And so if the, if the worker dies, it will read the change log to be, to get up to date, get back up to date. And we also have standby nodes. So for every table, we have 3 or more standby nodes that are ready to take over if 1 of the nodes fail. And so it sounds like it's using Kafka essentially as the distribution mechanism for a write ahead log so that the other instances can consume that to rebuild the current state? Yes. Exactly. And we have an in memory cache or not in memory, but a RocksDB cache of that state. So accessing it is is very fast. And so for somebody who is
[00:14:40] Unknown:
replacing the Kafka implementation with another queuing mechanism, would they potentially lose out on that capability, or would there be any sort of requirement in terms of the, underlying system to have some measure of persistence for being able to manage that right head log? Or is that just largely,
[00:14:57] Unknown:
up to the implementer to figure out? You mean you mean if you replace Kafka with something else? Yes. Yeah. I don't think you could replace Kafka with something else in regards to the tables. Well, I'm sure you could, but Right. Not RabbitMQ Right. And not not anything like that. It would have to be a database. At least not without doing a bunch of back flips. Yeah. That would be very tricky. But but you can use files without the tables as well. It's basically a way to have key value stores that you you create. And so it sounds like if somebody wanted to
[00:15:28] Unknown:
use the tabular aspect without Rocks DB, they might be able to layer it on top of something like Redis where it's a shared storage as opposed to per instance? Yeah. Sure. I guess. But it's not really built for that. And then in that case, I would just use Redis directly, I think. And so what have you found to be some of the most challenging or unexpected aspects of building streaming applications
[00:15:54] Unknown:
and building FAUST to facilitate those? Well, the most challenging aspect has probably been Kafka because it is a incredibly tricky technology. So 1 the most challenging part is probably the rebalancing. So every time a node, a Kafka consumer node joins or leaves, Kafka needs to stop the world and reassign partitions to a different node. Right? But when this happens, it means that if a worker crashes, it also needs to rebalance. And if 1 worker crashes and that leads to another worker crashing, then you have a perpetual state of, of rebalancing and tracking those errors down have have been quite a challenge. So I hope that we could get rid of the centralized nature of of Kafka at some point. Have you looked at all into whether Pulsar would fit your use case? I haven't looked at Pulsar directly, but I'm definitely interested in it. But at the same time, we know how to manage
[00:16:47] Unknown:
Kafka here. So Right. Like, it is, is something that we already need to deal with. Once you get to a certain level of institutional knowledge, it's easier to just keep it running than try and replace it regardless of any benefits that might result from that. Yeah. Exactly.
[00:17:01] Unknown:
And and it seems like Kafka is increasingly being used elsewhere as well. Like, I interview a lot of people and any of the companies here in the Bay Area are using Kafka for for stream processing. Right? And 1 of the additional
[00:17:14] Unknown:
challenges in terms of processing streams of information is that by their nature, they're unbounded. So sometimes it can be difficult to know when to maybe yield results from a particular
[00:17:33] Unknown:
front. Front. Yeah. The windowing is basically such that you can keep track of the number of visits to a website per hour or in the last 30 minutes or or whatever. But since we have at least once processing, those can the consistency of that is not always correct. I don't know. Have you heard of the exactly once
[00:17:54] Unknown:
semantics in in in Coupler streams? Yeah. It's a difficult problem in a lot of distributed systems is the idea of exactly once. And my understanding is that in most systems that, support the idea of exactly once semantics, it's based on having the computations be idempotent so that even if you do process the same event multiple times, it'll end up with the same result. Yes. Yes. And that's basically how Kafka Streams has,
[00:18:19] Unknown:
has solved it. Well, they haven't really solved it, but they are deduplicating the the data. We we don't have support for that yet, but we will as soon as, a Python Kafka client, supports transact Kafka transactions.
[00:18:31] Unknown:
And so you mentioned that Faust is based at least somewhat on the Kafka streams implementation. So I'm curious if you can talk a bit about the areas that you don't currently support and if there are any additional features that aren't present in Kafka streams that you've built into FaaS? Yeah. The the mostly, it's the exactly ones
[00:18:52] Unknown:
processing that we don't support, but we also don't have stream to stream joins yet, and neither do we have table to table joins. But those are not very tricky to implement. We we just haven't used them ourselves yet. But the idea is that you can join 2 streams together pretty much as the database does. And there's also k x k SQL, which is, using the SQL language to to create streams. We I I think we have 1 person working on that, but it's not it's not ready yet. And the FAAST project has been released as open source for other people to be able to use and contribute to. So I'm wondering what the motivation was for doing that rather than keeping
[00:19:32] Unknown:
it as a project internal to Robinhood and just using it within your company. Well, we built this,
[00:19:39] Unknown:
with the idea that it would be open source from the start. So the question was, could we could we easily implement Kafka streams in Python? The answer to that was no. Definitely not. Like, it is you can't just hook into it, like, with a client or anything like that. You need to reinvent all of it. But that's what we wanted to do, so we did it. Yeah. It's definitely, fairly substantial amount of engineering, and it looks like it's got a very
[00:20:06] Unknown:
easy API interface for being able to get up and running with it. So it seems that you've, done a good job with building it and architecting it in a way that it should be,
[00:20:16] Unknown:
maintainable and usable going forward. Thank you so much. Well, it was with the help of the data team here, and Vineeth, who is the the co creator of Faust. And so you talked a bit about
[00:20:28] Unknown:
the, pluggable nature of Faust for being able to support alternate implementations of queues or streams aside from Kafka. So I'm wondering what are some of the other areas of pluggability for somebody who wants to swap out different pieces of Faust for their own purposes?
[00:20:45] Unknown:
Oh, absolutely. Everything is Splunkable.
[00:20:47] Unknown:
I I would say, like, even the partition assigner or the tables, absolutely everything in the files is replaceable and extendable. I guess if you wanna go a bit deeper on the way that you implement pluggability because I've seen it done a few different ways with different projects in different languages, and it's something that I've always been fairly fascinated with in terms of how you build these areas or being able to drop in different implementations for large swaths of the code. Yeah. I mean, that is quite a big subject,
[00:21:17] Unknown:
but what what I usually do is that I I write methods that are short so that you can easily replace any method. And I usually make anything that the class references such as a a sub a class that it uses. If an order has an account, for example, the class used for that account is referenced so that you can replace it with a different class. And then if there are with things like, the partition assigner or the stream, I I leave the ability open to to use a string configuration variable to define the class. So you can replace it with your own implementation.
[00:21:55] Unknown:
And so when the process is loading, it's pulling in these different stringified namespaces for the class implementation and relying on the import system for being able to override some of the built in implementations?
[00:22:09] Unknown:
Yes. Exactly.
[00:22:11] Unknown:
And so are there any upcoming projects that you have planned for building on top of FAUST or interesting, architectural or, design challenges that you're hoping to be able to leverage fast for being able to implement? Well, yeah. We we are using it internally
[00:22:30] Unknown:
for now for writing, like, general back end services, I think. I I can't talk about services themselves. Yeah. Absolutely. We we we we basically yeah. I mean, at Robinhood, we have a we have an iOS client. We have, Android client that are communicated communicating with our back end services. And also the the website is also communicating with the back end services. So we we are basically writing small microservices that those different platforms are are using.
[00:23:01] Unknown:
And so going back to the lessons learned from Celery that you have brought forward into Faust, it seems that 1 thing that helps to ease the, on ramp of getting started with Faust is the fact that it's just pure Python, and there aren't any DSL layers or special syntax needed for being able to hook into its capabilities. It looks like it's mainly just a couple of different decorators to, wrap your functions in Faust to have it executed within that runtime versus some of the more deeply reaching aspects of Celery for hooking into the task system there. Oh, yeah. I I think that is absolutely,
[00:23:39] Unknown:
an important subject because in what makes files similar to Celery is the way that we have ported to Kafka Streams, but you can use it as a massive message passing system, basically. So we have implemented actors that are doing stream processing. So in in Faust, we have the agents. Every agent processes a stream using an async iterator. And you can also have a concurrent agent. So 1 agent has 1, 000 instances, for example. That means you're processing the stream in parallel or in concurrently, not necessarily in parallel. And when you start this agent on multiple machines, it turns into a distributed system that you can send messages to. And in that way, it is closely related to the the actor model of concurrency, which Celery is also similar to. But Kafka streams does not take advantage of this at all.
It's just a DSL that you use for press processing data. You can't really call into Java or call other functions or stuff like that. But with files, you can actually perform web requests while you're processing the stream.
[00:24:54] Unknown:
Yeah. That definitely lends a lot of power that you can just do arbitrary operations on these events because it's just pure Python rather than being restricted to a subset of capabilities as defined by the library of. Yeah. Exactly.
[00:25:08] Unknown:
And then you can also use PyTorch and TensorFlow and all the the cool things that that engineers are using.
[00:25:16] Unknown:
Right. And also given the tabular capabilities as well for being able to run aggregate operations and then bringing in machine learning libraries, I can see where that would lend a lot of potential power to anybody who wants to process unfounded streams of data. Yeah. No. I I I don't even know how it can be used yet.
[00:25:33] Unknown:
But, you know, people are using it in surprising ways,
[00:25:38] Unknown:
which I like. It's basically just a a pattern that And do you have any particular plans for new features or improvements that are coming in future releases of Faust?
[00:25:48] Unknown:
The the the thing we want the most is the exactly once processing, which is not really exactly once, but it is for the purpose of tables and windowing. So you won't get duplicate counts, for example. So then we could actually do accounting in FAUST, which excites me. And are there any other aspects
[00:26:10] Unknown:
of Faust or stream processing or, your work with Celery and how it relates to Faust that we didn't discuss yet, which you think we should cover before we close out the show? Well, yeah, maybe
[00:26:22] Unknown:
the the testing part is maybe interesting because the way that we we test files now is like like, from Celery, I know it's a it's a huge project with lots of users. But whenever we change something, we usually break something for someone. Even though we have a 100% coverage, we always break something. So the the way that we've done integration tests in FAUST is that we have a cluster a cloud cluster of FAUST workers that are constantly processing small parts of the actual production apps that we have in in production at Robinhood. So if you break something, we we we almost immediately notice it, which I think is great.
[00:27:13] Unknown:
Yeah. So for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the movie Super Troopers 2 that I watched recently because it was hilarious. It's not very suitable for younger kids, but, it it's hilarious. It's worth a watch. I'm just gonna leave it at that. And so with that, I'll pass it to you to ask. Do you have any picks this week? Well, I have a book.
[00:27:44] Unknown:
It's, called MicroSound, and it's by Curtis Rhodes. It is probably different from the movie that you mentioned, But it's very interesting. It's about finding small sections of sounds within sounds.
[00:28:02] Unknown:
That sounds fascinating. Mhmm. Alright. Well, thank you very much for taking the time today to join me and talk about your work on FAUST. It's a very interesting library and 1 that I hope to start leveraging for my own work in the near future. So thank you for that, and I hope you enjoy the rest of your day. Thank you so much. You too.
Introduction and Guest Introduction
Ask Solom's Journey to Python
Overview of FAUST and Its Motivation
Choosing Kafka and Initial Requirements
Naming FAUST and Its Significance
Internal Architecture of FAUST
Managing Consensus and Failover
Lessons from Celery Applied to FAUST
Extensibility and Future Plans for FAUST
Periodic Tasks and Leader Elections
Challenges with Async IO and Partitioning
Interesting Uses of FAUST
High Availability and Synchronization
Replacing Kafka and Alternative Implementations
Challenges in Building Streaming Applications
Handling Unbounded Streams and Windowing
Comparison with Kafka Streams
Open Sourcing FAUST
Pluggability in FAUST
Upcoming Projects and Design Challenges
Ease of Use and API Design
Future Features and Improvements
Testing and Integration
Contact Information and Picks