Summary
Analysis of streaming data in real time has long been the domain of big data frameworks, predominantly written in Java. In order to take advantage of those capabilities from Python requires using client libraries that suffer from impedance mis-matches that make the work harder than necessary. Bytewax is a new open source platform for writing stream processing applications in pure Python that don’t have to be translated into foreign idioms. In this episode Bytewax founder Zander Matheson explains how the system works and how to get started with it today.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with a fully automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your dbt, Snowflake, Tableau, Looker, or whatever you’re using and Select Star will set everything up in just a few hours. Go to pythonpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
- Need to automate your Python code in the cloud? Want to avoid the hassle of setting up and maintaining infrastructure? Shipyard is the premier orchestration platform built to help you quickly launch, monitor, and share python workflows in a matter of minutes with 0 changes to your code. Shipyard provides powerful features like webhooks, error-handling, monitoring, automatic containerization, syncing with Github, and more. Plus, it comes with over 70 open-source, low-code templates to help you quickly build solutions with the tools you already use. Go to dataengineeringpodcast.com/shipyard to get started automating with a free developer plan today!
- Your host as usual is Tobias Macey and today I’m interviewing Zander Matheson about Bytewax, an open source Python framework for building highly scalable dataflows to process ANY data stream.
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what Bytewax is and the story behind it?
- Who are the target users for Bytewax?
- What is the problem that you are trying to solve with Bytewax?
- What are the alternative systems/architectures that you might replace with Bytewax?
- Can you describe how Bytewax is implemented?
- What are the benefits of Timely Dataflow as a core building block for a system like Bytewax?
- How have the design and goals of the project changed/evolved since you first started working on it?
- What are the axes available for scaling Bytewax execution?
- How have you approached the design of the Bytewax API to make it accessible to a broader audience?
- Can you describe what is involved in building a project with Bytewax?
- What are some of the stream processing concepts that engineers are likely to run up against as they are experimenting and designing their code?
- What is your motivation for providing the core technology of your business as an open source engine?
- How are you approaching the balance of project governance and sustainability with opportunities for commercialization?
- What are the most interesting, innovative, or unexpected ways that you have seen Bytewax used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bytewax?
- When is Bytewax the wrong choice?
- What do you have planned for the future of Bytewax?
Keep In Touch
Picks
- Tobias
- Zander
Links
- Bytewax
- Flink
- Spark Streaming
- Kafka Connect
- Faust
- Ray
- Dask
- Timely Dataflow
- PyO3
- Materialize
- HyperLogLog
- Python River Library
- Shannon Entropy Calculation
- The blog post using incremental shannon entropy
- NATS
- waxctl
- Prometheus
- Grafana
- Streamz
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star as a data discovery platform that automatically analyzes and documents your data. For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use.
With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar. You'll also get a swag package when you continue on a paid plan. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, and dedicated CPU and GPU instances.
And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to python podcast dotcom/linode today to get a $100 credit to try out their new database service, and don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy, and today I'm interviewing Xander Mathieson about ByteWax, an open source Python framework for for building highly scalable data flows to process any data stream. So, Xander, can you start by introducing yourself? Sure. Thanks for having me on, Tobias. I'm Xander. I'm the founder of ByteWax.
[00:01:59] Unknown:
Yeah. I'm I'm stoked to be here today. And do you remember how you first got introduced to Python? Yeah. It was actually during business school. I have, like, a bit of a nontraditional background, maybe. I was a civil engineer, and I went to business school. And in business school, I took a Python CS class. I was always kind of into, you know, messing around with computers in 1 way or another, but that's how I got introduced to Python, and I really enjoyed it. And after business school, I tried to create a career where I could continue to use programming, and that's how I ended up in data science, actually.
[00:02:32] Unknown:
In terms of the ByteWax project, I'm wondering if you can describe a bit about what it is and some of the story behind how it came to be and why you decided that this is where you wanted to spend your time and energy.
[00:02:42] Unknown:
Yeah. Sure. So ByteWax is a Python library, but it's also a company by the same name. And I started ByteWax. I was at GitHub actually before starting BiteWax. Left GitHub to start something in the machine learning space. I was working on machine learning infrastructure at GitHub, and so the actually, the first iteration of what we were working on was a machine learning platform. And we pivoted the company from that to working on an open source project and focus on the stream processing library. And, I mean, the reasoning behind that was it was somewhat of a convoluted space, the machine learning hosted machine learning model space. We didn't feel it was getting as big of a need as we first anticipated, and we also kept running into people who wanted to be able to build data processing against data streams in Python. We had good start to that, so we kept working on Pyewax as the open source project.
[00:03:38] Unknown:
And as far as the overall use cases for it and the end users that you have in mind as you're building the project and iterating on what the kind of commercial road map looks like. Who are the target users of the Bitewax library and framework, and the specific problem areas that you're trying to address for those users?
[00:04:00] Unknown:
To be honest, we keep learning about more use cases. But I'll start with the users. When we've been developing ByteWax, the users we have in mind are data engineers and data scientists or machine learning engineers. Quickly after we announced the project, we had a lot of interest from the machine learning community to use this for future transformations. So oftentimes in machine learning, when you so you train a model on some large feature set that you create, then you want to run some inference in real time and you need to be able to do whatever you did to the data to use it in the model and the training step to do that in the inference step. And so there seemed to be a nice overlap in abilities and requirements there. Coming back to the use cases, it is a generalized data processing framework, but what we are seeing as great use cases are complex analysis on streams of data. So that's your, like, more machine learning esque type things. But in addition, people are using LiveWax for moving data around and doing small transformations. So maybe you're moving data from Kafka to Snowflake.
You can use a Kafka connector, but let's say that you actually wanna augment that data in some way before you move it to Snowflake. Say you wanna join it to some other join it with some data from another database or turn IP into a geolocation or something like that. So instead of needing to do that in 2 hops, you know, 1 for the augmentation then back to Kafka and then another 1 using Kafka Connect, you could do that in 1 step, which was slightly different use case than initially intended, but I think it's awesome that people found the utility there.
[00:05:38] Unknown:
And in terms of the ways that people have been solving this problem up till now, you mentioned 1 in the Kafka use case where you might be using, you know, Kafka Connect, or you might be using Kafka streams, or something like Faust project from Robinhood to be able to pull the data out from a topic, manipulate it, write it back into another topic, and then use Kafka Connect to write it out into the destination system. In the kind of streaming processing world, there are a whole bunch of options out there. I'm just wondering, as you've seen people starting to adopt BiteWax, what are some of the ways that they're factoring that into their existing architectures? Or for the case where maybe they're doing a greenfield project with it, what are some of the alternative structures that they were considering before they settled on bite wax as the solution to their problem?
[00:06:26] Unknown:
Yeah. I think you covered most of the alternative solutions there. Yeah. You have the Flink, Spark Structured Streaming. In the Kafka ecosystem, you have the Kafka Streams, Kafka Connect. And those are largely a heavy lift for operation and, you know, getting started, and they're mostly Java focused, so there's little overhead for Python folks to adopt those Java based ecosystem of tools. As far as the architecture where you would, potentially replace this with ByteWax or would be a good place to get started with ByteWax, so oftentimes, as a Python developer, you wanna consume some data from Kafka. It is fairly easy to take the Python Kafka library or the Confluent Kafka library and start consuming data and processing it. But very quickly, when you start to work with partition data and you need to aggregate things or window things, it becomes increasingly difficult, or if you want to scale your consumer group. And that's where people have reached for Faust in the past in Python. I would say this is a good place to introduce ByteWax because you can write your data flows and to do aggregations and windowing, and you don't have to worry about the complexity of managing the multiprocessing and exchanging data across the processes.
[00:07:45] Unknown:
All of that kind of is abstracted away. In the Python ecosystem, a couple of other projects that people might be using for adjacent use cases are things like Dask or Ray. And I'm wondering how you think about the ways that ByteWax either complements those tools in their ecosystems or some of the ways that ByteWax might be an alternative to them for specific use cases.
[00:08:06] Unknown:
Yeah. I think those are both, like, really great tools, and they have a lot of integrations built into, you know, the data frame library or they're very Python esque and great for batch processing or handling scaling across multiple machines when you wanna scale out your ability to respond to API requests. I I think they're great tools. Where we're focused is more on streaming data, and part of this is due to what we leverage underneath. Pyrax is a project called Timely Dataflow, which is a Rust library. And Timely Dataflow uses a dataflow programming model. So when you write a dataflow BI WEX dataflow and you have all these operations that are sequential, those all happen on 1 worker. And that lends itself well to paralyzing across streams because you can have the input and output paralyzed, and you don't need to have the same orchestration or scheduling going on that you might have with a centralized execution framework. And so that's kind of where we see the complementary systems, 1 for, you know, where that kind of paralyzed IO situation, another 1, Dask and and Ray, etcetera, is is great for when when you wanna farm out 1 thing to a bunch of workers and have it executed.
[00:09:21] Unknown:
As far as the BiteWax project itself, you mentioned a bit about the fact that it's relying on timely data flow as its core kind of data management layer. I'm wondering if you can just talk a bit more about some of the architectural and implementation details of ByteWax and some of the ways that it was designed to be able to address some of these data processing and data manipulation and scalable data workflows.
[00:09:45] Unknown:
Yeah. I have to also mention when we talk about implementation of PyTorch py 0 3, it's a framework for running Python and Rust, and it's a really great project. And, you know, any Python developer who's looking for performance benefits, I think it's a really good place to look. You know, we got a lot of the capabilities of ByteWax from timely data flow. Timely data flow is a project that is or the library itself was authored by someone whose name is Frank McSherry, and I I won't, like, try to speak exactly to it. I'll let him do that. There's a lot of great material about timely data flow that he's written and recorded. He works on the Materialise streaming database.
That's the company he founded. And with Timely Dataflow, it was out of Microsoft Research that they came up with this idea to kind of change the execution model. And I think that there's a good paper on it that I won't try to summarize here that Frank wrote called, scalability at what cost. But it just talks about how there's so much overhead with these distributed systems sometimes that until you reach a certain scale, you don't actually get the benefits of the parallelization because of, like, network overhead or or serialization overhead. And using that data flow programming model, they were able to I think he uses 1 core machine to match the capabilities, like a 100 Spark nodes or something like that. So timely data flow gives us a lot of benefit there for performance. And also, the library has all these operators built in. So it manages the communication between the workers, spinning up the workers, etcetera, and some of the operators. So the work we've done is turning that into something that is less academic and more that can be used into production. So working on persisting state in a way that the data flows can be recovered, adding integrations like with Kafka, etcetera, and other semantics that work for the Python library.
[00:11:45] Unknown:
With timely data flow as the core engine for being able to manage some of that data distribution and communication across the different processes and workers as you scale out horizontally, I imagine that for being able to actually bring that data in, initially, you're relying a lot on some of the ecosystem of Python tooling for database connectors, client libraries for different data systems. I'm curious how you have worked through maybe some of the scalability considerations around that where you might be bottlenecked on IO and figuring out how do I now, you know, segment these data loads and parallelize across these workers and just some of the coordination that it's involved in actually managing the ingress and egress of data into the system until you hand it off to timely data flow to manage some of the, you know, internal operations of data distribution as you're processing it.
[00:12:39] Unknown:
So the nice thing about using PYO 3 and having access to writing things in Rust is we can move some of those down to the Rust layer. So as an example, we wrote a Kafka input config helper. So it's essentially like pushing the Kafka integration to the Rust layer, and that helps the IO. You know, there's gonna be many opportunities. I think we're moving things to the Rust layer and then providing a binding to the Python layer to help with performance.
[00:13:13] Unknown:
The other interesting element of the BiteWax project is that you are exposing it as this Python layer, and so there are a lot of design considerations that go into how do I make this idiomatic for people who are working in that Python space? How do I bring in appropriate abstractions for some of the data considerations, being able to work with streaming environments, you know, how do I understand what size windows to use for being able to manage these, you know, streaming partitions and just some of the overall design aspects of figuring out how do I actually take this very complicated space but valuable and map it into something that is accessible for people who are operating across different ranges of experience and different backgrounds and provide sort of a cohesive experience so that somebody who is working in the application development ecosystem can understand it and make use of it and also be able to hand that off to a data engineer or a machine learning engineer and be able to map those same concepts into a shared problem space?
[00:14:18] Unknown:
Yeah. That is, like, the crux of designing the right API and making something usable that doesn't have the key abstractions. I mean, we're kind of operating under the premise that, okay, we're gonna screw this up at least once, probably many times, and we'll have to continue to iterate until we find that right abstraction layer. Yeah. There are certain concepts that exist in timely data flow that as ported into Python are maybe not natural for most of us, for all of us, for me, around this aspect of time that we call epochs. This project is constantly being changed right now, and we're trying to figure out what is the best or the happy medium. But that's 1 area where it's a good example of, like, how much do we expose to the user? How much do we abstract? And right now, it's very low level.
So a user can you know, they can be in total control of how data is processed in the data flow. You know, that's what the time in the sense of timely data flow. So our epochs are used for. And whether or not we'll keep that low level or hide it and still expose it in some way, shape, or form, those are sort of the things we're working through. Because right now, you know, it's so open ended. You can connect to anything as long as there's a Python library to do that, and you can make Windows with using epochs. And then you can be control of it advancing the data flow and emitting a new epoch. And all that can get really confusing, and so, you know, we continue operating and, like, let's give some of the lower level access. We'll try abstractions. And when they don't work, we'll go back to the drawing board and then work on them again. And 1 place we've been kind of working on this is on the inputs.
Providing tumbling windows or other windowing semantics or ordering semantics and how those work and how users will use them. I think it's probably 1 of the hardest parts of marrying up the ideas in timely data flow and what Python developers are used to. It looks like a functional programming paradigm, and it's different. Ray uses decorators, and there's like what we're used to as Python developers maybe isn't totally translated to PyWAC, so there's some learning curve there. And I think that's partially because we're kind of exposing that from timely data flow. I'm really interested to see how that changes.
Our, like, method is right now is, you know, put something out there and learn from it and then try and reiterate on it to make it more user friendly and better, like, broader experience when it comes to the API.
[00:16:57] Unknown:
Another interesting element of the specific semantics of streaming data that come in when you're trying to process it as it traverses the system and as it kind of flies by is some of the ways that you have to think about the algorithmic aspects differently. So 1 of the canonical examples that I always think about in this space is if you're just trying to count something, if you have a batch approach, you have all of the data available for a given time distribution. You can just count across all of it, and then you're done. But as soon as you start having to manage counts across different, you know, events as they traverse the system and they potentially come in out of order, you know, then you start having to think about, okay. I'm going to use a hyper log log so I can take an approximate count over this time distribution.
And then, eventually, I'm going to be accepting this sort of margin of error in the actual specific value at a given point in time. And I'm wondering how you've been thinking about how much of those algorithms and considerations to build into BiteWax natively versus building in recommendations of, if this is the type of thing that you're trying to do, here is a library that implements this well, you can pull this into your workflow.
[00:18:08] Unknown:
I've been writing a bunch of, like, blog posts on how to use PyWax and some tutorials and trying to incorporate other libraries. And there's 1 library in the Python ecosystem called River that is a really great library, and they have a ton of these algorithms that are for working on streams, both ML and other algorithms. And I think right now, we lean on tools like that. But we have come across some instances where it would be great to have a standard library sort of support in bite wax for some of these algorithms. Just at the end of June, I was working on a blog post that would look at network data and try to detect if something was funky. And I wanted to use the entropy approach.
It was stuff I didn't really know about. So when I was doing research, I found this entropy approach, Shannon entropy calculation to kind of score whether things were random or not random. I found, like, 1 incremental or 1 paper that you have, like, an incremental algorithm that would calculate Shannon entropy. And I kind of reformatted that to work with Biowax and was for further reinforcing that I think we should probably add a standard library where you can do things like that. Sliding hyper log log is another 1. I have a tab open right now a paper that has sliding hyper log log, and I would like to make a tutorial or a blog post be another instance when it would be great to have it in there. A priority algorithm's another 1. All of these incremental versions of the static or batch compute versions would be great to include, I think, in some point. We have a lot to get right first before adding that would be more of a value add, I think.
[00:19:48] Unknown:
As far as the kind of module design as well, there's also the question of, do I offer this as a standard library built into the BiteWax package? Do I create this as an add on library that is included by default in a BiteWax distribution but is available to other users in the ecosystem as a standalone package? It's always interesting to see sort of where those dividing lines end up making sense. Yeah. We have a little bit of a natural 1 if we implement it in the Rust layer. BioX is like a a binary.
[00:20:18] Unknown:
So if it's written in the Rust layer, it'll probably be easier well, it will be packaged up with ByteWax. Yeah. That's interesting too. We thought about that when we were doing input and output too. It's like, okay. That could be Bitewix. Io could be a separate thing, and ByteWax Algo could be a separate package, or it could be part of it. And I think leading the decision is where it happens right now, at Rust layer, at the Python layer.
[00:20:46] Unknown:
Need to automate your Python code in the cloud? Want to avoid the hassle of setting up and maintaining infrastructure? Shipyard is the premier orchestration platform built to help you quickly launch, monitor, and share Python workflows in a matter of minutes with 0 changes to your code. Shipyard provides powerful features like webhooks, error handling, monitoring, automatic containerization, syncing with GitHub, and more. Plus, it comes with over 70 open source low code templates Python podcast.com/shipyard today to get started automating with a free developer plan.
Going back to some of the kind of evolution of the project, you mentioned that your initial goal was actually to build a machine learning platform, and then you ended up shifting direction into where you are now. I'm wondering if you can talk to some of the sort of early ideas and goals that you had and how you ended up moving in this direction instead and just some of the ways that the project, as it stands now, has evolved from some of those early ideas and experiments.
[00:21:49] Unknown:
So when we set out to build a machine learning platform, serverless machine learning platform, we wanted to enable smaller teams to have the equivalent infrastructure of, you know, some of what the larger companies have that might not have the operational complexity or the teams to manage this. We also wanted to provide a way for people to manage real time inference, including pre and post processing. So what we developed is, like, a beta version of this and what we, like, we tried out with some, you know, early customers was we built a Python SDK where you could decorate Python functions. And then when you deploy them, they would be built into a graph.
So it was sort of like this Python based directed graph for real time processing, but it was leveraging, queuing system called NATS that runs on Kubernetes, and then we would run the steps, the individual processes as pods on top of Kubernetes. And this was all hosted. And we quickly learned from that that we would need to add an additional state layer to really provide what people wanted to do with some of this processing that was coming through the pipeline, and, you know, that was part of the impetus. But with respect to going open source, what we found was in these smaller companies, I guess, wrongly, we made the assumption that they wouldn't have the same requirements for data privacy and security, and they would be more open to using a hosted platform because the value added of being able to get going quickly and not having to maintain would outweigh security concerns.
And that wasn't the case. So there was a lot of friction there. We still wanted people to be able to run biteWax as easily as possible. And so we ended up going the open source route, and we've added this, like, deployment tool called WAX Control that allows you to easily deploy biteWax on most of the clouds on a VM or on top of Kubernetes. So we're kind of doing as much as we can to allow people to run this in their own networks if they need, you know, check those data privacy and security boxes, but still be able to try it out with very little additional operational complexity.
[00:24:12] Unknown:
And so as far as the core bite wax package, you mentioned that it's able to scale these workflows across multiple processes, and I know that it can also scale across multiple machines. I'm wondering if you can just talk about some of the axes that are available for being able to manage some of those scaling operations, both in terms of, you know, multithreading versus multiprocess versus multibachine, scaling up in terms of sort of instant sizing or scaling horizontally as far as number of cores, and just some of the ways that that problem manifests as you work through some of these different data processing workflows and some of the requirements that are involved as far as accuracy versus latency, etcetera?
[00:24:55] Unknown:
BioX has 2 dimensions. There's workers, which are threads, and then there are processes, which are processes that you can run a data flow and scale it across. And the way that I mean, I often talk about it just depends on what your workload is bound to. So if you're processing things and you require a lot of CPU, then you're probably gonna want to parallelize it with more processes and less threads. If it's more just IO bound, like lifting and shifting, then you can probably get away using threads. It'll save some of the performance knock of moving things around our processes. In terms of scaling the machines, so we actually have some work to do here. So right now, we're working on this state recovery across multiple workers. And a next step on that is how you would add a single new worker or a single new process to a data flow so you can skip the number of parallel dataflows running. You know, it's not a trivial problem to solve because you have to reshuffle state across these machines. But once we've solved the ability to add another worker or process and we have state recovery, then you should be able to have the ability to, you know, change the node, change a machine type, or increase the amount of CPU or or memory and scale that way as well.
[00:26:18] Unknown:
For people who are building on top of BiteWax, can you talk through some of the overall design and development workflow of actually building a solution with ByteWax for being able to manage some of these data processing use cases. Just sort of the going from, I have this idea or I have this requirement of I need to, you know, move this data, perform this transformation on it, or, you know, continuously manage these transformations, and then actually implementing that in the API that's exposed and getting that deployed onto a running system?
[00:26:49] Unknown:
I think this is where using timely data flow and BIWAX is a great experience because you can run it locally really easily. You don't have to have a centralized cluster to send your work to. So getting started looks like, you know, PIP install, ByteWax, whatever your required credentials are to connect to your IO or your input and output sources, and you can write a Python file with your data flow using the operators that are available to you. And then you can just run it locally as a Python file, hyphenemyfile.py with 3 workers or whatever it might be. So that's great because you can run it locally, and you can also write tests.
So as long as you can either connect to some staging environment for your input and output sources or you can recreate them in some way, shape, or form. You can write tests that you can include in your CICD process. You know, the next step is deployment, and that's where we're continually working on WAX control, WAX CTLs. That's the tool that allows you to deploy on Kubernetes or on a single VM in the cloud. The next step is how you configure that to work with whatever system you're using for version control or whatever you're using in that sense. And then you'll deploy your data flow. The next part of that process, so, like, observability and etcetera, that's kind of open ended today. So, you know, whether you're using Prometheus and Grafana or you have some other tool that is left to the developer's imagination today.
But we'll hopefully learn as we go there and be able to add some additional tooling or some integrations to help with that observability the next step of observability.
[00:28:38] Unknown:
In terms of actually building in testing or quality checks, what are some of the considerations beyond just standard Python practices for being able to actually validate the work that you're doing with a bite wax workflow and just some of the ways to think about scaling the logical complexity as you maybe move across multiple different operations on a particular unit of data as it goes from source through to destination?
[00:29:06] Unknown:
So Python best practices for writing tests and writing your code. We've kinda like, by design, you can't really shoot yourself in the foot with scaling and having the right data on the right worker because when there's an operator that is a Stateful operator, like StatefulMap or Reduce, EPOC, you're required to have your data in the format, a tuple that has a key and then the data. And that key is used for exchanging data across workers. So because of that, as you scale from testing to production, you're somewhat protected from ending up with the wrong data on the wrong workers. It could still happen potentially, but if you're aggregating, you won't. I think that's nice in terms of the scalability story of going once again, this is partly timely data flow. It's, great for that scalability story of running locally and then scaling it across multiple processes
[00:30:10] Unknown:
or workers. And then also as far as managing the organizational aspects of the code as you say, okay. I'm going to start with my initial implementation. The example where we say, I have a user record that is being processed, so my first operation, I want to obfuscate their email address so that doesn't flow through into the destination system. In ClearTax, I'm going to erase everything before the at sign. So all I know is the domains that people are using for their email. And then you say, okay. Well, now in the next step, I actually need to check to see if there is any Social Security information in this user record. And as you start to add in different requirements and different operations, what are the best practices or common approaches that you've seen people use to be able to manage the kind of logical complexity of these transformations as they scale beyond 1 through into a network of multiple transformations where there might be sort of forking and merging of different operations as data goes from point a to point b?
[00:31:16] Unknown:
That is probably 1 of the complexities people will face when building data flows is what the data looks like at each step. I mean, it's something we already all deal with when dealing with data is what does the data actually look like right now? First thing I'd say is there's an inspect operator that allows you to look at what the data looks like at a certain form. So you can drop that into the data flow at any point, and that inspect operator can log stuff. It can print stuff. It doesn't do anything to the data. It's just a a way to peek at what's going on in the stream.
So that's helpful when developing so you can see what the data looks like at that step. Yeah. Managing the data, recreating it, having test sets, and stuff like that is the same problem that is outside of BiteWax kind of exists
[00:32:03] Unknown:
using BiteWax too. Moving into the business aspect of what you're building at BiteWax, I'm wondering how you're thinking about the governance and sustainability aspect of the open source project, and how you're thinking about what are the commercialization options for being able to actually turn this into a sustainable business on top of the open source.
[00:32:24] Unknown:
You know, there are many opinionated people on how to do this correctly. We at the end of the day, we wanna make sure that the core open source product, Bitewax, is something that can be taken and used by a single developer or even a small team. And it's not until you are a larger organization that, you know, you'll require some of the features that are outside of that open source product. So it's some of the security features or enhanced collaboration features for data flows across teams, you know, managing multiple teams using data flows, other kind of maintenance and support.
So that's what we think of in terms of commercialization. The core product will be open source, and we will always whenever there is something developed for an enterprise that would help a, whenever there is something developed for an enterprise that would help a single developer or a small team use by WACs, we push it down to the open source project. If it's something that we don't think is necessarily required by smaller teams and single users, then we may not push it down. And so the enterprise product, I think Gantor, commercialization, sustainability of the business, the enterprise product would be licensed, you know, come with support, maintenance, as well as some of these peripheral features like user interface integration with other enterprise systems we licensed, paid for, and that will be the commercial aspect of the business outside of the core open source product.
[00:33:46] Unknown:
As you have been working with some of your early users and people in the community, what are some of the most interesting or innovative or unexpected ways that you've seen BiteWax applied?
[00:33:56] Unknown:
Yeah. I think that 1 of the most interesting ways was use PyWax for batch processing. I mean, you can process any data with PyWax. We've just focused on streaming data because we thought that that's where the gap was. It's a specific use case where sort of a lift and shift from a 1 data 1 data store, like a data warehouse that is not supposed to be used in production, doesn't have a sort of latency requirements, and lifting it and shifting it to a database that is more suitable for production has lower latency. And that was an interesting use case because it just sort of was unexpected. But then once we, like, looked at it and kind of went through the effort of what it would look like, it made sense because you could split the offline store. Let's call it the data warehouse.
Let's say it's like parquet files in s 3. You can hand a list of files to Byte Wax and split them all over the workers or processes and then parallelize that and then write it into this production database. So that was an interesting use case for BioX that kind of wasn't fully expected. We largely had in mind these sort of more advanced processing, like complex business processing, machine learning type use cases.
[00:35:08] Unknown:
In your experience of building and growing this open source project and the business around it and figuring out what are the actual useful problems that people are experiencing and how you how to actually address them. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:35:26] Unknown:
1 of the most interesting things that I've come across is you end up very close to the library you're working on and you have this predisposition on what it should be used like. And when someone grabbed Pyrax and put it in a Flask application so that they could host it on a web server framework and to run the data flow, they actually had a route where you would say, like, run my data flow. And then it would kick off this data flow that was ingesting data from Kinesis on AWS and doing something with it. I was like, I did not expect that I did not expect my wife's to be used in that way, and it was just kind of interesting how I don't know. Potentially, we have this idea that everything ends up in a web server. It was interesting. You know, you always end up with a fast API or a Flask service running your thing, and it just became, okay, cool. I guess Pywex can also run-in a web service that way.
It was just interesting to me how we're all kind of biased towards making things into web services.
[00:36:28] Unknown:
And for people who are interested in being able to scale out repeatable data flows or data manipulation workflows, what are the cases where bite wax is the wrong choice and maybe they're better suited just using a, you know, Pandas workflow or throwing something into Spark, or some of the cases where bite wax is not the right fit, and you'd be better suited with a different existing tool.
[00:36:52] Unknown:
Well, I think, you know, we all have to be reasonable. And if your organization is a Spark organization and you already have support for Spark, you know, sometimes it's best not to try and fight that for using a new tool. You know, you brought up Pandas. I think if you're doing Pandas workflows, those parallelize so well on some of the other frameworks like Dask and Ray. And we haven't built in any of the support into PyWax to be able to do that at the same scale. So that's another good use case. If you cannot split the input to parallelize upon the input, it might not be the best use case. So if it's not a list of files, if it's a single file and you have to load it all and read it all and into 1 worker and then wanna process it 1 at a time and only then would it be parallelized? It's probably not the best tool today.
[00:37:43] Unknown:
And as you continue to iterate on the project and the business, what are some of the things you have planned for the near to medium term or any particular problem areas that you're excited to dig into?
[00:37:54] Unknown:
As I mentioned earlier, adding additional workers, we're still working on adding integrations with input and output and kind of we have a upcoming release that will have the first of that work and the first of the state recovery work. I'm really excited about adding more input and output integrations because that will help people a lot. It's just a lot easier to get started. I'm also excited about some point in the future adding some of these incremental algorithms so they can be used off the shelf, and you don't have to necessarily understand how the math works or how to reimplement it yourself from a paper. That's pretty exciting. And in addition to all of that, the work around how buybacks is deployed and how it fits into the rest of the data ecosystem is is also really interesting. And we have a lot to learn there. And I think we still have a lot of work to do to make that all play nicely together.
[00:38:48] Unknown:
Are there any other aspects of the BiteWax project or the overall problem space of being able to manage performant and scalable data work flows in the Python ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[00:39:03] Unknown:
Yes. We didn't mention Streams. Streams is a Python project that is very similar to ByteWax. It's meant for processing iterable objects. It was started by the founder of Coiled, who founded Dask, actually. It's now maintained by someone else. And it's also a cool project, and we've looked at it multiple times. I think it has great integrations for visualizing streams, which is a big space that has not been kind of addressed very well in the Python ecosystem. The main difference between PyTorch and streams is that ability to parallelize the IO.
It it has a scatter and remember what they call it? Scatter and combine, reduce type of flow. But, yeah, that's another 1 worth mentioning for sure.
[00:39:49] Unknown:
For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. My pick this week is actually going to be a bike rack that I picked up for being able to haul bikes on my car because the 1 I had been using before was a pain in the butt to be able to fit all the bikes onto. So I picked up 1 from a place called AltaRacks, and it's actually got a hanging configuration so you can just lift your front tire up into a little basket, and then the whole bike just hangs there so it's easy to fit a bunch of bikes on the car. The 1 I got actually will fit up to 6 bikes so you can have a whole crew of people with you. So definitely worth checking out if you're looking for an easy way to get the whole family out on the trail. So with that, I'll pass it to you, Xander. Do you have any picks this week? Nice. Yeah. Well, why not since we're on bikes? I'm a big mountain biker. I love to go mountain biking in my spare time. I've done it my whole life. And I am obsessed with this bike. I don't have this bike. I would love to have this bike. I don't have this bike with the there's a family. Their last names are Atherton. They've all been, like, World Cup Downhillers,
[00:40:53] Unknown:
crazy bikers. And they have this they started a line of bikes called Atherton bikes a few years back. I'm obsessed with bikes because they're 3 d printed titanium lug nuts with glued in carbon fiber members, I guess, between. They have really cool suspension design, and 1 day, I dream of having 1. So they they 3 d print it, and they're, like, fully customizable. So the price point is
[00:41:16] Unknown:
out of my reach, but maybe 1 day. Alright. I'll definitely have to take a look at those. So definitely appreciate you taking the time today to join me and share the work that you've been doing with BiteWax. It's definitely a very interesting and useful project, and it's always great to see more of these entrance into the Python ecosystem for being able to manage some of these scalable data workflows that have, up till now, been largely the domain of Java. So I appreciate all the time and energy that you and your team have been putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It was great to be here. Thank you so much for having me. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Xander Mathieson and ByteWax
The Genesis of ByteWax
Target Users and Use Cases for ByteWax
Comparing ByteWax with Other Streaming Solutions
Architectural and Implementation Details
Designing an Idiomatic Python API
Evolution from Machine Learning Platform to ByteWax
Scaling ByteWax Workflows
Building and Deploying ByteWax Solutions
Governance and Commercialization of ByteWax
Interesting Use Cases and Lessons Learned
Future Plans and Integrations for ByteWax
Closing Remarks and Picks