Stream Processing In Real Time And At Scale In Pure Python With Bytewax

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star as a data discovery platform that automatically analyzes and documents your data.

For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries.

Best of all, it's simple to set up and easy for both engineering and operations teams to use.

With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.

Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar.

You'll also get a swag package when you continue on a paid plan.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers, 40 gigabit networking, and dedicated CPU and GPU instances.

And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover.

Go to python podcast dotcom/linode

today to get a $100 credit to try out their new database service, and don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy, and today I'm interviewing Xander Mathieson about ByteWax, an open source Python framework for for building highly scalable data flows to process any data stream. So, Xander, can you start by introducing yourself? Sure. Thanks for having me on, Tobias. I'm Xander. I'm the founder of ByteWax.

Yeah. I'm I'm stoked to be here today. And do you remember how you first got introduced to Python? Yeah. It was actually during business school. I have, like, a bit of a nontraditional background, maybe. I was a civil engineer, and I went to business school.

And in business school, I took a Python

CS class. I was always kind of into, you know, messing around with computers in 1 way or another, but that's how I got introduced to Python, and I really enjoyed it. And after business school, I tried to

create a career where I could continue to use programming, and that's how I ended up in data science, actually.

In terms of the ByteWax project, I'm wondering if you can describe a bit about what it is and some of the story behind how it came to be and why you decided that this is where you wanted to spend your time and energy.

Yeah. Sure. So ByteWax is a Python library, but it's also

a company by the same name. And I started ByteWax. I was at GitHub

actually before starting BiteWax. Left GitHub to start something in the machine learning space. I was working on machine learning infrastructure

at GitHub, and so the actually, the first iteration of what we were working on was a machine learning platform.

And we pivoted the company from that to working on an open source project

and focus on the stream processing library. And, I mean, the reasoning behind that was it was somewhat of a convoluted space, the machine learning hosted machine learning model space. We didn't feel it was getting as big of a need as we first anticipated,

and we also kept running into people who

wanted to be able to build data processing against data streams in Python. We had

good start to that, so we kept working on Pyewax as the open source project.

And as far as the

overall use cases for it and the end users that you have in mind as you're building the project and iterating on what the kind of commercial road map looks like. Who are the target users of the Bitewax

library and framework,

and the specific problem areas that you're trying to address for those users?

To be honest, we keep learning about more use cases. But I'll start with the users. When we've been developing ByteWax,

the users we have in mind are data engineers

and data scientists or machine learning engineers.

Quickly

after we announced the project, we had a lot of interest from the machine learning community to use this for future transformations.

So oftentimes in machine learning, when you so you train a model on some large feature set that you create, then you want to run some inference in real time and you need to be able to

do whatever you did to the data to use it in the model and the training step to do that in the inference step. And so there seemed to be a nice overlap in abilities and requirements

there. Coming back to the use cases, it is a generalized data processing framework, but what we are

seeing as great use cases are complex analysis on streams of data. So that's your, like,

more machine learning esque type things. But in addition, people are using LiveWax for moving data around and doing small transformations. So

maybe you're moving data from Kafka to Snowflake.

You can use a Kafka connector, but let's say that you actually wanna augment that data in some way before you move it to Snowflake. Say you wanna join it to some other join it with some data from another database or turn IP into a geolocation or something like that. So instead of needing to do that in 2 hops, you know, 1 for the augmentation then back to Kafka and then another 1 using Kafka Connect, you could do that in 1 step, which was slightly different use case than initially intended, but I think it's awesome that people found the utility there.

And in terms of the

ways that people have been solving this problem up till now, you mentioned 1 in the Kafka use case where you might be using, you know, Kafka Connect, or you might be using Kafka streams, or something like Faust project from Robinhood

to be able to pull the data out from a topic, manipulate it, write it back into another topic, and then use Kafka Connect to write it out into the destination system.

In the kind of streaming processing world, there are a whole bunch of options out there. I'm just wondering, as you've seen people starting to adopt BiteWax, what are some of the ways that they're factoring that into their existing architectures? Or for the case where maybe they're doing a greenfield project with it, what are some of the alternative structures that they were considering before they settled on bite wax as the solution to their problem?

Yeah. I think you covered most of the

alternative solutions there. Yeah. You have the Flink,

Spark Structured Streaming.

In the Kafka ecosystem, you have the Kafka Streams, Kafka Connect.

And those are

largely a heavy lift for operation and, you know, getting started,

and they're mostly Java focused, so there's little overhead for Python

folks to adopt those Java based ecosystem of tools.

As far as the architecture where you would,

potentially replace this with ByteWax or would be a good place to get started with ByteWax, so

oftentimes,

as a Python developer, you wanna consume some data from Kafka. It is fairly

easy to take the Python Kafka library or the Confluent Kafka library

and start consuming data and processing it. But very quickly, when you start to work with partition data and you need to aggregate things

or window things, it becomes increasingly difficult, or if you want to scale your consumer group. And that's where people have reached for Faust in the past in Python. I would say this is a good place to introduce ByteWax because you can

write your data flows and to do aggregations and windowing,

and you don't have to worry about the complexity of managing the multiprocessing and exchanging data across the processes.

All of that kind of is abstracted away. In the Python ecosystem, a couple of other projects that people might be using for adjacent use cases are things like Dask or Ray. And I'm wondering how you think about the ways that ByteWax

either complements those tools in their ecosystems or some of the ways that ByteWax might be an alternative to them for specific use cases.

Yeah. I think those are both, like, really great tools, and they have a lot of integrations built into, you know, the data frame library or they're very Python esque and great

for batch processing or handling scaling across multiple machines when you wanna scale out your

ability to respond to API requests. I I think they're great tools. Where we're focused is more on

streaming data, and part of this is due to what we leverage underneath.

Pyrax is a project called Timely Dataflow, which is a Rust library.

And Timely Dataflow uses a dataflow programming

model. So

when you write

a dataflow BI WEX dataflow and you have all these operations that are sequential,

those all happen on 1 worker. And that lends itself

well to paralyzing

across streams because you can have the input and output paralyzed,

and you don't need to have the same

orchestration

or scheduling going on that you might have with a centralized execution framework. And so that's kind of where we see the complementary

systems, 1 for, you know, where that kind of paralyzed IO situation, another 1, Dask and and Ray, etcetera, is is great for when when you wanna farm out 1 thing to a bunch of workers and have it executed.

As far as the BiteWax project itself, you mentioned a bit about the fact that it's relying on timely data flow as its core kind of data management layer. I'm wondering if you can just talk a bit more about some of the architectural and implementation details of ByteWax and some of the ways that it was designed to be able to address some of these

data processing and data manipulation and scalable data workflows.

Yeah. I have to also mention when we talk about implementation of PyTorch py 0 3, it's a framework for running Python and Rust, and it's a really great project. And, you know, any Python developer who's looking for performance benefits, I think it's a really good place to look. You know, we got a lot of the capabilities of ByteWax from

timely data flow. Timely data flow is a

project that is or the library itself was authored by someone whose name is Frank McSherry, and I I won't, like, try to speak exactly to it. I'll let him do that. There's a lot of great material about timely data flow that he's written and recorded.

He works on the Materialise

streaming database.

That's the company he founded. And with Timely Dataflow,

it was out of Microsoft Research that they

came up with this idea to

kind of change the execution model. And I think that there's a good paper on it that I won't try to summarize here that Frank wrote called,

scalability at what cost.

But it just talks about how there's so much overhead with these distributed systems sometimes that until you reach a certain scale, you don't actually get the benefits of the

parallelization

because of, like, network overhead or or serialization overhead. And using that data flow programming model,

they were able to I think he uses 1 core machine to match the capabilities, like a 100 Spark nodes or something like that. So timely data flow gives us a lot of benefit there for performance. And also, the library has all these operators built in. So it manages the communication between the workers, spinning up the workers, etcetera,

and some of the operators. So the work we've done is turning that into

something that is less academic and more that can be used into production. So working on persisting

state in a way that the data flows can be recovered,

adding integrations like with Kafka, etcetera,

and other semantics that work for the Python library.

With timely data flow as the core engine for being able to manage some of that data distribution and communication

across the different processes and workers as you scale out horizontally,

I imagine that for being able to actually bring that data in, initially, you're relying a lot on some of the ecosystem of Python tooling for database connectors,

client libraries for different data systems.

I'm curious how you have

worked through maybe some of the scalability considerations around that where you might be bottlenecked on IO and figuring out how do I now, you know, segment these data loads and parallelize across these workers and just some of the coordination that it's involved in actually managing the ingress and egress of data into the system until you hand it off to timely data flow to manage some of the, you know, internal operations of data distribution as you're processing it.

So the nice thing about using

PYO 3 and having access to writing things in Rust is we can move some of those

down to the Rust layer. So as an example,

we wrote a Kafka input config helper. So it's essentially like

pushing the Kafka integration to the Rust layer, and that helps

the IO. You know, there's gonna be many opportunities. I think we're

moving things to the Rust layer and then providing a binding to the Python layer

to help with performance.

The other interesting element of the BiteWax project is that you are exposing it as this Python layer, and so there are a lot of design considerations that go into

how do I make this idiomatic for people who are working in that Python space? How do I

bring in appropriate abstractions for some of the data considerations,

being able to work with streaming environments, you know, how do I understand

what size windows to use for being able to manage these, you know, streaming partitions

and just some of the overall

design aspects of figuring out how do I actually take this very complicated

space but valuable

and map it into something that is accessible for people who are operating across different ranges of experience and different backgrounds

and provide sort of a cohesive experience so that somebody who is working in the application development ecosystem

can understand it and make use of it and also be able to hand that off to a data engineer or a machine learning engineer and be able to map those same concepts into a shared problem space?

Yeah. That is, like, the crux of designing the right API and making something usable that doesn't have the key abstractions.

I mean, we're kind of operating under the premise that, okay, we're gonna screw this up at least once, probably many times, and we'll have to continue to iterate until we find that right abstraction layer.

Yeah. There are certain concepts that

exist in timely data flow that as ported

into Python are maybe not natural for most

of us, for all of us, for me, around this aspect of time that we call epochs.

This project is constantly being changed right now, and we're trying to figure out what is the best or the happy medium. But that's 1 area where it's a good example of, like, how much do we expose to the user? How much do we abstract? And right now, it's very low level.

So a user can you

know, they can be in total control of how data is processed in the data flow. You know, that's what the time in the sense of timely data flow. So our epochs are used for.

And whether or not we'll keep that low level or hide it and still expose it in some way, shape, or form,

those are sort of the

things we're working through. Because right now, you know, it's so open ended. You can connect to anything as long as there's a Python

library to do that, and you can make Windows with using epochs. And then you can be control of it advancing the data flow

and emitting a new epoch. And all that can get really confusing, and so, you know, we continue operating and, like, let's give some of the lower level

access. We'll try abstractions. And when they don't work, we'll

go back to the drawing board and then work on them again. And 1 place we've been kind of working on this is on the inputs.

Providing tumbling windows or other windowing semantics or ordering semantics

and how those work and how users will use them. I think it's probably 1 of the hardest parts of marrying up the ideas in timely data flow and what Python developers are used to.

It looks like a functional

programming paradigm, and it's different. Ray uses decorators, and there's like what we're used to as Python developers maybe isn't totally

translated to

PyWAC, so there's some learning curve there. And I think that's partially because we're kind of exposing that from timely data flow.

I'm really interested to see how that changes.

Our, like, method is right now is, you know, put something out there and learn from it and then try and reiterate on it to make it more user friendly and better, like, broader experience when it comes to the API.

Another interesting element of the

specific semantics of streaming data that come in when you're trying to process it as it traverses the system and as it kind of flies by

is some of the ways that you have to think about the algorithmic aspects differently. So 1 of the

canonical examples that I always think about in this space is if you're just trying to count something, if you have a batch approach, you have all of the data available for a given

time distribution.

You can just count across all of it, and then you're done.

But as soon as you start having to manage counts

across different, you know, events as they traverse the system and they potentially come in out of order, you know, then you start having to think about, okay. I'm going to use a hyper log log so I can take an approximate count over this time distribution.

And then, eventually, I'm going to be accepting this sort of margin of error in the actual

specific value at a given point in time. And I'm wondering how you've been thinking about

how much of those algorithms and considerations

to build into BiteWax natively versus building in recommendations of, if this is the type of thing that you're trying to do, here is a library that implements this well, you can pull this into your workflow.

I've been writing a bunch of, like, blog posts on how to use PyWax and some tutorials

and trying to incorporate other libraries. And there's 1 library in the Python ecosystem called River

that is a really great library, and they have a ton of these algorithms that are for working on streams, both ML

and other algorithms.

And

I think right now, we lean on

tools like that. But we have come across some instances where it would be great to have a standard library sort of support in bite wax for some of these algorithms.

Just at the end of June, I was working on

a blog post that would look at network

data and try to detect if something was funky. And I wanted to use the entropy approach.

It was stuff I didn't really know about. So when I was doing research, I found this entropy approach,

Shannon entropy

calculation to kind of score whether things were random or not random. I found, like, 1 incremental

or 1 paper that you have, like, an incremental

algorithm that would calculate Shannon entropy. And I kind of reformatted that to work with Biowax and was for further reinforcing that I think we should probably add a standard library where you can do things like that.

Sliding hyper log log is another 1. I have a tab open right now a paper that has sliding hyper log log, and I would like to make a tutorial or a blog post be another instance when it would be great to have it in there. A priority algorithm's another 1. All of these incremental

versions of the static or batch compute versions would be great to include, I think, in some point. We have a lot to get right first before

adding that would be more of a value add, I think.

As far as the

kind of module design as well, there's also the question of, do I offer this as a standard library built into the BiteWax

package? Do I create this as an add on library that is included by default in a BiteWax distribution but is available to other users in the ecosystem as a standalone package? It's always interesting to see sort of where those dividing lines end up making sense. Yeah. We have a little bit of a natural 1 if we implement it in the Rust layer. BioX is like a a binary.

So if it's written in the Rust layer, it'll probably be easier well, it will be packaged up with ByteWax.

Yeah. That's interesting too. We thought about that when we were doing input and output too. It's like, okay. That could be Bitewix. Io could be a separate thing, and ByteWax Algo could be a separate package, or it could be part of it. And I think leading the decision is where it happens right now,

at Rust layer, at the Python layer.

Need to automate your Python code in the cloud? Want to avoid the hassle of setting up and maintaining infrastructure?

Shipyard is the premier orchestration platform built to help you quickly launch, monitor, and share Python workflows in a matter of minutes with 0 changes to your code. Shipyard provides powerful features like webhooks, error handling, monitoring, automatic containerization,

syncing with GitHub, and more. Plus, it comes with over 70 open source low code templates

Python podcast.com/shipyard

today to get started automating with a free developer plan.

Going back to some of the

kind of evolution of the project, you mentioned that your

initial goal was actually to build a machine learning platform, and then you ended up shifting direction into where you are now. I'm wondering if you can talk to some of the sort of

early ideas and goals that you had and how you ended up moving in this direction instead and just some of the ways that the

project, as it stands now, has evolved from some of those early ideas and experiments.

So when we set out to build a machine learning platform, serverless machine learning platform, we wanted to

enable smaller teams to have the equivalent infrastructure of, you know, some of what the larger companies have that might not have the operational complexity or the teams to manage this.

We also wanted to

provide a

way for people to manage real time inference,

including

pre and post processing.

So what we developed is, like, a beta version

of this and what we, like, we tried out with some, you know, early customers was we built a Python SDK where you could decorate Python functions.

And then when you deploy them, they would be built into a graph.

So it was sort of like this Python based directed graph for real time

processing, but it was

leveraging,

queuing

system

called NATS that runs on Kubernetes, and then we would run the

steps, the individual processes as pods on top of Kubernetes.

And this was all hosted.

And we quickly learned from that that we would need to add an additional state layer

to really provide what people wanted to do with some of this processing that was coming

through the pipeline,

and, you know, that was part of the impetus. But with respect to going open source,

what we found was in these smaller

companies, I guess, wrongly, we made the assumption that

they wouldn't

have the same

requirements for data privacy and security,

and they would be more open to using a hosted platform because the value added of

being able to get going quickly and not having to maintain would outweigh

security concerns.

And that wasn't the case. So there was a lot of friction there. We still wanted people to be able to run biteWax as easily as possible.

And so we ended up going the open source route, and we've added this, like, deployment tool called WAX Control

that allows you to easily deploy biteWax on most of the clouds on a VM or on top of Kubernetes.

So we're kind of doing as much as we can to allow people to run this in their own

networks if they need, you know, check those data privacy and security boxes,

but still be able to try it out with very little additional operational complexity.

And so as far as the core bite wax

package, you mentioned that it's able to

scale these workflows across multiple processes,

and I know that it can also scale across multiple machines. I'm wondering if you can just talk about some of the axes that are available for being able to manage some of those scaling operations, both in terms of, you know, multithreading versus multiprocess versus multibachine,

scaling up in terms of sort of instant sizing or scaling horizontally as far as number of cores, and just some of the ways that

that problem manifests as you work through some of these different

data processing workflows and some of the requirements that are involved as far as accuracy versus latency, etcetera?

BioX

has 2 dimensions.

There's workers, which are threads, and then there are

processes, which are processes that you can run a data flow and scale it across.

And the way that I mean, I often

talk about it just depends on what your workload is bound to. So if you're

processing things and you require a lot of CPU, then you're probably gonna want to parallelize it with more processes and less threads. If it's more just IO bound, like lifting and shifting, then you can probably get away using threads.

It'll save some of the performance knock of moving things around our processes.

In terms of scaling the machines, so we actually have some work to do here. So right now, we're working on this state recovery across multiple workers. And a next step on that is how you would add

a single new worker or a single new process to a data flow so you can skip the number of

parallel dataflows running. You know, it's not a trivial problem to solve because you have to reshuffle

state across these machines. But once we've solved the ability to add another worker or process and we have state recovery,

then you should be able to have the ability to, you know, change the node, change a machine type, or increase the amount of CPU or

or memory

and scale that way as well.

For people who are building on top of BiteWax, can you talk through some of the overall

design and development workflow of actually

building a solution with ByteWax for being able to manage some of these data processing use cases. Just sort of the going from, I have this idea or I have this requirement

of I need to, you know, move this data, perform this transformation on it, or, you know, continuously manage these transformations, and then actually implementing that in the API that's exposed and getting that deployed onto a running system?

I think this is where

using timely data flow and BIWAX is

a great experience

because you can

run it locally really easily. You don't have to

have a centralized cluster to send your work to. So getting started looks like, you know, PIP install, ByteWax,

whatever your required credentials are to connect to your IO or your input and output sources, and you can write a Python file with your data flow using the operators

that are available to you. And then you can just run it locally as a Python file,

hyphenemyfile.py

with

3 workers or whatever it might be. So that's great because you can run it locally, and you can also write tests.

So as long as you can either connect to some staging environment for your input and output sources or you can recreate them in some way, shape, or form. You can write tests that you can

include in your CICD process.

You know, the next step is deployment, and that's where we're continually working

on WAX control, WAX CTLs. That's the tool that allows you to deploy on Kubernetes or on a single VM in the cloud.

The next step is how you configure that to work with whatever system you're

using for version control or whatever you're using in that sense. And then you'll deploy your data flow.

The next part of that process, so, like, observability and

etcetera,

that's kind of open ended today. So,

you know, whether you're using Prometheus and Grafana or you have some other

tool that is left to the developer's imagination today.

But we'll hopefully

learn as we go there and be able to add some additional tooling or some integrations to help with that observability

the next step of observability.

In terms of actually building in testing or quality checks, what are some of the

considerations

beyond just standard Python practices

for being able to actually

validate the work that you're doing with a bite wax workflow and just some of the ways to think about

scaling the logical complexity

as you maybe move across multiple different operations on a particular unit of data as it goes from source through to destination?

So Python best practices for

writing tests

and writing your code.

We've kinda

like, by design, you can't really shoot yourself in the foot with scaling and having

the right data on the right worker because when there's an operator that is a Stateful operator, like StatefulMap or Reduce, EPOC,

you're required to have your data in the format, a tuple that has a key and then the data.

And

that key is used for

exchanging data across workers.

So

because of that, as you scale

from testing to production,

you're somewhat protected from

ending up with the wrong data on the wrong workers. It could still happen

potentially, but if you're aggregating, you won't. I think that's nice in terms of the scalability story of going once again, this is partly timely

data flow. It's, great for that scalability story of running locally and then scaling it across multiple

processes

or workers. And then also as far as

managing the organizational

aspects of the code as you say, okay. I'm going to start with my initial implementation.

The example where we say, I have

a user

record that is being processed, so my first operation, I want to

obfuscate

their email address

so that doesn't flow through into the destination system. In ClearTax, I'm going to

erase everything before the at sign. So all I know is the domains that people are using for their email. And then you say, okay. Well, now in the next step, I actually need to check to see if there is any Social Security information in this user record. And as you start to add in different requirements and different operations,

what are the best practices or common approaches that you've seen people use to be able to

manage the kind of logical complexity of these transformations

as they scale beyond 1 through into a network of multiple transformations

where there might be sort of forking and merging of different operations as data goes from point a to point b?

That is probably 1 of the complexities people will face when building data flows is what the data looks like at each step. I mean, it's something we already

all deal with when dealing with data is

what does the data actually look like right now?

First thing I'd say is there's an inspect operator that allows you to look at what the data looks like at a certain form. So you can drop that into the data flow at any point, and that inspect operator can log stuff. It can print stuff. It doesn't do anything to the data. It's just a

a way to peek at what's going on in the stream.

So that's helpful when developing so you can see what the data looks like at that step. Yeah. Managing

the data, recreating it, having

test sets, and stuff like that is the same problem that is outside of BiteWax kind of exists

using BiteWax too. Moving into the business aspect of what you're building at BiteWax, I'm wondering how you're thinking about the

governance and sustainability

aspect of the open source project,

and how you're thinking about what are the commercialization

options for being able to actually turn this into a sustainable business on top of the open source.

You know, there are many opinionated people on how to do this correctly. We at the end of the day, we wanna make sure that the core open source product, Bitewax, is something that can be taken and used by

a single developer or even a small team. And it's not until

you are a larger organization that, you know, you'll require some of the features that are outside of that

open source product. So it's some of the security features or enhanced collaboration features for data flows across teams, you know, managing multiple teams using data flows,

other kind of maintenance and support.

So that's what we think of in terms of commercialization.

The core product will be open source, and we will always whenever there is something developed for an enterprise that would help a,

whenever there is something developed for an enterprise that would help a single developer or a small team use by WACs, we push it down to the open source

project.

If it's something that we don't think is necessarily

required by smaller teams and single users, then

we may not push it down. And so the enterprise product, I think Gantor, commercialization,

sustainability of the business, the enterprise product would be licensed, you know, come with support, maintenance, as well as some of these peripheral features like user interface integration with other enterprise systems we licensed, paid for, and that will be the commercial aspect of the business outside of the core open source product.

As you have been working with some of your early users and people in the community, what are some of the most interesting or innovative or unexpected ways that you've seen BiteWax applied?

Yeah. I think that 1 of the most interesting ways

was use PyWax for batch processing. I mean, you can process any data with PyWax. We've just focused on streaming data because we thought that that's where the gap was. It's a specific use case where

sort of a lift and shift from a 1 data 1

data store, like a data warehouse that is not supposed to be used in production, doesn't have a sort of latency requirements, and lifting it and shifting it to a database that is more suitable for production has lower latency. And that was an interesting

use case because it just sort of was

unexpected. But then once we, like, looked at it and kind of went through the effort of what it would look like,

it made sense because you could split the offline store. Let's call it the data warehouse.

Let's say it's like parquet files in s 3. You can hand a list of files to Byte Wax and split them all over

the workers or processes and then parallelize that and then write it into this production database. So that was an interesting

use case for BioX that kind of wasn't fully expected. We largely had in mind these sort of more advanced processing,

like complex business processing, machine learning type use cases.

In your experience

of building and growing this open source project and the business around it and figuring out what are the actual

useful problems that people are experiencing and how you how to actually address them. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

1 of the most interesting things that I've come across is

you end up very close to the library you're working on and you have this

predisposition on what it should be used like. And when someone grabbed Pyrax and put it in a Flask application

so that they could host it on a web server framework and to run the data flow, they actually had a route where you would say, like, run my data flow. And then it would kick off this data flow that was ingesting data from Kinesis on AWS and doing something with it. I was like, I did not expect

that I did not expect my wife's to be used in that way, and it was just kind of interesting how

I don't know. Potentially, we have this

idea that everything ends up in a web server. It was interesting. You know, you always end up with a fast API or a Flask

service running your thing, and it just became,

okay, cool. I guess Pywex can also run-in a web service that way.

It was just interesting to me how we're all kind of biased towards making things into web services.

And for people who are interested in being able to scale out repeatable

data flows or data manipulation

workflows, what are the cases where bite wax is the wrong choice and maybe they're better suited just

using a, you know, Pandas workflow

or throwing something into Spark, or some of the cases where bite wax is not the right fit, and you'd be better suited with a different existing tool.

Well, I think, you know, we all have to be reasonable. And if your organization is a Spark

organization and you already have support for Spark, you know, sometimes it's best not to try and fight that for using a new tool.

You know, you brought up Pandas. I think if you're doing Pandas workflows,

those parallelize

so well on some of the other frameworks like Dask and Ray. And we haven't built in any of the support into PyWax to be able to do that at the same

scale. So that's another good use case. If you cannot split the input to parallelize upon the input, it might not be the best use case. So if it's not a list of files, if it's a single file and you have to load it all and read it all and into 1 worker and then wanna process it 1 at a time and only then would it be parallelized? It's probably not the best tool today.

And as you continue to iterate on the project and the business, what are some of the things you have planned for the near to medium term or any particular problem areas that you're excited to dig into?

As I mentioned earlier, adding additional workers, we're still working on adding integrations with input and output

and kind of we have a upcoming release that will have the first of that work

and the first of the state recovery work. I'm really excited about adding more input and output integrations because that will help people a lot. It's just a lot easier to get started.

I'm also excited about some point in the future

adding some of these incremental algorithms so they can be used off the shelf, and you don't have to necessarily understand how

the math works or how to reimplement it yourself from a paper. That's pretty exciting. And in addition to all of that, the work around how buybacks is deployed and how it fits into the rest of the data ecosystem is

is also really interesting. And we have a lot to learn there. And I think we still have a lot of work to do to make that all play nicely together.

Are there any other aspects of the BiteWax

project or the overall problem space of being able to manage

performant and scalable data work flows in the Python ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?

Yes. We didn't mention

Streams.

Streams is a Python

project that is very similar to ByteWax. It's meant for

processing iterable objects.

It was started by the founder of Coiled, who founded Dask, actually.

It's now maintained by someone else. And it's also a cool project, and we've looked at it multiple times. I think it has great integrations for visualizing streams,

which is a big space that has not been kind of addressed very well in the Python ecosystem.

The main difference between PyTorch and streams is that ability to parallelize

the IO.

It it has a scatter and remember what they call it? Scatter and combine,

reduce type of flow.

But, yeah, that's another 1 worth mentioning for sure.

For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. My pick this week is actually going to be a bike rack that I picked up

for being able to haul bikes on my car because

the 1 I had been using before was a pain in the butt to be able to fit all the bikes onto. So I picked up 1 from a place called AltaRacks,

and it's actually got

a hanging configuration so you can just lift your front tire up into a little basket, and then the whole bike just hangs there so it's easy to fit a bunch of bikes on the car. The 1 I got actually will fit up to 6 bikes so you can have a whole crew of people with you. So definitely worth checking out if you're looking for an easy way to get the whole family out on the trail. So with that, I'll pass it to you, Xander. Do you have any picks this week? Nice. Yeah. Well, why not since we're on bikes? I'm a big mountain biker. I love to go mountain biking in my spare time. I've done it my whole life. And I am obsessed with this bike. I don't have this bike. I would love to have this bike. I don't have this bike with the there's a family. Their last names are Atherton. They've all been, like, World Cup Downhillers,

crazy bikers. And they have this they started a line of bikes called Atherton bikes a few years back. I'm obsessed with bikes because they're 3 d printed titanium

lug nuts with glued in carbon fiber

members, I guess, between.

They have really cool suspension design, and 1 day, I dream of having 1. So they they 3 d print it, and they're, like, fully customizable. So the price point is

out of my reach, but maybe 1 day. Alright. I'll definitely have to take a look at those. So definitely appreciate you taking the time today to join me and share the work that you've been doing with BiteWax. It's definitely a very interesting and useful project, and it's always great to see

more of these entrance into the Python ecosystem for being able to manage some of these scalable data workflows that have, up till now, been largely the domain of Java. So I appreciate all the time and energy that you and your team have been putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It was great to be here. Thank you so much for having me.

Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__