Summary
The speed of Python is a subject of constant debate, but there is no denying that for compute heavy work it is not the optimal tool. Rather than rewriting your data oriented applications, or having to rearchitect them, the team at Bodo wrote a compiler that will do the optimization for you. In this episode Ehsan Totoni explains how they are able to translate pure Python into massively parallel processes that are optimized for high performance compute systems.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, an inferential compiler for Python that automatically parallelizes your data oriented projects
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what Bodo is and the story behind it?
- What are some of the use cases that it is being applied to?
- What are the motivating factors for something like Dask or Ray as compared to Bodo?
- What are the software patterns that contribute to slowdowns in data processing code?
- What are some of the ways that the compiler is able to optimize those operations?
- Can you describe how Bodo is implemented?
- How does Bodo process the Python code for compiling to the optimized form?
- What are the compilation techniques for understanding the semantics of the code being processed?
- How do you manage packages that rely on C extensions?
- What do you use as an intermediate representation for translating into the optimized output?
- What is the workflow for applying Bodo to a Python project?
- What debugging utilities does it provide for identifying any errors that occur due to the added parallelism?
- What kind of support does Bodo have for optimizing a machine learning project with Bodo? (e.g. using PyTorch/Tensorflow/MxNet/etc.)
- When working with a workflow orchestrator such as Dagster for Airflow, what would the integration process look like for being able to take advantage of the optimized Bodo output?
- What are the most interesting, innovative, or unexpected ways that you have seen Bodo used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo?
- When is Bodo the wrong choice?
- What do you have planned for the future of Bodo?
Keep In Touch
Picks
- Tobias
- Ehsan
- [
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Bodo
- University of Illinois Urbana-Champaign
- HPC
- MPI
- Elastic Fabric Adapter
- All-to-All Communication
- Dask
- Ray
- Pandas Extension Arrays
- GeoPandas
- Numba
- LLVM
- scikit-learn
- Horovod
- Dagster
- Airflow
- IPython Parallel
- Parquet
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Ehsan Totoni about Boto, an inferential compiler for Python that automatically parallelizes your data oriented projects. So Ehsan, can you start by introducing yourself? Hi, Tobias. Thanks for having me. I'm Ehsan Tutuni, cofounder and CTO of Bodo dotai. And do you remember how you first got introduced to Python?
[00:01:18] Unknown:
A long story, I think it's worth digging into. I was a PhD student at the University of Illinois, Urbana Champaign and working on high performance computing on large supercomputers. So we had these codes with MPI and FORTRAN and c plus plus very low level code, hand optimized, running on 100 of thousands of cores sometimes. But you see everywhere, even in, an HPC oriented school like University of Illinois, Urbana Champaign, all the scientists and data scientists and computational scientists were using scripted languages because they are much simpler. It was usually MathLab or Python or sometimes R. So that's where I got interested in scripting languages in general and started thinking about how to bring the power of high performance computing and parallel computing to scripting languages to make it easier for regular programmers to take advantage of these compute resources and scale their code and run their code as fast as possible.
So after getting my PhD, I really wanted to bring this compute power to programmers of scripting languages, so I joined Intel Labs. We started trying to optimize Julia at first. The promise of Julia was that it's a scripted language that is fast and lets you run code with the speed of c. Simplicity of scripting language, but speed of c. So we were working with Julia team at MIT. We built several packages, compiled technologies to paralyze Julia, optimize Julia, but we saw that Python is taking over and is becoming standard for all kinds of programmers. So we brought our compute technologies to Python instead of Julia.
And it was a nice surprise for me in terms of how nice and easy to use Python is in terms of us developing compiler technologies in Python and for the user in terms of abstractions. So we thought some of the ideas that we had in Julia can be applied to Python, and that's what happened. So in terms of some of those ideas, the main idea in Julia is the fact that you can compile individual functions based on the type of their inputs. So you don't have to compile everything like c and lose some of the flexibility of dynamic and scripting languages, but you can compile individual functions and have different versions of functions based on the type of input that goes to function.
So the first time you call the function, the system compiles it, puts it in some table, and it reuses it afterwards, as this function is being called. So this is sort of the architecture of making the dynamic language compilable. And we saw that, basically, in Python, the Numba project does that for us. And that's how we got started in Python and brought ideas from Julia to Python. And I believe this is the right way to have the power of a dynamic scripting language, but at the same time, take advantage of compilation and optimization and get the speed of c, essentially, for your programs.
[00:05:09] Unknown:
And so in terms of the opportunities for the sort of utility of Bodo for Python applications. I'm wondering what are the motivating factors for using something like Bodo? And what are some of the elements of the Python language and runtime that cause it to be slow for these data heavy workloads?
[00:05:31] Unknown:
What Bodo is maybe I can start by introducing Bodo. Bodo is a new inferential compiler technology. What it means is that Modo can infer the structure of the program and convert the function to an optimized and paralyzed version. As an HPC expert, we wrote the code for you in MPI, c plus plus, but it's done automatically and, transparently in real time for you. So this was the result of 4 years of research at Intel Labs and Carnegie Mellon University where I was. And when the research was successful, I joined forces with my cofounder who is an experienced business executive based on mastery, and started Bodo in mid 2019. So Bodo is the first compiler system that can automatically parallelize code. This has been the holy grail of 1 of the holy grails of computer science for decades that we have achieved for the first time, and Python has made it possible that I will explain in a minute. But in terms of what we are building at Bodo is a platform for data engineers and data experts to scale their code and run it at HPC speed on thousands of cores. Doesn't matter. Scales linearly from a single core on your laptop to 8 cores of your laptop, then a single cloud instance, and thousands of cores. We have results from enterprise customers up to 10, 000 cores, and we have published some of that. So that's what's Bodo. We think we can improve very challenging data problems like ETL data prep and speed them up by orders of magnitude, and increase their reliability and predictability, and lower the cost all at the same time with this technology.
So the motivating factor for Bodo was that we saw data workloads are growing. Big data emerged basically in 2000, and there was no good way to scale code to big data because Moore's law slowed down in terms of sequential performance of CPUs. You had to use more CPUs to process larger datasets. And, the industry went towards libraries like MapReduce, then Hadoop, then Spark, where these libraries provide some high level APIs, started with MapReduce. Now there are data frame APIs that take your code and break it up into tasks, go across the cluster, run those tasks, and come back. So the driver program is a single process.
It's not actually parallelized, but compute power is delivered through these tasks. However, we saw that compared to native parallel code with things like MPI that has been standard of parallel competing for decades, these systems like Spark are orders of magnitude inefficient. So some MPI code takes a few seconds, Spark can take hours. And this was a huge shock for us, and that's what motivated me to create Vodle and replace all these inefficiencies of these big data frameworks. But back to Python and how Vodou delivers performance for Python.
Let's take a step back. Why is Python slow to begin with? So in c, the program has all the data types that translate to machine code. So you have integers, you have floating point values, classes that are made up of those things, and the compiler knows all the details of those data types to take the program and generate a binary that runs on bare metal. In Python, we have an interpreter, tip typically, c Python interpreter that takes the code, and 1 by 1, it runs the instructions. It means that for every variable, it has to figure out the data type of that variable and call the appropriate method of that variable.
So if you have, like, addition of 2 scalars, integers, in worst case, it can become check for data type of that value, go to the method for that addition, call that method a lot of overhead. And that's why sometimes Python is 100 of times slower than something like c. However, we can fix that. And numbers show that for certain class of applications, we can convert the Python code to something like c code under the hood and compile with LLVM to a binary because we can infer the data types of the program in some situations and compile it as if the code was c. So you don't have all this interpreter overhead.
The other approach that has existed is, what if we create high level APIs in Pandas and NumPy, And instead of running them on top of the interpreter, when you call that API, some c libraries actually call them. That's how the code is run. So you have array let's say, 2 NumPy arrays. You add them. It doesn't happen through a loop in the interpreter. The loop is actually a native code loop. Because if you do the loop thing in the interpreter, the interpreter has a representation of the program, very high levels. It's like white code, stack based white code, pop up the instruction from the stack, figure out what it is, what are the variables, run that. If you do a loop like that, it's very slow. So that was 1 of the approaches that, actually, in addition to bringing performance, it brings simplicity as well because there is motivation for building these high level APIs that are simpler for the program. Everyone knows that Pandas is very simple for dealing with data. So given this status of the Python ecosystem, which is there are some ways to compile Python. It's not perfect, but with Numba, there is some infrastructure.
And we have high level APIs that are more understandable for the programmer, it turns out those APIs are more understandable for the compiler as well. And that's what made Bodo possible because Bodo can take those APIs, understand their semantics. Let's say you're joining 2 data frames and figure out if the data structures can be distributed, do the distribution for you, and do the right communication across processors to make the computation possible and deliver it to you. So you don't have to worry about those aspects, and you don't have to write low level code to make that happen. So Python and this approach of using high level APIs is very instrumental to enabling Bodo and this new approach of scaling computation and data intensive applications.
[00:13:22] Unknown:
In terms of the sort of scalability of Python data intensive operations, there are a few other systems that have started to gain popularity, most notably Dask and Ray and some other systems that layer on top of those or have similar approaches to horizontally scaling out the computation and having largely compatible APIs to some of the bigger packages that are used in the system, in particular, NumPy and Pandas. And I'm wondering what are the sort of deciding factors where somebody might go with Dask or Ray as compared to the use cases that Bodo is uniquely suited for?
[00:13:59] Unknown:
Systems like Dask and Ray are very similar to Spark at the architectural level and different than Spark in terms of use cases and some of the actual implementation. So, basically, the idea behind all the different frameworks out there, as far as I know, for big data is that the program runs on the driver process, and there are some framework APIs that are backed by a the Suites system library, which is you know, you call this data frame dot add or data frame dot join. Some library figures out some tasks that need to be done, sends those tasks across the cluster, and comes back the driver.
So this approach is a task scheduling job scheduling type approach that we believe is not true parallel computing. So if you give the same code to a parallel computing expert, what would they do? They would write code in NPI where the program actually runs on all the processes. There are no tasks like that. We have processes. There is no driver. There is no scheduler. So the MPI expert would divide up the data across the processes and add communication code for the parallel portion so processes can talk to each other and figure out what data structures should be replicated across these processes, the temporary small data structures, versus what is the big data that needs to be distributed.
So that's what the PEL computing expert would do. Bodo automates that process and gives you the true high performance computing and parallel computing performance and scalability. These other frameworks, we believe, are trying to approximate parallel computing using distributed computing. Distributed computing is not fit for data workloads. For distributed computing, there is the assumption that the tasks are fully independent. You have a web server or web application. So the tasks of different people coming to your website is independent. That's 1 assumption.
The other assumption is the system components are heterogeneous. You have a mobile device or a laptop, and then you have a server in the back end doing the work of the website. And your network is slow and unreliable through the Internet with TCP. All these assumptions are baked into these distributor libraries, and these are all wrong for parallel computing. In parallel computing, the compute tasks have dependencies, so you can't fire them off and forget about them. That's 1 assumption. The other wrong assumption is heterogeneity. In parallel computing, typically, things are in bulk synchronous manner where at the same time, all processors are doing the same thing. You don't have a set of processors doing something. You as I said, doing something else. It's usually 1 thing in 1 computation. You are paralyzing 1 algorithm.
So the compute is homogeneous, and the hardware is homogeneous as well. You have a cluster in the cloud, and all the cores are some sort of CPU cores that are the same. You typically want your cluster to be homogeneous. So that heterogeneity assumption is wrong as well. The other assumption that's wrong is network. In the cloud, you don't have to use TCP. You don't have to assume you're running over the Internet and pay the overhead. Even today, on AWS, you you have access to better network protocols with things like Elastic Fabric Adapter, EFA, of AWS that is much faster than running TCP code on AWS.
And on Azure, you have InfiniBand, which is, you know, high performance network protocol and has capabilities like processes can actually write to into each other's memory. So the performance profile of a truly parallel computing and high performance computing, like, solution like Bodo is very different than Spark, TaskRape, all of these distributed task scheduling libraries. And we believe that parallel computing has existed for decades. There is a lot of research. There's a lot of development, and we should reuse what what has come before, and we should stand on the shoulders of giants.
More specifically, we believe we should use MPI for data workloads because MPI has all the primitives that you would need. Let me share an example. A key bottleneck of data workloads is shuffling data, where you're, let's say, joining 2 tables, all the keys with the same value have to be on the same process for join to happen algorithmically. So in these task scheduling frameworks, they fire off these tasks, and the tasks are asynchronous, and they can't communicate with each other. So they use workarounds. And and main workaround is writing some shuffle files in case of SPA. And they have spent many years optimizing that. So you write shuffle files through tasks, and some other tasks come read the files and merge them together to achieve your shuffle.
However, this is a well known parallel computing pattern, and you can look at parallel computing kind of resources and research papers from eighties describing how this should be done in parallel. And the operation is called all to all communication. So in the MPI world, because the tasks there are no tasks. These are processes, and they are working a bulk synchronous manner. They go to this all to all collective operation, and processes exchange messages in an optimal way. So without writing files, without crashing JVM, without crashing your Linux, hitting file descriptor limits, all those issues, communication happens much faster than these frameworks, and we believe this is the right way. And by the way, there has been a lot of research, a lot of different to do this more efficiently, and we are just reusing it.
And we are not reinventing the wheel. But back to DASK and Ray more specifically, we don't believe the use cases are similar. So Spark and Bodo are built for large data engineering ETL, data prep tasks, but DAS can already, in our experience, cannot even perform those tasks. And we have benchmarks that we will be sharing where scaling them beyond beyond a single node is not even practical for these type of data heavy workloads with large joins and things like that. The difference is 100 of times if they work. So both are 100 of times faster.
[00:21:41] Unknown:
In terms of the software patterns that are common in these data processing workflows, what are some of the elements that contribute to the sort of scaling complexity and the algorithmic slowdowns that you're able to optimize in Boto?
[00:21:58] Unknown:
So performance, I usually divide it in 2 orthogonal dimensions, what's related. 1 dimension is sequential performance. So the compute that's happening on each core, how fast is it? The other dimension is parallel performance, which is how is communication happening across processors, and is it done efficient? Is the code scaling well with linear scalability? On the sequential performance side, a key pattern that slows down data workloads is whenever you have custom code, and it's for real applications, very often, you have to process some string value, see where it fits, return some code number out of it. Just 1 example.
So these sort of things typically are written in terms of user defined functions in Pandas, let's say, data frame dot apply or some SQL user defined function. In these cases, there is no library to run the code for you because it's custom code. It can't into some c library, and you're limited by the interpreter speed, which is very slow. And every time for every row, the interpreter is calling your function, you're paying a function call overhead and the overhead of running the function. All those compiler capability to translate this to native code can give you 100 of time speed up on these sort of complex custom codes. So that's 1 key pattern.
There are some other patterns such as expressions. Now if you run your expressions in regular NumPy or pandas, the code is essentially calling libraries after library method after library method on these data frames and arrays. And, you know, you load data, store something, load data, store something, you lose cache efficiency of your processor, and Bodo is able to fuse these and optimize them. So that's the other pattern to improve cache efficiency. And in general, computations on strings, date, time, things like that, Bodo is able to optimize the computation, eliminate the interpreter overhead, restructure the the computation in a good way for cache efficiency, avoid object overheads of Python, and things like that. So that's the sequential performance dimension.
The other dimension is parallel performance. In data workloads, a key pattern is shuffling data whenever you have join, group by, any of these operations. And this is where a lot of the problems happen. The application slows down, something fails. Bodo is, as I mentioned, using MPI, very robust for these cases. It can scale using MPI very well, and you don't have any of these issues. For example, at a large enterprise company, the IT department really liked using TerraSource Benchmark, which the source large dataset random dataset. And this small benchmark doesn't have any complex code for Bodo to optimize. It's just 1 library called load parquet files called sort, data frame that sort values.
But Boldo was 8 times faster than their optimized Spark implementation, highly tuned Spark implementation on 4 to 500 cores on AWS infrastructure they had because of this parallel aspect and the fact that Bodo uses MPI to scale computation. So no tuning necessary, scaled to 4 to 500 cores. They literally changed the number of cores in their launch command from 1 core to 1, 000 to 4 to 500 without tuning it. Just because when you do parallel computing the right way, you scale easily and much more efficiently, And you can take advantage of better network protocols like EFA to scale. So for both dimensions of sequential performance and parallel performance, although addresses the challenges of data processing data processing code. By the way, loading data in parallel is also a major problem, although it's storage agnostic and can load data from all sorts of storage systems such as parquet on s 3 or data warehouses like Snowflake.
So when you have a parallel computing infrastructure like Bodo, it's easier to optimize
[00:26:52] Unknown:
load of data. Another pain point for large scale data workloads that need to load terabytes of data at the same time. And now digging more into Boto itself and in particular the compiler aspects of it, I'm wondering what are some of the sort of challenges that are sort of inherent in being able to parse the sort of semantics and structure of Python code given its dynamic capabilities and turn it into something that you're able to statically compile and optimize?
[00:27:26] Unknown:
So the main challenge is data types. For any compiler solution, you have to have data types to do anything. Without data types, the compiler can't do anything. So that's the key challenge. And in this case, the main data structures for data processing workloads is data frames. And for data frames, the data type of each column and the names of the columns are also part of the data type. So for every operation, Modo has to know what the data frame input is and what the data frame output is. So we have a lot of compiler techniques to figure out the data types. That's 1 of our daily challenges that, you know, in Python and Pandas is not designed to be compiled necessarily.
There are these patterns by the users. For example, very common thing. You have a data frame. You assign a new column to it. You add a new column to it. That's changing the data type, and it can break the compiler system. So how can we avoid that? So we automate all of these patterns. However, there are cases where we can't do it. The data type based on some condition in the program changes. And in those cases, we we have to throw the right error for the user to know what's going on and point exactly where the problem happens and where they need to slightly change their code for the compiler to work.
So throwing the right errors with the good user experience is 1 of the challenges we are focused on on a daily basis to make it easier for Python developers to get up to speed with Godot
[00:29:16] Unknown:
and understand this g technology and take advantage of it. For things like the Pandas data types, I'm wondering how things like the extension arrays capability that was added somewhat recently impacts your ability to be able to understand the data type, particularly if the data type is, like, a complex object, for instance, with the GeoPandas plugin that adds support for geographical information like, you know, shape files or plots, you know, lat and long coordinates and being able to optimize for that in something that might be using that extension to a pandas data frame?
[00:29:52] Unknown:
Odo has its own internal representation of various types of data and these arrays. So we have, obviously, numerical arrays, NumPy like arrays. We have more complex arrays like nested arrays, where each element of the column is an array or list by itself. We have map arrays where each element is like a dictionary and things like that. So Bodo has optimized representation of these arrays. And these new array types of pandas are actually making our job easier. So a recent 1, you mentioned some of the more complex ones that I also mentioned some examples, but a recent 1 was just nullable integer arrays. It made our job so much easier because in pandas, if you have an integer array and it's backed by a NumPy array, when you have nulls in your column, you introduce nulls, it becomes all of a sudden float, which has other problems like numerical accuracy and all the issues.
And Pandas would convert to float and assign NAND as the sentinel value as indicator of null. So this was causing a lot of issues for Bodo, this type change throughout computation. And you can't have a single optimized data structure for Notable integer array solve that problem. So this advancement of pandas towards that direction of having proper data structures and not just relying on NumPy, actually help Bodo. But for any of these, we have to add internal support to Bodo to understand these data structures, and that's our usual kind of expansion of Bodo's functionality based on Pandas advancement.
[00:31:53] Unknown:
Stepping through more of the actual compilation process, I'm wondering if you can talk about the sort of parsing and the ways that you understand some of the semantics of the code to know when you're able to parallelize certain operations and when they need to be executed sequentially and how the sort of output of the compilation step interacts with the original Python code?
[00:32:19] Unknown:
Yeah. Definitely. The compilation process is actually very interesting. So when a JIT function, Bodo JIT function is called starting from the very beginning, So the function is being called with some arguments. 1st, it goes to Numba. So Numba takes the function that Python's interpreter has parsed. It's in bytecode format, a stack based bytecode that I mentioned. So Nava takes that and converts it to an internal intermediate representation that corresponds directly to that bytecode. And Numba has facilities for type inference and various things that we take advantage of. So Bodo takes that, and it goes through the program. I'm overly simplifying it, but it looks at all the operations that it understands.
So it understands a large portion of pandas, the common APIs used for data processing. It understands a lot of NumPy, the data processing portion of NumPy. It understands scikit learn, and our documentation has a reference for all of these. So for these operations, Godot knows how to parallelize them. In the next stages of the pipeline, the compiler, using these parallelization semantics, has an algorithm to figure out how the program has to be structured, what data structures have to be distributed across processors, what data structures have to be replicated, and the corresponding computations that need to be distributed.
So Bodo, based on those semantics of Pandas APIs, can do that automatically, and then Bodo translates the code to a distributed code, which is process 0 on the chunk of the data, it will load only that portion of the data from file, do the computation on that portion, and so on and so forth. And there are some data structures that need to be replicated across processors, smaller data structures, and Bodo will insert the communication necessary for the computation to happen. For example, if you have a join of data frames, auto inserts the shuffle steps and all the other things around join so that join is done correctly.
So auto finishes this distributed parallel computing transformation, and then goes back to Namba. Numba translates this representation into LLVM IR, and LLVM translates the representation into actual assembly and binary. So given the Python function, an optimized binary goes out. The way this new function is called is that the Python arguments are so called unboxed into their native representation, it's number of framework terminology, the native function is called Python is great that it can interact with native code very easily and c code.
The function finishes. The return of the function is so called boxed into Python object and goes to regular Python. So the workflow of this JET optimized code is very seamless in terms of interaction with regular Python. So the goal is to provide Python experience, but speed up high performance computing in this process.
[00:36:01] Unknown:
You mentioned that Python is able to very easily interact with c libraries and other native code. And I'm wondering how packages such as NumPy and Pandas that do rely on those native compiled extensions to the Python library, how Bodo is able to interact with those types of systems or if it is able to sort of simply ignore them and just focus on the actual Python bytecode and all of the native binaries are treated the same by the are treated the same by the Python interpreter?
[00:36:31] Unknown:
So Bodo looks at the APIs of Pandas and NumPy and makes a decision in terms of how to implement them. We may ignore the implementation of pandas and NumPy altogether because both can do something more optimized and paralyzed. A lot of the times, that's what happens. But sometimes, there is an optimized implementation of something complicated in c library somewhere, and we just call that from Bodo. We can call any Python or c code inside this JIT context. Calling into Python has a little bit of boxing and unboxing overhead that I mentioned, but a lot of the times, it can be negligible.
This is a great advantage for us because a lot of the operations, things like date and time processing are already done in native code and c code in pandas, and we just take advantage. We don't need to reinvent the wheel. So interaction with those extensions easy. And anytime there's a custom piece of code that Bodo doesn't understand, the user can just come back to regular Python and run that. And this generality helps to deal with all the complicated situations that can happen in real real world applications.
[00:37:50] Unknown:
And so for people who are using Bodo, can you talk through the overall workflow of starting with a Python project that might already be fully functional, but it maybe doesn't run as fast as they want it to. And then applying Boto to it and being able to sort of determine what are the sort of levels of speed up that it's getting and the overall process of debugging any errors that might come up as they're either changing the program or in the process of applying Boto to it and challenges that come about because of the increased parallelism?
[00:38:23] Unknown:
So, typically, the processes a data engineer prototypes the workload in regular Python, and there's no need for using both of our initial development. So all the testing of actual computation can happen in regular Python tools. Other frameworks don't provide that because they can't handle native APIs. But afterwards, for using Bodo, typically, they have to look at where the actual computation happens and where the large data structures are. We factor those pieces of the application into functions that can be jitted. So the Pandas, NumPy work on large data structures, terabytes of data, are putting functions and, Bodo decorator is put on top of those. And sometimes, there may be some typing issues that need to be refactored for the code to work in Bodo. But the Bodo code, you can actually comment out the Jits, decorate, or even disable it from environment variable. It's regular Python code. Even after the factoring for Jits, we can still test it in regular Python and make sure it works.
From that point on, once it works on a single core with JIT, you can test it on 2 cores and 3 cores and 4 cores. Typically, there are no issues. Once it works, you can scale it on any number of processors without any problem. So we haven't had too many issues of the parallelization per se. It's the first step of converting code to JIT, needs attention and work, but afterwards it's very easy because Bodo takes care of the rest. Sometimes, there are structural issues in the program that the parallel APIs are not used properly by the programmer, Bodo automatically figures those situations out and throws warnings.
This function could not be paralyzed. Look at these diagnostics of why the operations you are using are not paralyzable, and it points exactly the line. So we have extensive diagnostics for parallelization capability. Afterwards, the code works. It's parallel. There are no issues, and the deployment is the standard MPI deployment. And there are so many resources on MPI that if there were issues, typically, with a simple search, you can find out various resources for MPI. So that's the process of using Vodou. And, basically, it's about the Jits aspect, jitting the code, and making sure there are no parallel issues based on compiler messages.
Afterwards, it's done. There is no other steps that are necessary. And for debugging, you have access to print. And so in terms of logs and printing and all of those things, but you also have access to all the tools in the Python world and the MPI world for debugging at the same time, If necessary, we haven't seen much of parallel issues. It's about making the code work sequentially on a single core, from that point on, it almost always works perfectly on any number of processors.
[00:42:01] Unknown:
You've mentioned a few times the use of Bodo for data engineering workflows, and I'm wondering what are the opportunities for being able to apply it to machine learning or deep learning projects using frameworks such as PyTorch or TensorFlow or MXNet.
[00:42:15] Unknown:
So our focus is on large scale data processing, ETL, Dataprep featurization, and Bodo is often used for machine learning applications in general. The last step of the pipeline is obviously training a machine learning model, and we have support for scikit learn, for more classical machine learning algorithms. But for deep learning, we don't compile the APIs of deep learning frameworks because they already do that, and they they can optimize for the GPUs. We integrate with them. So although because it's MPI based, it works with Harvard natively, and we have support for transferring the data from CPU after you do the processing in Bodo, transfer the data to GPUs, and manage these GPUs, but the rest of the compute is the same. It's your usual PyTorch and TensorFlow code. So that's how the machine learning kind of deep learning pipelines are built with Bodo today.
It's through integration with existing tools, and Bodo takes care of scaling beyond the single node and single GPU for you using Carver.
[00:43:35] Unknown:
In terms of being able to integrate with something like a workflow orchestrator such as Daxter or Airflow, what is the process for being able to use 1 of those tools to be able to drive the execution of Boto or a program that was compiled with Bodo and you're then using the Jitd output?
[00:43:55] Unknown:
So Bodo's workflow is standard Python, and any environment that supports Python Bodo is installed as a standard Python package. So any environment that supports Python can run Bodo. The difference of Bodo is that instead of a single Python process, you're launching many more, potentially 1, 000. And you can launch it through command line with MPI exec. So instead of saying Python this file, you say MPI exec this file, and MPI just needs a host list to make that happen. So we have been able to launch all the jobs in any environment on premises, in the cloud, standard tools, nonstandard tools because the requirements are so simple. And it's the same for the workflow management tools as well. All of them support Python or just running a command line, and that's the simplest way to just launch Bodo in any environment.
But there are cases where you want more integration where there are notebooks and there are interactive jobs and things like that, and those become more interesting, we are working with the open source community and developing Py Python parallel for those cases, more interactive cluster cases that just launching a command line may not be enough, and that's going very well. I encourage, Python developers to check out the IPY Parallel. It's a very cool project and makes PEL notebooks very easy and interesting.
[00:45:35] Unknown:
In terms of your experience of working with Bodo and seeing people apply it to their different data engineering challenges, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:45:48] Unknown:
So for Bodo, something that was very interesting and unexpected to me was that given this new powerful tool, data engineers basically change the architecture and just use a different stack altogether to solve the problem. So I saw that a lot of SQL based systems are being replaced with Bodo or graph databases or even change the storage. So if you can run Python at scale on parquet files, you may not want a SQL based system. And if your Python runs in real time and very fast, you may not want a Presto installation and deployment. These were very interesting and unexpected to me, and I didn't see Bodo being used for such diverse set of applications and changing the stack so much just because Bodo is so much simpler and so much faster. I didn't know you can decide to change a different type of storage for these kind of use cases that need real time or use Goto for these complex visualization pipelines or business intelligence and those sort of workflows.
These were unexpected to me how much impact Bodo could have on the compute environment.
[00:47:16] Unknown:
And in your experience of building the Bodo project and working with the Python community to understand its use cases and capabilities. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:47:29] Unknown:
So a key challenge for us introducing Bodo as a new platform to the world is that we are making Python faster. Right? And whenever we say the word Python, everybody thinks about slow computing on small data. This has been very challenging to explain that we bring simplicity of Python, but performance of high performance computing, and we are here to solve bigger data engineering problems. We are not here to solve small data problems of Python on your laptop just, you know, per se. You could use Bodo to make data science workloads on laptop faster, but we are here mainly to solve the bigger challenges that are causing a lot of issues for you on terabytes of data and 1, 000, of course.
So this understanding that Python can be used for large problems, and Python can be used for production workflows, that has been challenging, and I didn't expect it to be this difficult to convince various personas that Python should be a first class citizen and default language for production. You don't need to rewrite your code in Scala and see things like that. It's difficult to convince barriers, especially IT leaders, that you shouldn't make your programmers who love Python rewrite the code in Java and Scala, and Python can deliver the results that you would expect.
Python is very productive, very simple to use. The challenge is performance, and we have solved it. And there is a challenge of data types and robustness, which we have solved as well because we compile the code and check the types and throw the right compiler errors so that we don't have failures in production. So we have solved that problem as well. So we think Python should be the language for production.
[00:49:39] Unknown:
For people who are interested in Boto and they are looking to be able to optimize and accelerate their Python projects, what are the cases where Bodo is the wrong choice and somebody might need to actually go down the path of writing an optimized native extension for a, you know, a tight loop for some computation heavy aspect or, you know, explore some of these scale out architectures such as DASK or Ray?
[00:50:05] Unknown:
Our focus is on data intensive workloads in data engineering, ETL, data prep, and those sort of things. And we think Vodle is the best solution for those. If the workload doesn't fit the typical data problems, for example, it's a web application, web server. It clearly doesn't fit Bodo because it's not a data problem. But sometimes, there are things in between. Things like scientific workloads can look similar to data problems, and they may not be data problems. And we don't target some of those scientific workloads that are not data problems. So anything that's working on a lot of data, things like pandas, data frames, billions of rows, ETL, join of tables, processing tables, processing rows, those are great for Bodo.
If it's truly a distributed computing problem that there are independent things that need to be done across the system, and it's not a parallel computing problem. That's where Dask and Ray are the preferred solution, and Bodo is not the right answer. But whenever it's large scale compute and data, terabytes of data, that's where Bodo
[00:51:32] Unknown:
shines. And as you continue to iterate on the Bodo project and optimize its capabilities and explore how to expand its capabilities and compatibility. What are some of the things you have planned for the near to medium term?
[00:51:46] Unknown:
So our goal is to bring compute power of high performance computing to as many programmers as possible. So we want to lower the barrier to this kind of compute to, regular Python programmers, and help them become data engineers, essentially, with as little learning and barriers as possible. So in the short term, we are working on improving the user experience of Bodo, basically, automatically compiling as much as possible without changes in the code where it's actually practical. And when it's not possible to compile throw the right errors and have the right documentation to explain what could be done in the program. So the programmer experience is our number 1 focus in the short term. We are also developing Motto platform to be as complete as possible.
We are fully storage agnostic, so we support parquet on object storage like s 3 or ADLS. We support various data warehouses like Snowflake, and, we want to make this storage support as seamless and as efficient and scalable as possible. But we are also building on managed cloud platform on AWS and Azure to have a lot of the right features in terms of managing clusters, notebooks, managing users. And we have a new notebook experience that I'm very excited about. In a few months, it will be out, and I'm very excited to see what kind of applications and what kind of users this new experience can really enable because it's a much lower barrier for programmers than before.
I believe that any Python programmer can become a data engineer with both or with much lower amount of learning and barriers than something like Spark. And they can basically, hopefully, double their salary without learning all of these complex big data frameworks.
[00:54:08] Unknown:
So that's that's our goal. And are there any other aspects of the work that you're doing on Bodo or the overall space of using Python for data engineering and accelerating its computational speed that we didn't discuss yet that you'd like to cover before we close out the show? I think
[00:54:24] Unknown:
the biggest aspect is the mindset of various members of the data team. So at a high level, IT leaders need to pay attention to the needs of the developers because IT leaders care about time to business value. And if we explain that Python should be the default language to reduce that time, that's where the value for IT leaders is. So that's 1 aspect to focus on, and the programmers using Bodo, they can really argue that we can reduce development time and time to production. That's 1 key aspect to focus on. The other aspect is the cost of infrastructure for organizations.
If some computation is 10 times faster, we have had a 100 times, but let's say 10 times faster with Bozzo, it means that your cloud bill is 90% lower. So that's a key aspect that IT leaders are focused on, but I think developers have to be focused on as well because the cost grows significantly. Datasets are growing on a daily basis. So we think this change of mindset in terms of time to value and the cost in the infrastructure can really change the landscape of analytics, and Bodo is the catalyst to make this transformation happen and democratize data and machine learning for all enterprises.
[00:55:59] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose, recently started working on, experiment playing around with building paracord bracelets and sort of different interesting combinations of survival tools to put in there and just a fun thing to do away from the computer. So been enjoying that. Definitely worth taking a look at for just a fun little quick project you can do. So with that, I'll pass it to you, Ehsan. Do you have any picks this week? The NBA season is starting soon.
[00:56:36] Unknown:
So I'm addicted to basketball and looking forward to that. That's something I'm looking forward to.
[00:56:44] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Bodo. It's definitely very interesting project and an interesting product, and I'm excited to see the capabilities that it unlocks for the Python community and ecosystem and excited to see how it grows along with the sort of seemingly never ending growth of the Python language. So thank you for all the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias. It was great speaking with you, and I encourage the Python community
[00:57:12] Unknown:
to check out our community version and let us know what they think. We are listening, and we would like to know where this project should go to address their needs.
[00:57:25] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Ehsan's Journey to Python
Introduction to Bodo
Why Python is Slow and How Bodo Fixes It
Comparison with Other Systems
Optimizing Data Processing Workflows
Challenges in Compiling Python Code
Bodo's Compilation Process
Applying Bodo to Existing Projects
Bodo for Machine Learning and Deep Learning
Integration with Workflow Orchestrators
Interesting Use Cases of Bodo
Challenges and Lessons Learned
When Bodo is Not the Right Choice
Future Plans for Bodo
Final Thoughts and Closing Remarks