Summary
A majority of the work that we do as programmers involves data manipulation in some manner. This can range from large scale collection, aggregation, and statistical analysis across distrbuted systems, or it can be as simple as making a graph in a spreadsheet. In the middle of that range is the general task of ETL (Extract, Transform, and Load) which has its own range of scale. In this episode Romain Dorgueil discusses his experiences building ETL systems and the problems that he routinely encountered that led him to creating Bonobo, a lightweight, easy to use toolkit for data processing in Python 3. He also explains how the system works under the hood, how you can use it for your projects, and what he has planned for the future.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.
- If you’re tired of cobbling together your deployment pipeline then it’s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to podcatinit.com/gocd. Professional support and enterprise plugins are available for added piece of mind.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Romain Dorgueil about Bonobo, a data processing toolkit for modern Python
Interview
- Introductions
- How did you get introduced to Python?
- What is Bonobo and what was your motivation for creating it?
- What is the story behind the name?
- How does Bonobo differ from projects such as Luigi or Airflow?
[RD] After I explain why that’s totally different things, maybe a good follow up would be to ask about differences from other data streaming solutions, like Apache Beam or Spark. - How is Bonobo implemented and how has its architecture evolved since you began working on it?
- What have been some of the most challenging aspects of building and maintaining Bonobo?
- What are some extensions that you would like to have but don’t have the time to implement?
- What are some of the most interesting or creative uses of Bonobo that you are aware of?
- What do you have planned for the future of Bonobo?
Keep In Touch
- Bonobo Project
- Romain
- Website
- @rdorgueil on Twitter
- hartym on GitHub
Picks
- Tobias
- Romain
Links
- Bonobo
- RedHat
- Anaconda Installer
- ETL
- Pentaho
- RDC.ETL
- DAG (Directed Acyclic Graph)
- Luigi
- Airflow
- NamedTuple
- Jupyter
- OAuth
- Graphviz
- Dask
- Data Engineering Podcast
- Dask Interview
- Selenium
- Zapier
- IFTTT (If This Then That)
- FPGA
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports the show on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@podcastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app. And now you can deliver your work to your users even faster with the newly upgraded 200 gigabit network in all of their data centers. If you're tired of cobbling together your deployment pipeline, then it's time to try out GoCD, the open source continuous delivery platform built by the people at Thoughtworks who wrote the book about it. With GoCD, you get complete visibility into the life cycle of your software from 1 location. To download it now, go to podcastinit.com/gocd. Professional support and enterprise plug ins are available for added peace of mind. You can visit the site at podcastinnit.com to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions, I would love to hear them. You can reach me on Twitter at podcastinit or email me at host@podcastinit.com.
To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Romain Durgu about Bonobo, a data processing toolkit for modern Python. So, Romain, could you please introduce yourself?
[00:01:32] Unknown:
Hi, Tobias. So first, thank you very much for having me on the podcast. So I'm Romain Dorgay. I'm the creator and main developer of Bonobo ETL, a Python library to extract when someone load data. Sometime I define myself as a Python hacker, but, I mostly tend to see technology as a mean more than an end. And I like building solution and seeing how it can change how humans, use thing and do things, more than the actual technology behind it. And do you remember how you first got introduced to Python? Yeah. So that that's that's quite funny because, I grew up in, a in a, let's say, early Linux powered house.
We had Linux boxes at home around, like, 95, something like this, mostly thanks to my father. The first distribution we used were were Red Hat distributions. And at some point, they used this thing called Anaconda, which was the installation system, not the Anaconda we know in the in the Python distribution world. So some people told me, yeah, the Anaconda thing is made with Python. I didn't know about Python then, and they told me it in about very nice terms. So a few years later, I had to start doing some system engineering work in, in a team of software engineers. I had to help them raise the quality of, what they were delivery, what the deliverables, And I started using, quite naturally, Python for that, and that was pretty much my first encounter with that. It was probably bad code. I was waiting, by them, but that's the first real professional encounter I had with the language.
And, after that, I I started building companies at some point. And, while we had a bunch of PHP software, the obvious choice I had for handling data processing, was Python. Yeah. And it's it's at this time I started to work on ETL libraries and that would at some point, be rewritten in Python 3 dot 5 as as Bonobo.
[00:03:26] Unknown:
And that leads us nicely into a brief sort of generational evolution, to the point of having bonobo where there might have been some previous incarnations.
[00:03:45] Unknown:
Yes. Mostly, I hit my head, against the wall a lot of time, about transforming data. Yeah. It was kind of iterative process. So just to to tell, what Bonobo is, so it's an extract transform log, maybe a framework, Maybe not everybody is familiar with, this term. So I tried to explain it in simple terms. For me, ETL is just a fancy term to say a system that moves and transform a bunch of data from 1 place to another. I say it's an ETL framework because the the the goal here is not to reinvent the the the wheel about how we will organize thing, how we will structure the the files and the data every time. I I put in place some guidelines to to to to great teams of engineers about that.
And it helps also, running the thing efficiently so that the teams can just be about writing jobs, writing data transformation jobs instead of reinventing all that. If we jump back in time, like you said, in, like, 2011, I think, when I started to get interested in building companies, I had to find a way to integrate a lot of, catalog data at the time that came from a lot of different system in into our, databases and data source. I mean, we are working with computers, but, out there, it's pretty artisanal. So, each of our partners had their own ways to store the catalogs, to stock to store informations, and even the the the partner that would use the same vendor softwares, has a difference between them. So we had to deal with XML, JSON, CSV, probably was a format. I don't remember. Some would provide the pictures associated with the catalog items on FTP servers.
Sometimes they would push to our own FTP server. Some would make them available on HTTP. For some, we we even had to scrap the ecommerce website because they didn't have the pictures and they didn't know how to to get them. And and there was some structural properties differing from 1 system to another, especially regarding the identity of the product. And then we would eventually have to match the taxonomies with with, internal taxonomy. So, yeah, there was a bunch of thing to do, a lot of thing. And as you add different partners, it it becomes harder harder and harder to, do it properly each time. So, at at the time, we started, using, graphical ETL because ETL is not new. It's, it's like 40 last year at ETL software, and we used something called Pentaho Data Integration, which is Java IDE, for for ETL. And it's all graphical, so you you just point and click and draw boxes and and and draw the the data path between, between the boxes. So it it did the job at first, but, it required a lot of copy pasting, between jobs because we had partner 1, then partner 2, then partner 3. And 90% of the job was the same from partner to partner, but the the 10% left was the the real problem here because, once you start copy pasting thing, you want to find a to to fix a bug sometime, and it's just not possible because you have to fix it in 20 different files.
So it's the moment I watched the my first ETL library. It was Python 2.7. It was open source also. It's still available, but it's not really a good code quality. Mostly, we had to be pragmatic. We were building a business, so, we didn't round the corners. We just tried to get the job done. And we used it for quite some time, like a few years. Adyen, what took weeks to integrate, at first was just a few hours to integrate a new partner, Adyen. So it it was pretty great for us, and it went great because we could not humanly ask for high set of fee to our clients. We needed to be easy and quite cheap to integrate. Yeah. It did the job, and probably we were integrating something around 20, 000, 000 catalog items a day, with this system, only using 1 computer. It was not distributed across more than 1, but, it was sufficient for only. Then probably last year or a bit before, I started to have similar work to do again.
Not at all for catalog items, but for different thing we use here, mostly contacts and, person, personal data, personal informations. This is the point where I decided to rewrite it from scratch, still as open source using the same concept that proved efficient, in my previous job, but, much more accessible and much more mass market than it was before. It's always good to start from a white, to to for from a clean sheet because you can take drastic decisions without having to worry about anything. So it was pretty easy to drop Python 2 support. I could wipe all the code that, was either flaky or not necessary.
I could fix some real, deeply tied, structural defects that lived in the old, code base. That was basically when Bonobo was, was born. And at the at the point, I was not really certain I could go further than just something very confidential and not really, not really used. But, then I got this, featuring on the on page of Hacker News that's, that that drew a lot of traffic to the to the website and a lot of comments, a lot of, user giving their opinion either because they've seen the software or just because they read the message and said, oh, yeah. That sound great or that sound that you don't have a lot of informations about monkeys or anything that just throws whatever, was in her head, in in their head. And that's really gave me a lot of energy to work, making it a real thing, really, really serious work, really a serious project. It's basically worked a lot since then, but that's really thanks for thanks to all the the interaction and all the feedback I got at this point.
And since then, I am I met a lot of different companies that are, reinventing the wheel again and again, in the ETL field. Each 1 for the set for themselves, probably hidden somewhere, not not really open source or partly open source, not not really documented. Just talking with them just confirmed to me that the problem exists, is real, and that people really need some piece of software to do this. And so that, yeah, that that is a a real need, and that's that's indeed a big, big, big part of the motivation and all the energy I I have to put, in this project.
[00:10:23] Unknown:
And what led you to choose the name Bonobo for the project? Is there any particular backstory behind that?
[00:10:29] Unknown:
That's a bit of a story. The the previous 1 was called rgc.etl. If it's, taught me something is that, cryptic names using acronyms nobody understands is not really good for a project. So I started diving into a name that was symbolic, that had a meaning. And I really liked the idea of, like, an army of tiny monkeys just working your data 1 packet at a time and passing, to the next monkey and then assembling the kind of assembly line of data. From all the names available on PIPI, Bonobos was there. At the time, I didn't really know that, Bonobos are not monkeys, but apes. For my defense in the French language, there is no difference that we have only 1 word for both.
So how can you stop me that I was just, not informed at all and I had I had no attention to details because it's it's not a monkey? But I guess that's not really important for but, yeah, that that's the story behind it. Now it's it's just vulnerable. So I'm I'm pretty happy with the name.
[00:11:33] Unknown:
Some of the other tools in the Python space for doing ETL jobs that people might be familiar with are things such as Luigi or Airflow. So I'm wondering if you can just, take some time to do a bit of compare and contrast between the use cases that bonobo is targeting and what people are using Luigi and Airflow for?
[00:11:56] Unknown:
Yes. Definitely. In in fact, that's a very good question because I get that asked quite often. I think it's a very different kind of projects, but people are easily confused about comparing it because both are using the DAG, words to describe how they store data structures. DAG is directed acyclic graphs. So it's a kind of basic data structure for for computing, but we both use it for different reasons. Also, Logia and Airflow are used for ETL, so it's easy to to say, yeah, what's the what's the difference? As far as I as I understand it, Airflow and Luigi are more like GnuMake with superpowers.
It's tools that allows you to schedule task and to describe the dependencies between tasks. So for example, if you have, c that depends on b that depends on a, you create a graph that says, okay. I'll just ask, those kind of dependencies. And if you want to build c, then, the system, knows that it should be b it should be b which needs a to be built, first. And so it's it's really okay. Wait for all my dependencies to be built before I start this thing and then aggregate all the the monitoring data, like how much time did it take to what are errors, etcetera on this thing.
Vulnerable, on the other hand, is more like a data flow or data streaming solution. It use graphs also to describe, what you wanna do, but it's it use them to say, okay. Whenever something goes out of a, then it should be inserted in, like, it should be put in b and whatever goes out of b should, should go in c. But it won't wait, I I mean, the a and b and c in Bonobo are really tied together and, work as a whole. It's not when a is finished, do b. It's, it's streaming between a and b. I don't know if that's a clear explanation of the difference.
[00:13:59] Unknown:
No. That definitely makes sense because particularly given that Luigi and Airflow are also more focused on large distributed task graphs that are going to be spanning multiple systems where you're pulling things out of a set of flat files, doing some manipulations on them, putting them into maybe a Hadoop cluster and then triggering a MapReduce job, waiting for that to complete, and then taking the output of that to put it into a database somewhere, for instance. Whereas from what I was able to gather from Bonobo, the use case is more focused on being able to do in process ETL jobs where you're taking the output of a SQL query, doing some manipulation on that, and then maybe uploading that to s 3 as a set of flat files or something along those lines?
[00:14:45] Unknown:
Yeah. Definitely. And, as as a result, I'm I'm really looking forward to having integration between Bonobo and Airflow, especially, especially since it joined the Apache incubator because it would really make sense to have for some tasks, Bonobo just handle the theme, then have Airflow just grab the arrows that may go out of a task. And whenever a task is done, do some other thing that may or may not involve vulnerable. I I think both system could work together quite quite well.
[00:15:13] Unknown:
And some of the other tools that are more focused on doing data streaming processing are things like Apache Spark or Apache Beam, which are implemented on top of the JVM but are often interfaced from Python programs. So I'm wondering if you can do a bit of compare and contrast on the use cases for those solutions versus Bonobo.
[00:15:35] Unknown:
Yes. So it's the the value proposition is much closer, so it's it's tougher to to answer this 1. But mostly, Bonobo focus on being like a Swiss army knife for data while systems like Beam or Spark are mostly tools designed by huge structures to handle big data first. So there there are different scales of data, and I I think, if you're a gaffer and you have hundreds of terabytes of data, Bonobo is definitely not the right tool for that. And there is, tools like Beam and Spark that are really good at doing that. But, being big data first and gaffer first, focused is comes with a price. And for the the options you have to start with BMO, Spark today, of 2 different kind. You can either, tie yourself to a cloud vendor. You you all know them, but, you you can you can choose Amazon. You can choose Google.
Just have them run it for you, and then it would be not quite easy. It it it can be very hard or it can be just hard to just say, okay. I I'm leaving Google for, for Amazon or I'm leaving Amazon for Microsoft or anything. And the other kind of options you have is if you want want to run it locally and run it on your own infrastructure is to install and maintain a cost infrastructure. If you just want to just work with some web service, work with some SQL, etcetera, you probably don't want to pay the price of the cost infrastructure just because you want to have a a a small tool that you can just run locally and pricing at some point. So I I'd say it's it's really made for different scales of data. I want at some point to explore a bit the the bottom of the big data world with with Bonobo. But most mostly, it's it's it's not the the same goal and the same skill at all. Also, you you told that, Beam and Spark are based on the JVM.
I think it's a it's a difference too because even if there is, Python interfaces, the the code and design of of of the different tools are really coming from the Jira world. And even, for example, Beam is still tied to the to Python 2.7. I I think it's a normal 1. You you have tens of thousands of employees that, you'd you're not moving fast towards the the most modern versions of Python. But probably if I'm just, building a company tomorrow, I want to take advantage of the the most modern thing I can. So that's that's also a difference. And, of course, Beam at some point will will go in Python 3 just because Python 2 will expire at some at some point. But, yeah, the the the look and feel of the of the code is is quite different too here. And, of course, if you have data sets of the big data scales, you you need to leverage cluster of workers. And, yeah, the the Hadoop ecosystem has better solutions way better solutions than Bonobo for you. And it's worth calling out the fact that, as you said, Bonobo is Python 3 only. And if I'm not mistaken, Python 3.5 and forward only. So I'm wondering what it is in the 3.5 release
[00:18:33] Unknown:
that you are taking advantage of that wouldn't be available in prior releases.
[00:18:38] Unknown:
Okay. I don't remember when is the end of life of Python 3 dot 4, but I think it's not so far away. There is 2 points that made me choose Python 3 dot 5. First, it's the release that's available on Debian. Debian is known as kind of conservative, distributions, Linux distribution. And so I thought if Debian think the Python 3 stable version is 3.5, I can safely assume that, most of the people will do the same. That's 1 reason. The second reason is that, 3 dot 5 is the first 1 that have, async await keywords, built in.
Before that, you you need you needed to use, asyncio. Even if Bonobo today is not leveraging, asyncio at all, I think in the future for some kind of, nodes, not everything, but for some kind of nodes, it would be really great to leverage an event loop to have asynchronous, nodes within a graph. So that's not existing from today, but it's basically the reason why I said, 3.5 and the following.
[00:19:41] Unknown:
And digging into how Bonobo works, I'm wondering if you can describe what the internal architecture looks like and how that's evolved since you first began working on it, and, you know, maybe what are some of the lessons that you brought into it from your previous attempts at ETL tools?
[00:19:58] Unknown:
So how Bonobo is implemented today? Mostly, the the goal of Bonobo is to help you creating graphs or directed, acyclic graphs of regular Python callables. And once you build graphs, it provides tool to execute them in parallel without having to worry about all the problems that, parallel programming can bring. The current execution implementation is quite simple at least to explain each of your notes, though. So the the graphs the the collabels you added in into graph instances are executed in a separate thread. All those thread have, input and output queues, and everything that comes to the input queue is, used as argument to call, your callable, and everything everything you would, return or yield from this callable is put on the output queue. And then the graph is used to understand what output you need to connect to what inputs, and Bonobo just pass the bits of data from 1 node to another. There are a few advantages of doing that. First, you you don't have to worry about any blocking calls that would happen in 1 node. For example, if you're relying on HTTP HTTP services or slow network services on databases that would yield a lot of data, you don't have to worry about the impact of this call blocking all the other nodes because all the nodes all the nodes will still execute, in their own on thread. 2nd, you don't have to worry about, parallelism. Parallelism is, quite complicated and can bring some deadlock problems and things like this.
And as the only communication mechanism is queues that are managed by vulnerable, and that's simple thread safe queues from the Python standard library. You don't you really don't have to worry about that. And, and I think that's a good advantage for Python developers that may not be, familiar with all the problems, and used by threading and parallelism. And, also, even if you're familiar with that, maybe you don't have to you you don't want to just, take all your time to worry about that and use your time more to create value. 3rd, you you're writing Python callables. So it's just very standard data structure. Not really data structure, but it's it's very standard programming. It means you you don't have to learn some, fancy stuff and complicated stuff. It's just you're defining something, returning thing, and you're just handling your data 1 1 bit at a time. Finally, and that's a property I I really like with, data processing is that it's very resilient to errors in the input data because you can, if a if a rule fails when transforming at some step, you can easily skip just the rule and continue processing. In some systems, you you would prefer, to stop, whenever an error happens. But when it comes to integrating data, often you have 1, 000, 000 rows. If 10 rows fail somewhere, you just don't want to stop all the system because those 10 rows fail. You just want to skip them for now, have the developers, fix the the code and and fix the edge cases that happen. And just on the next deploy, you would be able to to to handle the 10 40 rows. That's the 1, 000, 000, but 10 rows, we're working. So you you don't want to stop everything just because of 10 rows. Yeah. So, yeah, basically, that's, the current implementation and why, I think it's a good way to do the things. Yeah. And you asked about the architectural changes that happened in Bonobos. Right? So I I think the the biggest the biggest change happened before Bonobos, was born.
It's the the architecture change that's happened between rdc.etl and Bonobo. The the biggest defects that I had, by them is that the the framework would add side effects to every node you would put in a graph. So it was usable if you knew about it. But for example, you could not reuse, 1 colorable from 1 graph to another because it would add some dirty stuff here. Basically, it would taint everything you would pass by the framework with states. It was really hard to explain to people why, that was like this because it it really did not feel right to do it like this. So that's that's not really within Bonobo, but that's, probably the technical explanation of, why Bonobo and why a rewrite, from scratch with architectural changes. The the second huge problem, I had quite recently was that, in fact, well, there was 2 thing. And it happens just just after PyConD.
I realized that by talking with users, I introduced too much magic in Bonobo. People were not understanding why, we needed a CLI, a command line interface, why Bonobo was detecting the graph instances in file, and it didn't feel really Pythonic. And at some point, even someone told me, let's not try to turn Python into JavaScript. And I have a very good opinion about JavaScript. Don't get me wrong, but, it was obviously right. I I was trying to reinvent wheels, at some point that was not really understandable by people. So I worked a lot to suppress all this magic and the the release that just happened, removed all that. Now, you're you're just running, regular Python scripts. You're using the the main blocks, within Python scripts. So you don't need the CLI anymore. You can just, use the Python interpreter and run vulnerable jobs like you would when anything gets in Python. And the second problem was really a tough 1.
As you probably know, the dictionaries in Python 3.6 are now ordered, meaning that they preserve the insertion order of items within it. And it it has nice side effects, like it it preserved the the order of keyword arguments, in a call. It's documented in Python documentation that's as only a side effect of the new dict implementation, which is a packed dict. I don't really know the details about that, but I I felt it was a nice side effect. It was really tempting to start relying on this 3 dot 6 feature, which is documented as not really a feature, but maybe a bit of feature still. And I started to leverage this property for a lot of things, in the development of the 0.6.
Then I had a lot of problems like the field order would shuffle when using Python 3.5. The CSV, column order would change, when you converted 1 file to another, etcetera. I have to think a lot about this probably maybe for 2 or 3 weeks. I didn't got anything because I did not have any solution, for this program. So, yeah, I thought a lot about it. 1 option, honestly, was to do a Python 3.5, But, I was not really happy with that. Mostly because the reason, I told you before was still true and I had users use Python 3.5. So probably they would be upset. Even if I dropped Python 3.5, there was some underlying implementation like the data structure used to used to pass data from, within the queues from 1 node to another. The there was starting to be complexity in this in this thing. The the the code was like, hundreds of lines of code, and that was not really feeling right or feeling correct. So, finally, the the solution was to remove anything relying on dict to pass data. Instead, I used an approach based on named tuples or, kind of named tuples. And now now every input is somehow some kind of, and it allowed to remove all the complex codes about raw data structures. I probably deleted, like, 500 lines of code here. And, yeah, removing complex code is, always feel right. And it it also completely solved the problem of ordering fields, whether they have names, like CSV columns have names. But, for example, if you have list of lists, you just have indexes, which are integers. So you could order integer indexed, values or name indexed values and also those field names using the memory, much more efficiently because you just use it once per per queue and not having it duplicated from every packet. So overall, when I when I started to do this u-turn, everything started to be feeling more correct. So now I'm I I was pretty confident to release this code. And, also, there was some busy break here. I I worked with users, on the upgrade process using an alpha version.
Until today, everything pretty much, went great for everybody. And just an anecdote here, the the the the refund thing is that when I started to read the Python implementation in the standard library of named tuples, I started to say, wow. That that's ugly because to describe it, it's relying on, generating code. So it generate the Python code as a string, then it execute it using exec, and then it use that as the type of your data. So when you just read this, you say, why is it implemented like this? But it's really funny, but because when you really dive in each detail of the implementation, it still looks ugly, but you understand why that was the only possible way they had to do that. And in fact, it's really smart because it's it's really efficient on the memory side. So I'm pretty happy that we we went on on this change now.
[00:29:25] Unknown:
And 1 of the things coming from that that also differentiates from things like Luigi and Airflow is the fact that all of the nodes in your graph of processing is just a standard Python callable, whether that's a function or a class. Whereas with the Luigi and Airflow frameworks, you really need to sort of conform your application to fit their implementation target. So with Luigi, there are certain classes that you need to subclass in order to be able to create different nodes in the graph as opposed to just being able to do free form Python development, which is where I think Bonobo really has an advantage because it allows you to work in the same way that you have been working, but still take advantage of the underlying infrastructure that Binoba provides to link the different nodes together and create that flow of data from 1 to the next?
[00:30:18] Unknown:
Yes. Def definitely. And probably, Lucien airflow has this need to have 1 level more of abstraction, which is, is great because it's, it's very flexible and you can use a lot more systems with it, but also comes with a cost of, having to adapt your code for this level of abstraction.
[00:30:37] Unknown:
And you've spoken a bit about some of the different challenges that you've been faced as as you've been working on evolving and growing the Bonobo library. But are there any other, aspects of that process that you like to call out as being particularly challenging, whether it's related to the software itself or just the marketing or interactions with the community?
[00:31:01] Unknown:
Yeah. It's it's nice that you you mentioned that because I think the most challenging aspects of building and maintaining bottom Monoble is mostly everything that is nontechnical. I mean, the the technical part we discussed a bit, just before is not really simple. Everything needs to be backward compatible. Parallelism must be transparent. You must think about developer experience, the API or it looks, etcetera. But all that is challenging, but, as our jobs is, software engineers, that's pretty much what we do for daily. So that's our job to solve those kind of problems, and we know how to solve those kind of problems. The really hard part is just just to cut, pedophile, it's going from 0 to 1. From you have nothing to there is something that exists. As you said, it involves, building a community around the software, both of users, but also of contributors.
It involves finding ways to maintain and enhance the software without like burning myself out. It involves, docking, documenting the the library, making it, accessible to the the user base. It involves also, designing the developer expense or or the APIs not because of the implementation but because how you would like the community to use it. Yeah. As you said, it's it's kind of marketing the the open source software. Marketing for engineers is pretty tough. It's not our strongest skill, and so it's it's even harder. And that's funny because in the Hacker News release, someone told me that there is the word marketing written in some commit message and isn't marketing evil? And I strongly disagree with that because for me, marketing is all the things you do to raise the awareness of a given target market about a given solution or a given product.
Documentation is marketing. Talking in conferences is marketing. Writing examples is marketing, etcetera. Talking on a podcast is marketing. It's not a bad thing if you're using it to promote something that you think is good. Of course, if you're using marketing to promote selling atomic bombs, probably it can be considered as evil. But I I think it's a necessary task that is not the most pleasant task for technical, people, but a very necessary task to to do for pretty much anything if you want it to be bigger than just something confidential you were just using for yourself. Basically, I'm trying to, it's a bit like a a startup company. You you're starting with some flaky products, trying to bring it into a viable state at some point where it cannot die anymore. And that's really the point here. The the oldest marketing I'm trying to do and all this explanation about what I'm doing, I'm trying to do is trying to get the project to the point where I'm it's not relying on me only anymore and it can live by itself. As in a startup company, the most difficult task you have is that at on day 1, you're fighting really hard against the fact that, in fact, nobody gives a damn about what you're doing. And you you have somehow to convince them that it can solve problem for them. It can help them doing the job, and so they should care about what you're doing.
Yeah. That's definitely the biggest challenge of probably building and maintaining any open source software. Of course, if you Google and you're just releasing something, everybody will care because you already Google. But if you're just some developer trying to release an open source software, it's very important to try to find a way to make it sustainable, not in the terms of, yeah, I should make money with this thing, etcetera, but just to the point where the the software is is not really I mean, if if if I can't at some point have it, in a viable state, then probably it will die. And my task is more than developing the system is try to find a way for it to be viable even if for maybe for sometimes I'm not here anymore or just trying to take vacation or anything. I'm I'm a bit, confused in this explanation, but that's in fact, a few years ago, I read a book that said something like you need to spend at least half the time you work on something actually explaining why and how you work on that thing. And probably that's the best advice I ever got because, yeah, you can make the best product in the world. If nobody knows about it, then it's nobody cares. It's it's not important. And so I I think that it's actually very much more valuable to do less things and explain it better than the opposites.
[00:35:35] Unknown:
And that also goes back to 1 of the ideas in the zone of Python that if you can't explain an idea, then it's probably not a very good idea. And I'm paraphrasing there. But but by being able to actually talk about the project that you're doing and explain to people what it is and why it's useful can help provide a a concrete understanding for yourself about what it is that you're trying to do and potentially even suggest some ways that you might improve it or new directions that you might take it because by virtue of teaching people about it, you gain a deeper understanding of what it is yourself.
[00:36:09] Unknown:
Yeah. Definitely.
[00:36:10] Unknown:
1 of the things that you briefly alluded to earlier is the fact that Bonobo has some built in extensions and converters for being able to simplify some of the tasks in ETL pipelines, such as converting from CSV to JSON as an example. And I'm wondering if you could just briefly outline some of the different extensions that are already built in and talk about what are some of the extensions that you would like to have but don't have the time or inclination or understanding to implement?
[00:36:42] Unknown:
So for now, there is, there is different kind of extensions. That's, there there are extensions that lives in the main vulnerable repository, and there are extensions that lives outside of that. For now, in the main vulnerable repository, you have some basic tools to work with standard data formats like CSV, like, JSON, different kind of JSON. There will be soon things to read, Excel files, to read XML files, things like this. I only include in the in the main framework what is really like 90% of people already dealt with this kind of file formats because there is probably an infinity of file formats, and you can find in the Python, pretty much anything you need to read, pretty much any kind of file format. So if it's not here, it's pretty much easy to to implement, and you have examples, on how to implement it correctly. So it it will be straightforward work. And, of course, I tend to encourage people to contribute, the the readers and the the writers do what for the formats they use because, of course, I I can't, I can do everything. Apart from the the core nodes that and the core readers and writers that are provided with the framework, There is also 2 plugins, which is the console plugin and the Jupyter plugins that provides display for, interactive terminals and for the Jupyter notebooks that you probably most of your audience already know. The the idea is that, Bonobos should provide the tools to really, understand what understand what's happening while you execute you execute things.
Step by step, we're trying, especially with the Jupyter notebook extension to provide additional tools to work interactively with your graphs. So, yeah, it's it's basically tooling around the ETL and and graphs, of Bonnable. There is also since the 0.6, some, drafts of integration for Google APIs. So you can, for example, work with spreadsheets. You can work with Google contacts, etcetera. So working with OAuth, systems. I I I have a few extension to work with the open data services also. And, yeah, as as you said, there is SQL Alchemy, extension, which is an external extension that that can help you work with, databases. Now I know what I forgot just before is that, the 0.6 also released the first, Django integration. So you can also, very easily integrate vulnerable ETL jobs within, Django management command and just manage data within your Django manage. Py file, like like you would do in Django, but using.
And you have some bits of integration between them, so it makes it really easy to use the Django ORM also if I mean, if you're out of the Django world, you you would probably use SQL Alchemy, for the for the SQL access. But if you're in the Django world and using the Django over RAM, you can now use it also with quite easily. So, yeah, we we talked before about, Airflow and Luigi, so I I won't, I won't expand more on that. 1 core thing that I'd love to have is implementing new execution strategies. I'd probably work on subprocess based strategy in the next few months because it's important to have I I need to have more than 1 strategy, 1 real world strategy to for execution to be sure that I didn't forget anything, in the in the interface of that. But, for example, and I won't probably have any time to to do that, in the in the next month, but I would love leveraging DASK and DASK distributed to allow 1 that would use vulnerable for a small scale thing, but that could become big to know that at some point, you can switch the strategy and with very few adjustments, leverage a cluster of, worker machine to do the exact same job. So I don't know if you you you've seen, DASK and DASK distributed, but, it's really doing a great job of doing that. So I'd really like to have vulnerable leverage this. If you want to go, further than just 1 machine and have a cluster of 100 machine, that would be really be the the way to go. And, also, but that's probably a a world project in itself. I'd love at some point to have a graphical user interface on top of, probably something built with, Electron and a lot of JavaScript.
But that would make Bonobo a lot more accessible for less technical users and yet still leverage the software engineering practice that I'm trying to to inject in the this ETL development and data engineering development. So, yeah, that's that's probably more and, it's really open. We have a a Slack channel where we discuss quite a bunch of thing. But that's probably the 2 big things I'd really love to have in the, I don't know, next year of Bonobo. That would be huge jumps forward.
[00:41:27] Unknown:
Yeah. The work that Matthew Rocklin has been doing with DASK is definitely very impressive. And I actually interviewed him for an episode of my other podcast, the data engineering podcast, which I'll put a link to in the show notes. Okay. That also answers 1 of the questions that I had about what would be involved in allowing Bonobo to span across multiple machines in the same sort of way that things like Luigi and Airflow do. So that's definitely an interesting direction that I think would be cool to see Bonobo taken in. And 1 of the other things that I believe is already built in as well is the capability of being able to generate a graph vis visualization of the way that the task graph is tied together so that you can visually see what are all the different steps in a given ETL pipeline that you've built with Bonobo?
[00:42:16] Unknown:
Yes. Bonobo leverage Graphviz, to generate simple graph image for that. It's it's pretty basic. But,
[00:42:23] Unknown:
yeah, you can visualize what what you build so you can understand graphically what the inputs and outputs like in your graph. Yeah. Anytime you get beyond a, you know, 3 or 4 node graph, I think that that would be very useful particularly for somebody coming in to see what you're doing in your code, not having to necessarily delve through all the different lines just to see at a high level, okay, these are the different inputs and outputs, and this is the end result.
[00:42:48] Unknown:
Yes. And, this is now used as a default, visualization also in the Jupyter integration, meaning that if you create a graph instance in Jupyter Notebook and you have graphviz installed, when you just display, just type graph, you will see, the actual graph viz, image of your graph. So it's it's pretty easy when you're using Jupyter Notebook to work on graphs and really understand the real time what you're doing because you can just visualize
[00:43:15] Unknown:
really easy. And what are some of the most interesting or creative uses of bonobo that you've seen?
[00:43:22] Unknown:
So on my side, in our job here, we are doing a lot of selenium based web scraping on JavaScript intensive websites, like things built with Angular JS or Reg JS, which cannot really be scrapped with things like Scrappy or or just requests. So we're basically having a browser that's a Firefox browser run with Selenium that yields thing from different pages, and then we have all this graph that takes this data and interpret it with, for different usage. I've seen some people use it for financial stuff and some cryptocurrencies related stuff to just get things from, APIs and make things in in the system. I've seen some replacement of, some solution that were previously based on IFTTT or Zapier, which is a really small scale solution like, trigger, through, there is triggers and actions.
It's pretty easy to replace, some cloud based solutions like this with, vulnerable. But, yeah, the the the creativity award, the most creative use awards, definitely goes to, some people that told me they are stress testing an FPGA based database system. So FPGA are, like, programmable processors. Honestly, I don't even know how Bonobo can be of any use in there, but it's, they they just told me they they used it for that, and they sounded quite happy about it. So probably that's I I I can't say how you could use it for that because I don't understand what they're doing, but that's probably just the most creative thing I heard about it. Yeah. That's very cool. I I I think that they could potentially be using it for being able to orchestrate large batches of reads and writes as a way to try and determine what some of the shortcomings are as far as handling the IO of multiple different tasks
[00:45:10] Unknown:
running concurrently. So that that is definitely very interesting and cool. And you've talked a bit about some of the plans that you have in the near future for Bonobo, but I'm wondering if you can just talk a bit more about what you envision Bonobo becoming in the future and some of the different features that you have in mind for future versions?
[00:45:30] Unknown:
Yeah. So, for the quite near future, like the next 2 version, which will be 0.7 and 0.8, probably the lessons learned from last release is that, I'd really need to calm down a bit, and try to simplify things. So I expect the 0.7 version to be not really an exciting version and mostly, trying to simplify everything, test more everything, stabilize, to try to get the the the the foundation more solid. That's not the the the most exciting thing, but, it's really necessary here. But once this will be done, I have plans for the 0.8, which is as we are using, functions, in the graph, it's pretty easy to decorate functions. I want to provide tools within Bonobo to enhance, not behaviors. I will give some examples for that because it's it's quite abstract. But for exam and it's, a lot of users told me that's this. That's pretty much what they expect from an ETL framework. For example, 1 behavior we could, add to pretty much any node would be adding cache. So, you have some key matching algorithm that says, okay. If this key is already there, then I do nothing.
If the key doesn't exist, I will do the transformation and keep, for example, the result for 24 hour. You can think about rate limiting notes. So if you're trying to use an API that's only allows 1 call per minute, you could just say, okay, this note can only run once per minute. And whatever the it has on the input queue just, when it's once every minute. You can implement retry strategies on notes. So for example, you can say, okay, this thing can fail. Try it, like, 10 times with 60 seconds daily between calls. If after 10 calls, it didn't do anything else, an an exception, just give up. But, yeah, try it in times. There is also something I briefly introduced before, by talking about the asyncio thing. But we could we could introduce non FIFO, non fuzzing for some parallel exception for some nodes. For example, if you have an HTTP API for which you don't really care about the first in first out property, you can say, okay, this 1 you will make 4 calls at a time. And if for example, if it's some geo coding, let's say, API that just takes an address and gives you coordinates, you don't really and you don't really care of the the feeds order. You could just hammer this 4 at a time and basically, if it's the bottleneck, speed up your transformation by 4 by effect of 4. I'm thinking also about tools to introduce step debugging probably based on the integration where you could have interactive widgets, allow the user to, manually, send 1 word of data at a time to help him develop and enhance his notes.
And something that which is not really, decorator pattern based, but, that I really like to have is to introduce some kind of recursive pattern. We we talked about directed acyclic graphs before, but probably in some cases, it's safe to introduce cycles, in the ETL. And for example, when you're watching web crawlers, it's useful if you have an exit condition to have some kind of recursive crawler because every time you read a page, you find links. So you want to read all the links and follow-up to a certain point those links. So combined with the retry and rate limits, behaviors that would allow to write quotas in really a few lines with really powerful behavior. So that that's all small features, but that would, add a lot of value for a lot of use cases. Yeah. For the for the short term, I think that's that's basically the plans. For the the more long long term, I'm also, looking forward to businesses built around vulnerable and Python Etihad in general. I cannot say very much about it yet, but, of course, the the framework is and we're still 100% open source. But there's probably a lot of value added services I could imagine around that. So I'm I'm thinking a lot about this, and that would be also a way to to get towards, sustainability of the project and maybe finance how to finance development of the project. And overall, I'd I'd really like to see, Bonobo become some kind of Swiss Army knife for everyday data. It should probably feel the natural way to go. And if you are with a terminal, then I'd like Bonobo to come to your mind when you manipulate data. And something that I did not mention before, but, which is really important is there is a lot of use case now. You can use Bonobo without even writing 1 line of Python. So it's not a tool that is reserved to people that know how to write Python. There is a convert command in Bonobo that can, take any file formats Bonobo understands and convert converts it into any other file format Bonobo understands. In my job here, there is a a Ruby programmer that just use it to convert files, and it don't write Python. But it as as you can use the terminal, it is still able to get some value from vulnerable. So, yeah, I'd I'd really like it to become some tools like this, in the future. Are there any other topics that we didn't talk about yet that you think we should discuss before we close out the show? Yeah. There there is definitely 1 topic, which is not really a discussion, but I'd really like to thanks the 20 contributors that's in the comic logs of the last release. I I mean, Bonobo is 1 year old around today.
1 year ago, there was 1 contributor, and today there is 20. So probably,
[00:50:52] Unknown:
they would not be vulnerable if they were were not here. So I really need to thank them be before we we close this. And for anybody who wants to follow the work that you're doing and keep up to date with Bonobo and any other projects that you work on, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And my pick today is going to be an episode of the data skeptic podcast that discusses some of the different finer points of quantum computing, and the guest that he has on for that episode is very knowledgeable in the space and does a great job explaining a lot of the underlying principles of quantum computing and dispelling some of the hype around it, talking about the time horizons that we can start to expect some of the different capabilities that are talked about. So it's really interesting and informative episode.
I learned a lot more about quantum computing than I knew beforehand, so I definitely recommend people go take a look at that if that's something you're interested in. And with that, I'll pass it to you, Romain. Do you have any pics for us this week? Yeah. Well, first, I'll definitely check,
[00:51:58] Unknown:
the the the episode about quantum computing you just said because that that's really a topic. My picks today, 1 of its Egoist, which is MediKit. I just wanted to introduce it because it's a project I also, initiated, but it's how we manage all the different projects, either closed source or open source. The same way, it's what generate the make files. It's what generate the setup. Py. It's what make the releases on PyPI. So, if you have release engineering to do, just check it out. I want to talk a bit about Worker, which is a project by Grammarly, company. And it's it's a Go software. It's a software written in Go, which helps you build Docker images, but it's way better than just Dockerfile. It's it's looks similar than Dockerfile but adds some commands. So for example, you can, generate artifacts and exports or imports them. You can generate more than 1 image. You can have, variables. You can have mounts. So you can, for example, have the the PIP cache mounted, on the host.
So if you're using Docker images and you're just, hungry at what's Docker files provide, just check worker because it's amazing.
[00:53:06] Unknown:
Those are definitely projects that I'll be taking a closer look at myself. So with that, I'd like to thank you very much for taking the time today to join me and discuss the work that you've been doing with Bonobo. It's a very interesting library and 1 that I've actually started using myself for some of my work. So thank you for that, and I hope you enjoy the rest of your day. Yeah. Thank you very much. Thank you for having me. It was really, really great.
Introduction and Sponsor Messages
Interview with Romain Durgu
Romain's Introduction to Python
Evolution of Bonobo ETL
Choosing the Name Bonobo
Comparing Bonobo with Luigi and Airflow
Comparing Bonobo with Apache Spark and Beam
Internal Architecture of Bonobo
Challenges in Developing Bonobo
Extensions and Future Plans for Bonobo
Interesting Uses of Bonobo
Future Vision for Bonobo
Closing Remarks and Picks