Summary
Jupyter notebooks are a dominant tool for data scientists, but they lack a number of conveniences for building reusable and maintainable systems. For machine learning projects in particular there is a need for being able to pivot from exploring a particular dataset or problem to integrating that solution into a larger workflow. Rick Lamers and Yannick Perrenet were tired of struggling with one-off solutions when they created the Orchest platform. In this episode they explain how Orchest allows you to turn your notebooks into executable components that are integrated into a graph of execution for running end-to-end machine learning workflows.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Rick Lamers and Yannick Perrenet about Orchest, a development environment designed for building data science pipelines from notebooks and scripts.
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by giving an overview of what Orchest is and the story behind it?
- Who are the users that you are building Orchest for and what are their biggest challenges?
- What are some examples of the types of tools or workflows that they are using now?
- What are some of the other tools or strategies in the data science ecosystem that Orchest might replace? (e.g. MLFlow, Metaflow, etc.)
- What problems does Orchest solve?
- Can you describe how Orchest is implemented?
- How have the design and goals of the project changed since you first started working on it?
- What is the workflow for someone who is using Orchest?
- What are some of the sharp edges that they might run into?
- What is the deployable unit once a pipeline has been created?
- How do you handle verification and promotion of pipelines across staging and production environments?
- What are the interfaces available for integrating with or extending Orchest?
- How might an organization incorporate a pipeline defined in Orchest with the rest of their data orchestration workflows?
- How are you approaching governance and sustainability of the Orchest project?
- What are the most interesting, innovative, or unexpected ways that you have seen Orchest used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building Orchest?
- When is Orchest the wrong choice?
- What do you have planned for the future of the project and company?
Keep In Touch
- Rick
- ricklamers on GitHub
- @RickLamers on Twitter
- Yannick
- yannickperrenet on GitHub
Picks
- Tobias
- Rick
- Yannick
Links
- Orchest
- Geoffrey Hinton
- Yann LeCun
- CoffeeScript
- Vim
- GAN == Generative Adversarial Network
- Git
- SQL
- BigQuery
- Software Carpentry
- Google Colab
- Airflow
- Kedro
- nbdev
- Papermill
- MLFlow
- Metaflow
- DVC
- Andrew Ng
- Kubeflow
- Lua
- Caddy
- Traefik
- DAG == Directed Acyclic Graph
- Jupyter Enterprise Gateway
- Streamlit
- Kubernetes
- Dagster
- DBT
- GitLab
- Spark
- ETL
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host, as usual, is Tobias Macy. And today, I'm interviewing Rick Lamers and Janek Perronet about Orkest, a development environment designed for building data science pipelines from notebooks and scripts. So, Rick, can you start by introducing yourself? Yeah. Thank you for having us. My name is Rick Lamers. And together with Janik, we started the ORCUS project.
[00:01:15] Unknown:
I studied computer science, got obsessed with deep learning. Geoffrey Hinton, Jan de Koon were, like, inventing all these algorithms.
[00:01:22] Unknown:
Yeah. Just really excited to be working on a project that can help data scientists kinda streamline their workflow. I'm very excited to be here. And, Janik, how about yourself? Yeah. Thanks again also for having us. So a quick introduction on myself. I think it's more interesting actually to talk a bit about interest since Python podcast. It's good to say that I've been very much interested in the Python ecosystem for quite some time now, which came from the earlier days where I've been doing some side projects, a lot of programming on the sites, websites, robotics, Nxt Mindstorms for the for the ones who know it. Great to share.
[00:01:56] Unknown:
And going back to you, Rick, do you remember how you first got introduced to Python? Actually, I do. And it's funny because I had prior experience with CoffeeScript.
[00:02:03] Unknown:
And for those who aren't familiar, it's a white space sensitive dialect of JavaScript. I felt like, okay, if Python is also has this white space thing, I'm not gonna like it. And in fact, it was just a really bad implementation on CoffeeScript part because in Python, it never bothered me. And I remember thinking when I was first trying out Python, this has been created by a Dutch guy, so I kinda owe it to the Dutch, you know, society to try out this language. And I remember
[00:02:29] Unknown:
very vividly that the first time working in it, it felt natural straight off the bat, which I can't say from any other languages felt like this. So I was kinda hooked right from the get go. Yeah. When I was getting ready for the interview, I noticed that you were both Dutch. So I was gonna ask if that was just sort of a national requirement that everybody who's from there has to learn Python.
[00:02:48] Unknown:
Yeah. Pretty much, I would say. I think Vim is also written by Dutch guys. So it's like this stuff, like everything that was produced by Dutch people, computer scientists, like, a star algorithm, like, all of that required learning.
[00:03:00] Unknown:
And, Janik, do you remember how you first got introduced to Python? I think the first time I really got started using Python, I think, was from a university course. I was studying mathematics at the time. So it started with a bit of Python questions there to do compute and stuff using MATLAB. And ever since then, I've been using Python for pretty much
[00:03:18] Unknown:
everything, I'd say. Digging into Orkest now, can you give a bit of an overview about what it is that you've built and some of the story behind it?
[00:03:26] Unknown:
We worked as data scientists ourselves. We were in university, and we had to, like, make some money on the side. So we were doing a lot of projects. Some of them out of curiosity, I was working on some GaN network, some deep learning techniques, but also some more basic ETL processing. And what we found was that a lot of the people we worked with on these data science projects were not really good at software engineering necessarily, even though they understood the concept like statistics and mathematical models. And a lot of them were kinda coming up with really funky hacky solutions to a lot of kinda engineering problems.
And we were just like, hey, there's this thing called Git. You can use Git. Right? It's really nice. And we were talking about CICD stuff and how they could maybe abstract over some of these slower level infrastructure pieces that you have when you're working in cloud. And somehow they didn't really kind of embrace those things themselves, maybe because they weren't really coming from a software engineering background. And we just figured, like, we really have a good way of working because this is becoming a complete mess if you have, like, these huge monolithic notebooks with no structure, no code reuse, no general common sense of engineering practices being applied. And so we figured, what's the best way to go about this? Should we, like, start teaching all of data scientists in the world about these practices and, like, convert them into software engineers?
Or can we maybe abstract and hide some of these ideas? And it really came from like, an inspiration source was the fact that SQL is a really powerful DSL that allows you to work with databases way where you don't have to understand a lot about what's going on behind the scenes, like replication charting and scaling is, like, automatically handled by something like Google Big Query. And we felt that these kinds of abstractions in process of data scientists needed to be built in terms of tooling as well. And we we didn't really go with the DSL approach, but I think we we take a lot of inspiration from that, you know, hiding complexity, hiding and abstraction still.
[00:05:16] Unknown:
Your mention of deciding whether or not you wanted to have to teach software engineering best practices to data scientists puts me in mind of the software project, and I'm wondering if you've had any experience with that or seen some of the ways that they go about approaching teaching those different concepts to people who are coming from more of a statistical background and any inspiration that you might have drawn in terms of how you've structured the interfaces for Orcast or designed the overall workflow around introducing, you know, gradual introduction of those concepts?
[00:05:51] Unknown:
I think there's still a requirement for data scientists to become better software engineers. It's not like it's either or. They still need to, like, have a bit more abstract thinking or code reuse in mind. But what we try to do is we try to make the defaults and everything that happens by default behind the scenes to be aligned with those general common software engineering practices that are prescribed. So for example, when you create a new project in Orkis, it automatically is created as a Git repository. And when you want to compare 2 versions, we, under the hood, can use Git to to do those diffs.
So we kind of sneak data scientists into becoming better software engineers by encouraging these principles. But I think, especially when they're writing their own Python and our code, there's no real substitution for some general principles for them, to to apply as well. Quite some time ago, I actually taught a Python course.
[00:06:41] Unknown:
And the the first time the first iteration I did was a bit more so it was taught to data scientists in in companies who were not that familiar with writing code. So first, I figured I start to learn a bit of the Python internals, just the basics, and later go into the PYDATA stack and just make some visualizations, look at the data, and it actually turned out that starting right off the bat doing data visualization was way easier for people to get started with. So I think this is also important for orcas because we want to emphasize that data scientists do not per se need to have data or software engineering skills.
So that's also what we want to make easier with the tool, gain insight from your data instead of having to do infrastructure and other software engineering tasks.
[00:07:28] Unknown:
Given that you have the experience of working as data scientists and you're building this product for data scientists to help overcome some of these issues around software engineering best practices and making sure that the projects are maintainable over the long run. And, you know, you don't just end up with a monolithic notebook, as you mentioned, that has a working experiment but ends up needing to be thrown away or reengineered to actually go into production. I'm wondering what you see as being some of the biggest challenges that users of Orkast are facing in being able to build these types of projects.
[00:08:04] Unknown:
What we've noticed a lot is that even simple things like having a lot of constants defined in your code, which may which makes it harder to run the same notebook for different, you know, data locations or different sort of hard coded parameters. We've made it a first class feature within orcas to be able to substitute that information using parameters and environment variables. So that makes a lot of pluggability easier when it comes to notebooks and scripts. And then second, I think is really central, the the concept of pipelines where it's a much smaller step to be able to split up a monolithic or larger notebook into multiple steps. And typically, you already find that a larger notebook already has kind of sections or tables of content. And so if you break up these notebooks into smaller concerns, you can more easily think of this step in a whole. Like, you can reason about the full notebook more easily if the notebooks are smaller.
And then you can make it easier to see those abstractions that allow you to generalize and reuse this notebook step or script for multiple purposes. So it's about making, like, more modular notebooks and then having the ability to substitute and kind of abstract out information in terms of parameters and environments already saves, like, a ton in terms of how hard it is to reuse notebooks and scripts. And I think this also translates into
[00:09:24] Unknown:
effect for collaboration. If you're able to modularize your work better into pipelines, it's also way easier to collaborate on your work versus just sharing your notebook right off the bat or versioning your notebook, which I think many people know is not the best. And so I think having
[00:09:46] Unknown:
what are some of the examples what are some of the examples of the tools or workflows that they might be using now and some of the ways that Orkast can either augment or replace some of those utilities?
[00:10:01] Unknown:
So a good example of the tool would be, for example, Google Colab, where it's this very simple UI to set up, like, a quick experiment. And a lot of people like the fact that it's so easy to set up and use. And then you kept this GUI where you don't have to actually touch, like, a whole bunch of command line tools in order to get going. As to supposed to contrast would be something like airflow where there's a little bit more work required from the data scientists to get up and running, or another tool like Kedro where there's a lot of like, you have to first get into documentation and start understanding a bit more about the framework in order to get some benefits from it. But then the issue with something like Google Colab is that you run into this, okay, I have a notebook, but now what? Like, what's the next step that comes after? I think, typically, you see that there's a pretty hard cutoff point today where you're either in a notebook, and that's becoming a bit too wildly wild because it grows in size, and you are in a production context where you have a full system that's fully versioned, has, like, a CICD.
And, like, the the middle ground in between is typically, like, I take my working experiment or prototype and I hand it off to software engineer that actually turns it into real quote, unquote software. And I think we want to kind of expand the space between that gap, where if you're working inside of a notebook, we want to make it easier to go and remain in that kind of iterative experimental workflow, but not necessarily force you to just go to frameworks and APIs instead of like your more intuitive notebook
[00:11:33] Unknown:
development workflow, if that makes sense. Yeah. And this puts me in mind too of a couple of different tools for notebook oriented workflows. 1 being nbdev, which is for being able to convert notebook code into a Python library. And the other 1 is papermill, which is more along the data science and data engineering use case where you're able to parameterize these notebooks, which helps to work towards these modular notebooks that you were mentioning for being able to build out these pipelines. And I'm wondering how you view Orkast either in relation to or in contrast with some of those other projects.
[00:12:12] Unknown:
Nbdev is a really interesting 1 because it actually helps with creating a more documented and well tested pipeline in orcas 2. So if you think about nd dev as a tool that lets you export your notebook code and lets you reuse that code and then also have automated testing and documentation generation on top of that. That's something that directly integrates with orcas because if you combine nbdev with orcas, you get a pipeline that has those features. People can go into the various steps of the pipeline and observe, like, the reasoning and inline documentation as notebooks offer about, you know, what's happening in that particular step in the pipeline, and you get the added benefit of having tested pipeline. So we see the ecosystem of notebook tools really kind of providing solutions for some of its shortcomings Because we are really focused around notebooks, you can really combine those approaches cleanly and get kind of the best of both worlds. So nbdev, I would say, is a very nicely integrated tool that you can use directly within Orchest. And then papermill, I think we replace papermill because we also take care of the parameterization.
We do it through a simple Python SDK. So it's just a framework import. You can get access to parameters that can be used within the script. I would say it's a very similar semantic construct of having notebooks that can be parameterized during execution.
[00:13:38] Unknown:
And so in addition to some of the specific tools that are oriented around notebooks and just the overall workflows that data scientists might fall into, what are some of the other tools or strategies in the broader data science ecosystem that Orcast might replace? I'm thinking in terms of things like MLflow or Metaflow or DVC for being able to handle some of the overall life cycle and pipelining of the data science process and versioning of the code and data?
[00:14:07] Unknown:
I think the question then also becomes a bit about what your objective is for a particular project. Orcus is very much focused around the phase of a project where there's a lot of uncertainty. Because what Orcus is really good at is letting you iterate quickly on ideas. Andrew Ng at the ML course and a lot of other interesting work that he's done recently highlighted in his newsletter, the distinction between the phase in which you really wanna get to a lot of insights and answers quickly. And then there's a next phase where you're really optimizing robustness, reliability, and scaling. And I think it's important to differentiate which tools need to be used in which phase. And I think with Orcus, we're really focused on the first phase, but there's a lot of uncertainty and that's why you need a lot of iteration. But if you really know exactly what you want to build and you just need to, like, really build it reliably and robustly, then I think you start drifting towards tools like Metaflow or something like Kubeflow, where you can really start rigidly defining the full system and have some good understanding about its scaling dynamics and integrate it into an existing stack. So I think that's, like, 1 answer. And then there's, like, tools that kind of fall in between, like, something like DVC you could use in a prototyping context to, like, store your data, but in a production context can also be used as kind of this layer that moves data around.
There, I think you can just mix and match the tools. And that's why we're also focusing on integrating open source tools. So tools that kinda cover both phases, I think, can typically be be integrated and used in both phases. Digging more into Orkest itself, can you talk through how it's implemented and some of the ways that the goals and design of the project have evolved since you first began working on it? So how Orkis is currently implemented,
[00:15:50] Unknown:
we from the start, we felt that data science yes, we have powerful machines, but probably data science will love to use or run their workloads on the cloud. So this was 1 aspect of running things on the cloud as well as being cross platform. We felt for adoption, it was important that we did not only say, you can only run Oracles, for example, on your Linux machine. So we wanted to be cross platform, and so we kind of ended up with a dockerized microservices implementation of orchest, where we have different services. For example, the web UI you see and the back end we run, they're all dockerized into their own microservices. So these different microservices then, say, talk to each other using we've made some Flask REST APIs, which the different services use to talk to each other and basically everything is implemented, at least in the back end, using Python. And of course the front end uses some React, HTML, CSS as well.
And when it comes to what are the changes we've come along the way, it's basically the design pretty much the same. Of course, it's also because it's still a young project And so the goal is, of course, we're still listening to the community to see what features should we prioritize over other features. I would say that we have started to create
[00:17:07] Unknown:
more delineation around what we are and what we are not doing because we've gotten requests from people asking, like, hey, can I do this in orcas? And sometimes we have to also say no in order to protect kind of the scope. And I feel like a lot of open source projects are kind of in this position where they're being pulled into various directions and they have to kind of, for the consistency of the product, make some decisions. And so 1 particular example in the case of Orcus is about batch processing versus processing versus streaming processing. And so if we had accepted, given that people want to implement streaming processing within Orcus as well, in addition to batch processing workloads, we would have to force a very different way of working out to data scientists and it would be a lot trickier to integrate notebooks.
So we kind of clearly said, like, okay, we don't wanna go towards streaming workloads. We think tools like Kafka and Fink, like other tools really focused on this area will be the best choice. And as a result, if you have batch workloads, you can productionize your pipelines within Orcus. But if you don't, typically, it's not really a good fit. And we also don't really do a lot of things to make that easier at this point.
[00:18:10] Unknown:
In terms of the actual ways that it's implemented, you mentioned that it's dockerized and that it's oriented around these microservices. And I was looking through the repository and saw some of the dividing lines, and there are a few challenges that usually crop up once you start going the microservices route. 1 being management of deployment, which Docker obviously helps with, but also the other 1 is figuring out what are those dividing lines and what are the interfaces that you build for being able to ensure that all of the different components are able to work together while being self contained enough to operate in isolation. And I'm curious what you use as sort of a guiding principle of where to draw those boundaries and how that has helped you to evolve the system since you've only been working on it for such a short time?
[00:18:58] Unknown:
It's a really good question because it's been 1 of the key engineering challenges. And I think anyone that decides to not build a monolith, but like a distributed system with microservices can relate to this problem. Because what we are constantly dealing with is states where we keep states, how do we keep state consistent, and how do we handle various failure modes. I think there's really not like a 1 size fits all answer. What has helped for us is to come up with kind of this sentinel microservice that doesn't really exist, but can help us reason about state. So in our case, we're really prioritizing the web based interface right now because we wanna make, like, Google Colab. It's really easy to use orchestr fully from a browser. But then actually saying, like, okay. What if we had a CLI that talks to Archist as well that you'd be able to do roughly the same things as you can do from the web interface? And then you start getting answers about, like, where state should be. Because if you want the CLI to be able to perform a certain action, the state cannot really reside in, for example, the web server running. It needs to be in some sort of engine. And so we've also looked at project like Docker itself, where they have this engine API and various interfaces and SDKs and ways of interacting with kind of the engine or the core of something.
And I feel like this kind of Sentinel reasoning is extremely useful in, like, state distribution. And then in terms of error handling, I think you just have to really follow a lot of best practices about, like, transactional behavior in software. If a transaction fails, you have to do some sort of rollback, and we implemented some abstractions. Maybe if you in the code have seen it was just something called a 2 phase executor where we have side effects isolated from transactional behavior. So the database transactions only if and only they succeed, we can perform certain side effects that need to happen, like calling out through a different microservice.
And having these kind of abstractions really made the code a lot easier to reason about, and it's been a lot easier to handle, like, errors and failure modes.
[00:20:52] Unknown:
Another interesting challenge that comes out of the entire system being dockerized and web oriented is how you handle the local developer experience, where if you're running in a cloud or hosted environment, it's obviously easy to set up a web server and handle all the proxying and routing. But when you're in a local environment, how do you make sure that it's easy for somebody to just download the project and get started and have all the necessary services and web interfaces available for building their project.
[00:21:25] Unknown:
What we've actually noticed is that as long as you provide, like, a simple single entry point and where you just have to run a single command to get it up and running, then it works relatively fine. Because right now, the only requirement to run Orcus is you install Docker, and you clone the repo run 1 command. That's been working quite well for us because we've tested it in very, very different local setups. It always kinda works because of how Docker abstracts over a lot of these local development issues. And I would say that the Nginx proxy that we use to kinda handle all of the communication, so all kind of the internal communication between all the microservices, is really made possible by something that's a bit more recent, which is like scripted NGINX programming, like these config files that contain Lua code. We've been able to set up, like, clever routing of, like, microservices based communication all within, like, a single NGINX Docker container.
And it just never let us down once. So we'd really recommend, like, this Nginx proxy. It's really simple. I think much simpler than, like, heavyweight Kubernetes native solutions for, like, proxying that you also see kind of being used. Exactly. I think for most of the communication, we can just basically say, Docker,
[00:22:40] Unknown:
here you go. And until now, everything seems to work pretty well, actually. Credits to Docker. Yeah.
[00:22:47] Unknown:
And this might be a little too far down the rabbit hole, but on the proxy layer, have you also looked at things like Caddy or traffic for their cloud native capabilities as being able to generate the config from their internal API and using just a native JSON config as well, at least in the case of Kaddy? I think traffic was the 1 that was the biggest contender to Nginx.
[00:23:10] Unknown:
At the end, we thought NGINX is very simple, robust, and proven, and it worked. And in general, we like to choose technologies in our stack that have proven themselves over time. In a startup, because we're a startup with an open source project, you really have to think about what solves the problem instead of chasing the fancy or chasing the shiny. And so for us, whenever there's a solution that really works and is proven, we tend to opt for that 1. Boring technology is a good thing when it's what's running your business. Yes. It's also for our users. Right? We wanna make sure we can focus on the right value adding features instead of, like, messing with traffic all day. For somebody who is using Orkast, can you talk through the workflow?
[00:23:54] Unknown:
We've talked about how you get set up with it. But once it's running either locally or in a hosted context, what is the workflow for somebody who's using Orkast to actually build a project and the ways that Orkast helps to guide them through structuring it in a way that is able to help them scale and maintain it. I think this is 1 of the most interesting points of differentiation
[00:24:17] Unknown:
of Orcus versus other pipelining tools because we allow you to very subtly transition your existing workflow into 1 that's based around pipelines and abstractions. So a typical workflow would be you have an existing project, you put it somewhere even on your own disk or maybe it's in a git repo. You just take that project and open it in orgist, and you basically get access to all of the files, all of the scripts, and then you can slowly incrementally start building a pipeline. So the first thing you could do is you probably have an entry point, and you just create a single step in your pipeline, which is the execution of your entry point file, so a Python file or notebook, and can even be a bash file if you wanna, like, integrate some other languages even. And so then you can slowly start building out a pipeline that performs various tasks. So you could even have it as kind of a script runner initially, and then actually start taking advantage of some of the data passing capabilities in orchest that lets you abstract over the data passing. You can start parameterizing out some things using either environment variables or the Python SDK for passing parameters. So it's this very, like, creeping process that we've noticed works extremely well because you don't have to buy into orchestr that much at first to kind of dip your toes in. And that's a real big advantage compared to, let's say, a framework where you really have to start integrating the API all over your code right from the get go.
[00:25:40] Unknown:
For people who are using Orkast, they've started to onboard their workflow. They're starting to maybe dig into some of the various features and capabilities of the platform. What are some of the sharp edges that they should watch out for or some of the potential points of confusion that they might come up against as they're onboarding into this new way of working?
[00:26:01] Unknown:
What we've noticed is that because of the integrations we have, for example, we integrate with the Jupyter Enterprise gateway, We are dependent on something like the Jupyter Stacks Docker images because they kind of depend on it for their platform to function. So it can be a bit tricky because when you just start out with orcas, you have to download like this huge, non familiar way of using an application. So while something like streamlet would just be like PIP install streamlet, it's more like you're pulling some sort of, like, cloud built application onto your local machine, and that can be a bit weird at first. Like, it it feels a bit strange because that's not typically how you use your dev tools on your machine.
But once you've used it for a while, you can start to see the advantages because it can actually do a lot more. For example, if you're using the environments in Orcus, you can actually fine tune Docker images just by running some bash commands. So we allow you to run a bash script to fine tune existing Docker images. So So you don't actually have to write or bring any of your Docker images. Just run a few commands that you would typically run at the top of a notebook like a PIP install. And that allows you to, in the browser directly without touching terminal whatsoever, fine tune these Docker images to have these static environments, but that have all the requirements.
So there's no messing around with virtual ends anymore. And like, that kind of like solves a lot of issues. Another thing I would say is that if you are using Orkist and you want to integrate it with other services that you're running, that can be a bit tricky at the moment because we don't have a stable API to interface with from a networking perspective. So we don't have, like, this stable API to trigger, like, a pipeline job or any of the sorts. We have seen people use the internal APIs that are non stable. So that's a bit like hacky and, yeah, I would consider that to be 1 of the sharper edges of Orcus at the moment.
[00:27:49] Unknown:
And that brings me into 1 of my next questions is the available interfaces and extension points for being able to integrate Orkast into a broader workflow or a preexisting set of pipelining tools such as Airflow or Dagster or being able to hook it into various data sources such as your application databases or your cloud data warehouse for being able to pull in the data that you're running these experiments against?
[00:28:17] Unknown:
The integrations, I think, with external data source are pretty neatly organized by having, like, just regular networking and environment variables. We've noticed, like, when people are connecting to, like, a Postgres or some other data port, like maybe s 3, You typically just use what you would normally do, which is like you use some sort of SDK from a Python script and you plug in the credentials using our secrets that go through environment variables. When it comes to integrating with other tools like Airflow or Kubeflow, the full pipeline and all the code and scripts is just a single project.
So it's very simple to just call a bunch of these scripts and notebooks from something like cube flow or airflow. So you could even have these projects exist simultaneously where you can work in Orcus and then translate that into an Airflow and not break the Orcus part. So I think the integrations are basically similar to how you would integrate something like airflow into an existing workflow. Although I would argue that it makes more sense to like, down the line, we were looking to compile, let's say, an air and orchest pipeline to something like Kubeflow, where we can just run it directly on top of, like, a larger cluster. And we would allow you to, like, export it, which is something that's not there at the moment, but the way Orkis was built, that's easily
[00:29:35] Unknown:
added. Another aspect of building these pipelines is how you handle going from your local development environment into a staging or production flow and what the actual deployable unit is for a project that's built with Orkest.
[00:29:53] Unknown:
The natural flow, I would say, is that if you, for example, have a local cloud development workflow where you may be iterating on ideas on your own machine, and then there's some sort of cloud environment in which you want to productionize it. You can actually easily have 2 instances of Orkis, 1 that's local, 1 that's remote. And you can just sync up the projects using your regular Git synchronization. Right? You can just Git pull the latest from some environment, and then you can actually just on the Orkis instance that's running in the cloud, productionize something. So that's 1 way if you have local and cloud. And then if you want to differentiate between various environments, that's where environment variables come in. So if you have some sort of modularity in the pipeline itself where you're either pulling data from a different endpoint or you're pushing to a different, you know, Lambda service that's taking, let's say, the trained weights of a model that was trained by a pipeline, you could just differentiate using different environment variables. And that's actually very, very simple. You just create another job, just change the environment variable to say, like, let's say, constant production, and it could take care of the differentiating that way. It's implemented in a very generic way. So we also allow you to move between staging and production and local development, depending on how you do that, because that's very different for a lot of people.
[00:31:10] Unknown:
You mentioned that Orcast is an open source project and that you're also building a business around it. And that brings up the question of governance and ongoing sustainability of the open source aspects of it. And I'm curious how you're handling some of those questions and your overall thoughts on sort of community growth and management and how to ensure that even if the business doesn't succeed in the long run, that people who have oriented their workloads around Orcast are still able to be using it, you know, into the future?
[00:31:42] Unknown:
Yeah. I think that's really the power of open source project where if the company does not end up surviving or sticking around for long enough, you as a user of that technology are not negatively impacted by that necessarily. Like in the long term, obviously, the development of a project may stall, so that's not good. But at least, like, you can rely on the things you've built. And so, Orcus is built as an open core company similar to how MongoDB or Elastic function. So we expect a lot of the code contributions to come from people employed by Ork as the company, and that's going to probably be the case for the full duration. But then there's this open core that's licensed using the AGPL 3.0 license that will be there indefinitely. So that's something that's part of the community and can never be retracted. So in a way, that guarantees some sort of stability.
And then when it comes to governance, I think we welcome all community contributions, and we want people to get involved if they're interested. And if you compare and contrast this to something like Kubernetes, where it's really kind of an alliance or consortium driving an open source project, I think it's good to differentiate between those types of projects and projects like Orcus. You look at a project like Daxter, where there's a company behind it. I think another example would be dbt from the Fishtown Analytics folks. I think it's very different. Right? If if there's a single company mainly driving a project, then realistically speaking, much of the decisions will be influenced by the core team working on the project.
And sustainability is really guaranteed just by the business model of the company backing it. Right? And so in the case of MongoDB or Elastic, how are they able to afford all the engineering hours required to build out the product is because they built next to the open core some sort of paid or enterprise version, which typically handles, like, organizational complexity, something less of interest to the individual user, like a technical data scientist user. And those features will incentivize companies to get a paid version. And in turn, that will create a healthy company behind the project that can sustain the open core project for the long term. Even companies like GitLab have have shown that this can really work to the benefit of an open source community because they're sustenance to the people behind the project.
[00:33:59] Unknown:
For people who are using Orkest or for some of the ways that you're using it itself, I'm curious what you have seen as some of the most interesting or innovative or unexpected ways that it's being employed.
[00:34:09] Unknown:
A cool example that came across in our Slack community was a Chinese gaming company, relatively large company that is building their Spark ETL workload using Orkast. I just find it extremely cool to know that there's this project that we started 12 months ago, and now there are games in China presumably being improved because the team gets to streamline their workflow using something like Orkast. And I think the cool thing is that if you think about data engineering and, like, airflow based data engineering workloads or workflows, it can feel a bit static. And I think that there's room for ideation experimentation during the actual development.
You can get to a, like, a better place much sooner. You can discover issues much sooner. What we actually were surprised by, but found out later was that we knew that data processing and wrangling was a huge part of a lot of data workflows because you typically have to, like, mangle and wrangle the data and to get it into shape before you can do the actual thing. And what we've noticed is if that can be interactive where you can visualize the values or the distributions of columns, and if you can do that iteratively, it's a lot less painful because you don't have to, like, kick off a job, wait, look at the output, say, oh, this data is actually contained some null values. And I have to, like, kind of rinse and repeat that tedious process. With Orcus, you can kind of inject using the interactivity between each step in the data pipeline.
And so we've noticed that it's very easy to pinpoint where it goes wrong and and kind of spot those issues before. They require, like, a full run of the whole pipeline. So, like, partial running of a subset of the DAG. It's really kinda baked into Orcus. We didn't expect that because we built data sign Orcus for for data scientists like ML modeling, predictive modeling, and inference. But actually for a lot of data ATL workloads, it turns out to work really, really well. So that was kind of unexpected.
[00:36:01] Unknown:
In your experience of building the Orkis project and building a business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:36:10] Unknown:
I think the difference between software engineering in theory and in practice. In theory, everyone says, write tests, write tests. I'd love to write some tests for orcas, but mainly also everything is dockerized. It's in microservices. The test should also run on GitHub actions instead of only locally. Then we will basically have to port in its entirety to a Cloud VM, which is then shut down, turned on again, and so we actually lose the entirety of workers being pulled into that instance. So doing full integration is not as easy as what might think. To me, it's been the biggest challenge, as in theory and practice are very different.
[00:36:53] Unknown:
You mentioned this a little bit earlier when you were talking about batch versus streaming, but what are some of the other cases where Orkest is the wrong choice and somebody might be better served with a different pipelining tool or something like paper mill for parameterizing and breaking up notebooks?
[00:37:08] Unknown:
I think it depends a bit on, like, the requirements of what you're integrating. Let's say you have a very well defined existing stack and you really want to, like, productionize something that's going to be part of that, you know, existing setup. I think then it makes sense to use, like, more vertical library style tools that you can, like, integrate in a piece by piece fashion where, like, you pick and choose and you kinda built your own stack. I think with orcas, we try to be a bit more end to end, and that's really nice for data scientists that kinda wanna get like this 80 20 scenario where, like, 80% of the what the people work on is about 20% of the things that they could potentially do. And we wanna make sure that that 20% is very, very easy, very simple. You don't have to think about anything. But if you're more kind of this custom, like, edge case scenario where you're doing, like, complex edge case inference or, like, edge based inference where you need to deploy to kind of a distributed, I don't know, Cloudflare based worker scenario.
I think the more custom it gets, the less likely you'll benefit from Orkast. And then in addition, I think you have to be a bit of a fan of notebooks. Like, if notebooks don't appeal to you, if you just wanna have code only, I think Orcus is also not a great fit because a lot of the benefits we provide are when you use Orcus in conjunction with notebooks. And I think that's pretty much it. Like, if you really like notebooks and you can live with batch processing, I think it's very hard to find something that's easier to use than orchestras.
[00:38:38] Unknown:
As you continue to build on and iterate on the Orkest project and continue to build out the business, what are some of the things that you have planned for the near to medium term?
[00:38:47] Unknown:
We actually wanted to make it even easier to use Orkust by offering a cloud hosted version. And we've just focused on the open source right now, and people can run it locally. But we've noticed that a lot of people, when they actually try the project, they are really surprised by the simplicity and kind of how it ties together many concepts that they might be familiar with, like a DAG concept or notebooks. And so in order to get that kinda click moment where they really get like, oh, I can then do this, this, and this off from within 1 simple environment without them having to do the up front investment of, like, downloading the project from GitHub themselves and setting up Docker on their machine, even though they might not, like, be convinced of the advantages yet. We really want to make sure that we have this hosted version. So that's something that's coming up. And we're still over, like, a super tiny team of, like, 3 people at the moment. It's just Janek, me and Yacopo, which is a fantastic engineer that joined us a few months ago. And so we want to also build out the team and get more contributors involved. So we're working with educational institutions and programs, people working in data science because we think we're especially useful for the beginner context where a lot of people still kind of figuring things out. And then if you can offer an environment that doesn't overwhelm as much, that can be usually beneficial.
[00:40:04] Unknown:
Are there any other aspects of the Orkast project or the overall space of notebooks and data science and pipelining that we didn't discuss yet that you'd like to cover before we close out the show? I think the stigma of notebooks, I think, is good to cover. I think a lot of things have been said about the advantages and disadvantages of notebooks.
[00:40:22] Unknown:
And I think rather than kind of attacking the shortcomings of notebooks, yes, the encoding of JSON makes it hard to version them. I think we should try to see the merits and address the shortcomings. And I think someone like Jeremy Howard, who does actual work and investing in tools like nbdev, coming up with solutions that make Jupyter Notebooks better, instead of just saying Jupyter Notebooks have shortcomings, and therefore we should not use them, I think is a very short sighted way. So I think I would encourage the community to think about why do data scientists love notebooks so much, what's so great about them, and how can we make them even better. I think with that attitude, we can provide better tools, have better workflows because there's a reason why a lot of data scientists clench to their notebooks because they really get a lot of benefits. It's not because they just wanna say I like notebooks because it really provides them value.
So all the efforts that make them easier to use. In our case, we make it easier to embed them in a pipeline. And b dev makes it easier to make them testable and use them as modules, right, to let you import code notebooks. And I think that these efforts are great and the knee jerk reaction of notebooks are not great, Notebooks have shortcomings, I think, is a bit harmful to the ecosystem.
[00:41:38] Unknown:
Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And this week, I'm going to choose getting some freshly baked bagels because they're delicious, and I recently found a nice place that's actually pretty close to where I am now. So I was pretty excited about that. So, you know, just the opportunity to have a well baked bagel in the morning, maybe with some smoked salmon and cream cheese or whatever other toppings you enjoy, definitely worth a shot. And they're sort of a universal food that you can find pretty much anywhere. So with that, I'll pass it to you, Rick. Do you have any picks this week?
[00:42:14] Unknown:
1 recommendation I would have for people is to check out a library called Vex. Not sure if people are already familiar with it, but it's a library that intends to improve the performance of pandas using out of core processing. And I think it's a really interesting and maybe under respected or under appreciated library. I think it's really high quality, and I would recommend people to check it out. V a e x.
[00:42:38] Unknown:
And, Janek, how about yourself? So my pick is also gonna be bug codes. So, actually, this week, I was toying a bit with my Python environment. It's like, yeah, how am I going to manage different versions? What if I start a new Python project? I want to get up and running quickly. So then I ran into a project called cookiecutler,
[00:42:56] Unknown:
and I am using cookiecutler together with py and poetry now to set up my Python environments and liking it very much so far. Yeah. Project templating tools are definitely a good thing to invest in. Another interesting 1 that I found recently and had on the show was the copier project. I've actually started using that 1 myself for some of my projects. Well, thank you both for taking the time today to join me and share the work that you're doing on Orkest. It's definitely an interesting project and 1 that solves a growing need in the overall community because managing data science projects is something that is still an unsolved problem in a lot of ways. So definitely appreciate the time and energy you've put into that, and I hope you have a good rest of your day. Same to you. Thank you very much, and have a good day.
Thank you for listening. Don't forget to check out our other show, the data engineering podcast at data engineering podcast.com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Overview
Meet Rick Lamers and Janek Perronet
The Genesis of Orkest
Teaching Software Engineering to Data Scientists
Challenges in Data Science Projects
Orkest vs. Other Tools
Implementation and Evolution of Orkest
Microservices and State Management
User Workflow with Orkest
Integration with Other Tools
Open Source Governance and Sustainability
Innovative Uses of Orkest
Lessons Learned in Building Orkest
When Orkest is Not the Right Choice
Future Plans for Orkest
The Stigma of Notebooks
Closing Remarks and Picks