Summary
The notebook format that has been exemplified by the IPython/Jupyter project has gained in popularity among data scientists. While the existing formats have proven their value, they are still susceptible with difficulties in collaboration and maintainability. Scott Ernst created the Cauldron notebook to be testable, production ready, and friendly to version control. This week we explore the capabilities, use cases, and architecture of Cauldron and how you can start using it today!
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.podastinit.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.
- Visit the site to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Scott Ernst about Cauldron, a new notebook format built with software engineering best practices in mind.
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what Cauldron is and what problem you were trying to solve when you created it?
- In the documentation it mentions that you can use any editor for creating the content of the notebook. Can you describe a typical workflow of authoring the various files and cells and viewing the output?
- How does Cauldron compare to the Jupyter notebook format and what factors would lead someone to choose one over the other?
- Does Cauldron support running languages other than Python? If not then what would be involved in adding that capability?
- Cauldron notebooks support unit tests of individual cells. How does that process work and what are the limitations?
- The option for running the notebook in the context of a task workflow tool appears to be a powerful capability. What are some of the considerations that are necessary when writing a notebook to be run in that manner?
- What are some of the most interesting or unexpected projects that you have seen people using Cauldron for?
- What do you have planned for the future of Cauldron?
Keep In Touch
Picks
- Tobias
- Scott
Links
- When I Work
- IPython Interview
- Spark
- R2Py
- Bokeh
- Luigi
- Airflow
- Digital Paleontology
- A16 Project
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it. So you should check out linode@linode.com/podcast in it, and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app or trying out something you hear about on the show. You can visit our site at www.podcastinnit.com to subscribe to the show. Sign up for the newsletter, read the show notes, and get in touch.
To help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Scott Ernst about cauldron, a new notebook format built with software engineering best practices in mind. So, Scott, could you please introduce yourself? Yeah. Hi. I'm Scott. I'm currently a data architect and data scientist at a startup called When I Work. My background is I got a PhD in computational physics
[00:01:11] Unknown:
and got into the startup scene after that. I've done some other things, but I'm kinda in the data science, data engineering space.
[00:01:17] Unknown:
And do you remember HayFirst got introduced to Python?
[00:01:20] Unknown:
Yeah. It's been a long time. So it was about a decade ago. I was working at a start up. I was moonlighting there while I was in grad school, and we were doing some, like, web social media stuff in kind of the mid 2000 mid, late 2000. We are doing PHP because that was kind of the thing that people did at the time. And we started doing we wanted to do, do, like, image processing and some video processing in in our app. And PHP was kind of a terrible choice for doing that kind of stuff, and so we started playing around with Python and got really excited about it. And then Django kind of hit the scene not too long after that. And the Django and and pyramid actually was the framework that really hooked me. The pyramid plus, SQLAlchemy, that combination really, like, grabbed me. And so I kind of ditched PHP for Python and have been using it ever since.
[00:02:10] Unknown:
Yeah. I've heard a lot of really great things about the pyramid framework, and I keep meaning to find a project that'll, let me work with it, but it has not yet come to fruition.
[00:02:18] Unknown:
Yeah. No. It's it's a really great framework. I like the kind of flexibility to plug in things. I mean, there's there's a real value to Django and its batteries included, but I just I've always loved SQL for me as an ORM and and that expression language side of it too. So
[00:02:35] Unknown:
Yeah. Absolutely. I a lot of my experience has been with Flask and using SQLAlchemy in conjunction with that. So I imagine that Pyramid would be a interesting experiment to work with. From what I understand, it's sort of halfway between Flask and Django in terms of the overall weight and breadth of the framework. Yeah. I think that's a pretty good way to to frame it. So you created this project called Cauldron. So I'm wondering if you can start by explaining a bit about what it is and the problem that you were trying to solve when you first created it.
[00:03:06] Unknown:
Yeah. So the way you do data science and data analysis, I mean, there's a lot of coding involved, but it it just doesn't have the same kind of process that you'd you'd have when you're trying to build an app. And so the idea of a notebook and kind of notebook style editing where you can kind of flexibly change things and rerun them and change them and rerun them in real time as you're going through kind of has taken off in the data science space. That's kind of what is is is a primary tool for that. And so in in my history of this, I've been doing a lot of that kind of data science work, but it always really it frustrated me what I had to give up in order to use that format that you would have when you're doing normal app development. The things like version control with separate files for code and being able to separate things out like that easily and, being able to to develop in the IDE of my choice and and a lot of the processes that go around kinda software development.
And I I I really wanted a solution where I could have the best of both worlds. And so I just, for a long time, kind of waited for it to happen, and it it didn't happen. And so I just got to the point where, like, you know what? I'm gonna build it myself because I I just want it. So that was kind of how Cauldron was born. And it and that's that's what it is. It it isn't a data analysis environment. So it's it's it's a interactive notebook where you can, you know, edit code and run it and keep it going. It's live, so it's it's keeping, the current application and or the current stuff in memory. And so you're you're working on that kind of, like, live data and stuff. You don't have to start from scratch and, like, rerun it every single time. But at the same time, you're developing it in a way that feels like you're developing normal software. You've you've got the files and you can you work in whatever IDU you want and structured in in a kind of a normal Python development sense. So that's the kind of basics of it.
[00:05:03] Unknown:
Yeah. The, the ability to work in your own editor is definitely appealing because a lot of my experience of working with Jupyter Notebooks has been through the emacsipython notebook plugin, which lets you actually interact with the IPython slash Jupyter server from your emacs editor. And it's for the most part, it works pretty well. But recently, with a lot of changes they've been doing, it seems that the plugin author hasn't been able to keep up. So it doesn't seem to work quite reliably anymore. And so trying to do the text editing in the browser is always a little frustrating because my fingers are so used to the key combinations of my editor.
[00:05:38] Unknown:
Yep. Exactly. Yes. You you definitely get you know, everyone's got their editor, and it's kind of their helper in the background. Right? And it responds in the way you're used to, and it helps you remember the things that you you've set it up that you normally forget and that kind of stuff. And so, yeah, going into that that browser window and just typing in the text box is kind of frustrating. Yeah. So for the cauldron notebook, what does the workflow look like when you're actually working in your editor and interacting with the notebook format itself as far as being able to make some changes and then view the output? Yeah. So there's 2 ways you can use cauldron.
1 way is is straight off of a command line. So I have I have a full CLI shell, and so you can just after you got cauldron installed, after you PIP installed it, you can just type cauldron, and it will pop up with a shell. And in that case, you just have you have a series of commands you can use, you know, create a new step, rename a step, run a step, run all the steps, that kind of stuff. And there's there's a show command in what when you click that or run that show command, it will pop up with a browser window for you, and it will show you that kind of notebook notebook display, the kind of current state of what you've you've done, and that's kind of the low level way to get into it. The other way is that I've built a a UI, so it's an actual application that you download install, and then it that becomes kind of the notebook runner and and viewer. And so then the workflow in that case is you're adding your steps and setting all that stuff up and actually running the steps inside of that that viewer application, which is showing you the display in real time right there. So you don't have to have a separate browser window open to see it. It's actually got the format right there. And then you've got a second window open, which is your editor.
And that's open like you normally would. You've got a kind of standard folder with your files in it. If you add a step in the Calderon editor, you will see that as a new file in that folder, and then you can open it up and edit it. You know, you can add all the different stuff that you want there, different libraries, that kind of stuff. There's that 2 window aspect of it. There's the running and the viewing on 1 side, and that's separate from the editor, which is where you're choosing your own ID to to actually develop him. And how does
[00:07:42] Unknown:
the cauldron format compare with the capabilities of the Jupyter Notebook? And what are some of the factors that would lead someone to choose 1 over the other? Yeah. So there's a lot of,
[00:07:53] Unknown:
similarity in terms of, like, what you can do with it. I like you know, both of them kind of try to make the best of Python available as much as possible. I think, like, the real key differences between them like, 1 is 1 is definitely the code editing. The Jupyter is really designed to be inside of that notebook, and you're editing in that space. Whereas this has been designed from, you know, the ground up to to to use a separate editor. I think another 1 is is the source files. Right? So when Jupyter was created in the in the IPython notebook predecessor was created, people were sharing code through email. That was, like, a common thing you did. And so this idea of, like, well, let's put it all into 1 file. Let's get it into 1 file. It was, like, a really great idea because you wanted to be able to share stuff easy. Like, hey. I worked on this analysis. Let me email it to you. Right? But today, version control systems are how you share code, not email. And the idea of a version control system is you've got a repository. That's your unit, and it's filled with files, and you can structure them however you want. And so it's Calderon's designed around that paradigm where you are breaking things out into the those separate files. And so if you send it into a version control system, it's very easy to do code reviews and PRs and all of those kinds of things that you would see in a traditional software engineering environment. And so that, to me, that's a key difference between a cauldron and Jupyter is what what you want to see in your version control system. If you just wanna have that notebook file with all of the code kind of mangled into it or whether you want it to be in those separate files and kind of manage it in a software way. I think another 1 would be display functionality.
Jupyter's got a a lot of kind of basic stuff built in, but they went the plug in route to try to get as many people as possible to develop stuff for them. And so if you want to do really some really cool stuff in Jupyter, I think a lot of that requires getting those plug ins and adding them and customizing them, which surprisingly few people do. And so my approach with Caldron was to try to take a little bit more of the the Django's batteries included style. And so I I try to pull in more things that give you that advanced functionality that a lot of people are looking for without them having to go and find the plug ins to do it. Another 1 is the the kind of data science that I've done, a lot of it you want to go into production. And the notebook has traditionally been for kind of you do your exploratory analysis in it. You gotta prep stuff, and then you take that, and then you convert it into something that you can use in production. And I don't like that translation process. It's a chance for bugs. It's a chance for all kinds of mischief to happen in addition to the extra work that you're doing. And so I wanted something that, like, I was honing from exploration straight into something that was, like, when I was done, I'm ready for production, and I can just go and hit it and go. And so Kalden's designed really to do that. I guess the final 1 I would say is is being able to share. I work with a lot of people in that are kind of in the nontechnical side, and I wanted something that was really easy to just like, here's here's my analysis.
Here it is. Right? And so when I was thinking of Calderon, I was trying to think of how how to share it, and I really settled on the I like the idea of the, PDF reader. Everyone's kind of got that PDF reader, and anyone can just open a PDF file and view it. And so it's a great format to just send something in. And I wanted it an equivalent for cauldron, but I wanted it to retain all of the interactivity that you would get in the notebook in terms of, like, graphs with interactive overlays and stuff and all those those great kind of features. And so I built a second app called the cauldron reader, which all is it does 1 thing, which is just open these cauldron files, the CDF files, I call them, and and view them. So you could anyone can just install that. They can they can go to the website, download it, install it. It's as easy to install as, like, Adobe Reader, and then they can just open those files, and they see exactly what I created for my analysis because it uses the exact same rendering engine. So there's gonna be no cross browser. Oh, it's not working in here. None of that, or like a PDF or a PowerPoint format where, like, you've lost all the activity or it doesn't fit right. You get you know what they're viewing is exactly what you created.
[00:12:10] Unknown:
So there are a lot of different questions from there that I wanted to dig into. 1 of them being in terms of the language support. I know that 1 of the appeals of the Jupyter project is that you can have different kernels for different languages, and I'm curious if cauldron has that capability. And if not, is there any plan to add that capability?
[00:12:29] Unknown:
I'm resistant to that at the moment. And the reason for that is I think when you try to generalize it to all languages, you spend a lot of time doing that generalization, and you give up a lot of the optimizations that you have if you focus on a single language. And so, like, I don't Calder doesn't even support Python 2. I didn't I didn't even wanna try to support the legacy aspect of that. I, like I'm I picked 35, and I said, okay. I'm gonna start 35, and I'm gonna support everything going forward. And I did that so that I could take advantage of, like, all of the stuff that Python 3 5 forward has to offer. And it gives me a lot of opportunity to create something that is really coupled to the the benefits of Python. And so I'm pushing in that direction right now. And 1 of the reasons I think more and more I can get away with that, and I think we can, is that a lot of the data stuff that's coming out now is is fairly language agnostic. Right? So, like, the Spark, for example, has moved off kind of the RDD kind of framework, and they've moved to the data frame style framework. And Spark 2 is just a major shift there. And if you look at the performance stuff, you don't have to go into Scala or Java to kinda get the performance. You can do it in Python. And so the need to be able to switch to, like, Scala or Java to get the performance and then switch back to Python or something like that is just not as important. And then I think there are things like the r to pi where you can actually get you you can go into and do r in that way, I think are still available in cauldron because they're available in Python. But to try to get that, like, a first first class kernel of its own would just take focus away from the optimizations I wanna do in terms of making it the best possible environment for Python.
[00:14:15] Unknown:
And another thing that you're mentioning is the plugin capabilities of Jupyter. I'm wondering if there is any capability for that in Cauldron, and, again, whether you plan to have any if you don't already support plug ins.
[00:14:27] Unknown:
I don't have an official plug in support at this time. I'm definitely considering it going forward. I think right now, I with the batteries included, it aspect of it, I'm more excited by the idea of people submitting pull requests with additional functionality. I there's a size you get to at at which it makes more sense to do plug ins because you just don't wanna be able to you don't wanna have everything in there. But Calderon is still kind of early and growing. And at this point, I think, you know, getting the best ideas in there and available so that everyone has access to them immediately when they download it is to me, it's the it's the more exciting avenue. There's so many plug ins for Jupyter that are out there that just don't get the usage really that I think they deserve just because there's the lack of discovery and and and people just not wanting to spend all the effort of of getting those plug ins installed in their system and making sure that that's consistent across all of the people they're collaborating with. So it's a someday thing. And if it is a someday thing, I wanna do it in a way that makes plug in discovery
[00:15:36] Unknown:
better. But right now, I really want a batteries included approach to it. And then on the display format piece, 1 of the things that I was wondering about is that if you do send if you do send somebody a display format, I'm assuming that all of the data is embedded into that file. But 1 thing that I'm curious about is whether somebody who does receive that display format and then wanted to do some of their own analysis based on what you sent them. Is there a way to sort of flip the read only bit and then be able to actually edit the, original content or even maybe just extract the data from the display format and then do your own analysis?
[00:16:12] Unknown:
Oh, that's an interesting question. I there there is no way to kind of flip that back to the editing because the editing isn't a single file. I to me, it's if if you wanna share the editing process with someone, the way to do that is through version controls to just say, hey. Here's my repo. Download it, and you can run with it. But in terms of just getting access to some of that data, yeah, I I don't have at the moment an easy way to just, like, extract data out of that, like, download this as a CSV or something like that. But that's an that's an interesting thought. I had not considered that. I might I might think about that going forward.
[00:16:46] Unknown:
Yeah. I can see how that would be potentially useful. For instance, if you share a report with somebody or even if you just publish it online and then somebody retrieves it and then wants to do some of their own additional analysis based off of what you already did, being able to then just extract the values and continue your work might be an interesting way to, interact with the overall process.
[00:17:05] Unknown:
Yeah. I I agree. That is interesting.
[00:17:07] Unknown:
And while I was going through the documentation, I was noticing that 1 of the other sort sort of software engineering capabilities that you built into Cauldron is the idea of unit tests for the individual cells. I'm curious if you can explain what that process looks like and any limitations that there might be in terms of the testability of those you of those individual cells.
[00:17:27] Unknown:
Yeah. So the way it works is so if you're if you're doing unit testing in Python, normally, you're gonna import the unit test library, and then you're gonna create that test case class and then put your tests in there. The the only real thing that you have to do differently for Calderon is that there is a special test case class in available inside of the cauldron library that you need to subclass to create your tests. And what that special class does is during the setup and the tear down of the test classes or the tests, it opens the project and sets it to kind of an empty state so that you can interact with Calderon in the same way you would if you're doing in the notebook format. And then at the end of that, it tears it down. And so it gives you the opportunity to test it in kind of exactly the same environment that it would run-in. But you are doing it is it is a unit test, so it's it's not running all of the steps previously to get to where you are. So if you run on step 4, it's just gonna run step 4 in isolation. So you do still have to think about it from the unit testing standpoint of I need to get whatever global data would be needed in order for step 4 to run properly there in some kind of mocked way. So there I don't really see that as a limitation, just as something that you need to account for. Just in a way you're writing normal testable code, in any other way, you wanna think about how to write yourselves in a way that they're going to be reasonably testable, or easily testable and, you know, not something that's just gonna explode and be really hard for you to test. There are certain capabilities that are different inside of Calderon because if you have a cell, you can actually define functions or classes or whatever you want inside of that cell.
And so when you're rerunning the unit test, I actually grab all of the local variables inside of that cell and make them available to you. So even if you don't, for example, like, export a function, you have a function that just lives inside of that cell, when you run a unit test, you can get access to that and run just that function if you wanna function if you wanna test that function in a particular unit test. So there's kind of expanded capabilities in that in that way, but you can't just you can't just import a cell like you would a normal Python module file because of the way it it works and the way it functions. And so to get around that, you just kind of have to come at it from a slightly different angle. But all of the same capabilities are there. And does it work with Pytest as well? Yes. Yeah. Oh, yeah. So you can run them it it runs in the in exactly this the standard way. So you can use Pytest or nose tester or whatever you want to run that. And, you know, like, I'll use, PyCharm often, and I can run those tests directly through setting it up Pytest inside of PyCharm and still have that nice UI output that shows me that everything passed or what failed and that kind of stuff. So it's completely compatible with the the standard Python testing.
[00:20:20] Unknown:
And another thing that I found interesting while reading through the documentation is the fact that the variables that are assigned within the cells aren't automatically global to the entire notebook and instead need to be explicitly shared between the various cells. Yes. So I'm wondering if you can explain a bit about how that works and what your reasoning was behind creating that particular functionality.
[00:20:43] Unknown:
Yeah. So I've seen a lot of problems with notebooks being someone defines x up here at the top of their notebook, and then later they define x as something else, and then they go back up to rerun a cell in the middle, and they get the wrong output. They don't realize it. They send it to someone. The analysis is all wrong. There's a reason why software engineering has has come to the conclusion that global variables are generally bad. And I think right now, the notebook space is learning that lesson the hard way all over again. And so when I would before Calderon, when I was working in other notebooks, I would do everything I could to try to limit that from happening, but it's so hard when everything just goes straight to the global space. So that was 1 of the first things when I was specking out Kaldron as I as I didn't want that. I wanted to be able to explicitly declare things as global if they're gonna be global and, to keep that kind of separated.
So the way that works in cauldron is that, the cauldron library has has a shared object that you can attach variables to. So, like, if you define x in the first cell and you wanna share that with others, you would say cauldron.shared.x equals, and that will assign it to that global shared object. And then in a later cell, you can say x equals cauldron.shared.x to pull it out of that global object and then use it. And that gives you, as the author of the analysis, the control of when you want to be assigning stuff and just not having it accidentally get assigned. It gets really important in in large long notebooks where you forget that way up above you define something as, you know, you reuse something that you had way up above and you just didn't realize it. It can cause all kinds of problems. I think 1 thing that's really that's unique to kind of the analysis, is because you can keep doing stuff interactively, this global variable problem actually compounds because a lot of times you'll find when people have made that mistake that the analysis works. They get done in the end. They're happy with it. They shut everything down, and they come back to open the notebook and run it again later, and it doesn't run. You get an error, and then they have to go and figure out why there was an error. And the reason was because they were doing these kind of global variables, and they removed 1 up above that they had already had assigned down below. And so it works because their the state of their environment allowed it, but it wasn't actually a functional programming program from beginning to end if you were to rerun it. So I wanted to eliminate that as much as possible.
[00:23:17] Unknown:
And can you dig a bit into how the Kaldritt Library itself is built and architected?
[00:23:23] Unknown:
Yeah. So there's kind of there's the rendering aspect of it, which is basically there's a there's a top level kind of display package, which is how you access the display. So you would do cauldron dot display dot, and then whatever you wanted to render, you can do arbitrary as HTML, like called or not display dot HTML and just throw some HTML right onto the notebook display. You could do markdown, and it will render markdown that way. There's tables, you know, like, all the kinds of stuff you would expect to be able to put up on a notebook display is is there. I create a bunch of convenience functions for, like, if you wanna use matplotlib or you wanna use or something like that, you just do cd.display.
And you you pass in that bouquet figure you've created. There's no, like, magic. I need to set this bouquet up to run inside of cauldron. You just pass that in, to the display function, and and the display function takes care of all of that. And I and I I wanted to do it that way because I didn't really wanna be patching other people's libraries or forcing libraries to function other than they normally do. So, for example, in map plotlib, if if you are in running a color and you say plot dot show, right, to show a figure, that will run-in the way matplotlib normally runs it where it opens it up in a separate window for display, because I'm not touching the matplotlib library. If you want that to appear in the color and display, you would do color and dot display dot pieplot and then, pass in that figure, and it will it will render it into the notebook for you. So it's not destructive in the way that it's trying to change libraries and keep up with that. But there's a lot of convenience functions like that. So there's a the a lot of the cauldron library itself goes into the the the rendering aspect of it.
Another piece of it is the command line interface that I told you about. I wanted a really nice kind of easy to use command line interface, something that has, you know, autocomplete support and is broken down, and so you can run the various commands. And and it's it's easy to repeat kinds of the steps you'd wanna use with the UI. So there's that CLI aspect of it. And then to run it as a kernel, it has a Flask kind of Flask application. And so that's if if you run it in kernel mode, which is what you're doing with if you're using the user interface, then that exposes itself as a Flask application that's that's passing JSON rest information back and forth, API information back and forth, to communicate between the kernel and whatever display output there is.
[00:26:02] Unknown:
And when you're writing the actual notebook, if I understood correctly from the documentation, each separate cell is contained within a predict within a given file. So and then you have the metadata file to determine what the ordering of those are? Correct. Yep. As you mentioned earlier, 1 of the other capabilities that Cauldron has is the option of running the notebook in the context of a task workflow tool such as or airflow so that you can productionize your code, which definitely seems like a very powerful and interesting use case for a notebook. So I'm wondering what are some of the considerations that authors should be thinking about when they are writing a notebook to be run-in that production context?
[00:26:42] Unknown:
So I think that some of those things that change are you really wanna think about the kind of you think about your notebook in terms of it having inputs and outputs. And if you're working as part of a a Luigi Airflow kinda task data pipeline, then unless it's a self contained piece, it's gonna have to pull those inputs from somewhere, and it's gonna have to deliver those output somewhere else. Right? So it's it's not going to be a completely self contained notebook by itself. It it has to it has to probably interact with with both the input and the output. And so 1 thing you wanna think about is in terms of the input, how you're getting the input into your notebook, and you wanna do it in a way that, you can actually develop the notebook and then put it in in the pipeline and not have to change code to do that process. I really like the idea with Caldern of of of writing something up, putting it into production, pulling it back out, making changes as you improve your models or change the way you wanna do analysis, putting it back in, pulling it out when you need to, and not having to go through the retranslation process.
So 1 of the ways you can get at the input side of it is that the CD shared object has a get function, a fetch function. So you can do, like, cd.shared. Fetch, and you can put a default value in so that when you're, like, editing it yourself and you're running in the notebook, it will use that default value. But when you run it in a pipeline context, you can actually pass that information in during the run during the actual start of the execution. And so it will populate that shared object with that data prior to running the first step. And so you can you can essentially treat that as like function arguments and pass stuff in to to change it inside of your pipeline without actually having to change your code at all. So that's 1 side of it. And then the other side of it is you wanna think about the outputs that you're that you're putting out at the end.
And so 1 of the things Calder does is that you can, within the library, you can tell if you're running it in a production environment or whether you're running it in an interactive notebook environment. And so you can actually use in in in this case, if you want to turn off certain outputs because you're not part of the in the pipeline, you can have that part only run-in production and just kind of have an if statement or a block statement or some some way of turn that on off. And you can use Call Room to figure out where what environment and what context you're running in. And then, say, you're in Airflow or you're in Luigi and you want to deliver something to, an Amazon s 3 bucket or something like that, which you don't wanna do when you're editing it as a notebook yourself. You just wanna see the the results then. You can just have, like, that if in production, write this out. Otherwise, do nothing. Right? And so you can still have all of that functionality in there without having to really change your code to fit the different environments.
[00:29:30] Unknown:
And what are some of the most interesting or unexpected uses that you've seen people, using Cauldron for?
[00:29:37] Unknown:
So that's an interesting question because, a little story on my background. For the last 5 years until a couple months ago, I was working as a digital paleontologist for the Swiss government. There was a there's a project called the a 16 project, and what happened was in they were putting in a highway in, the a 16 Highway in Western Switzerland about 15 years ago. And what they what they uncovered while they were doing that was the 1 of the largest number of dinosaur footprints ever found in a single location, and it totaled something around 16, 000 all all done. And it was exciting because normally with dinosaur footprints, you find, like, 1 or 2, maybe 10. It is really exciting.
So to have so much data to try to figure out what's going on was really exciting. And so they put together this project and this research directive for us to kind of analyze that stuff. And that's that's the project that I was working on when I created Calderon. It was it was that environment where I wanted to be doing all this really interesting analysis stuff, and we were kind of working with very different people all over the world, researchers in many different time zones, trying to collaborate together to figure this stuff out. And that's where the kind of the tooling was really falling apart and starting to frustrate me. And so Calderon came out of that. So it came itself out of what I think was a pretty unusual and interesting use case. But for me, the most rewarding thing has been to see people using it in more just normal use cases. Right? Like, I get excited when, you know, just a data scientist at, like, a Fortune 500 company is like, yeah. I'm using this at work, and it's great. Right? Right now, I I guess that's that's where I get really excited because it it just shows the breadth of, potential uses for it. So
[00:31:27] Unknown:
Are there any particular areas that you're looking for help or contribution with the project?
[00:31:31] Unknown:
I have a number of people who are kind of like my beta testers. They're always kinda testing the new stuff that I'm putting out. And then I have a colleague, Dan Mayhew, who is kind of the the 1 who's getting out there and getting it in front of people and, like, testing it and getting feedback. He's a he's a real great kind of UI, UX feedback kinda guy, which I think is really important for a project like this because it's really easy to just develop in a bubble and not really get a sense of, you know, what people are are are or how they're using it and, you know, what ways can it be better.
So he's not contributing to code, but he contributes a lot of a lot of fantastic feedback. And so I think in terms of contribution, right now, everything is great. I think there's a lot of really cool stuff that can be done in terms of, like, the rendering side, especially to get kind of the batteries included. I kind of some weaknesses I have right now that I that I want on my road map that I wanna fix is I I wanna make some of the map plotlib stuff, the more advanced map plotlib stuff work better. And especially I I try to make sure that all of the notebook stuff is is fully responsive, so that you're not getting, like, a small plot on a large screen that it's actually gonna fill the screen. And so that introduces a few challenges in matplotlib, and I definitely, could use some some help there. But any of the rendering functions, there's so many great things you can do that I mean, that would be fantastic. But I I try to document it really well. It's very well unit tested, and so there's that aspect of documentation. So I I hope it's a project that people can open up and and start to understand how it works and find a way to contribute if they're if they're interested. I'm certainly happy,
[00:33:21] Unknown:
to have more people help on it. And do you have anything in particular planned for the future of Cauldron?
[00:33:28] Unknown:
Yeah. I'm getting close to the next big release, and the the key piece for this 1 is right now, I think 1 of the key limitations for cauldron is that it's designed to work on a local Python kernel running on your machine. And I'm with this next release that's kind of in the beta testing phase at this point, that eliminate that has been eliminated, and you can run through Docker containers or remote hosts or wherever. And so right now, I'm I'm testing both in a Docker container environment where I'm running the UI through to a Docker container that's got all of the cauldron and, Python stuff set up in there. And so I can have or bring it up and tear it down as as I want. But I also am doing a lot of stuff where I'm I'm running the UI locally on my computer, and I have a a VM in the cloud where Colburn's actually running, and I can get access to to a lot more horsepower and a lot more stuff for that. And, for the big data use cases, that's really great. Doing stuff in the Spark world and that kind of thing. So that's coming soon.
[00:34:33] Unknown:
Yeah. It's definitely useful being able to run your analysis collocated with the data that it's doing the analysis against because data itself has a lot of gravity and having to ship it around various places can be quite cumbersome and expensive.
[00:34:46] Unknown:
Yep. Exactly. Yeah. So stay tuned for that because that's that's the big feature that I'm I'm kind of tying the knot right now, getting ready to go.
[00:34:56] Unknown:
And are there any topics or questions that we didn't cover that you think we should bring up before we start to close out the show? No. I don't think so. I think you've got it pretty well covered. Okay. And with that, I'll bring us to the picks. So for my pick this week, I'm going to choose the Tiffany Aching series of adventures by Terry Pratchett as part of his Discworlds books. I've, been listening to them with my wife over the past few days, and they're I had already read them before, but they're still entertaining a second time through. So for anybody who enjoys Terry Pratchett or or humorous fiction stories, I definitely recommend, giving that a read or a listen. And with that, I'll pass it to you. Do you have any picks for us, Scott?
[00:35:35] Unknown:
Yeah. Mine's a little more techie. So Apache does their big data conference every year, and it's coming up next week. And I know everyone probably in this audience is getting excited about PyCon coming up, but I just wanted to give a shout out to that 1 as well. And they do at least they did last year. They put up all of the keynotes. And so if you go to YouTube, you could check out the keynotes on them. And there were some there were some really fantastic talks there. And so I'm I'm really excited to see in the coming weeks the the keynote stuff from this year's conference.
[00:36:04] Unknown:
Yeah. That definitely sounds interesting. By the time this episode gets released, the conference will have already been gone, but people can keep an eye out for those recordings to see the interesting bits and then start planning for attending next year. Alright. Well, I appreciate you taking the time out of your evening to share your work with Cauldron. It's definitely a very interesting project and 1 that I think will probably gain a lot of adoption as more and more people start doing more complex analysis with their data and starting to run up against the limitations of some of the existing notebook libraries.
[00:36:32] Unknown:
You know, it's funny. I guess, and I like a last thought. I I I look at the data science world right now and kind of where the state of it is, and it reminds me so much of the web development era in the dotcom bubble, where it's so wild, wild west, and everybody's just like you know, there's new tools coming out, and nobody knows how it functions, and everybody's doing everything differently. And, like, I I I can see Calderon is like, let's let's bring some order to this just a little bit, right, in kind of the same let's let's not have to learn the hard way the lessons that kind of the web development had to go through where there's the crash and, like, okay. We need to figure out how to do this stuff reliably and, like, let's introduce all this process and stuff and mature as a as a discipline. And so I know we're gonna see that in in data science, and I'm hoping that Cauldron can be 1 of the pieces of that puzzle. Well, thank you again, and I hope you enjoy the rest of your evening. You too. Thanks a lot.
Introduction and Guest Introduction
Scott Ernst's Background and Journey with Python
Introduction to Cauldron
Cauldron Workflow and Features
Language Support and Plugin Capabilities
Unit Testing in Cauldron
Productionizing Notebooks with Cauldron
Interesting Use Cases and Contributions
Future Plans for Cauldron
Picks and Closing Remarks