Summary
The experimentation phase of building a machine learning model requires a lot of trial and error. One of the limiting factors of how many experiments you can try is the length of time required to train the model which can be on the order of days or weeks. To reduce the time required to test different iterations Rolando Garcia Sanchez created FLOR which is a library that automatically checkpoints training epochs and instruments your code so that you can bypass early training cycles when you want to explore a different path in your algorithm. In this episode he explains how the tool works to speed up your experimentation phase and how to get started with it.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Rolando Garcia about FLOR, a suite of machine learning tools for hindsight logging that lets you speed up model experimentation by checkpointing training data
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what FLOR is and the story behind it?
- What is the core problem that you are trying to solve for with FLOR?
- What are the fundamental challenges in model training and experimentation that make it necessary?
- How do machine learning reasearchers and engineers address this problem in the absence of something like FLOR?
- Can you describe how FLOR is implemented?
- What were the core engineering problems that you had to solve for while building it?
- What is the workflow for integrating FLOR into your model development process?
- What information are you capturing in the log structures and epoch checkpoints?
- How does FLOR use that data to prime the model training to a given state when backtracking and trying a different approach?
- How does the presence of FLOR change the costs of ML experimentation and what is the long-range impact of that shift?
- Once a model has been trained and optimized, what is the long-term utility of FLOR?
- What are the opportunities for supporting e.g. Horovod for distributed training of large models or with large datasets?
- What does the maintenance process for research-oriented OSS projects look like?
- What are the most interesting, innovative, or unexpected ways that you have seen FLOR used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on FLOR?
- When is FLOR the wrong choice?
- What do you have planned for the future of FLOR?
Keep In Touch
- rlnsanz on GitHub
- @rogarcia_sanz on Twitter
Picks
- Tobias
- Rolando
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- FLOR
- UC Berkeley
- Joe Hellerstein
- MLOps
- RISE Lab
- AMP Lab
- Clipper Model Serving
- Ground Data Context Service
- Context: The Missing Piece Of The Machine Learning Lifecycle
- Airflow
- Copy on write
- ASTor
- Green Tree Snakes: Python AST Documentation
- MLFlow
- Amazon Sagemaker
- Cloudpickle
- Horovod
- Ray Anyscale
- PyTorch
- Tensorflow
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Rolando Garcia about Floor, a suite of machine learning tools for hindsight logging that lets you speed up model experimentation by checkpointing training data. So, Rolando, can you start by introducing yourself?
[00:01:11] Unknown:
Of course. Thank you so much, Tobias. So I'm a PhD candidate at UC Berkeley, which means that I'm very close to graduating and getting my PhD. I came here I'm advised by Joe Hellerstein. He's an expert in databases, so this will be my 5th year. I started at Berkeley from Arizona State, and I have had the opportunity to collaborate with companies like Amazon as well as recently worked with a start up on a similar space. I think throughout my studies here at the PhD level in Berkeley, I have focused on the model management life cycle, which now recently has a more convenient acronym of MLOps.
[00:01:54] Unknown:
Yeah. MLOps is definitely a term that's been gaining a lot of ground recently. Do you remember how you first got introduced to Python?
[00:02:00] Unknown:
When I was learning Python, the other languages that I knew was Java and c sharp. I came across Python as a consequence of wanting to learn AI. I was trying to teach it to myself. I was a student at Arizona State, and I wanted to learn the greatest and latest AI tools and and techniques at the time. So what I had access to were online courses, so MOOCs from Stanford and Berkeley. So I don't know if your audience would be familiar with the Pacman project that Berkeley hosts on edX or Coursera. So I started participating in that. I started doing the Pacman project activities, and that's how I started to learn Python. This was around 2015.
[00:02:46] Unknown:
That brings us now to the project that you've been working on recently called FLOR or FLOR for Fast Low Overhead Recovery. I'm wondering if you could just describe a bit about what it is and some of the story behind how it came to be and why that was a problem that was worthy of your attention.
[00:03:02] Unknown:
So, Flor, the project here, when I entered in my graduate studies in 2017, I joined the RISE Lab, which was a successor to the AMP Lab at UC Berkeley. It was being headed or directed by Jan Stoica, Roluca Popat, Joe Hellerstein, and Joey Gonzales. So, jointly, they were experts in systems, security, databases, and ML or ML systems. So it was a really fortunate combination of experts at Berkeley. I came in, and I was advised by Joe Hellerstein. So and I was working very closely with Joey Gonzalez, an ML person. When I came into the lab, there was a project called CLIPr headed by Joey Gonzalez that was meant for low latency prediction serving.
So the question that Joey had said to me during the interview was, assume that ML works and AI is working as it is. So we've made enough progress. How do we get it out there? How do we, like, deploy it and maintain it? Kinda moving beyond the basic questions of model training. So CLIPr allowed model developers to deploy their models and serve predictions, kinda like the MLS service mentality or spirit. And the next step then was the pipeline automation. So building pipelines that trained models, and then we're able to automatically deploy and maintain those models. So machine learning models have degrading performance over time. Their quality can degrade.
So they were working on a system called Jarvis that would help model developers train models and then upload them or update them as necessary to maintain service level objectives. So that was a system that was being headed at the time that I joined. Another project that was being developed was called Ground, which was a data context service. And the aim of Ground was to manage the lineage, provenance, usage, history, and logs of data. I think we've known several headline stories about, like, 538 misreporting some incident as a consequence of a lab accidentally clobbering a dataset that they didn't really understand where they where it came from. So when Joe and Joey discussed together kinda like what the problems were around model management and also this context management, we found that the problems that we saw with data management were exacerbated with model development and model management. And so not long after that, we published a paper on context, the missing piece of the machine learning life cycle.
And so the idea was, can we build a system that allows model developers to train a model that allows them to flow easily between the development and production environments. And long after production allows developers to answer questions about failures, for example, that happened months ago, because we know that machine learning has these long failure horizons. So floor is flower in Spanish. It came about as, like, sprouting from ground. And so data context service was such that the model developers at first had to write a DAG or workflow, kinda like you would in Airflow for what the execution was. So you had a data preprocessing step, and you had a model training step, and then you had a validation step and so on. And when we met with sponsors at, like, poster sessions, 1 of the comments that I repeatedly heard was that people were not interested in rewriting existing pipelines, mostly because of trust and reliability.
1, something that a person said that really stood out to me was that they had a model trading pipeline written in Pearl 7 years ago, and the engineer who had written that pipeline was no longer at the company. And so they said, this is something that runs routinely. We're not interested in writing it for 4. How can you help us? So that's kind of where we started, and then the journey changed pretty drastically from there to answer the question, what's the most tooling or what's the most support we can provide to a model developer without asking or relying on any kind of cooperation from them?
So this could be because we have a data scientist who's a domain expert and a nontechnical user, or it could be someone who's working at a company, and they're just working against the deadline, and so they don't have time for these, like, best practices. So what does the toolings look like? And 1 of the first area lines of of research of floor involve program instrumentation to be able to provide this kind of tooling without the user having to have that direct manual control.
[00:07:50] Unknown:
You mentioned that Flor came out of the ground project in order to be able to help with this question of context and particularly when building and training machine learning models. And I'm wondering if you can just talk to some of the fundamental challenges that people experience when they're going through that model training and experimentation process and some of the utilities that Fluor provides to help remediate those issues?
[00:08:14] Unknown:
Yeah. So there are 2 problems that we focus on with Fluor with model training. I guess the more fundamental problem is that engineering best practices are at odds with the exploration flexibility and agility. When people are exploring properly, it's almost like playing in a sandbox. And last thing they wanna do is be very methodical and meticulous about record keeping practices. And so I guess the challenge is 1 of human behavior. It's how do you ask people to record, keep track of everything that they're trying so that they can learn from previous mistakes and, you know, build a coherent theory about how machine learning is working without having this burden then slow down their speed of iteration. Because we know that from machine learning, the number of alternatives that you can try in some allotted block of time, the more you can try, the better your performance is going to get. It's a very empirical science.
So that was the first kinda, like, the fundamental problem. And the more immediate 1 that was kind of irritating for us into lab was that there were some folks that were rerunning model training jobs. And in a lab at Berkeley, that could be something that runs overnight or can take as long as a week to train. And they would see that the model wasn't learning anything. It was flatlining, and they wouldn't understand why. And then when they would return and look at their code, they would realize that they were missing a logging statement. So that kind of missing a logging statement in training, it's not like in systems where you can just sometimes do that cyclic debugging pattern.
It takes very long. So it was 1 of those things where the execution is long running enough and the data is valuable enough that we need to have some tooling to recover it. In the absence of something like Fluor, what are some of the ways that machine learning researchers and engineers might address that problem and some of the strategies
[00:10:05] Unknown:
that are, you know, maybe Band Aids on the solution, but are ultimately too cumbersome to be a standard practice.
[00:10:13] Unknown:
The work that we've done in floor, evaluated, and and we have an accompanying publication on VODB at a at a conference. Your question reminds me of something the reviewer asked. And what we ended up doing in the paper, 1 of the baselines that we compared against for floor was a hand tuned or or expert approach. So everything that floor does can be done by hand with proper methodology. If you take periodic checkpoints and you are serializing the model parameters and the optimizer parameters, you're probably gonna be fine. The way that you use those to answer questions, I feel like once you understand kind of the host hoc analysis or the hindsight logging way of doing things, you can do yourself. Actually, I think that's infinitely more valuable than having a tool if I'm able to if we're able to distill what these minimal best practices are that someone could reproduce even without adopting any tool. So people already follow checkpointing practices.
What they do with those checkpoints, I think they can be more creative about it. So, usually, the regular checkpointing that people do, they might use it for warm starting model training, or they might use it for fall tolerance. Like, if a model training run dies after a 178 epochs of training, they might wanna pick it up from 179 and continue. But with hindsight logging, if you miss the logging statement and you wanna reproduce the entire execution, this is a great opportunity for parallelism. You could have 1 worker do the job from 0 to 178, and the other worker pick it up from 179 and continue.
So you know that you can use this for data recovery, these kinds of, like, checkpoint resume by dispatching parallel workers simultaneously. So that would be 1 example.
[00:12:04] Unknown:
In terms of the floor project itself, can you talk through some of the ways that it's implemented and some of the design considerations that you built in to make it sort of a low overhead in terms of implementation, but also low overhead in terms of actual execution?
[00:12:20] Unknown:
Floor is entirely in Python. I've had actually the fortune of of working with many undergraduate students at Berkeley, some of whom went on to receive their master's degrees working on the system. The first student, Eric Liu, worked on background materialization. And the idea was, how can we have the main thread or the main process focus on model training and have the other course of which there are many in these clusters help with serializing the data and then writing it to disk just to reduce to further reduce logging overhead. And this actually turned out to be a very, like, Python specific problem because anywhere else, you would just solve this with multithreading.
You would solve this problem with multithreading and do that kind of concurrency control. Python has a global interpreter lock, and so that was a real option for us. We then started thinking about, well, do we have multiple processes? And they communicate with each other. They they, you know, pass the tensors around. The Eric evaluated multiple different alternatives. And the reason why they were eliminated was because serialization was very costly in our setting. So serialization ended up being 4 times more expensive than writing to a file.
And so if we have to serialize something to put it into py arrow or if we have to serialize data to put it into a queue that then gets message passed, That's gonna lead to too much overhead. It's almost not worth it. So the way that we ended up solving that was with fork and copyright semantics. So this is data that is going to get written once. So fork serves as a kind of 1 shot, 1 way IPC. And we copy the data, and the child processes serialization, so we're able to get some savings on overhead there. 1 of the ideas that was explored was writing low level code in something like c or c plus plus for this kinda like multithreading background logging. But, again, we saw that it wasn't really necessary with fork, So we were able to just stick with with vanilla Python.
And then for the other pieces, there's adaptive checkpointing, and there's also the instrumentation piece, which rewrites user code to do this automatic checkpointing and this automatic resume that we use the standard AST library as well as ASTOR. And I also wanna give a shout out to Greentree Snakes, which is documentation for AST parsing and transformation in Python that I think is very valuable. I heard about that library from Jonathan Righetti and Kelly. I think he's now a professor at MIT. Very early in my graduate career, but it's been 1 of the more useful documentation sites that I've used.
[00:15:03] Unknown:
In terms of the evolution of the project, you mentioned that early on, you had 1 particular direction and then ended up sort of shifting focus. I'm wondering if you could just talk to some of the initial ideas that you had about the problem space and some of the ways that your thinking on the potential solutions has changed and how that has been reflected in terms of the evolution of the implementation of floor?
[00:15:26] Unknown:
I think on the very early days, we ended up looking a lot like something like Amazon SageMaker or MLflow, for example, where you can specify pipelines that later execute. So the goal here was to reduce some of the friction between development and training. So we kind of were aware that data scientists, you know, who may be economists or physicists might write some of these models, but then they need to train at scale. Someone has to take them from Python and rewrite them in c plus plus and then they make them capable of distributed training, And then there were other issues like that. And so we thought that by making the language match something that was more production oriented from the beginning, that we could avoid some of those problems. But they just ended up becoming barriers to adoption, so we started to move away from that. Instead of asking people, like, well, tell me what the inputs are, what the computation is, what the computation produces, and then where it writes it to. You can see that a lot of that information is already encoded in the program just the way that programming languages work. A lot of that structure is already there.
So a function name can tell you, you know, roughly what the function is trying to do. You know how many inputs it has. You can do some kind of duct typing, like, runtime checks to see what data's coming in, do some hash checksums to see how it's being transferred and connected. And so we started kinda moving towards a direction of what's the most we can infer ourselves without involving that direct intervention from the user.
[00:17:09] Unknown:
As far as the core engineering problems that you had to build into floor and maybe some of the sort of API design to make it approachable for people who didn't wanna dig into the specifics of what the size is actually doing under the hood. Wondering if you can just talk through some of those considerations and some of the challenging elements of actually building this library and making it accessible for people who just wanna get their job done.
[00:17:34] Unknown:
I think the the core engineering contributions would be the background materialization, adaptive checkpointing. So the model developer can choose the kind of logging overhead that they're willing to tolerate. The default is at 6%, and then the system automatically throttles the period with which the model is check pointed to make sure that that target is met. We do that through runtime checks, runtime analysis. And then in addition to that, I think the core of the implementation of the engineering challenge was instrumenting Python code. People who work with Python, people who work with program analysis know that the same reason that makes Python so easy to start using off the bat makes it very difficult for a static analysis engine. You know? So you have to make some kinds of assumptions.
For floor, 1 of the things we wanna do is with memoization, if we're skipping a block of Python code during replay for speed ups, we wanna make sure that we capture all of the relevant side effects. And when you have Python calling functions, calling functions, and then later they go out into c or something else, that side effect estimation is non negligible. So we do a side effect estimation. It's an overapproximation of the side effects of a block of code. And so that gets selected for checkpointing. Getting that right was a technical challenge. Determining what the side effects are, it's not possible to do a confirmation or a verification that you were able to checkpoint the state correctly without incurring large overhead.
And so 1 of the ways that we got around that was because of our specific logging scenario. So in the first execution during model training, we assume that the model developers log some data such as the training loss. And we assume that on replay, they log that same information. So that serves almost as a fingerprint to compare the replay fidelity to the record case. I know another challenge, for example, was that just kind of the expectations, record, replay in other communities, you can brag about, like, 24 x overhead. Like, if it's JavaScript that you're trying to run on a computer instead of on a mobile device, such cases.
But in our case, with model training, you can always just rerun training. So we have this strict less than 2 x ceiling on overhead. So kinda like being very aggressive about favoring a system that doesn't lead to these burdens, but still is able to provide significant speedups at a replay time.
[00:20:14] Unknown:
And so for somebody who is building their model training, they want to be able to take advantage of the utilities that floor offers. What's the process for actually integrating it into that model development process and then being able to speed up the iteration cycles for successive experimentation runs and changes in model behavior?
[00:20:35] Unknown:
We have 2 modes. We kinda have, like, a hands free mode where from the command line, you call floor, and it instruments your training script for logging. We also have kind of, like, an expert mode or an expert API, which is the 1 that I prefer to use right now. It makes debugging easier. I guess I can return to that point later, but, you know, debugging instrument code is kind of its own hassle. With using the manual API, you import floor. You tell us what the main loop is. So there's an iterator kinda like for epoch. The main loop is the 1 that iterates the model training epoch by epoch.
And then you memoize the nested training loop to checkpoint the model and the optimizer periodically. And so that gets you the background logging, the adaptive checkpointing, and the fast replay.
[00:21:29] Unknown:
And as far as the checkpointing and log structures that you implement for being able to, you know, jump to a particular point in the training cycle, what's the actual information that you are storing there to be able to rehydrate the model at a particular point in the training cycle?
[00:21:48] Unknown:
We store some metadata, like the static identifier, the runtime identifier so something can get called multiple times in the same execution, the name that is given at the program level, which is usually pretty descriptive. It's something like loss model. And then we serialize the data. And, again, serialization in Python is not something so simple. For developers who are listening, we use CloudPickle. Of all the alternatives that we've tried, it's been the 1 that is the most robust. I think the only thing that it can't handle is serializing generators. But other than that, it does a fairly good job.
And fortunately for us, again, we're kinda building on existing practices. So TensorFlow, PyTorch, they already provide their own serialization primitives. So floor is just able to detect that it's dealing with an object, a Torch object or a TensorFlow object, and it's able to call its appropriate dictionary, elicitation routines, and then serialize those. So what ends up being written is on 1 checkpoint, there's a collection of log records that are written. And each log record will have the metadata, the name, and then the serialized value.
And at replay time, it's able to, accommodate those values the same way.
[00:23:04] Unknown:
And as far as the integration aspect, sort of machine learning ecosystem for Python has been growing and exploding over recent years, particularly in the deep learning space with PyTorch and TensorFlow being the dominant players, but there have been a lot of other entrance into the ecosystem. I think JAX is maybe 1 of the more recent ones. And I'm wondering if you can talk to what the level of support is for floor for each of these different libraries or maybe some of the ways that you have engineered it in order to remain agnostic to the specific machine learning framework that you're operating on?
[00:23:41] Unknown:
We definitely want to be agnostic and supplementary to these systems. We want to be able to work with logging libraries like TensorBoard or Weights and Biases. So that's kind of the natural ecosystem. If people are using the manual version of floor that doesn't do instrumentation, then they'll be able to integrate it into their Python workflows the way that they would any Python library, like, NumPy or Pandas. If using the more advanced, like, instrumentation library, like, that might have some side effects that we need to think about more carefully. So floor, the manual mode of floor is going to be you know? If you know how to use TensorFlow and you know how to use pipes, what you're gonna use floor in a way that integrates those properly.
For the automated hands free mode, the for side effect estimation, I think that's where we do rely on PyTorch or or PyTorch semantics just because we're able to specialize. And we haven't tried it with TensorFlow. I know that the TensorFlow lazy syntax is it's different, so I don't think we would be compatible with that. But the more eager syntax of TensorFlow is definitely something that just even for the integration sorry, for the instrumentation library, that should be simple patch. It's just something that was out of scope at the time of publication.
[00:24:58] Unknown:
And so for people who start to adopt Fluor as a component of their machine learning experimentation process, what do you see as some of the long range impact in terms of their productivity or the quality of the models or the types of experimentation or maybe just sort of shifts in terms of how they think about building and experimenting on models?
[00:25:21] Unknown:
I think the first consequence is just more data or more of a record available for analysis after the fact. I'm not sure what all the different institutions of that might be. I have some examples from students that I think are quite creative. I mentioned earlier that 1 of the horror stories in the lab was that people were training models, finding a missing logging statement, and then rerunning that whole execution. But I think the truth or the more common case was even scarier than that, which is you forgot the login statement. It's gonna be too expensive to rerun, so you just kinda make a guess and move on. And I think that attitude is around. Again, I haven't quantified it, but that's my impression.
And I would like to know, you know, how much that contributes to our current inception of ML and AI as this very foggy, obscure thing. And I think part of it is because we're not really having this record that we build off from that we can improve and iterate and form this theory about why things are going wrong or kinda confirm our hypotheses. We might say, well, the reason why the model training failed was because there was a dead ReLU or maybe it failed because of exploding or vanishing gradients or it might have been war hacking, like 1 of these particular problems. But if you're not able to get to the specific cause of that problem because it's too expensive to collect that data, You're not really learning from your mistakes.
So I think kind of creating a culture where long after a model has been deployed, people can revisit training and ask these in-depth questions. Or people can, after something goes wrong, are able to recover the data and do this analysis instead of just skipping it can help with our theorizing and hypothesizing
[00:27:08] Unknown:
about how machine learning works and how we get better at. Beyond the point that the model has been trained and you've settled on the specific architecture or the specific solution to the problem that you're trying to solve for, what is the long term utility of floor? Is it that it's only useful in the context of doing the initial training and experimentation, or is it something where you would keep floor as a component of your model training as you go through successive iterations of accounting for a model drift and production and, you know, dealing with some of the long term maintenance of of models to ensure that it is operating efficiently and that you don't have to worry about, you know, shifts in concept and the various kind of productionization
[00:27:54] Unknown:
concerns that go along with it? So floor is going to be the lab that I'm a part of. The RISE Lab is named that way for real time intelligent, secure, and explainable. Floor is definitely a project in the explainability realm of the lab. So as far as development, training, and deployment goes, the long term vision of LoRa is to aid with that explainability, to aid with the analysis. And you're right to point out that training is only part of the story in especially in machine learning and more so than in other areas. Failures are not localized.
You can have a model that trains, and it fails because it didn't reach an accuracy higher than 75%. Or it might actually not fail on your data and only fail on data that it sees post deployment, just, like you said, because of distribution drifts or other reasons. And so it's possible that even though you have no reason to look into the training at depth at the time, you might need to return to this data, not a day, but maybe a week or a month after the fact. And so it's very important that all of the context that was there at the time of training is still available post hoc so that the model developers can answer their questions and get to the root of the problem.
[00:29:14] Unknown:
And so in terms of actually maintaining that contextual information that is generated during those training cycles, what are some of the useful strategies as far as being able to store that over the long run and being able to categorize that so that you can retrieve it and reanalyze it when you hit the point where you say, okay. This model trained and it hit 90% accuracy on this dataset. Now I'm gonna put it into production. And, oh, shoot. Now it's only operating at 60% because my training data wasn't representative of the real world, and it's sort of managing that more sort of long term, you know, long horizon workflows to be able to reanalyze that context and just some of the cataloging aspect that goes along with it. Yeah. Hearing you now reminds me of kind of this
[00:29:58] Unknown:
ideal that has been going on around, which is to make work self documenting so that, you know, ideally, you kind of just do your own work, and the relevant records are entered automatically. And then you can just ask questions from this oracle about what happened in the past listener's net. Maybe I'm a pessimist. I think that might be too hard for us to achieve. At least I've wrestled too much with managing context. And so I tend to believe in something a little weaker, but I think it's almost just as good, which is you should have the freedom to put in the work to annotate things at any time, and that context shouldn't degrade. So it's 1 of those things where, like you say, how do you capture stuff almost in the raw, but enough of it so that someone can come in with a highlighter later and tag it and give it the appropriate meaning.
That's actually leaning into the work that's ongoing in the floor, some of the the next steps, where it starts to look a little bit more of like a database where we interact with it, you know, by inserting logging statements, opposing queries, and then the record replay serves almost like as a query execution engine underneath. So some of the things that we're exploring, what are the things that you need? The code, the source code, you need to version automatically. So that's currently a challenge. Auto versioning is something that we tried very early on in the project 5 years ago, and we kinda kept taking it out of you know, putting it in the back burner and then returning to it. And the reason is that it's easy to do poorly.
There's a lot that model developers work with as far as data. It can be spec files. It can be the different in the Python environment, the virtual environment that they're using, and different users have different preferences. So right now, we're focusing on versioning the code on execution and focusing just on that code. And the checkpoints enable us to replay that execution efficiently after the fact. So I think what other context we need in order to be able to answer those questions you know, in deployment time, we're gonna need records of the inference that the model made. I guess if I can speak in general terms about what the information is that would be necessary, as well as the information that that floor collects.
Floor is definitely, right now, focused on the narrower piece of context, which is 1 that revolves around training.
[00:32:31] Unknown:
The other interesting aspect is that for most people, as they're going through the initial experimentation process, they might just be running on their laptop. But depending on the quantity and scale of the data that you're working with or the overall size of the model, you might hit a tipping point where you actually need to start thinking about distributed training using something like Horovod or some of the other projects that are out there. And I'm curious what level of support exists in floor for being able to support those distributed training use cases. And if it doesn't have anything now, what you see as some of the necessary work to be able to grow to support that or if that is, you know, just completely out of scope for the problem that you're trying to solve for?
[00:33:12] Unknown:
It's not out of scope, but it is a difficult question. I think on replay, we auto parallelize an execution. And part of the reason why that's so simple for us is because it's a single chain of instructions that we're parallelizing. So for people familiar with the record replay problem in systems and software engineering, they know that like, if you're trying to find the root cause to some failure in an Oracle database, you're gonna have to trace logs or dumps of a service that was running, you know, continuously for many months. Kind of replaying the conditions that led to that bug is arduous, and you need a very high fidelity.
And what that problem ends up looking like is like turning a multi processed distributed execution into a single thread so that that facilitates reasonable. It could be like a topological sort of a distributed execution. In our case, it was almost the inverse of that. We took this thread or this chain of instructions, and then we parallelize it, which is more in line with the tradition of transaction recovery, like the work with Aries or transaction logging. So when we consider distributed training cases and we consider finding a temporal ordering of the logs, In this case, it's nontrivial.
So it's not something that we support yet. We have some partners who work in in companies like Meta and Google. You know, we we maintain open lines of communication there. It is definitely an extension for distributed training. Logging on distributed training is in scope, but it's just not something that we've been able to complete to this point.
[00:35:01] Unknown:
Another aspect of the work that you're doing is that you are a researcher. You're go doing your graduate studies. Floor is an open source project that you're maintaining as part of that work. And I'm wondering if you could just talk to some of the aspects of maintenance of open source projects when the core purpose of it is for research oriented goals and just some of the ways that that might contrast with other types of open source that people might be used to where maybe it's a hobby project from somebody who's doing it in their free time, or it's a project that is open sourced as part of somebody's day to day work at a corporation?
[00:35:41] Unknown:
So I'm fortunate to be a member of the RISE Lab, and we have a pretty good track record as far as commercializing projects goes. Our predecessor lab, the AMP Lab, led to the launching of Spark and Databricks, now the company that maintains it. And the RISE Lab, Ray, and AnyScale. I know you mentioned Horovath might be other systems in that space. I think we definitely have the support for, you know, rigorous software engineering and development. A lot of the times, though, it's a matter of team size. And so in this case, for the extent that a single graduate student working on the project, the maintenance cycle or latency is gonna be wider than a project that has more maintainers. But the project is open source, and so as a researcher, I learned the most from people who use the system.
Control problems maybe bring to my attention some assumption that I was wrong or some use case that I didn't consider. So kind of bringing that back to us is extremely helpful. I can provide support to that particular person. Like, we will do that as soon as we're able to get to that. And then if people want to collaborate and start participating to the open source project, then that would definitely make some of this feature adoption and turning the project into something more product scale possible.
[00:37:08] Unknown:
And in your experience of building and experimenting with floor and helping people adopt it for their own research and experimentation purposes? What are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:37:22] Unknown:
This semester, I worked with an undergraduate student, Alexis Wan, who was being advised by Kaushik, programming languages professor at Berkeley. The way that she was using Plur, she was generating traces to serve almost as training data for another model that then later do code summarization or code recommendation, kinda like you might hear autopilot or copilot from Microsoft. And so here, floor we have floor as a system that provides virtualization of our model training executions. And what she did was she enlisted in a Kaggle competition, and she tried to be the leaderboard of scores.
And in that process, she would generate a repository with 300 versions of explorations. And she would backtrack and check out a new branch and then continue that exploration. And so what she's generating now is this extremely rich record of executions of all of the things that she tried, which now because she has floor, she's able to go back into them and insert logging statements after. But instead of inserting logging statements, she can add program analysis instrumentation for, like, dynamic analysis. Systems like, for example, might do, similar behavior.
And so she's able to produce these really rich traces to accompany the code, which can then be fed to a model so that it has more signal for making recommendation. And so 1 of the applications that you were looking into is, can you detect when the documentation of a function diverges from the implementation using those those means? So I guess using floor as a means of generating data for, deep learning model was something surprising.
[00:39:16] Unknown:
In your experience of building the floor project and maintaining it and helping to use it for research and doing research on it in your own studies, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:31] Unknown:
I think I answered that question before, and I think 1 of the the surprises here was to see this record replay of a Python execution being something that was within the proper domain of database theory. So when I was explaining to them to the database community how I replay a Python program efficiently, just really fast replay of a Python program. I was able to explain that to them using a transaction model, which is just a sequence of rights and base it on contributions from the eighties with Ares. So we already, back then, see parallel transaction recovery, which is something that was kinda similar.
And so I kinda felt like that connection was not something that was immediately obvious to me. It also seemed to kind of explain why the overhead constraints were so strict in our use case, but they seem to be so different in other papers that did record replay. And that was more because we were kinda taking a simpler execution and turning it into something more complex than people in software engineering or systems might be doing.
[00:40:44] Unknown:
And for people who are dealing with some of these challenges of being able to quickly manage sort of rapid experimentation cycles of their machine learning models? What are the cases where a floor is the wrong choice?
[00:40:59] Unknown:
Our parallelism games have, like, rely on the number of epochs that you have. So model training jobs where you only have a single epoch, like scikit learn models or some that have very few epochs. Sometimes we've heard of cases in robotics and reinforcement learning where the model might be trained for 1 or 5 epochs, but each epoch takes a very long time to train, then it's not such a good case because there's just not that much speed up you can get from paralyzing.
[00:41:34] Unknown:
And as you continue to iterate on the floor project and as you're nearing the end of your studies, what are some of the things you have planned for the near to medium term, and what do you view as sort of the long range future of the project once you do complete your studies and move on to whatever is next?
[00:41:51] Unknown:
Yeah. So the project the paper that I'm working on right now, the extension of the work that I'm focusing on, is an extension of time so that we're able to quantify and answer questions involving change over time. So what that might look like is in the refactoring case. Someone had a model training pipeline that they wrote in TensorFlow and then, again, another 1 that they would like to write in PyTorch or Hugging Face. When it's time for them to ensure that the refactor succeeds, it's a question about how 2 executions compare.
Another diagnostics pattern is that people will try they don't just try something and then do analysis to see what went wrong. They try something, then they try something else, then they try something else. And then once they've exhausted all their hypotheses, they don't know what's going wrong, then they kind of drill down. So by the time that they drill down, they've collected this collection of executions, maybe 10 or 15. And so they would like to ask questions over the set. Their questions along like, when did this go wrong? Had the segmentation masks been fractured all along, or is it something that I introduced when I added this particular change?
What would have happened if I had eliminated this hyperparameter modification? So posing those kinds of questions and answering them is ongoing work. And we want that to be able to scale with the questions, not with the number of executions, of which there will be many. So we want the interface to be such that the user can post a query, maybe in something like SQL or Canvas, and insert a logging statement in the latest version of their code. And then use software patching techniques to take that logging statement and push it into every version in history. We've been doing the automatic version control. We have the checkpointing. And so then the system picks an intelligent reexecution schedule to collect the data and answer questions. It could be in the aggregate, or it could be approximate query processing.
So the next extension before is kind of lengthening the horizon so that we get some visibility across how things change, not just the latest execution of model change.
[00:44:05] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, we'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a movie I watched over the weekend. We saw the Batman, so the latest reboot in the long series of reboots for the Batman franchise, and it runs almost 3 hours, but it was actually a very well done movie. Some good acting, good take on the character. So definitely recommend that for folks who are looking for something to keep them occupied over a long evening. So with that, I'll pass it you, Rolando. Do you have any picks this week? Well, 1 pick I'll highlight based on a TV show since you were speaking about a movie. My wife and I have really enjoyed the show Severance
[00:44:47] Unknown:
on Apple TV. I think just 1 season has been out right now, but it was definitely a binge worthy show, so definitely would recommend. And as far as the tech goes, I've become a really huge fan of Codespaces with GitHub. So that has made it really easy for me to take my prototype and hand it to undergraduates and see the work that they're doing and kind of be able to inspect their work and also manage their environments because that's something that people who have TA computer science classes, like, you spend most of office hours setting that up. So check out Codespaces. It might help you manage your IDEs. And if you're working with other people and managing your IDEs, that'll also make your life easier.
[00:45:28] Unknown:
Awesome. Well, thank you very much for taking the time today to join me and share the work that you've been doing on floor. It's definitely a very interesting project and, interesting problem domain. So I appreciate all of the time and energy that you've put into helping make it a bit more of a tractable problem. So appreciate that, and I hope you enjoy the rest of your day. Thank you so
[00:45:48] Unknown:
much.
[00:45:49] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Rolando Garcia's Background and Journey
Introduction to FLOR Project
Development and Challenges in FLOR
Challenges in Model Training and Experimentation
Implementation and Design of FLOR
Evolution and Early Ideas of FLOR
Core Engineering Problems and API Design
Integrating FLOR into Model Development
Framework Support and Agnosticism
Long-term Impact and Productivity Gains
Utility of FLOR Beyond Initial Training
Maintaining Contextual Information
Support for Distributed Training
Maintaining Open Source Projects for Research
Interesting Uses of FLOR
Lessons Learned from Building FLOR
When FLOR is Not the Right Choice
Future Plans for FLOR
Picks and Recommendations