Accelerate Your Machine Learning Experimentation With Automatic Checkpoints Using FLOR

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Rolando Garcia about Floor, a suite of machine learning tools for hindsight logging that lets you speed up model experimentation by checkpointing training data. So, Rolando, can you start by introducing yourself?

Of course. Thank you so much, Tobias.

So I'm a PhD candidate at UC Berkeley, which means that I'm very close to graduating and getting my PhD.

I came here I'm advised by Joe Hellerstein.

He's an expert in databases,

so this will be my 5th year. I started

at Berkeley from

Arizona State,

and I have had the opportunity to collaborate

with

companies like Amazon

as well as recently worked with a start up on a similar space.

I think throughout my studies

here at the PhD level in Berkeley, I have focused on the model management life cycle,

which now recently has a more convenient acronym of MLOps.

Yeah. MLOps is definitely a term that's been gaining a lot of ground recently.

Do you remember how you first got introduced to Python?

When I was learning Python,

the other languages that I knew was Java

and c sharp.

I came across

Python

as a consequence of wanting to learn AI.

I was trying to teach it to myself. I was a student at Arizona State,

and I wanted to learn the greatest and latest

AI

tools and and techniques

at the time. So what I had access to were

online courses, so MOOCs from Stanford and Berkeley.

So I don't know if your audience would be familiar with the Pacman project that Berkeley hosts on edX or Coursera.

So I started participating in that. I started

doing the Pacman project activities,

and that's how I started to learn Python. This was around 2015.

That brings us now to the project that you've been working on recently called FLOR or FLOR

for Fast Low Overhead Recovery. I'm wondering if you could just describe a bit about what it is and some of the story behind how it came to be and why that was a problem that was worthy of your attention.

So,

Flor, the project here, when I entered

in

my graduate studies in 2017,

I joined the RISE Lab, which was a successor to the AMP Lab at UC Berkeley.

It was being headed or directed by Jan Stoica, Roluca Popat,

Joe Hellerstein,

and Joey Gonzales.

So, jointly, they were experts in systems, security,

databases,

and ML or ML systems. So it was a really fortunate combination

of experts at Berkeley. I came in, and I was advised by Joe Hellerstein. So and I was working very closely with Joey Gonzalez, an ML person. When I came into the lab,

there was a project called CLIPr

headed by Joey Gonzalez

that was

meant

for low latency prediction serving.

So the question that Joey had said to me during the interview was,

assume that ML works and AI is working as it is. So we've made enough progress.

How do we get it out there? How do we, like, deploy it and maintain it? Kinda moving beyond the basic questions of model training. So CLIPr allowed model developers to deploy their models and serve predictions, kinda like the MLS service

mentality

or spirit.

And

the next step then was

the pipeline automation. So building pipelines that trained models, and then we're able to automatically deploy and maintain those models.

So

machine learning models

have degrading performance over time. Their quality can degrade.

So they were working on a system called Jarvis

that would help model developers train models and then upload them or update them as necessary to maintain

service level objectives.

So

that was a system that was being headed at the time that I joined. Another project that was being developed was called Ground, which was a data context service.

And the aim of Ground was to manage the lineage,

provenance,

usage,

history, and logs of data.

I think we've known

several headline stories about, like, 538

misreporting

some incident as a consequence of a lab accidentally

clobbering a dataset that they didn't really understand where they where it came from. So

when Joe

and Joey

discussed together kinda like what the problems were around model management

and also this context management,

we found that the problems that we saw with data management were exacerbated

with

model development and model management. And so not long after that, we published a paper on context, the missing piece of the machine learning life cycle.

And

so the idea was, can we build a system

that allows model developers to train a model that allows them to flow easily between the development and production

environments. And long after production

allows developers to answer questions about failures, for example, that happened months ago, because we know that machine learning has these long failure horizons.

So

floor is flower in Spanish.

It came about as, like, sprouting from ground. And

so data context service was such that the model developers at first had to write a DAG or workflow,

kinda like you would in Airflow for what the execution was. So you had a data preprocessing step, and you had a model training step, and then you had a validation step and so on. And when we met with sponsors at, like, poster sessions, 1 of the comments that I repeatedly heard

was that people

were not interested in rewriting existing pipelines, mostly because of trust and reliability.

1, something that a person said that really stood out to me

was that they had a model trading pipeline written in Pearl 7 years ago,

and the engineer who had written that pipeline was no longer at the company. And so they said, this is something that runs routinely. We're not interested in writing it for 4. How can you help us?

So

that's kind of where we started, and then the journey changed pretty drastically from there to answer the question, what's the most tooling or what's the most support we can provide to a model developer

without asking or relying on any kind of cooperation from them?

So this could be because we have a data scientist who's a domain expert and a nontechnical user,

or it could be someone who's working at a company,

and they're just working against the deadline, and so they don't have time for these, like, best practices.

So what does the toolings look like? And 1 of the first area lines of of research of floor involve program instrumentation to be able to provide this kind of tooling

without the user having to have that direct manual control.

You mentioned that Flor

came out of

the ground project in order to be able to help with this question of context and particularly when building and training machine learning models. And I'm wondering if you can just talk to some

of the fundamental challenges that people experience when they're going through that model training and experimentation process and some of the utilities that Fluor provides to help remediate those issues?

Yeah. So there are 2 problems that we focus on with Fluor with model training. I guess the more fundamental problem

is that engineering best practices

are

at odds with the exploration flexibility and agility. When people are exploring properly, it's almost like playing in a sandbox.

And last thing they wanna do is be very methodical

and meticulous about record keeping practices. And so I guess the challenge is 1 of human behavior. It's how do you ask people

to

record, keep track of everything that they're trying so that they can learn from

previous mistakes and, you know, build a coherent theory about how machine learning is working without having this burden then slow down their speed of iteration. Because we know that from machine learning, the number of alternatives that you can try in some allotted block of time, the more you can try, the better your performance is going to get. It's a very empirical science.

So that was the first kinda, like, the fundamental problem.

And the more immediate 1 that was kind of irritating for us into lab was that there were some folks that were rerunning

model training jobs. And in a lab at Berkeley, that could be something that runs overnight or can take as long as a week to train.

And they would see that the model wasn't learning anything. It was flatlining,

and they wouldn't understand why. And then when they would return and look at their code, they would realize that they were missing a logging statement. So that kind of missing a logging statement in training,

it's not like in systems where you can just sometimes do that cyclic debugging pattern.

It takes very long. So it was 1 of those things where the execution is long running enough and the data is valuable enough that we need to have some tooling to recover it. In the absence of something like Fluor, what are some of the ways that machine learning researchers and engineers might address that problem and some of the strategies

that are,

you know, maybe Band Aids on the solution, but are ultimately

too cumbersome to be a standard practice.

The work that we've done in floor, evaluated, and and we have an accompanying publication on VODB

at a at a conference.

Your question reminds me of something the reviewer asked. And what we ended up doing in the paper,

1 of the baselines that we compared against for floor was a hand tuned or or expert approach.

So

everything that floor does

can be done

by hand with proper methodology.

If you

take periodic checkpoints

and you are serializing the model

parameters and the optimizer parameters,

you're probably gonna be fine. The way that you use those to answer questions,

I feel like

once you understand

kind of the host hoc analysis or the hindsight logging

way of doing things, you can do yourself. Actually, I think that's infinitely more valuable than having a tool if I'm able to if we're able to distill what these minimal best practices are that someone could reproduce even without adopting any tool. So

people already follow

checkpointing practices.

What they do with those checkpoints, I think they can be more creative about it.

So, usually,

the regular checkpointing that people do, they might use it for warm starting model training, or they might use it for fall tolerance. Like, if a model

training

run dies after a 178

epochs of training, they might wanna pick it up from 179 and continue.

But with hindsight logging, if you miss the logging statement and you wanna reproduce the entire execution,

this is a great opportunity for parallelism.

You could have

1 worker

do the job from 0 to 178,

and the other worker pick it up from 179 and continue.

So you know that you can use this for data recovery, these kinds of, like, checkpoint resume by dispatching parallel workers simultaneously.

So that would be 1 example.

In terms of the floor project itself, can you talk through some of the ways that it's implemented and some of the design considerations that you built in to make it sort of a low overhead in terms of implementation, but also low overhead in terms of actual execution?

Floor is entirely

in Python.

I've had actually the fortune of of working

with many undergraduate students at Berkeley, some of whom went on to receive their master's

degrees working on the system.

The first student, Eric Liu,

worked on background materialization.

And the idea was,

how can we have the main thread or the main process

focus on model training

and have the other course of which there are many in these clusters

help with serializing the data and then writing it to disk just to reduce to further reduce logging overhead.

And this actually turned out to be a very, like, Python specific problem because anywhere else, you would just solve this with multithreading.

You would solve this problem with multithreading and do that kind of concurrency control. Python has a global interpreter lock, and so that was a real option for us. We then started thinking about, well, do we have multiple processes? And they communicate with each other. They they, you know, pass the tensors around.

The Eric

evaluated multiple different alternatives.

And

the reason why

they were eliminated was because serialization

was very costly in our setting. So serialization ended up being 4 times more expensive than writing

to a file.

And so if we have to serialize something to put it into py arrow or if we have to serialize data to put it into a queue that then gets message passed, That's gonna lead to too much overhead. It's almost not worth it. So the way that we ended up solving that was with fork and copyright

semantics. So this is data that is going to get written once. So fork serves as a kind of 1 shot, 1 way IPC.

And we copy the data, and the child processes serialization, so we're able to get some savings on overhead there. 1 of the ideas

that was explored was writing low level

code in something like c or c plus plus

for this kinda like multithreading background logging. But, again, we saw that it wasn't really necessary with fork, So we were able to just stick with with vanilla Python.

And then for the other pieces,

there's adaptive

checkpointing, and there's also the instrumentation piece, which rewrites user code to do this automatic checkpointing and this automatic resume

that we

use the standard AST library as well as ASTOR. And I also wanna give a shout out to Greentree Snakes, which is documentation for

AST

parsing and transformation in Python that I think is very valuable. I heard about that library from Jonathan Righetti and Kelly. I think he's now a professor at MIT. Very early in my graduate career, but it's been 1 of the more useful documentation

sites that I've used.

In terms of the evolution of the project, you mentioned that early on, you had 1 particular direction and then ended up sort of shifting focus. I'm wondering if you could just talk to some of the initial ideas that you had about the problem space and some of the ways that your thinking on the potential solutions has changed and how that has been reflected in terms of the evolution of the implementation of floor?

I think

on the very early days, we ended up looking a lot like something like Amazon SageMaker

or MLflow, for example,

where you can

specify

pipelines that later execute.

So the goal here was to reduce some of the friction between

development

and training.

So we kind of were aware that data scientists,

you know, who may be economists

or physicists

might write some of these models, but then they need to train at scale. Someone has to take them from Python and rewrite them in c plus plus and then they make them capable of distributed training,

And then there were other issues like that. And so we thought that by making the language

match something that was more production oriented from the beginning, that we could avoid some of those problems. But they just ended up becoming barriers to adoption,

so we started to move away from that. Instead of asking people, like, well, tell me what the inputs are, what the computation is, what the computation produces, and then where it writes it to. You can see that a lot of that information is already encoded in the program just the way that programming languages work. A lot of that structure is already there.

So a function name can tell you, you know, roughly what the function is trying to do. You know how many inputs it has. You can do some kind of

duct typing, like, runtime checks

to

see what data's coming in, do some

hash checksums to see how it's being transferred and connected. And so we started kinda moving towards a direction of what's the most we can infer

ourselves without involving that direct intervention from the user.

As far as the

core engineering problems that you had to build into floor and maybe some of the sort of API design to make it approachable for people who didn't wanna dig into the specifics of what the size is actually doing under the hood. Wondering if you can just talk through some of those considerations and some of the challenging elements of actually building this library and making it accessible for people who just wanna get their job done.

I think the the core engineering

contributions would be

the background materialization,

adaptive checkpointing. So the model developer can choose the kind of logging overhead that they're willing to tolerate.

The default is at 6%,

and then the system automatically

throttles

the period with which

the model is check pointed to make sure that that target is met.

We do that through runtime checks, runtime analysis.

And then in addition to that, I think the core of the implementation of the engineering challenge was instrumenting Python code. People who work with Python, people who work with program analysis

know that the same reason that makes Python so easy to start using off the bat makes it very difficult

for a static analysis engine. You know? So you have to make some kinds of assumptions.

For floor, 1 of the things we wanna do is

with memoization, if we're skipping a block of Python code

during replay for speed ups, we wanna make sure that we capture all of the relevant side effects.

And when you have Python calling functions, calling functions, and then later they go out into c or something else, that side effect estimation is non negligible.

So

we do a side effect

estimation. It's an overapproximation

of the side effects of a block of code.

And so that gets selected

for checkpointing.

Getting that right was a technical challenge.

Determining what the side effects are, it's not possible to do

a confirmation or a verification

that you were able to checkpoint the state

correctly

without incurring large overhead.

And so 1 of the ways that we got around that was because of our specific

logging scenario.

So

in the first execution during model training, we assume that the model developers

log some data such as the training loss.

And we assume that on replay, they log that same information. So that serves almost as a fingerprint to compare the replay fidelity to the record

case.

I know another challenge, for example, was that

just kind of the expectations,

record, replay in other communities, you can brag about, like, 24 x overhead. Like, if it's JavaScript that you're trying to

run on a computer instead of on a mobile device,

such cases.

But in our case, with model training, you can always just rerun training. So we have this strict less than 2 x

ceiling on overhead. So

kinda like being very aggressive about

favoring

a system that doesn't lead to these burdens,

but still is able to provide significant speedups

at a replay time.

And so for somebody who is

building their model training, they want to be able to take advantage of the utilities that floor offers. What's the process for actually integrating it into that model development process and then being able to

speed up the iteration cycles for successive experimentation

runs and changes in model behavior?

We have 2 modes. We kinda have, like, a hands free mode where from the command line, you

call

floor, and it instruments your training script for logging.

We also have kind of, like, an expert mode or an expert API,

which is the 1 that I prefer to use right now. It makes debugging easier. I guess I can return to that point later, but, you know, debugging instrument code is kind of its own hassle.

With using the manual API, you import floor.

You tell us what the main loop is.

So there's an iterator kinda like for epoch. The main loop is the 1 that iterates the model training

epoch by epoch.

And then you memoize

the nested training loop to checkpoint

the model and the optimizer periodically.

And so that gets you the background

logging, the adaptive checkpointing, and the fast replay.

And as far as the

checkpointing and log structures

that you implement for being able to, you know, jump to a particular point in the training cycle, what's the actual information that you are storing there to be able to

rehydrate the model at a particular point in the training cycle?

We store some metadata,

like the static identifier,

the runtime identifier so something can get called multiple times in the same execution,

the name that is given

at the program level, which is usually pretty descriptive. It's something like loss model.

And then we serialize the data. And, again, serialization in Python is not something so simple. For developers who are listening, we use CloudPickle. Of all the alternatives that we've tried, it's been the 1 that is the most robust. I think the only thing that it can't handle is serializing generators.

But other than that, it does a fairly good job.

And fortunately for us, again, we're kinda building on existing practices. So

TensorFlow, PyTorch, they already provide their own serialization

primitives.

So floor is just able to detect that it's dealing with an object, a Torch object or a TensorFlow object, and it's able to call its appropriate

dictionary,

elicitation routines, and then serialize those.

So what ends up being written is on 1 checkpoint, there's a collection

of log records that are written.

And each log record will have the metadata, the name,

and then the serialized value.

And at replay time, it's able to,

accommodate those values the same way.

And as far as the

integration aspect,

sort of machine learning ecosystem for Python has been growing and exploding over recent years, particularly in the deep learning space with

PyTorch and TensorFlow being the dominant players, but there have been a lot of other entrance into the ecosystem.

I think JAX is maybe 1 of the more recent ones. And I'm wondering if you can talk to

what

the level of support is for floor for each of these different libraries or maybe some of the ways that you have engineered it in order to remain agnostic to the specific machine learning framework that you're operating on?

We definitely want to be agnostic and supplementary

to these systems.

We want to be able to work with logging libraries like TensorBoard or Weights and Biases.

So that's kind of the natural ecosystem.

If people are using the

manual version of floor that doesn't do instrumentation,

then they'll be able to integrate it into their Python workflows the way that they would any Python library, like, NumPy or Pandas.

If using the more advanced,

like, instrumentation

library, like, that might have some side effects that we need to think about more carefully.

So floor,

the manual

mode of floor is going to be you know? If you know how to use TensorFlow and you know how to use pipes, what you're gonna use floor in a way that integrates those properly.

For the automated

hands free mode, the for side effect estimation, I think that's where we do rely on PyTorch

or or PyTorch semantics

just because we're able to specialize. And we haven't tried it with TensorFlow.

I know that the TensorFlow lazy syntax is

it's different, so I don't think we would be compatible with that. But the more eager syntax of TensorFlow

is definitely something that just even for the integration

sorry, for the instrumentation library, that should be simple

patch. It's just something that was out of scope at the time of publication.

And so for people who start to adopt Fluor as a component of their machine learning experimentation

process,

what do you see as some of the long range impact in terms of their productivity or the quality of the models or the types of experimentation

or maybe just sort of shifts in terms of how they think about building and experimenting on models?

I think

the first consequence is just more

data or more of a record available for analysis after the fact. I'm not sure what all the different institutions of that might be. I have some examples from students that I think are quite creative.

I mentioned earlier that 1 of the horror stories in the lab was that people were training models,

finding a missing logging statement, and then rerunning that whole execution.

But I think the truth or the more common case was even scarier than that, which is

you forgot the login statement.

It's gonna be too expensive to rerun, so you just kinda make a guess and move on. And I think

that attitude

is around. Again, I haven't quantified it, but that's my impression.

And

I would like to know, you know, how much that contributes to our current inception of ML and AI as this very foggy, obscure thing. And I think part of it is because we're not really having this record that we build off from

that we can improve and iterate and form this theory about why things are going wrong or kinda confirm our hypotheses.

We might say, well, the reason why the model training

failed was because there was a dead ReLU

or maybe it failed because of exploding or vanishing gradients or it might have been war hacking, like 1 of these particular problems.

But if you're not able to get to the specific

cause of that problem because it's too expensive to collect that data, You're not really learning from your mistakes.

So I think

kind of creating a culture

where long after a model has been deployed, people can revisit training and ask these in-depth questions.

Or people can,

after something goes wrong, are able to recover the data and do this analysis instead of just skipping it

can help with our theorizing and hypothesizing

about how machine learning works and how we get better at. Beyond the point that the model has been trained and you've settled on the specific architecture or the specific solution to the problem that you're trying to solve for,

what is the long term utility of floor? Is it that it's only useful in the context of doing the initial training and experimentation, or is it something where you would keep floor as a component of your

model training as you go through successive iterations of accounting for a model drift and production and,

you know, dealing with some of the long term maintenance of of models to ensure that it is operating efficiently and that you don't have to worry about, you know, shifts in concept and the various kind of productionization

concerns that go along with it? So floor is going to be the lab that I'm a part of. The RISE Lab is named that way for real time intelligent, secure, and explainable.

Floor is definitely a project in the explainability

realm of the lab.

So as far

as development,

training, and deployment goes,

the long term vision of LoRa is to aid with that explainability,

to aid with the analysis.

And you're right to point out that training is only part of the story

in especially in machine learning and more so than in other areas.

Failures are not localized.

You can have a model that trains, and it fails because it didn't reach an accuracy higher than 75%.

Or it might actually

not fail on your data and only fail on data that it sees post deployment, just, like you said, because of distribution drifts or other reasons.

And so

it's possible that even though you have no reason to look into the training at depth at the time, you might need to return to this data,

not a day, but maybe a week or a month after the fact. And so it's very important that all of the context that was there at the time of training

is still available post hoc so that the model developers can answer their questions and get to the root of the problem.

And so in terms of actually maintaining that contextual information that is generated during those training cycles,

what are some of the

useful strategies as far as being able to store that over the long run and being able to categorize that so that you can retrieve it and reanalyze it when you hit the point where you say, okay. This model trained and it hit 90% accuracy on this dataset. Now I'm gonna put it into production. And, oh, shoot. Now it's only operating at 60% because my training data wasn't representative of the real world, and it's sort of managing that more sort of long term, you know, long horizon

workflows to be able to reanalyze that context and just some of the cataloging aspect that goes along with it. Yeah. Hearing you now reminds me of kind of this

ideal that has been going on around, which is

to make work self documenting so that, you know, ideally, you kind of just

do your own work,

and the relevant records are entered automatically. And then you can just ask questions from this oracle about what happened in the past listener's net.

Maybe I'm a pessimist. I think that might be too hard for us to achieve. At least I've wrestled too much with managing context.

And so I tend to believe in something a little weaker, but I think it's almost just as good, which is

you should have the freedom to put in the work to annotate things at any time, and that context shouldn't degrade. So it's 1 of those things where,

like you say, how do you capture stuff almost in the raw, but enough of it so that someone can come in with a highlighter later and tag it and give it the appropriate meaning.

That's actually leaning into the work that's ongoing in the floor, some of the the next steps,

where it starts to look a little bit more of like a database

where we interact with it, you know, by inserting logging statements, opposing queries,

and then the record replay serves almost like as a query execution engine underneath.

So some of the things that we're exploring, what are the things that you need? The code, the source code, you need to version

automatically.

So that's

currently a challenge. Auto versioning is something that we tried very early on in the project 5 years ago,

and we kinda kept taking it out of you know, putting it in the back burner and then returning to it. And the reason is that it's easy to do poorly.

There's a lot

that model developers work with as far as data.

It can

be spec files. It can

be the different in the Python environment, the virtual environment that they're using, and different users have different preferences.

So

right now, we're focusing on versioning the code on execution

and focusing just on that code. And the checkpoints enable us to replay

that execution

efficiently

after the fact.

So

I think what other context we need in order to be able to answer those questions

you know, in deployment time, we're gonna need records of the inference that the model made. I guess if I can speak in general terms about what the information is that would be necessary,

as well as the information that that floor collects.

Floor is definitely, right now,

focused on the narrower piece of context, which is 1 that revolves around training.

The other interesting aspect is that

for

most people, as they're going through the initial experimentation process, they might just be running on their laptop. But depending on the

quantity and scale of the data that you're working with or the overall size of the model, you might hit a tipping point where you actually need to start thinking about distributed training using something like Horovod or some of the other projects that are out there. And I'm curious what level of support exists in floor for being able to support those distributed training use cases. And if it doesn't have anything now, what you see as some of the

necessary work to be able to grow to support that or if that is, you know, just completely

out of scope for the problem that you're trying to solve for?

It's not out of scope,

but it is a difficult question. I think

on replay, we auto parallelize an execution.

And part of the reason why that's so simple for us is because it's a single chain of

instructions

that we're parallelizing.

So for people familiar with the record replay

problem in systems and software engineering,

they know that like, if you're trying to find the root cause to some failure in an Oracle database,

you're gonna have to trace

logs or dumps of a service that was running, you know, continuously for many months.

Kind of

replaying the conditions that led to that bug is arduous, and you need a very high fidelity.

And what that problem ends up looking like is like

turning

a multi processed

distributed execution into a single thread so that that facilitates reasonable.

It could be like a topological sort of a distributed execution.

In our case, it was almost the inverse of that. We took this thread or this chain of instructions, and then we parallelize it, which is more in line with the

tradition of transaction recovery, like the work with Aries or transaction logging.

So

when we consider

distributed training cases and we consider

finding a temporal ordering of the logs,

In this case, it's nontrivial.

So

it's not something that we support yet. We have some partners who work in in companies like Meta and Google. You know,

we we maintain open lines of communication there.

It is definitely

an extension for distributed training. Logging on distributed training

is in scope, but it's just not something that we've been

able to complete to this point.

Another aspect of the work that you're doing is that you are a researcher. You're go doing your graduate studies.

Floor is an open source project that you're maintaining as part of that work. And I'm wondering if you could

just talk to some of the aspects of maintenance of open source projects when the core purpose of it is for research oriented goals

and just some of the ways that that might contrast with other

types of open source that people might be used to where maybe it's a hobby project from somebody who's doing it in their free time, or it's a project that is open sourced as part of somebody's day to day work at a corporation?

So I'm fortunate to be a member of the RISE Lab, and we have a pretty good track record as far as commercializing

projects goes.

Our predecessor lab, the AMP Lab, led to the launching

of Spark and Databricks,

now the company that maintains it. And the RISE Lab, Ray, and AnyScale. I know you mentioned Horovath might be other systems in that space.

I think we definitely have

the support for,

you know, rigorous software engineering and development.

A lot of the times, though, it's a matter of team

size. And so

in this case,

for the extent that a single graduate student working on the project,

the

maintenance

cycle or latency is gonna be wider than

a project that has more maintainers. But the project is open source, and so as a researcher, I learned the most from people who use the system.

Control problems maybe bring to my attention some assumption that I was wrong or some use case that I didn't consider.

So kind of bringing that back to us is extremely helpful.

I can provide support to that particular person. Like, we will do that as soon as we're able to get to that. And then if people want to

collaborate and start

participating to the open source project,

then that would definitely make some of this feature

adoption and turning the project into something

more product scale possible.

And in your experience of

building and experimenting with floor and helping people

adopt it for their own research and experimentation

purposes? What are some of the most interesting or innovative or unexpected ways that you've seen it used?

This semester, I worked with an undergraduate student, Alexis Wan,

who was being advised by Kaushik,

programming languages

professor at Berkeley.

The way that she was using Plur,

she was

generating

traces to serve almost as training data

for another model that then later do code summarization

or code recommendation,

kinda like you might hear autopilot or copilot from Microsoft.

And so

here, floor we have floor as a system that provides virtualization

of our model training executions.

And

what she did was she enlisted in a Kaggle competition, and she tried to be the leaderboard

of scores.

And in that process, she would generate

a repository with 300 versions of explorations.

And she would backtrack and check out a new branch and then continue that exploration.

And so what she's

generating now is this extremely rich record of executions

of all of the things that she tried,

which now because she has floor, she's able to go back into them and insert logging statements after. But instead of inserting logging statements,

she can add

program analysis

instrumentation

for, like, dynamic analysis.

Systems like,

for example, might do, similar

behavior.

And so she's able to produce these really rich traces

to accompany the code, which can then be fed to a model

so that it has more signal

for making recommendation. And so 1 of the applications that you were looking into is, can you detect when the documentation

of a function diverges from the implementation

using those those means?

So I guess using floor as a means of generating data

for, deep learning model was something surprising.

In your experience of building the floor project and maintaining it and

helping

to use it for research and doing research on it in your own studies, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think I answered that

question

before,

and I think 1 of the the surprises here

was to see this

record replay of a Python execution

being something that was within

the proper domain of database

theory.

So when I was explaining to them to the database community

how I

replay a Python program

efficiently, just really fast replay of a Python program. I was able to explain that to them using a transaction model, which is just a sequence of

rights and base it on contributions from the eighties with Ares.

So we already, back then, see parallel

transaction recovery, which is something that was kinda similar.

And so I kinda felt like

that connection was not something that was immediately obvious to me. It also seemed to kind of explain why

the overhead constraints were so strict in our use case, but they seem to be so different in other papers

that did record replay.

And that was more because we were kinda taking a simpler execution and turning it into something more complex

than people in software engineering or systems might be doing.

And for people who are

dealing with some of these challenges of being able to

quickly

manage sort of rapid experimentation

cycles of their machine learning models? What are the cases where a floor is the wrong choice?

Our parallelism

games

have,

like, rely on the number of epochs that you have. So

model training jobs where

you only have a single epoch, like scikit learn models

or some that have very few epochs. Sometimes we've heard of cases in robotics and reinforcement learning

where the model might be trained for

1 or 5 epochs, but each epoch takes a very long time to train,

then it's not such a good case because

there's just not that much

speed up you can get from paralyzing.

And as you continue to iterate on the floor project and as you're nearing the end of your studies, what are some of the things you have planned for the near to medium term, and what do you view as sort of the

long range future of the project once you do complete your studies and move on to whatever is next?

Yeah. So the project the paper that I'm working on right now, the extension of the work that I'm focusing on, is

an extension of time so that we're able to

quantify and answer questions involving change over time.

So what that might look like is

in the refactoring case. Someone had a model training

pipeline that they wrote in TensorFlow and then, again,

another 1 that they would like to write in PyTorch or Hugging Face. When it's time for them to ensure that the refactor succeeds, it's a question about how 2 executions

compare.

Another diagnostics pattern is that

people will try

they don't

just

try something and then do analysis to see what went wrong. They try something,

then they try something else, then they try something else. And then once they've exhausted all their

hypotheses,

they don't know what's going wrong, then they kind of drill down.

So by the time that they drill down, they've collected this collection of executions, maybe 10 or 15. And so they would like to ask questions over the set. Their questions along like, when did this go wrong? Had the segmentation

masks been fractured all along, or is it something that I introduced when I added this particular change?

What would have happened if I had eliminated this hyperparameter

modification?

So

posing those kinds of questions and answering them is ongoing work.

And we want that to be able to scale

with the questions,

not with the number of executions,

of which there will be many. So we want the interface to be such that the user can post a query, maybe in something like SQL or Canvas,

and insert a logging statement in the latest version of their code. And then use software patching techniques to take that logging statement and push it into every version in history.

We've been doing the automatic version control. We have the checkpointing. And so then the system picks an intelligent reexecution

schedule to collect the data and answer questions. It could be in the aggregate, or it could be approximate

query processing.

So the next extension before is kind of lengthening the horizon so that we get some visibility

across how things change, not just the latest execution of model change.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, we'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a movie I watched over the weekend.

We saw the Batman, so the latest

reboot in the long series of reboots for the Batman franchise, and it runs almost 3 hours, but it was actually a very well done movie.

Some good acting, good take on the character. So definitely recommend that for folks who are looking for something to keep them occupied over a long evening. So with that, I'll pass it you, Rolando. Do you have any picks this week? Well, 1 pick I'll highlight based on a TV show since you were speaking about a movie. My wife and I have really enjoyed the show Severance

on Apple TV. I think just 1 season has been out right now, but it was definitely

a binge worthy show, so definitely would recommend.

And as far as the tech goes, I've become a really huge fan of Codespaces

with GitHub.

So that has made it really easy for me to take my

prototype and hand it to undergraduates

and see the work that they're doing and kind of be able to inspect their work and also manage their environments because that's something that people who have TA computer science classes, like, you spend most of office hours setting that up. So check out Codespaces. It might help you manage your IDEs. And if you're working with other people and managing your IDEs, that'll also make your life easier.

Awesome. Well, thank you very much for taking the time today to join me and share the work that you've been doing on floor. It's definitely a very interesting project and, interesting problem domain. So I appreciate all of the time and energy that you've put into helping make it a bit more of a tractable problem. So appreciate that, and I hope you enjoy the rest of your day. Thank you so

much.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__