Declarative Deep Learning From Your Laptop To Production With Ludwig and Horovod

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Travis Sadair about building and training machine learning models with Ludwig and Horovod. So, Travis, can you start by introducing yourself? Thanks for having me on today. So I'm Travis. I'm currently the CTO of a stealth ish company called prediabase,

and I'm also the lead maintainer on the Horovod open source project

and 1 of the core maintainers of the Ludwig project.

Previously spent about 4 and a half years at Uber being a tech lead and manager for the Michelangelo machine learning platform there, specifically leading the deep learning training team. And do you remember how you first got introduced to Python?

Yeah. So that was quite a while ago, I think. But it was actually in the kind of data science and machine learning

regime, in fact. So

used to work as a applied machine learning engineer

at Google. And at the time, Python was 1 of the 3 main languages that we used. So I got to get a good bit of experience there as well as when I was doing kind of my master's work in machine learning.

And so that brings us to the Horovod and Ludwig projects. I'm wondering if you can just discuss a bit about what those are

and maybe how they relate to each other and how you got involved with each of them. Yeah. Yeah. So Horabant and Ludwig are both projects that came out of the Uber AI Group. So this is, you know, the part of Uber that's focused on building

machine learning, like, kind of general machine learning technology.

And so Horovod was originally started by a good friend and coworker of mine, Alex Sergiev. And the purpose

was to kinda find ways to scale up the training of, at the time, TensorFlow models, but then also we made it work with PyTorch as well because the existing

solutions that were out there at the time were very limited, very rudimentary, and kind of were still kind of stuck in some older ways of thinking about how to do distributed training.

So I was the 2nd person to work on that project. And then after Alex left to become the CTO of a robotics company destroying fleets with lasers, I took over the project

and, and Ludwig

kind of a similar story in terms of how I got involved. So I was 1 of the early people working on that project and originally started by my co founder, Piero Molino,

as part of his work that he was doing on Uber AI. And my involvement there was in large part in figuring out ways that we could integrate solutions like Horovod into the project

to to help scale it and productionize it and put it in the hands of more people.

My understanding

is that the

Ludwig project is oriented around being a sort of low code or AutoML style tool chain around deep learning frameworks or at least specifically the TensorFlow framework, and we'll dig more into the specifics of which framework later. But wondering if you can talk through

some of the ways that the deep learning aspect of that problem statement

is maybe 1 of the more interesting or challenging aspects of building a tool like Ludwig?

Yeah. I think that there are lots of

technologies that exist in this AutoML space that largely revolve around

gradient boosted trees and the early successes

of applying kind of standard

operating procedures, if you will, to tabular data, like well structured data.

And so you can see that in even, you know, XGBoost being a popular framework, Capboost, Lite GBM,

you know, Databricks and Spark have built a lot of nice integrations on top of tree based models.

And there are, you know, even higher level frameworks like, you know, Pycaret that also tackle similar problems.

But deep learning tends to be a very different beast in a lot of ways. So it's still very much a research problem. So the best

model architectures

haven't been fully solved and commoditized yet. And the modalities of data that we work with are often

very different in structure. Right? So instead of just having a bunch of rows and columns in a database or a CSV file, you often have, you know, documents of text data or images

of, you know, whatever,

videos,

audio samples.

So there's lots of different types of data that we work with in deep learning. And so as a result,

you need something that ends up being a lot more flexible and a lot more

kind of thinking about the problem differently, I guess you could say, such that you can rapidly incorporate all of these new modalities of data and all of these new types of

techniques as well, kind of as they come out of the research community. And that was kind of, I think, the

primary motivating

factor behind Ludwig was, could we think about structuring a deep learning project

differently so that we aren't having to write these

bespoke models from scratch in TensorFlow or PyTorch

every time we wanna apply them to a new type of data. So

to kind of give the high level pitch, I guess, in a way is that I think most machine learning frameworks for deep learning in particular are very model centric where they want you to think first and foremost about what's my mall architecture? How do I reshape the data to make it work in this model?

Whereas Ludwig takes a more data centric approach where we say, what is the type of data that you're working with? Is it numerical or categorical or image or whatever? What do you wanna predict?

And then we will provide the model architecture that will work for that and allow you to configure it to your requirements as well. It's kind of a second order operation.

As far as the main use cases that the Ludwig project is being applied to and

maybe some of the

motivating

use cases of what brought about its creation. I'm wondering if you can talk through sort of the

domain and problem requirements that make a tool like Ludwig useful and more applicable in sort of the general case than maybe that more model centric approach that you were describing?

Yeah. Absolutely. So I think particularly when you're working with what we would call multimodal data or that would be specifically

any data that combines

images and text or text and tabular data or a tabular and audio, any of those sorts of combinations.

I think Ludwig becomes an extremely compelling choice. And when we look at sort of how the framework developed, that was exactly what we saw happening a lot at Uber. So 1 of the first applications of Ludwig was on this project called customer obsession ticket assistant or Coda model.

And the idea behind that project was that we had

customer support tickets coming in from Uber customers who were saying things like, you know, there was incorrect or the trip was recorded as ended, you know, much later than it actually ended or something like that.

And, you know, oftentimes this just becomes a very heavy burden on, you know, human operators who have to triage all these tickets and figure out what's happening.

So naturally, we wanted to apply machine learning to this to make it easier to automate this process of triaging the tickets and coming up with appropriate resolutions.

And the interesting thing about this data is that, yes, you have the raw text of the ticket that the user is submitting,

But you also have information about, you know, the user and the trip and all these other things that are kind of in this more

structured or tabular type of data that you also can use to make an informed decision about what type of ticket is and what kind of resolution is appropriate.

And so something like Ludwig then becomes a very natural choice where

each of these new features that you want to incorporate just becomes

1 additional column, if you will, in the data

as opposed to, you know, having to think about grappling all these different types of data into a custom bespoke solution. Right? So that was kind of the motivation was

scenarios like that. And so when we see

usage of Ludwig in the wild, we see a lot of people applying it in just that exact same way where, you know, they want to very quickly be able to incorporate

any type of data into

a machine learning project and not have to worry about

what are the downstream implications

for how I'm going to fuse these things together into a training process.

And I know that 1 of the very active areas in the deep learning community and some of the research that's being performed right now is in the space of these sort of hybrid or multitask models where you, as you said, have multiple different types of data that you wanna be able to train and

infer on. And I'm wondering

sort of how that relates to the use cases that Ludwig is being applied to or if there are any sort of specific requirements or complexities that are involved in those multitask models where maybe you want to combine both image and audio processing or image and text processing to be able to

support, you know, both a natural

language

is very well situated for. So you raise a good point here is that it's not just about the multimodality of the input, but also the multimodality of the output as well. So

you can have a single model that, you know, predicts the intent of the user from some text,

but that also then generates an automated response using a text decoder, right? Something like that. Or you can combine image and text and then like do some sort of captioning model or something like that. So all these things are certainly possible. And 1 thing that I think Ludwig does that's very

novel in this area is also thinking about task dependencies

or data dependencies that exist within the prediction. So

you might have a particular field that you want to predict that

depends on the output of another part of the model. Right? So so coming back to the, you know, ticket classification thing. So based on the intent of the user,

you might then make a different decision about what sort of response you want to automatically generate.

Right? And so you can define these sorts of relationships very naturally in Ludwig

without having to write any code. It's all declarative and all driven by configuration.

So, yeah, I absolutely think that the future of machine learning is multimodal and multitask.

And so frameworks like Ludwig that provide the the higher level abstractions to enable you to solve those sorts of problems,

this is, I think, where things are gonna be headed in the next couple of years.

As you started to become involved with the Ludwig project and as you have been exploring it and working with it more over the past few years, I'm wondering what you have seen as the sort of target persona for an end user of Ludwig and

maybe what that range happens to be and how you have approached the

sort of gradations

and of workflow for being able to accommodate that different sort of range of personas.

Yeah. That's a great question. So I think broadly,

we see that most users of Lulu today fall into 1 of 2 camps. 1 is the

more data science oriented persona. So this is someone who, you know, understands

data science and,

you know, works very frequently in, like, comparing model

help from

help from a framework is on things like, you know, working with deep learning specifically because oftentimes, you know, writing deep learning models is a different sort of

problem than, you know, working with tree based models or other types of solutions like that. And so Ludwig, I think provides a very accessible experience

to people who are not deep learning researchers,

but still understand the data science and understand, you know, what to look for

and provide some a very nice interface to do so.

The other type of user that we often see is what I would call more of the citizen data scientist or even like a software engineer, more strictly speaking. Right? So we had a number of people like this that were interested in using Ludwig

at Uber, where

you aren't necessarily,

like, tasked with solving a particular machine learning problem as your job. You're maybe tasked with maintaining the payments platform as a back end service

or, you know, any number of types of other traditional software engineering roles. But as part of that, you have some kind of like p

2 machine learning problem that, you know, you say, oh, wouldn't it be great if we could use machine learning to make this 1 part of our system more efficient or, you know, easier for us to develop etcetera.

And the problem with those sorts of tasks is that in most organizations, this means going to the data science group and saying, can can you get me resources to help me solve this problem? And they'll say, no, we're too busy solving the problems that, you know, affect our bottom line or our top line in some, like, very measurable way. Right? And

so realistically, these problems never get solved. But with Ludwig, it kind of empowers these people who, you know, understand their data

and understand how to use, you know, Python

packages more broadly

to just be able to apply

state of the art deep learning very quickly to those tasks

without needing to involve, like, an expert data scientist. Right? And so I call this the long tail of machine learning problems. Right? So this isn't the types of problems that you staff an entire organization to solve, but these are the problems that, you know, individually are not necessarily going to make or break your business, but when you combine them all together, have a very significant impact on the overall

health of your organization.

And I'm wondering if you can talk through the way that the Ludwig project is architected and implemented

and some of the sort of progressive exposure of complexity for being able

to support those different user personas of going from, I understand the data. I don't wanna dig into the deep learning. I just wanna solve this problem to, I understand the deep learning, and I know how to tune it and tweak it. And I want to be able to sort of get the best possible performance and accuracy out of this model, but I don't want to have to do all of the sort of discovery and feature selection from scratch.

I love this question because I think this really gets to the heart of what makes Ludwig different from most other AutoML ish projects out there.

So the way that I kind of think about it is that most AutoML solutions provide a black box

where you try it and if it doesn't work, you maybe have a few knobs you can turn. And if that doesn't work, then, okay, it's time to write pytorch or tensorflow code.

And what I like about Ludwig as a structure

is that it is meant to be this glass box, right? Where it can provide all the automation for you as a starting point. And if at any point along the way you want to introduce some domain knowledge, like you know something about how your data should be encoded or represented or how you should handle missing values or whatever,

Then Ludwig gives you the tools to do that through this configuration

language. Right? So the way that you specify

models, it's a it's a model as config kind of structure

where

you have a yaml file that describes what your features are

and

at a bare minimum just what your inputs and outputs are and what types they are. And then from there you can get a lot more expressive as you start to move from that, you know, non machine learning expert type persona into the more

expert regime. And that's where you can start specifying things like, you know, what's my learning rate? What kind of early stopping do I want to use? How do I want to pre process my data? Do I want to encode this text feature using a BERT encoder or using like Roberta or something else

all the way down to hyperparameter optimization?

And do I wanna use hyperband

or do I wanna use, you know, SK ops? And then figuring out from there all of those nitty gritty details of how you want to perform your training operation.

And so the nice thing is that we provide these reasonable defaults for everything. Right? So if you don't specify it, we'll just do whatever we think makes the most sense. And if you do specify, we honor your choice. And that's how we think about creating this gradient essentially from,

you know, entry level to power user for the user so that

no matter where you fall on that spectrum, there's value that you can get out of Ludwig.

As you mentioned, the Ludwig project originated at Uber to be able to solve some of the problems that the data science team there was encountering as far as being able to just accelerate their pace of experimentation and innovation. And I'm wondering

how the

exposure to the sort of broader

ecosystem and the problem space outside of the bounds of Uber has shifted or changed the overall goals and design of the project as it has

evolved to be able to meet that broader set of goals and needs?

Definitely, I would say when you look at the types of features that LoopLink supports and here, I mean, strictly like the features

in terms of data types and things like that,

very many of them I think have come not from specific problems that we had at Uber, but from just broader kind of things that people thought would be useful for their own problems out in the outside world. Right? So while initially a lot of problems

for Uber were focused on text features, categorical,

numerical, binary,

There are also lots of people that have data types that could be represented as a sequence or a bag or

audio features or time series features. Right?

And so all of these things, date, daytime features, all these things have been kind of incrementally added as the community kind of sees that, oh, like my type of data is different from, you know, what you're working with an Uber. Like, how can we find ways to support that?

File formats is another 1 as well. So we've, you know, put in a lot of effort to make sure that every type of data

structure that we can think of, we can support.

We've even worked with several kind of other organizations on coming up with standards around

how do you represent

a bounding box for, you know, doing kind of like

object detection tasks and things like that. So these are all things that we've been actively working with the community to solve.

And scalability is another 1 as well. So,

you know, at Uber, when the project was originally created, we had our own

way of thinking about how do we distribute things, how do we scale it up. Oftentimes, the challenges that we see in industry are very different. Users have their own kind of clusters that they're running on that have their own limitations. And so making it easier to scale

has been a big kind of community component to Ludwig as well. And that brings us to the Horovod project, which you have also been involved with and I know is, broadly speaking, a way to be able to parallelize your machine learning training jobs across a cluster of

CPU and GPU resources. And I'm wondering if you can talk to some of the

ways that you have approached the design of Horovod and its integration with Ludwig to be able to

make that as

transparent as possible to the person who's using it so that they can go from, I have this simple model. I was able to train it on my laptop, but it's taking too long, or I wanna throw more data on it to I'm now running this across, you know, 500 GPUs on a cluster that's taking me from 10 hours to train down to 5 minutes?

I would say that the way that we typically think about distributed training,

particularly for Ludwig,

is that

oftentimes

people

have this expectation that it should be as simple as, you know, I just add more machines and I get better throughput and then my model just trains faster.

And that's certainly the goal. But historically, that has not been true. Right? And so oftentimes,

you know, making distributed training work

has been kind of an optimization problem unto itself that you need to solve kind of before you can actually

get good model performance.

And so

Horovod provides the low level set of tools to

make the training work and make it work efficiently.

And then what we try to do with Ludwig is additionally provide

the layers on top to make it also produce good models.

Right? And so, you know, for example, at a very base level, when you want to scale up, you need to think about, you know, how am I going to partition my data? How am I going to adjust it in parallel among these different workers?

How am I going to adjust my learning rate as I scale up the number of workers?

How am I going to maybe adjust the learning rate like schedule as well? Right? Because oftentimes you'll want to do some warm up followed by some decay and things like that.

All these things vary quite considerably based on how you're to what degree you're scaling up the number of workers.

And these are things that we, you know, have worked very hard to abstract from the user as part of Ludwig.

So that's where I see the interaction coming in is that, you know, I have already spent the time to figure out

what they need to do. Can use Horovod.

But again, for the people who want just like the, you know, this isn't my primary concern. I'm more interested in just getting a good model.

Ludwig provides us all inclusive package that not only abstracts away the complexity of the model, but the complexity of the infrastructure as well. And that infrastructure abstraction is where Horovod fits into the picture.

I'm definitely interested in digging a bit more into Horovod. But before we go too far afield from Ludwig, I'm also interested in discussing some of the recent changes that have been underway to migrate from TensorFlow as the core framework over to PyTorch

and what were some of the motivating issues or the

desired benefits that you were aiming at when that project started, and maybe some of the ways that this abstraction layer has been able

to insulate the end users from this migration so that they don't have to necessarily change all of their models and code and update all their repositories to be able to take advantage of this shift? Yeah. Absolutely. So that is, I think, 1 of the nice things about Ludwig is that we do think of the framework as an implementation detail. So

most users of Ludwig there may be some power users that were very intimately familiar with the TensorFlow parts. But for most users of Ludwig, you know, upgrading to the new 0.5 release that will be coming out later this month, you know, should be just a matter of, like, PIP install and then, you don't even have to think about it. Right? And in terms of our relationship

with tensorflow and PyTorch. So when the Ludwig project was started by Piero, you know, PyTorch didn't exist at the time. Right? So TensorFlow is kind of the first to market and so TensorFlow was what was used.

Over

time, it started to become apparent that more and more people in the research community were gravitating towards PyTorch.

But then around the same time, TensorFlow 2 came out. And so we, you know, put a big effort in to migrate to TensorFlow 2 and to

see what effect that had. And certainly, I think that from a developer standpoint, TensorFlow 2 was a big step up.

But we still had a lot of issues with performance

as well as with keeping up with changes that they were making to the API. So

for example, TensorFlow's relationship with their kind of modeling API Keras has always been pretty interesting. So Keras used to be a separate package, then it was part of TensorFlow, then now it's a separate package again. And these all have a lot of implications for us as, you know, users of that API.

But primarily, I think my

thought process was around

looking at kind of how the ecosystem

around each project was evolving

and seeing that there were a lot of really

cool

new packages coming out for PyTorch that

we wanted to incorporate into Ludwig.

And it was becoming increasingly difficult to find

the equivalence for TensorFlow to just use off the shelf. Right? So a good example here, you know, Google,

I believe was DeepMind specifically or Google Research wanted to create the Vision Transformer architecture.

And 1 of the things we wanted to do was say, okay, can we put a Vision Transformer

pre trained

encoder into Ludwig so that, you know, we can use that

to make really good embeddings of images

to use as features in the models without having to do any kind of heavyweight training first.

And

surprisingly, there wasn't a really good, certainly not Google produced, TensorFlow Vision Transformer.

Instead, you know, Hugging Face and the PyTorch community had created such a thing. Right?

And the original implementation of the Vision Transformer that was open source was written in JAX, which is, you know, another framework that's in some ways competing with TensorFlow, but also created by the folks at Google. In some ways kind of cannibalize their own ecosystem a little bit with

that. And as a result,

it just seemed that in order to be able to take advantage of some of these more interesting developments

community,

PyTorch certainly had the momentum behind it. Another example here is Microsoft

and OpenAI have recently

invested heavily in PyTorch and particularly as a Horovod

maintainer. Also particularly interested in frameworks like DeepSpeech that attempt to solve problems around

training very large models

on multiple GPU's.

And, you know, for us, we've always been very GPU centric and I feel that increasingly

PyTorch has a very good amount of momentum behind it on GPU specifically

that, you know, we are increasingly keen to take advantage of.

So those are a few of the major reasons for us. And so far we've definitely

had

very minimal headaches with PyTorch as we've integrated into Ludwig.

Discussing more about the Horovod project and the overall

problem of being able to scale the training and inference across a suite of machines, I'm wondering if you can start by talking about some of the ways that

that is a distinct and unique problem

as compared to what a lot of people might be looking to for being able to do scale out computation

in more of the data processing domain where they might use,

Spark or DASK and just some of the

differences in terms of the resource scheduling and data distribution and some of the

communication overhead requirements for being able to maintain those different types of workflows?

So I think people often

compare

distributed training to distributed data processing for good reason because

they both are coming at this problem of my data is too big. How do I fan out instead of scaling up? Right?

And to me, what I often point to as being the big differentiator here is that

most parallel data processing is embarrassingly parallel. Like, there might be some

occasions where you need to do some aggregations. So that would typically be like the reduce in a map reduce or something like that. But by and large, most of it is very easy to just kind of fan out and you don't have to worry about interactions between the different workers that are working on their own little partition of the data. Deep learning is a bit different, however,

because you also have this state, the statefulness inherent in the training process

that comes from the model parameters themselves.

And so every time a worker processes some partition of the data, it also needs to do an update to the model parameters.

And those model parameters need to be kept globally in sync among the workers

or else you get a kind of drift that occurs where

a particular replica is trying to move the model in a direction that no longer makes sense because it's too far behind the other workers. Right? So you need to keep them as close and sync as possible.

And that's essentially what the entire role of Horovod in the process is, is that every time you compute these update steps on the model,

Horovod does the work of aggregating them together and typically something like an average

and then applying those

synchronously to all of the model replicas.

And so, you know, in the early days before Horovod, this was facilitated

by what you would call a parameter server that acts as like a central repository

for all of the

either model parameters or gradient updates.

But this ended up being a pretty significant bottleneck in a lot of cases

where,

you know, all of the data has to flow through this 1 node. And so as you add more and more workers to the process,

you very quickly get bottlenecked.

There were ways that people thought to address this by sharding the model parameter server, but then that introduces its own overhead. So

what people started to do is they look to by people here, I mean Baidu originally being the ones who were the first to publish on this, if I remember correctly,

pointed out that there was a paradigm from the high performance computing community,

specifically

a technique called all reduce,

where you have a decentralized

aggregation

that occurs and what is typically a ring but can also be a tree

where, you know, the different workers pass the updates to 1 another and then they send them along the way. And then by the time you do a couple of loops around the ring, everyone has all of the updates from all the other workers. And this turned out to be

the bandwidth optimal for the ring already assembled ation, the bandwidth optimal way of doing this particular

summation.

And another nice advance that came out was NVIDIA

introduced a framework called nickel or NCCL as it's written out That's specifically designed to do this kind of fast aggregation

on the GPU memory directly so you don't have to do any copying between GPU

and host memory. You just do it all using remote direct memory access on

the GPUs. And so you can be extremely fast in practice.

And this has become, in a large way, the dominant kind of paradigm of doing distributed training today.

As far as the scalability characteristics,

anytime you have to move across machines, you add additional overhead as far as the networking and communication. And so that means that you're always going to be scaling sublinearly.

And I'm wondering what you have seen as some of the real world capabilities

and limitations of being able to use Horovod and this MPI style communication pattern to be able to scale out computation,

whether that's in terms of the number of nodes or the volume of data

or, you know, the limitation as far as the floor of training time as soon as you have to scale out across these instances?

So what we've observed is that

in traditional,

kind of, data center environments,

you're typically going to be bounded

by your network bandwidth between the nodes.

And I think for distributed training, you have this very complex

interdependency

between all these different resources. So you not only need to think about your network, but also your GPU itself, like how fast it can actually compute

the gradients that need to be transmitted across the network,

as well as how fast is your data pipeline that's feeding data into

the GPU, right? So this might also have some CPU operations, some IO,

some, you know, in memory caching.

So you pretty much are bringing to bear the full set of resources to solve this problem efficiently.

And it's absolutely the case that

at a certain point, you're going to get bottlenecked by 1 thing or another. So whether it's going to be your GPU or going to be your network, etcetera.

Oftentimes what we see is you add more and more nodes. There does come a point where things start to drop off. And so this in particular was a whole body of work that was done by

1 of my good friends at NVIDIA, Josh Romero, who's a key contributor to Horovod.

So he worked with some folks at Oak Ridge National Lab running on the Summit supercomputer, which at the time was the largest supercomputer in the world, to try to scale up deep learning to what we call, like, the exascale.

Right? So running at, like,

very high level of floating point operations for a second. I think it was, like,

1.1

exaflops or something like that was what they were achieving.

And

there were a lot of optimizations that need to be made to multiple parts of Horovod to be able to get to scaling at about 90% scaling efficiency. So when I say 90%, I mean, you know, from what would be the ideal if you've got linear performance improvements for each incremental node that you added. Right? And most of that came down to

a couple of different components. 1 was

in Horovod, we have this control plane that decides which tensors are ready to be aggregated. And so there's some metadata exchange that happens between the workers.

So there was a lot of optimization done around this component so that, you know, you very efficiently figure out what's the right tensor to all reduce at any given point in time by transmitting very small amounts of data around. Right? And in so doing, you save yourself the trouble of having to inefficiently

send large amounts of data, right? So you kind of optimize the wind component of sending the data. The other aspect of this,

which is pretty related in terms of optimizing, you know, what's the right time to aggregate

was on

what we call groups are reduced, where we essentially grouped Josh in particular implement this as deciding,

like, what are the optimal groups for these tensors that should be all reduced together

because different tensors are going to be ready at different points in time as they're coming out of the GPU from doing the gradient computation and then going into the all reduce portion.

So optimizing

that grouping as well turned out to be a very

major step towards getting to that 90% scaling efficiency

at tens of thousands of GPUs.

Now 1 caveat I would say is that most people don't have

quite the level of performance that a supercomputer

does in terms of, like, interconnectivity.

Right? So

oftentimes, you might get nodes placed, you know, not on the same rack even. Right? Like in potentially very different parts of the data center or the availability zone. Right?

When you're running in sort of these cloud computing environments.

And so you can't always get the level of bandwidth that you need in order to achieve these very high scales.

A couple of the most common techniques that we use to kind of address this in Horovod, 1 is to do what we call gradient compression. So we reduce the overall size of the gradient by instead of having it be 32 bit precision floating point going down to 16 or even lower in some cases,

and then doing the aggregation on that.

Or what I personally have found to be the most effective is what we call local gradient aggregation,

where instead of

every single time you do an update step on the model, you aggregate the gradients. Instead, you have a local cache

that

aggregates the gradients locally for some number of steps. And then only after, say,

k number of steps have

elapsed, then you do the all reduce on the total amount of data. And the result is that, you know, you can reduce the number of times that you have to do this all reduce step by however many times you do the local gradient aggregation. So these 2 things in particular tend to be very effective in kind of the day to day sense

for not hitting those sorts of, like, inefficiencies

that you mentioned.

Inevitably, when you're dealing with these sort of very

paralyzed and scale out operations

where you're dealing with multiple nodes and particularly when it's very state full

and trying to optimize for time and latency characteristics. I'm wondering how you handle the situation of a node failure where you're trying to scale out across, you know, even 10 or 20

instances with GPUs and possibly even multiple GPUs per node, and 1 of those instances has a hardware failure and drops out of the pool, how you're able to recover that state and be able to continue the training operation without having to start everything over from scratch and then try again.

Historically, the kind of, like, v 1 implementation of this

was just a very standard

model checkpointing process. So, you know, when you're doing

training, you typically are operating on these epochs, you know, where every epoch is defined as a full pass through the data. Right?

And so what in practice you would do is once the epoch has completed, you save out the current state of the model to disk. And so if a failure occurs,

worst case scenarios, you can recover from that state and then continue from there.

However, this ended up being impractical

for certain situations.

1 of the main kind of impracticalities here is that as you scale up more and more and more,

the probability of having a node failure increases

at any given point in time. Right? Because you only need to have 1 to bring the whole system down.

And additionally, as the dataset

size increases more and more,

oftentimes what you'll find is that

the time it takes to get through a single epoch is just so great that the cost of having to recover

from that checkpoint is, you know, potentially like could cost you minutes, tens of minutes, maybe even an hour, depending on what your system is, how much you're able to scale out. Right?

And so

what we introduced, this was 1 of the last things I worked on at Uber before kind of jumping off into the startup realm was a project called Elastic Horovod.

And the idea behind Elastic Horovod is to

gracefully allow workers to come and go from the system

without having to incur this stop the world penalty that you described.

And at a high level, the way that it works

is that you essentially have an additional

node that

acts as a coordinator, kind of like in Spark, you have the driver node.

And the purpose of this coordinator node is to inform all the other nodes in the system of, you know, what nodes are available and to update

the kind of environment if a node fails and you have to recover. Right? So of course, it requires some amount of coordination from the individual workers as well. So hypothetically, you imagine that a training process is going on. And then at some point during the training process,

1 of the workers just dies. Right? So what can you do? Well, 1 thing that we can do is if we detect it before we get

to a collective operation,

like an all reduce, then the coordinator can send a message to all the workers telling them, hey, a worker died, you should stop what you're doing and kind of, you know, let's reassess and like reform the ring essentially, right?

Another thing that can happen is that you're not so lucky

and the failure occurs in the middle of an all reduce,

in which case all the workers will raise an exception and they'll need to detect that that exception

was, you know, caused by a network failure of some kind. And then they'll report back to the coordinator themselves and say, hey, I just had a failure. What's going on? Should I stop or should I, you know, like reform the ring as a different rank than I was before potentially.

So after the coordinator tells them all, okay, now, you know, this is your new

place in the process,

continue from where you left off. There's this whole process of kind of restoring the state back to a good point. Right? So

we have a kind of outer loop that runs that will do this synchronization

where we'll say, okay, like from the last good state, here was where we're at in terms of which batch number we were reading,

You know, what the current parameters of the model were. Let's make sure we're all in sync collectively. And then once that's defined,

we just proceed from where we left off. Right?

And 1 of the really nice things about this is that this kind of framework

not only works very nicely for fault tolerance,

but also has an added property of being very useful if you're running, say, on an on premise cluster where there's a finite number of GPU resources

And you'd like to be able to scale up and down dynamically based on demand.

And so we've also seen a lot of successes from projects like ElasticDL from Ant Financial

that applied Elastic Horovod to doing this type of workload as well.

Of course, anytime you add a coordination point to a system, that then becomes a point of failure. And so I'm sure that there are potentials for being able to, you know, add additional layers and layers and layers to, you know, guard against that, you know, point 0 0 1% of failure cases. And I'm wondering what you see as kind of the balance point of

optimizing for being able to recover from failure, you know, given certain classes of failure conditions and, you know, just the amount of complexity

and operational overhead to be able to manage these systems.

So it was certainly the case that

early on, there was a lot of discussion around, do we want to make sure that every single component in the system is fully fault tolerant? And, you know, that started getting into discussions of, like, okay. Do we implement, like, Raft or something like that? It's like a consensus protocol for

making sure that there is, you know, always an available coordinator node. And I do think there's certainly is still room to implement something like that for certain extreme cases.

But I would say that, you know, from the practical standpoint of like, what was the immediate problem that we were looking to solve? There were a couple of reasons why

we were okay going with having a coordinator as a single point of failure in the initial implementation.

So 1 is that

in cloud

environments where you're doing this sort of training,

you can be a little bit clever about what types of nodes you use for having a coordinator versus having the workers. So for example,

coordinator node can be a dedicated instance that has a guaranteed SLA of like, you know, we will do everything we can to make sure this node doesn't go down. Right?

Whereas the worker nodes, you can potentially run on spot or some other pool of instances

where the costs are going to be much, much lower.

But, you know, there's no guarantee that you're going to be able to keep that note around. Right? It may be removed at any given point in time.

The other aspect of this is if we come back to the, you know, what's the probability of failure

problem from before.

So I mentioned that elastic becomes particularly important as you scale up the number of nodes because the probability of individual node failure increases.

But if you only have 1 coordinator that can fail,

then the probability of that failing is essentially fixed, right? So you scale up and down the number of workers

and you're not increasing the likelihood that that coordinator is going to fail necessarily.

And so you don't have to worry about what's the kind of, you know, o big o complexity of failure, right, when you're dealing with that kind of regime.

So

that's why in practice we haven't, but it heads against this problem when people have tried it out in the wild. But I'm really curious to see if someone does run across this prom, in which case, I will look forward to working on the RFC with them for

adding consensus protocols to Horovod.

Yes. The the complexity never goes away. You just have to pick where to put it.

Exactly.

And

so this combination

of Ludwig to be able to

simplify

the onboarding process of being able to go from, I have some data. I want to generate a prediction or build a model and have this disclosure of complexity to be able to get as deep in the weeds as you want to tune and tweak things as far as you go. And Horovod to be able to go from, I have things running on my laptop. I just wanna throw this on a cluster and not have to worry about all this overhead, but it does have this very sophisticated capability of being able to manage these failure cases and scalability

considerations.

I'm wondering what you see as the potential impact of that combination on the overall availability and adoption

of sophisticated

machine learning capabilities

in organizations

at various levels of scale and sophistication.

Absolutely. So this has been kind of broadly the theme of, I think, my entire career to date is, like, how do we make machine learning and state of the art deep learning

more accessible to people? And, you know, I think like to kind of take a high level perspective for a second, the problem that I see in general is that we have lots and lots of occurrences of

organizations

and individual data scientists

reinventing the wheel, right, in terms of having to figure out how to operationalize

a new research paper or something like that into

the specific domain problem that they're solving.

And to me that seems like unnecessary

because, you know, the architectures that people are using are becoming increasingly standardized

And the data infrastructure that's being used to manage all of the data that they're training on is becoming increasingly standardized as well.

And so it feels natural that there should be just like a standard

system that can

hook into your data infrastructure

and then apply state of the art deep learning based on whatever the task is that you wanna solve. And so you can imagine this is very much our startup predi bases is trying to do. And so this is how we imagine

fitting systems like Ludwig and Horovod into the enterprise and to like other organizations

is through this idea of deeply embedding it with your data infrastructure

and applying state of the art deep learning so that you don't have to become

this expert or be the person to, like, reinvent all these things over and over again. Like our vision would be that

if you do have a new model architecture that you think is very cool and very general and could be applied to a lot of different problems,

instead of writing it in your own private code base,

you know, you contribute it to Ludwig, and then it becomes available to the broader community and everyone can benefit.

Going back briefly to Ludwig and some of its implementation details and keying off of what you were saying about the availability

of

different

packages and prebuilt capabilities

in the PyTorch community for being able to solve for different data types or different problem domains. And I know 1 of the large areas of excitement right now is in graph machine learning and using things like PyTorch Geometric.

And I'm wondering

how you have approached the kind of modularity

boundaries in the Ludwig project to be able to

guard against some of the

dependency

resolution conflicts and sort of the dependency hell situation for people who are going from I you know, I'm comfortable with

using Ludwig to build this simple model, but now I want to go and build this very complex and customized

model definition using some of these more

advanced and maybe out of the box capabilities from this collection of libraries

and being able to just manage that sort of gradation of complexity?

Yeah. It's a good question. I mean, certainly,

there are a lot of components to Ludwig, and certainly, you know, there are a lot of

dependencies in the Python ecosystem that don't always play well together in practice.

What I would say on that front is that we've done a good job, I think, in Ludwig of

isolating dependencies into different verticals. So

if you are interested in using

certain text decoders, you know, we have

optional dependencies that you can install. Whereas if you are only interested in other types of data or other systems,

you know, you don't have to install those dependencies. So it kind of reduces the scope of conflict a little bit there. But to the broader point, I think about how you kind of

make this integrate together nicely. So 1 thing we've been trying to do a lot is

make Ludwig compilable into TorchScript as well so that

as much as possible once you want to serve the model. So at training time you can usually solve these dependency problems at training time. But then once you want to serve it, then the dependencies become like, I think, a much

bigger problem in practice because,

you know, something that serves could be long lived. How do you update the packages?

Like, how much memory overhead is gonna be incurred by

containerizing

this particular

deployment, this virtual environment. Right?

And so that's where we've looked to things like TorchScript in the hope that we can

compile this down to a standard

set of operations,

both in terms of pre processing as well as training

for the model. And then have that be servable in a runtime that

doesn't depend on all of these, you know, different packages.

So you can benefit from not having to,

you know, keep all of these things in sync indefinitely. Right? That even though Ludwig

is this very broad and expansive framework that has all these different features,

the thing you ultimately get out of it is limited to whatever is the core functionality that you need for your model.

I think the 1 other thing I would add on to that that I think is relevant is that

we also have been investing a lot in Ludwig's ability to do transfer learning, that you could take a trained Ludwig model, potentially compiled into this TorchScript format so you don't have the dependency

hell that you were describing as a bottleneck,

and then use that as an encoder or as a component in another Ludwig model for doing a different task.

Right? And so this is another area that we think that we could potentially address that and provide more modularity to users.

Beyond just the challenges

of being able to manage the

technical complexity

of building and training machine learning models and being able to scale them out, what do you see as some of the other significant barriers to more wide spread adoption of machine learning functionality and deep learning in particular?

Yeah. I think absolutely

the biggest barrier at the moment, which is thankfully being rapidly addressed, is on the data infrastructure side.

So, you know, I think we've

all heard a lot about kind of data lakes versus data warehouses. And if you follow Databricks at all, you've heard a lot about lake houses

and kind of where all of this is converging. Right?

And so my kind of

view of the state of the world is that today,

the big bottlenecks to broader adoption of

advanced machine learning, like state of the art deep learning technology that we're providing,

come from the fact that

you may have your tabular data structured in a nice warehouse like Snowflake,

but then your text data, your image data, your unstructured data

lives in some sort of other lake or, you know, maybe even like swamp,

right? That

is only accessible to the, you know, high powered data scientists who know how to wrangle that data. Right?

And so what I'm really looking forward to over the next couple of years is seeing how

the data infrastructure community can

tackle this problem of essentially structuring unstructured data, right, of making it so that you can use the same kinds of interfaces to

query and access and work with unstructured data like text and images that you currently have available to you for, you know, running SQL queries on top of structured data.

In your experience

of building and working with and using and interacting with the community around the Ludwig and Horovod projects, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen either or both of them applied.

Yeah. So certainly for

Horovod,

I think the most interesting thing that I've seen is that Oak Ridge projects that, you know, ran Horovod on 27, 000

v 100 GPUs,

and they ended up winning a Gordon Bell prize from the ACM for that work. So, certainly, I think that was, you know, 1 of the things that most amazed me that people could pull off with Horovod.

In terms of Ludwig, I think that, certainly I've seen some really amazing

examples of doing

really advanced

computer vision tasks with Ludwig that,

you know, it fascinated me to see it because, you know, it was not something that at Uber we

had invested very heavily in doing with Ludwig. This is again was something that primarily came out of kind of the open source side. But then seeing people take those features and run with it to do these amazing kinds of computer vision tasks,

like, prove to me, I think, the value proposition in a big way is that

this isn't just something that appeals to people who do things the Uber way. This is a very general tool that can be applied to all sorts of problems.

And in your experience

of working with and contributing to the Ludwig and Horovod projects, and as you have begun building a business around the capabilities to be make them more accessible, I'm wondering what are are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. Certainly, I think that 1 of the big challenges

that I've seen

in terms of

explaining the value proposition

of something like Ludwig is I think in a lot of ways, it's a little bit of a paradigm shift for people in terms of no longer thinking about

the model, but thinking about the data and, you know, thinking about like, what can I do with multimodal,

right? You know, 1 of the barriers kind of to come back to the data infrastructure side is that oftentimes people think they're boxed in into, I can only use tabular data because that's the only thing that my data warehouse

gives me access to. Right?

And so what we want to try to do is make the case of saying, you know, you can train a model on that tabular data, but you also have all of this rich data,

like, you know, all this text and all these images and all this audio

that if you were to add those as features into the model, you'd be able to get even better performance.

And so figuring out how to wrangle that from the kind of data infrastructure side

is certainly 1 of the big barriers. The other, I think, is in terms of making this appeal, not just to the data scientists, but to people up the stack like data analysts who frequently work in the regime of, you know, the data warehouse, for example.

Oftentimes, they think about these problems very differently from how a data scientist would. And so, you know, all these different personas

are looking at the problem differently and looking at the features differently.

And so while some people may care a lot about

the operational side of, like, how do I productionize these models? How do I serve them? How do I retrain them?

Other people

primarily care about how they measure their performance of these models. Like what are the different metrics? How do I visualize them? And so other

people care a lot about the kind of explainability

side. Right? Of, like, why is the model making this prediction? Or how do I ensure that there's no ingrained bias in the model that's going to bite me down the road? Right?

How do I do causal analytics to kind of understand

the why behind why a certain event in my data happened in the first place? Right? And so kind of reconciling all of these different

requirements,

because Ludwig is a very broad framework that tries to cast a wide net,

that is another very big kind of challenging aspect to this that,

you know, we definitely appreciate engaging with the community to help solve those some of those problems.

For people who are interested in being able to more easily adopt or scale their machine learning capacity, what are the cases where Ludwig and or Horovod are the wrong choice? There certainly are times where, you know, it's more appropriate to use 1 thing or another. So

for Horovod specifically,

1 thing that we don't

really well support yet is model parallelism,

which is something that frameworks like DeepSpeed or FairScale from Microsoft and Facebook respectively

tend to do really well at. So while Horovod is has a very good data parallel framework,

that if you're 1 of the people that is training, you know, GPT 4 or whatever, then, you know, there are different types of, like, frameworks that can help you solve those sorts of highly model parallel tasks. But I will say that Horovod can be combined with a lot of these model parallel frameworks together to kind of give you the best of both worlds. And we've also done some work on hybrid parallelism where you can kind of combine the 2 together in interesting ways. But that would be 1 area.

And then for Ludwig, I would say that because Ludwig is a very high level framework, there are certainly people that want that very low level control,

right?

And in those cases, you know, if you wanted to write a custom training loop or, you know, you're trying to train a GAN or something like that today, Ludwig isn't,

you know, properly set up to be able to do those sorts of things. Like, we're very prescriptive about how the training loop is structured.

Certainly, we want to be able to be more flexible in those sorts of areas in the future, but that would be 1 area where, for people who have very low level requirements,

a framework like PyTorch might be more appropriate.

As you continue

to work on and with both of these projects and the community and the business that you're building around them, what are some of the things you have planned for the near to medium term future or any particularly interesting projects that you're excited to dig into?

Yeah. There are too many, so I wish we could hire more people faster.

But, yeah, for Horovod in particular, a few that really stand out to me. So I would say we're the biggest

limitation

distributed training today is isn't around getting consistently good model performance, like in terms of the model quality

at any

scale. And so

as part of that, we've been doing a lot of work on

figuring out ways that we can better

achieve good convergence of model training

through a combination of

hyperparameter

search. So being able to like incrementally

scale up the number of workers in the training process

as some of the hyperparameter

trials finish and you or get pruned, then you can reallocate those resources to better performing trials. So that's 1 area that we are very keen on continuing to invest in. Another is on the elastic components. So fleshing out elastic training to better support iterable data loaders, which currently is 1 area that's very complex with elastic training,

as well as dynamic

schedules for how many workers are optimal at any given point in time in order to not just get the best throughput, but to generate the best model.

So those are both things that are broadly on the Horovod side of the equation. And then for Ludwig,

we're doing quite a bit on AutoML.

So, you know, Ludwig, of course, you can consider an AutoML framework to a large extent,

but we're doing a lot more to

really provide like best in class, like

hands free kind of performance through that framework.

And so we have an experimental API that's already up there in the repo today.

But we want this to be a very different AutoML experience from what most people see today. So

instead of again it being a black box, we want it to be a system where you just give it your data, say I want to predict this. It generates a Ludwig config.

You can then maybe constrain it in different ways and say, oh, I actually only care about this.

Give it back to this kind of system and have it like continue to refine

its query plan, if you will, right, for training the model

based on your specifications.

So

continuing to invest in that AutoML component and make it to where

Ludwig becomes an even more hands free kind of system is is a big area for us. Adding more tasks to Ludwig that it can support is another big area. So being able to do more kind of in terms of, like, natural language processing, like part of speech tagging or things like that. These are all things that we do in SkordenLulu, but we'd like to be able to improve the the scope of what we can do there.

And finally, I would say

self supervised and semi supervised learning is another big 1 that we're starting to look into very seriously on Moodle. So

if you have a mix of labeled and unlabeled data, being able to

use that unlabeled data to further improve the results of the training process There's some sort of pre training process or semi supervised learning

training process.

That's particularly exciting to me right now, and we've already started to make a little bit of progress in that direction with some preliminary work.

Are there any other aspects of the work that you're doing on Ludwig and Horovod and the overall space of making machine learning and deep learning more tractable that we didn't discuss yet that you'd like to cover before we close out the show? The 1 other thing that I would like to point out is our very close collaboration with the AnyScale

folks in Ray, the Ray project. So

we've worked very closely with them for some time dating back to my time at Uber. And we've made a decision in Ludwig to go kind of all in on Ray so that

we can very easily provide an abstraction for users

that does distributed data processing with Dask,

distributed training with Horovod on Ray, and parallel hyperparameter

search with Ray Tune

that requires them to only provision a single piece of infrastructure, which is the Ray cluster.

And then the experience you get is that the same code that runs on your local laptop, whatever Ludwig code you wanna run, will then automatically scale up to the Ray cluster without any code changes or config changes whatsoever.

And that work was preliminarily released in, like, a beta version back in v0.4.

But for the upcoming v0.5

release, we've done a lot to improve that, particularly around the integration between data processing and training. So in Ray, they introduced a Ray dataset API at 1 of their most recent versions.

And the really novel aspect here is that we can do the feature processing and feature engineering and Dask

and then spill it to raise in memory object store. And then directly from that in memory representation,

shard out the data to different Horovod workers to do distributed training.

And this will be a big part of a blog post that we're currently putting together in the next couple of weeks. But the overall observation here is that we've observed very significant performance improvements from

migrating to this kind of direction, and we're very excited to continue collaborating with future integrations to Ludwig and Horizont.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and contribute to any of these projects, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

This week, I'm going to choose a music group called Zeal and Arter. They're an

interesting combination of genres. Broadly speaking, you could put them in the heavy metal camp, but they've got a lot of,

sort of interesting influences and a different take on the overall space. So definitely worth checking out. Hard to put into words, giving it a listen and seeing if it fits your tastes. So with that, I'll pass it to you, Travis. Do you have any picks this week? So I can definitely stick with the theme of heavy metal recommendations.

I'll put a plug in for my favorite band, OPEF, if you

like very eclectic progressive music.

Progressive in the sense of not knowing what to expect next, definitely give them a recommendation.

And also as we're getting into

fall, I'm a big fan of some of the very ambient black metal out there. So

Agaloc from the Pacific Northwest, which is where I live, also a good recommendation if you're into that sort of thing. I'll definitely have to take a look at that 1. OPEF, I'm very familiar with, but I'll have to take a look at the other group you mentioned. And so definitely appreciate all of the time you've taken today to join me and share the work that you're doing on Ludwig and Horovod and the business that you're starting to build around them. So thank you for all of the time and effort you've put into those projects and that community, and I hope you enjoy the rest of your day. Thanks. You too. And I really appreciate you taking the time to put this together, and I really enjoyed the conversation. So yeah, thank

you.

Thank you for listening. Don't forget to check out our other show, the data engineering podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__