Declarative Machine Learning For High Performance Deep Learning Models With Predibase

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers,

40 gigabit networking, and dedicated CPU and GPU instances.

And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to python podcast.com/linode

today to get a $100 credit to try out their new database service. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy, and this month, I'm running a series about Python's use in machine learning.

If you enjoy this episode, you can explore further on my new show, The Machine Learning Podcast, that helps you go from idea to production with machine learning.

To find out more, you can go to the machine learning podcast.com.

Your host is Tobias Macy. And today, I'm interviewing Travis Adaire about prediabase, a low code platform for building ML models in a declarative format. So, Travis, can you start by introducing yourself? Thanks for having me on today, Tobias. So I'm Travis. I'm the CTO of prediabase. Prediabase is

a low code platform designed to make machine learning

more accessible and more useful to the enterprise.

Before that, I was a tech lead manager for Uber's machine learning platform, leading a team focused on deep learning training.

1 of the lead maintainers on the Horovod open source project

and also a core contributor to a little project as well, which is 1 of the foundational technologies for prediabase.

And do you remember how you first got started working in the area of machine learning?

Absolutely.

Yeah. So it goes back a bit to 2011 or so. I was working for

Lawrence Livermore National Lab on

processing about 10 terabytes of seismic data. And our goal was

to try to do some analysis of it to detect nuclear weapons tests, interestingly enough.

But what we found was that

there were a lot of over 50% of the data or so was noise, and we had no good way to detect it. So

started pulling out some of my undergrad AI textbooks,

started implementing some support vector machine and ran it on top of Hadoop

and got just really excited about the whole thing. Published an article in Computers and Geosciences

journal

and decided to go to grad school and get my master's in machine learning and

more active in the industry.

And, yeah, that's how it all started.

And now that has brought you to where you are today with the prede base business and the project, and I'm wondering if you can describe a bit more about what it is that you're building there and some of the story behind how you decided that this was a problem space that you wanted to spend your time and focus on. When I was at Uber, I actually started off as a machine learning engineer. So working on kind of the vertical problems of ML.

And what I found was that there were a lot of things I wanted to do. Like, I wanted to, you know, try deep learning and try training on large datasets and multi GPU and all these sorts of things. But there just wasn't a lot of good tooling available at the time to do that. There was TensorFlow and we wanted to run TensorFlow scalably. There was like a whole lot of hoops you had to jump through. And then integrating that with something like Spark that we were using for data processing, it was just like, forget about it. So what I realized was that if I wanted to solve it all problems, I really need to start with the and all tooling and infrastructure.

And so I joined the

Michelangelo

deep learning team.

And while I was there working on this horizontal problem, you know, we worked with a lot of customers that had very similar patterns that emerged

to mine where

we often found that there was the struggle to get something that was productionizable,

like, at scale. Right? And there was also

a repeating pattern of

there just being frankly more in all problems than there was in all expertise of the company.

And so from a kind of horizontal platform standpoint of Uber AI

tasked with figuring out, you know, how can we help get models into production faster, we kind of

realized that there was a need for better abstractions,

better infrastructure more generally.

And so

the tool Ludwig that's my cofounder Puro put together ended up being a perfect encapsulation of that vision of being able to say, you know, let's let

researchers build state of the art models and put them into as components into this framework.

And then that gives the vertical teams in the company this very easy declarative interface to just kind of swap in and out different components for their data without having to

rewrite, you know, huge swaths of Python code every time. And we realized that this was like a very successful pattern

for Uber. And at the same time, you know, Ludwig became open source,

saw that it resonate very strongly with the community, and realized that there was, like, a very real need for this kind of better abstraction layer, if you will, in the industry

and decide to form prediabase with the intent kind of pushing

the state of the art forward in terms of what kinds of tools data science and machine learning teams have available to them to make them more productive and kind of decrease the time to value of machine learning in the enterprise.

In terms of the audience that you're focused on, particularly given that you're very early in your journey of being a new company and starting to work with some of the first sets of customers,

I'm curious how you think about the

priorities that you're trying to support and how that influences

the areas of focus that you're investing in and the user experience and feature development that you're prioritizing?

Yeah. So we like to say that we don't expect a prediabase that we're gonna be your first machine learning model that you've ever trained. Right? So oftentimes,

we're coming into organizations that

have a lot of machine learning problems that they wanna solve.

Maybe they have some kind of horizontal team that's focused on trying to build out a platform for doing machine learning. And maybe they've tried using some AutoML tools in the past and have struggled with getting them into production.

And so what we identify is, you know, we see these teams that have struggled in this way, and the value proposition that we wanna bring to them is to say, you know, if you've

used, like, you know, some traditional ML systems in the past, like Spark MLlib or what have you, and you're struggling to kind of uplevel to deep learning and more state of the art techniques in the industry,

you know, we provide, like, a platform that gives you those capabilities in a form factor that's, like, much more familiar to you. And if you're struggling to kinda keep up with the amount of problems that the organization has, like, maybe you have teams of engineers that have ML problems that maybe are not the top priority for the whole company, but a very important priority for that team.

Credit based provides a platform

that allows those teams to be unblocked and allows everyone in the organization to collaborate together

towards building these solutions.

And so this focus on,

you know, collaboration

and kind of mixed modality, like, you know, very broad set of tasks that people might wanna solve, those are very core focuses for us when we look at companies that we wanna partner with at this stage.

1 of the interesting things that you mentioned is that you're working with companies who have a lot of machine learning problems to solve. And I'm wondering if you can talk to

what that really means. Like, how you can identify that a problem that you have is a machine learning problem or whether machine learning is the right approach to being able to provide value and utility for a a given objective that you're trying to achieve?

I would say that it comes in a few different flavors. Like, on the 1 hand, you have

kind of

a traditional data warehouse type systems that have tables, that have historical data or transactional data.

And so very often, the story there is people wanna do some kind of predictive analytics. Right? So we know who churned last month. You know, we wanna predict who's gonna churn next month. So that kind of forward looking

predictive capability

is, like, 1, like, type of problem that we see a lot with companies

that fits very nicely into machine learning. So you have a lot of data in your database. You wanna be able to, you know, predict or, like, make forward looking statements about that data.

That's 1 area where we saw it in. But beyond that kind of structured problem, there's also this question of unstructured data as well. Right? And so you have a lot of companies that have text data or image data or audio data sitting around that they've collected, and they don't really know what to do with it. And it's maybe not so much a question of, you know, I have data that says what customers submitted and support tickets in the past. I wanna predict what support tickets are gonna say in the future. It's not anything like that, but it's more about just understanding semantically what's going on in the data and then how that can be used to

better inform the predictive forward looking models that we want to build on that more transactional data. So this idea of unlocking the power of unstructured data is another really core 1. And so

1 of the things I think is very unique about Predabase and Ludwig is the ability that we can kind of bridge this gap between structure and unstructured data. So the platform is is very flexible. It's

data oriented in such a way that if you have some transactional tabular data and you also have some unstructured image or audio or text data, those can be combined together into a unified

machine learning model in a very simple and straightforward

way. And so we can unblock organizations

that have all this disparate data and they wanna derive value from it, but they haven't figured out how in an effective way,

that's where we think there's a lot of power for machine learning, particularly on credit based to kind of slot it. Yeah. 1 of the ways that I've seen those different applications categorized is the difference between predictive analytics, which is the first category that you mentioned versus prescriptive analytics of this is what you should do, and then descriptive analytics to say, I just wanna understand what this is trying to tell me. Right. Right. Absolutely. I think that descriptive component

is 1 that not a lot of people have tapped into.

In terms of the

way that you have formulated the product that you're building at prediabase, you're positioning it as declarative ML.

And you've mentioned earlier that

may have had experience trying to use the category of tools called AutoML.

And I'm wondering if you can just talk to the differences

in the nomenclature

as far as

what that really means and how the sort of expectations

are different between an AutoML

category of tool and a declarative ML category of tool.

Absolutely. Yeah. I think this is, like, a very key

differentiator

between how we're thinking about the problem and how a lot of other companies out there are. So the way that we think about it is that at a high level, there are very similar capabilities

in terms of being able to put ML in the hands of non experts at kind of the starting point. But we believe that declarative ML provides a more principled and flexible path forward.

So whereas a lot of AutoML solutions, I think in a degenerate state, AutoML often becomes this kind of kitchen sink style approach where the system will throw everything it can at the problem

they can think of. And if something works, then great. You know, you kinda take that baton and run with it. If something doesn't work, then there's not really a lot of options that you have in terms of what's going to happen next. Like, how someone who is maybe a domain expert or a data expert can come in and kind of help out with, you know, unblocking things.

Where we think declarative ML provides a difference here is that because it gives you this very complete

specification,

you can start at something very high level and say, you know, I just wanna predict this target given these input variables.

And you can get a baseline from there, but it's not the end of the story. And so because it's very explicit about saying, here's everything that the system did. And you can modify

and customize any aspect of this down to, you know, individual, you know, layers of a neural network. Right? It allows people to then iterate on these systems, on these implementations over time

and kind of build towards

a working solution in a more principled way. So for example, they can say, oh, well, I initially tried building this model with this set of parameters.

And then for v 2, I swapped out this model architecture, and I changed the learning rate from this to this. And it gives you that audit trail of being able to say, here are all the things I tried. Here's what changed,

and here is the effect that that had on on model performance. And so we think that this is also very powerful for enabling collaboration.

So if, you know, someone who is an engineer maybe wants to train their first model

in prediabase, they can do that without having to know a lot of details about the how of what's going on under the hood. But if they get a solution and then they want to maybe have a more expert data scientist take a look and say, you know, what do you think we should try in order to get better performance?

They can take a look at the config and say, oh, well, you know, I see that you're using this parameter here. Let's maybe try swapping that out, see what happens. And so, you know, it gives you that ability to make these incremental changes in a very simple way that if you were kind of going down to just pure low level tools like PyTorch or TensorFlow,

it'd be much more difficult to do that because you'd be having to ship over

entire, you know, Jupyter Notebooks and Python libraries and, you know, what's the execution environment for all this. It's much more difficult for someone to just quickly take a look and kind of provide feedback and kind of next steps.

And we also believe this ties in very nicely to our version of AutoML, which we would call iterative ML or iter ML, where we see it being much more of a conversation that you're having with the system where

you try something out, the system can propose some new things to modify in the in the specification

for the model. You can choose to either accept or reject any of those things,

train for a little bit more, and then use the results of that previous

run to then inform what you're gonna try next. So it becomes a very in the loop back and forth

process that progresses in a much more in a way that we think is much more like how traditional software development is done. Right? Where you have a git repo,

make some code commits,

and over time, you can see the code evolve and change to kind of better conform to the end state

as opposed to just trying to get all out there at once and, you know, not have any way of knowing, you know, what was the history, what was the effect of every change that we made. In this category of declarative ML,

another company that I've seen using that terminology

is Continual, which is based on being able to

build machine learning pipelines on top of your data warehouse so that you can just treat your machine learning workflow as SQL effectively.

I'm wondering if you can just characterize

the relative strengths and use cases of what you're building at prediabase versus what they're building at continual.

Yeah. Absolutely. So I guess the first thing I'd say is that, you know, in general, we're very glad to see other companies kind of validating the idea behind declarative ML, you know, from following the work that Tristan and the folks that continue have been doing.

It's always been very nice to see that they've referenced our work on Ludwig and Overton. So 1 of our cofounders, Chris Ray, was

had a company called Lattice that had a product called Overton that was acquired by Apple, which was another early declared of a mail system. And so I think in general, there's like a really good shared vision of kind of moving the conversation forward about better abstractions in a mountain. So I think there's definitely an element of, you know, rising tides, you know, raising all ships to it. Where I think there are differences would be they definitely are very leaned in to the kind of modern data stack operation side of things.

I think that

their value proposition resonates very nicely with people who are active DBT users, for example. That's a big part of kind of how they're approaching the problem, which I think is a totally valid way to think about it. For us, we definitely think that we can do a lot, not just on the operation side, but on the model development side as well. So with Ludwig,

we provide a framework that is also pushing forward the state of the art of what ML models can do. Right? And so that's a big part of the story for us is trying to figure out how do we help users get good models in the first place and do it in a way that is very

low barriers to entry, but very high performance and high ceiling. Right? We also believe the operations component is a big part of that, but it's not the only part. I think there's still a lot of work that needs to be done on just getting to a good model that you wanna put in production.

And so that's where I think

Ludwig is a bit different from what some of the other tools out there provide and that

we're also tackling that aspect of the problem in the big. And as far as

the implementation and architecture of what you've built at prediabase, can you talk to the overall system design and the ways that you've thought about the

architecture

of how to approach this problem of making declarative ML

accessible

and easy to operate so that teams who don't necessarily want to invest in building the entire MLOps stack can be able to pick it up and run with it and start to gain value very quickly.

Credit base is built as a multi cloud native platform. We're built on top of

Kubernetes.

And so, you know, we have deployments that run AWS,

Azure, GCP, and also on some on premise Kubernetes clusters.

So we believe that that's like a very core part to make it flexible so that wherever your data happens to live, you know, we can push the compute down to be as close to that data as possible to minimize

the latency and minimize the egress costs and all those things.

So that's a very core part of how it's architected.

We also have a separation between what we would call our control plane and data plane

of the system. Our data plane is built on top of the Ludwig on Ray open source work that we've done. So we use Ray for doing the distributed aspect

of, you know, scaling the to large datasets

and paralyzing the work. We use Horovod for doing distributed data parallel training.

And then we also have a serving layer as well that's built on top of that. And then we also have a separate control plane that provides a serverless

abstraction layer on top of this data plane so that, you know, from the user perspective, they don't need to be as concerned about

provisioning

ray clusters that run and, like, how to right size it for the workload and whether I wanna use this GPU or that GPU.

So that's a big aspect of what we provide on the infra side is this kind of intelligent provisioning life cycle management of the compute resources

and making sure that these long running training and prediction workloads

can be processed end to end in an efficient way. And then, of course, there's a whole another serving stack as well that we're building out that's built on top of NVIDIA Trident, and and we'll have a lot hopefully on the to say on in terms of our work there, kind of some blog post coming out in the future. But that's something that we're also looking to push into the open source to some extent as well as some of the serving capabilities

for Ludwig that we're bringing to the enterprise.

As you started to go down the path of starting to build out this platform and explore the capabilities that you wanted to offer,

I'm wondering how the initial

design and

ideas and vision around where you wanted to end up have shifted and evolved and

some of the directions that you have moved in order to be able to accommodate some of the early feedback that you've gotten as you worked with design partners and just some of the overall evolution of the platform as you started to dig deeper into this space?

That's a great question. So definitely, we had a certain set of assumptions coming in about what the market was looking for and kind of what level the user wanted to think about the problem. Right? Whether this was a production problem first and foremost for them, a research problem somewhere in between. Right? And so we definitely had a very early focus on thinking a lot about the analyst use case

and how, you know, there are people who have

data but don't have a background in ML who want to be, you know, up level with that capability.

And so we thought a lot about, you know, making it kind of operations and production oriented to begin with. What we found in kind of the early working with early customers is that there's still a lot of interest in, you know, the model development aspect. And, you know, you're never going to at least, you know, how AI machine learning works today gets that perfect model every time, like, without any kind of manual intervention or kind of domain expertise.

And so we definitely, from a very early on working with customers, realized that having

a teaching element of the platform

was very important as well explaining, you know, here are the options you have available to you in this declared a specification.

Here's how you should think about using 1 option versus the other, what the appropriate ranges are, you know, how you should go about doing model development in terms of starting with, you know, a really complex model or a baseline,

understanding how the model performed kind of in a post hoc way and saying,

you know, these were the features that contributed the most to the model's predictions or these were fields that maybe, you know, were imbalanced in some way, and so there are other corrections we need to make. So definitely that aspect of

iteration

and instilling kind of machine learning best practices

is something that has been a learned experience. And so, know, 1 of the most recent additions we made to the platform

just before coming out of stealth was

investing really heavily in a Python SDK that's very similar to what the Ludwig Python SDK does, but provides but with some more enterprise features

that really make it well integrated into

a data science

stack where you're able to

experiment with data, experiment with models, iterate

as opposed to just going straight for the production

model from day 1. Right? So that was definitely something we learned early on in the process of working with customers.

And it's also interesting to think about

which areas of the overall

machine learning problem and life cycle you're looking to be able

to facilitate

because there are, you know, boundless capabilities

where, you know, there's the

experiment tracking and model tracking. There's the model monitoring to be able to understand, you know, what the concept drift is happening once it's in production.

You know, there's the kind of pipelining.

So there are definitely a number of different areas that you could try to focus on, a number of different directions you could try to push into.

I'm curious

given the fact that you were starting with Ludwig as the kind of core

building block,

how that helped to shape your overall

consideration

of the appropriate scope for what you're trying to build and how you were thinking about what are the maybe boundaries and interfaces

that you want to incorporate to be able to let prediabase fit into the overall

workflow and life cycle of machine learning in an organization

while being able to be very opinionated

and

drive the conversation around the areas that you wanted to own? It's a really good question, I think, because

this, I think, is at the core of the problem of startups in the space trying to define their category. Right? Because I think when you look at the space in general, what you see is that there are a lot of really good tools that are what you might call point solutions that are solving 1 aspect of the problem, whether it's explainability,

model training, model serving,

what have you. Right? But really, what organizations need is they they do need something that is end to end. Right? That is actually a platform that is

fundamentally delivering

business value, not just, you know, model training or something like that. So the way that we think about it broadly is that

we

want to be able to provide

a story that goes from data to deployment for the user. So connecting data from your data warehouse or database,

providing best in class model training at scale

with the serverless infrastructure,

and then providing a really clean and simple path to deployment

that can be either a REST API for low latency real time prediction

or the PQL SQL Life Language that provides batch prediction capabilities to the user.

And starting with that core vision of, like, this is the journey for the user.

There are, as you said, a lot of other aspects to it as well, like, you know, model explainability,

data preparation and data quality and data versioning,

model monitoring, model drift detection.

And the way we're thinking about these things today is pretty similar to how we think about them actually on the open source side with Ludwig, where we want to try to be as integrated with the community as possible. Right? So on Ludwig, we have integrations with CommonML,

Weights and Biases,

y logs, MLflow,

AIMstack that we're working on. And so these tools, you know, provide, like, different capabilities at different parts of the process. Right? Like experiment tracking or

model monitoring.

But the way we wanna think about it is if you already have a tool that you like that solves these problems, like, we don't want to have to say prediabase is a rip and replace

solution for you. Right? We wanna be well integrated. So if you wanna use weights and biases or comment, you know, you just give us an API key, and we'll log things there and then have a nice way to link back and forth between the 2. Or if you're using y logs slash y labs for doing

model monitoring, you know, we're thinking about ways that we can integrate there to do automated model retraining based on triggers that come from from ylabs. So that's the way we're thinking about the problem today is let's integrate as much as possible in the parts of the platform where we don't feel we're providing

strong differentiated value or that we could provide, like, a best in class value proposition on

while still

telling the user, like, hey. If you are starting from scratch, right, and you don't have an ML platform today,

credit base isn't a point solution. It's something that will get you end to end from

the data to a deployed model that can start delivering value. And then you can layer on more tooling on top of that, you know, as we see fit.

And so in terms of that workflow, in the case where somebody is greenfielding, they say, I want to adopt ML. This is my first foray into that. I'm going to use prediabase to be able to experiment with how can I take this data that I have and turn it into something useful that I can do with it? I was wondering if you can just talk through that kind of end to end workflow of starting with the data and ending with, I have a model running in production, and I'm doing something with it. In the tool, there are different ways that users can do it. We do have a web UI that people can use

to take all the actions.

Everything that you can do in the platform can also be done through our sequel like language people as well as the Python SDK. So we have many different views depending on the persona that's using the platform that do the same thing. But regardless of which entry point you choose to use, the steps are largely the same as you first start with the data. So if you have your data in Snowflake or s 3 or BigQuery or whatever,

You just give us some credentials, point us to what table or what bucket you're interested in working with,

and then we can start with any data that is structured in some kind of table like form. Right? So that can be an actual database table.

That can also be a parquet file, a CSV,

anything like that.

And then, you know, maybe a question that comes after that is, what if I wanna use unstructured data like images or audio?

So the way we think about that today is that give us the URLs

to those images or those audio files as columns in your tabular data, and then we can pick those up and we'll join all that together until, like, a single flat tabular view for training. Right? Once you've pointed us to the data that you wanna work with, we automatically

do all the metadata extraction and schema extraction from the data for you. So we know, you know, what data types the data is and and that sort of thing. And then you can start creating models. So you, you know, go into the model builder UI or use the SDK to build a model in a way that's very similar to how you do it in Ludwig.

And all you need to specify to get started is just the target or targets since we support multitask learning that you wanna predict. From there, you can customize any aspect of the training through either layering on full

kind of, like, hyperparameter

optimization,

AutoML

suggested like configurations

for everything,

or start with just a very simple baseline. Either way, you can go like any level of the extremes, right, or any level of customization between,

and then start training a model. Once you start training a model, it will be sent to 1 of the, what we call, engines. So that's 1 of our serverless

clusters that does the computation for you that lives, you know, wherever your data happens to live and, you know, the same kind of region. Right? Model will get trained. And from there, you can start using PQL or the Python SDK to

validate. We also provide a full set of visualizations for the user to explore in terms of, you know, understanding

the explainability of, like, feature importance. We also have calibration

plots,

all sorts of other things like confusion matrices, etcetera, that you can dig into. And from there, you can either iterate on the model, continue to develop it in kind of a incremental way that with a fully kind of versioned and lineage process.

And then once you're happy with the model, there's kind of a 1 click deployment that we have where you can deploy it to a rest end point and then start curling it with, you know, JSON objects

as you see fit.

And then if you'd like to, you know, retire the model or replace it with a new model version, it's a similar 1 click kind of deployment process as well.

And then there's, of course, ways that you can automate this as well-to-do

retraining

as well as do validation to determine when you want to trigger

redeployment. Right? So if you say I have a held out test dataset, I only wanna redeploy when the, you know, new model does better than the old model on that dataset.

You know, that's something that you can configure within the platform as well. And at a high level, that's the journey that we see as being the core flow that the user wants to go through. So connect data, train model, deploy.

And then there's a lot in between, of course, that kind of fills in the gaps, but that's fundamentally what the platform provides.

And you mentioned the PQL dialect a couple of times. And I noticed that when I was going through some of the blog posts and some of the early material that you have about what you're building at prediabase. And I'm wondering if you can

speak to

some of the motivation

behind creating this new dialect and this new, I guess, language you could call it, and some of the ways that you think about the semantic and syntactic design of it. Yeah. So with people, we think it's a very natural

way to extend the declarative idea because I think it plays very nicely into

the way that we think about ML systems today compared to databases

a few decades back, right, where you had in the early days, like low level languages like COBOL that people would be writing, would be interacting with databases and and then SQL comes along and provides this very nice declarative

way of expressing

all sorts of complex data analysis that you might wanna do.

And we see people as being the natural extension of that idea to the ML domain.

Whereas since you already have this, declared specification

that provides a very tight

semantic

link between your data fields, the fields of your dataset,

and the fields that are the inputs and outputs of your model, and then everything that happens in between.

PEQOL provides a very natural way to express, you know, the model prediction request that you might wanna do. So what I think is very powerful about PQL is that you can do something as complex as a batch prediction over, you know, a 10 terabyte dataset

using some model that you then wanna write out to a downstream table.

Normally, you would end up writing like an ETL job and spark to do something like this. But that's just a 1 line people query, which would be predict

target given

select star from data or whatever. Right? And you can, of course, then do all sorts of more complex things from there in terms of joining tables across different datasets, filtering them, doing sliced analysis.

You have ways of doing what we call hypothetical

queries that are kind of similar to what you might do for real time prediction, where you want to take, like, an entirely new data point and then express it as a query that then can be predicted on. And so I think, you know, certainly 1 powerful use case of people is this idea of a more efficient way of doing batch prediction that fits in nicely in to other tools that

do ELT. Like dbt is a really good example there where we already have a dbt integration that we we've written that some of our customers are using. And so if you

want to be able to express your

prediction pipeline as

SQL, essentially, like people provides a very natural way to do that. But we also think that people is a very powerful enabler of putting ML

inference and, like, letting people interact with and understand the model, stakeholders who might not

today ever really interact with an ML model. Right? So anyone now who understands SQL can start making predictions and start to play around with understanding, like, what do I have to change in this input to make the model predict something else? Really, it's very fun kind of just interactive process that users can go through.

And these sorts of people queries are a very nice sharing point as well. So if the data scientist has a model that they've trained and the analyst wants to play around with it or wants to see the result of some prediction on some slice of data,

you just need to share that people query with them and then, you know, say, hey. Go run this and, you know, let me know how it goes

instead of having to ship whole notebooks and Python files and whatever with it. So that's kind of where we see the value of people is batch prediction as well as for kind of pipelining and doing, you know, ELT type workloads

as well as this kind of shareability and making NL inference more accessible to the broader organization.

Noting the PQL

acronym,

my initial thought when I first read that in the blog post was, oh, obviously, the p stands for prediabase,

but it stands for predictive.

And so I'm wondering if you can talk to what your overall vision is for this syntax

and if you intend for it to be something that is

maybe adopted outside of prediabase as kind of a general standard for this

means of interacting with machine learning and just some of the overall vision there?

So the interesting thing there is that the name people actually predated the name Credit Base. So we had the name people in mind before we came up with Credit Base. But to your point, I definitely do see people being something

larger than Prettabase in a lot of ways. Like, we want to see

more folks in the industry adopted. And so the vision for people is that, you know, you do have a lot of the AI tools today that have very tight integration with SQL.

And we'd like to be able to see, you know, very nice integration with people on a lot of these tools in the future as well. You know, thinking about how we can make it a standard that the larger community

embraces.

I think there is a lot of value to that. And so we do work very closely with some companies in the BI space through the Linux Foundation, actually, where Ludwig and Horovod, the projects that we maintain are hosted. And so we do have a collaboration there with the AI plus bi committee that is working on exactly this problem of

integrating, you know, machine learning prediction into

BI systems. And that's

where I think things can go if, you know, the standard ends up becoming well adopted in the future.

And another interesting element

of the overall ML space is the question of collaboration. You mentioned PQL allows you to say, I've got this model. I wanna pass it off to this analyst to be able to play with it and experiment with its capabilities, maybe provide some feedback on ways that I should tweak it to, you know, make it more powerful for a certain use case. And I'm just curious how you think about that collaboration aspect of prediabase

and how you've designed the platform to be able to

be kind of idiomatic and recognizable

for different roles and stakeholders

across the organization who are interacting with the different capabilities

of the model and the overall workflow?

So I do think that collaboration is very core to what we're doing because we see this as being a tool not just for

an individual

data scientist or engineer, but a tool for an organization.

Right? And so we do have different

metaphors that we think about that relate to different stakeholders

that you can see kind of visions of each of them in the platform. So for folks who are more on the analytics side, we do have a query editor built into the UI that lets you just start writing people or even ordinary SQL queries.

The parser is expressive enough to kinda support both

in the editor and kind of playing around with things as you would if you were using superset or some other, like, BI slash analytics tool. But we also then, for the kind of data scientists and the engineer personas,

we have a lot of tools that kind of adhere to more of like a GitHub style workflow where, you know, folks

will be able to kind of incrementally update models in a way that is, you know, versioned. And so you can kind of diff between different models and then have this ability to do experiments in separate branches. And then once you're happy with how the experiment is doing, saying, oh, this experiment is now doing better than what is currently in production.

Let's, you know, merge this back into the main line similar to how you would

in Git. And then, you know, that kind of becomes a concept very similar to a poll request where people can comment and kind of say, hey. I don't agree with this particular parameter choice. Can we maybe revisit this?

So there are different ways that you make it more approachable to people that we've thought about doing it to make it more approachable to people by having

those callouts that's gonna hearken back to things that they are already familiar with, but at the same time, giving them something that's net new. Right? Like, I think the problem with just, you know, Git using ML today is that Git doesn't provide a story around the

non source code artifacts. Right? And so you need to use external tools for that, and things are not super tightly integrated. And that's where something like prediabase can slot in to fill that gap for that particular persona. Right? So it's all about large part providing the right metaphors for the right person that's using the tool.

Digging into the Ludwig aspect of what you're building, as you mentioned, it's an open source project. It predates the business. You have used it as sort of the core building block of what you're providing.

I'm curious if you can talk to some of the ways that you're thinking about the governance of the open source project and how you identify

which pieces of the engineering that you're doing on and around Ludwig are part of the business and which parts belong with the open source project. And

along with that, some of the ways that your work on prediabase has fed back into the Ludwig project.

From the governance standpoint,

we have been making a concerted effort to get more folks involved and, you know, we hold regular monthly meetings with the community.

Talk about the road map, get buy in from different people about

what features are important. So right now, 1 thing that we've been working on in the open source side, because a lot of other companies have been interested in is working on a model hub that provides, you know, some ability to share different trained Ludwig models and configurations.

So that's something that's definitely been a community driven effort today. And then I would say that in terms of how we see the relationship between what's preda base and what's Ludwig,

we do have a very substantial part of the engineering team that works almost exclusively on the open source. And so it's very important to us that, you know, we not just take Ludwig as something that we consume downstream, but that we also actively,

you know, are investing back into Ludwig and making it better, that that will make credit based better by default. Right? But a really good example there is on the work we've done on scalability recently. You know, we have some customers that we've worked with that are training on larger datasets, you know, terabyte plus.

And so we've had to think a lot about,

you know, what are the bottlenecks in Ludwig. You know, Ludwig is a very complex system in a lot of ways. We deal with every type of modality of data at the same time potentially. Right? I need to have an efficient way to

pipeline it all in terms of both the data processing and the model training and the prediction.

And so we've invested quite a lot in building that out

specifically

to improve preda base. But the nice thing is all of those features then, you know, become ultimately part of Ludwig because that's where

the core of those capabilities live. I'd say 2 other big features that are coming to Ludwig being driven by requirements on the credit based side

have been 1 would be the improved AutoML capabilities that we've been investing in the tools. So this would be kind of suggesting

configurations

and suggesting hyperparameter search ranges

based on the data, based on past trials,

trainings, and things like that. And then the other is on the serving side. 1 thing we definitely found on Credit Base is that there's a very strong need

to make sure that the serving environment is isolated

and doesn't have tons of external dependencies

that blow up the deployments and, you know, adds to your overhead.

Since moving from TensorFlow to PyTorch last year, we've invested quite a lot in building out a Torch script layer for doing serving, which allows us to

strip out all of the Python dependencies

on Ludwig at serving time

and provide a very low latency

end to end servable that does not only model inference, but also does the preprocessing,

so the data transformation

as well as the post processing.

And this is something that, you know, was very core to what we're doing in front of base, but we've made all of that open source as part of Ludwig as well. So now the community can take advantage of it as well.

In terms of

the early applications

of the preda based platform that you're building and how you've been working with some of your early design partners, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen the platform used.

Yeah. There have definitely been some ones that surprised us. So we definitely expected that

Tabular was going to be a very important use case for folks, And so we invested a lot in making sure that we had state of the art

architectures and capabilities on tabular data. And so that turned out to be

true, but we've also found there are quite a lot of really interesting

unstructured datasets that people have been working with as well,

where they're trying to predict, you know, anomalies and like image data, very large image data sets, or doing kind of really interesting

mixed modality

training with, like, text and tabular.

We've also found that there are a lot of situations where users wanna do machine learning training without a lot of labeled data. And that's, I think, a particularly interesting 1 because

it's been leading us to invest a lot more heavily and building out self supervised learning capabilities into Luke Lake. And so, you know, 1 thing that we're working on actively right now

is building out a really sleek pre training API for Ludwig so that you can, without needing to specify, you know, a target call or anything like that, do some initial training to learn a good representation of the data

that you can then apply downstream to a lot of different tasks. And so that's 1 that has definitely been informed by

what we've been seeing from customers as being a a very critical need for them. That's now informing a lot of product roadmap.

In your own experience

of going from

working at Uber and helping to solve the problems that they have for machine learning and producing these useful open source projects that have been available to the community

and then turning that into building a business around those capabilities and from the lessons that you learned in Uber. I'm wondering if you can just talk to some of the most interesting or unexpected or challenging lessons that you've learned in the process of building prediabase.

Yeah. So I would say that

there have been some really interesting

problems that mirror a lot of problems that we encountered at Uber. So I think that when you look back at my time at Uber,

the story there was that I was very keen on unification

of infrastructure.

And so 1 of the things that I was really heavily pushing towards the end of my time there was on moving away from a kind of spark and then plus random bespoke like training architecture built on top of or about and some other things towards using Ray as a unified

infrastructure

layer. And so that very heavily informed

the direction that we took with Credit Base in terms of building out our training

system

of being capable of being this single compute cluster

that is capable of doing the preprocessing,

the training,

batch prediction, kind of the whole thing end to end. That's worked out really well. And then, you know, when we were starting to build credit base, then we had to take this kind of data plane that came from all these years of working at Uber and, like, all the lessons that we learned along the way. And then think about, okay. Now how are we gonna make this into

a truly serverless

enterprise experience. Right? And so we did a lot in terms of the early days of, like, building out the control plane layer. I think there were quite a lot of lessons we learned along the way about

how you should think about coupling

in these sorts of big complex distributed systems

where, you know, we had early on some, like, interface boundaries between the control plane and the data plane that were not particularly well

defined. You know, there was a lot of tight coupling. So sometimes failures would occur, and

certain things that should not have failed would fail because there was too much coupling in there. And what we've done over time is rearchitect the platform to be much better isolated so that we use more kind of event driven architecture.

So more message brokers and things like that that kind of make things very cleanly separated.

And that's been a very big

learning building an enterprise platform is, you know, how important it is to really define the service boundaries well between the different components of the system. And overall, you know, we found that reliability, robustness, stability, these have been, like, concerns that when you start building the company, you don't initially think, oh, yeah. These are gonna be the top things that I'm gonna put on the road map. Right?

But now that's definitely, like, top of mind for us at all times is, you know, how do we build the platform in a way where we account for as many things going wrong as possible and have a story around making sure that at the end of the day, the user gets a very clean and a very responsive

experience, right, that doesn't fail in some weird unexpected way.

Because of the fact that you are running a large and scalable and multi cloud system with a lot of distributed systems going on, I'm curious

how you have approached the kind of testing and validation

so that as you iterate on the product, you're able to very quickly get feedback as to

whether a change has caused your aggression in terms of your, you know, ability to quickly recover or being able to identify

potential issues with fault tolerance and just how you're able to think about

managing

forward progress and iterative development on the platform while ensuring that you maintain those principles of stability and scalability and fault tolerance?

Yeah. That has been, I'd say, 1 of the more

difficult challenges to solve. I'd say that we're still figuring out the right way to think about some of these things. But we've definitely invested quite a lot in both

ensuring, like, the benchmarking

of the Ludwig sign. And so there's an active project from 1 of our employees working on building out an entire

benchmarking pipeline for Ludwig so that every time a change happens,

we can, you know, validate it against different workloads and make sure that

model performance is good, GPU utilization is good, memory utilization is good, that sort of all the kind of metrics that we care about for the workload

are there at that level so that we know

that, okay, it's not a change in Ludwig that is causing memory to spike or something like that after this change is made. So that's, I think, the first aspect we have to get right is, like, making sure that the open source is very stable and, you know, meets the requirements that we have set. And then from there, we have quite a lot that we've done on the platform side in terms of building out

continuous integration

and different tiers of deployments

for the whole system to make sure that it's all well tested before we do a release.

So we do have a regular release cadence that we have set up with our customers.

Every change we make goes into a live staging environment that we test out internally,

goes through a full battery of integration tests that actually run on a Kubernetes cluster on, you know, live compute resources and make sure that all the different models that we, you know, regularly test out are working correctly and, not failing in any unexpected ways.

And then we've also invested a lot on the observability side as well, in terms of making sure that we know, okay, so if this workload used to take a minute to run and now it takes 5 minutes,

you know, what's the part of the system that's suddenly taking longer? Like, what's the part of the system that's suddenly taking more memory? Right? Be able to see what that trend line looks like and what the inflection point was. Right? And so

that's been a big area of focus for us lately because

it's just very important for us to ensure that as we get more and more people contributing to the code base and more and more moving parts, that we identify as quickly as possible, like, when something changes

and then can go back and and address it. Right? So having every single commit go through the full CI process has been very critical to that. And I think pretty good policies in place where, you know, we make sure that we don't commit anything to the main line if, you know, the tests aren't in a good state and that, you know, we always make sure that we prioritize stability and bug fixes above

new feature development. So all of those best practices, I think, are very key to getting it right. It's still something we're learning as we go.

And so for

individuals or organizations

that are looking to be able to

accelerate the rate at which they're able to experiment with and adopt machine learning to

address some of the

organizational

and product problems that they're trying to solve for? What are the cases where Predabase is the wrong choice?

I mean, that's, I think, a very valid question. And I think there are definitely

times when it might not be the right choice for your organization.

When we think about, like, where the market segments,

you know, you can kind of think of it in 4 quadrants, I guess. Right? On the 1 hand or maybe 2 axes to make it sound like, on the 1 hand, you have organizations that have low data

versus organizations that have high data. And then on the other axis, you have organizations that have high ML experience and low ML experience. Right?

And so definitely, you know, the bread and butter customer for us would be a company that's very high in terms of, like, data volume and quantity,

but not as high in terms of, you know, having a big sophisticated ML team. Can certainly have an ML team, but, you know, not 1 that wouldn't want to necessarily say that, like, Google Research would be a target customer. That's right.

And then on the flip side, you have organizations

that maybe don't have a lot of data at all. And certainly, I think there are companies out there that are trying to think about ways that they can bring in all to companies that don't have data.

But, you know, for specialized use cases where it's like using pre trained models and things like that. But that's not what we're currently looking at. We definitely still are thinking, you know, companies that have a lot of data and don't quite know how to get enough value out of it. That's very core to to what we do well. Right? And I would also say that it's very important for a customer prediabase to have some variety of of use cases that they wanna solve. You know, it's definitely not a prerequisite, but I would say that when you look at the market, there are companies that

only do fraud detection or only do computer vision or something like that. And, you know, I won't necessarily wanna say

that private base is gonna beat all of them all the time on every task. Right? Like, what I would say is that we provide a very good solution for time to value relative

to these other platforms. Right? If you have a good variety of different things you wanna do on the ML space.

So certainly, I think if you wanna do, you know, computer vision and NLP

from, like, a purely cost benefit standpoint, I think that we have a much stronger value proposition there than if you were to try to do point solutions for all of these different things. Right? So

that's the other aspect that's maybe less of a hard requirement, but still, I think an important differentiator.

As you continue

to iterate on the product and now that you have gone out of stealth and you're starting to accept new customers onto the platform, what are some of the things you have planned for the near to medium term or any particular

areas of focus or new features that you're excited to dig into?

I'm certainly very excited about having a fully SaaS version of the product that people can try out.

Right now, we're in a closed beta. So, you know, we are certainly really excited when people come to us and say they wanna try it out, and we'll set up some time to do a pilot with them. But I'm, you know, very excited about the possibility of having a website people should just log in to, start using it, you know, without any commitment. Right? And so that's something that we're definitely working on right now and and thinking about how we can put that in people's hands.

From a product standpoint, there's also a lot that we're thinking about now. So I mentioned the self supervised learning work before.

There's also some work that we're doing to the open source community as well around better

support for custom components

and kind of user defined functions, if you will. So, you know, with Ludwig and prediabase, there's quite a lot of flexibility in terms of, like, your degree to specify, like, every parameter of the model. But if you wanna add new model architectures,

it's possible today, but we wanna make that experience even easier for folks so that it's just a very lightweight interface you implement and then you can register that component

as just another option in your config within prediabase that other people in the organization can use.

And then also this concept of the model hub slash model registry,

I think is 1 that I'm very excited that will be provide benefits for both the open source users as well as the commercial users

where you can do things like define canonical

components that you wanna use in your organization. So if there's, like, a feature that gets used all the time in different models. If I remember at Uber,

we had some features, like, related to, like, customers

related to, like,

locales that were just used in all different types of models. Right? So being able to have canonical encoders for those that are maybe pre trained even on, you know, a very large datasets, so there's very low cost to fine tuning them. I'm very excited about building out that capability as well.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.

So definitely, I think that there is a very big barrier to adoption that comes from just

not having

good enough abstractions

to start getting value out of machine learning. So I think the analogy that I like to draw here is really about software and kind of what's enabled

software to eat the world as the famous article in the Wall Street Journal once said. Right? And it really comes down to this idea of modularity and being able to kind of stand on the shoulders of giants. So instead of being able to reimplement every, you know, great new idea that comes along, like you just download a library and use that software. I think that ML hasn't had this abstraction before. Right? And I think that it's been a very big inhibitor to people actually being able to adopt it as you know, great new idea comes out from research, but

companies aren't able to productionize it and actually get it to deliver value because they're too busy trying to reinvent the wheel and reinventing the infrastructure and figuring out how to get data from 1 place to another, clean up their data.

So definitely, I think having better abstractions and better canonical

sources of data as well are the 2 biggest barriers in my opinion. So I think once you get to a point where all the data is clean and then standard data warehouse systems and is ready for machine learning, and then you have very powerful abstractions like CrediBase that allow you to take best in class models and just run it right on this, you know, nice clean canonical data source,

then you'll have a very, very fast pass to production. And so,

we definitely think we can move the needle on the modeling side. And I think, you know, certainly companies like DBT, Snowflake, others are doing a great job on the data side. Then once these 2 things converge, then hopefully we'll be able to really start, you know, delivering more value. But that's definitely, I think, where companies struggle the most today.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at prediabase. It's definitely a very interesting

platform and product that you're building there. So I'm excited to see where you go from here. So thank you again for all the time and energy that you and your team are putting into making it easier for organizations

to get onboarded with ML and be able to experiment with it and gain some of the value from its capabilities. So thank you again for that, and I hope you enjoy the rest of your day. Awesome. Thank you, Tobias. I really appreciate it and as well.

Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com

with your story.

And to help other people

The Python Podcast.init

Preamble

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links

The Python Podcast.__init__