Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
Deep learning is a revolutionary category of machine learning that accelerates our ability to build powerful inference models. Along with that power comes a great deal of complexity in determining what neural architectures are best suited to a given task, engineering features, scaling computation, etc. Predibase is building on the successes of the Ludwig framework for declarative deep learning and Horovod for horizontally distributing model training. In this episode CTO and co-founder of Predibase, Travis Addair, explains how they are reducing the burden of model development even further with their managed service for declarative and low-code ML and how they are integrating with the growing ecosystem of solutions for the full ML lifecycle.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great!
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host is Tobias Macey and today I’m interviewing Travis Addair about Predibase, a low-code platform for building ML models in a declarative format
Interview
- Introduction
- How did you get involved in machine learning?
- Can you describe what Predibase is and the story behind it?
- Who is your target audience and how does that focus influence your user experience and feature development priorities?
- How would you describe the semantic differences between your chosen terminology of "declarative ML" and the "autoML" nomenclature that many projects and products have adopted?
- Another platform that launched recently with a promise of "declarative ML" is Continual. How would you characterize your relative strengths?
- Can you describe how the Predibase platform is implemented?
- How have the design and goals of the product changed as you worked through the initial implementation and started working with early customers?
- The operational aspects of the ML lifecycle are still fairly nascent. How have you thought about the boundaries for your product to avoid getting drawn into scope creep while providing a happy path to delivery?
- Ludwig is a core element of your platform. What are the other capabilities that you are layering around and on top of it to build a differentiated product?
- In addition to the existing interfaces for Ludwig you created a new language in the form of PQL. What was the motivation for that decision?
- How did you approach the semantic and syntactic design of the dialect?
- What is your vision for PQL in the space of "declarative ML" that you are working to define?
- Can you describe the available workflows for an individual or team that is using Predibase for prototyping and validating an ML model?
- Once a model has been deemed satisfactory, what is the path to production?
- How are you approaching governance and sustainability of Ludwig and Horovod while balancing your reliance on them in Predibase?
- What are some of the notable investments/improvements that you have made in Ludwig during your work of building Predibase?
- What are the most interesting, innovative, or unexpected ways that you have seen Predibase used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Predibase?
- When is Predibase the wrong choice?
- What do you have planned for the future of Predibase?
Contact Info
- tgaddair on GitHub
- @travisaddair on Twitter
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Predibase
- Horovod
- Ludwig
- Support Vector Machine
- Hadoop
- Tensorflow
- Uber Michaelangelo
- AutoML
- Spark ML Lib
- Deep Learning
- PyTorch
- Continual
- Overton
- Kubernetes
- Ray
- Nvidia Triton
- Whylogs
- Weights and Biases
- MLFlow
- Comet
- Confusion Matrices
- dbt
- Torchscript
- Self-supervised Learning
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, and dedicated CPU and GPU instances. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to python podcast.com/linode today to get a $100 credit to try out their new database service. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy, and this month, I'm running a series about Python's use in machine learning.
If you enjoy this episode, you can explore further on my new show, The Machine Learning Podcast, that helps you go from idea to production with machine learning. To find out more, you can go to the machine learning podcast.com. Your host is Tobias Macy. And today, I'm interviewing Travis Adaire about prediabase, a low code platform for building ML models in a declarative format. So, Travis, can you start by introducing yourself? Thanks for having me on today, Tobias. So I'm Travis. I'm the CTO of prediabase. Prediabase is
[00:01:31] Unknown:
a low code platform designed to make machine learning more accessible and more useful to the enterprise. Before that, I was a tech lead manager for Uber's machine learning platform, leading a team focused on deep learning training. 1 of the lead maintainers on the Horovod open source project and also a core contributor to a little project as well, which is 1 of the foundational technologies for prediabase.
[00:01:58] Unknown:
And do you remember how you first got started working in the area of machine learning?
[00:02:02] Unknown:
Absolutely. Yeah. So it goes back a bit to 2011 or so. I was working for Lawrence Livermore National Lab on processing about 10 terabytes of seismic data. And our goal was to try to do some analysis of it to detect nuclear weapons tests, interestingly enough. But what we found was that there were a lot of over 50% of the data or so was noise, and we had no good way to detect it. So started pulling out some of my undergrad AI textbooks, started implementing some support vector machine and ran it on top of Hadoop and got just really excited about the whole thing. Published an article in Computers and Geosciences journal and decided to go to grad school and get my master's in machine learning and more active in the industry.
And, yeah, that's how it all started.
[00:02:54] Unknown:
And now that has brought you to where you are today with the prede base business and the project, and I'm wondering if you can describe a bit more about what it is that you're building there and some of the story behind how you decided that this was a problem space that you wanted to spend your time and focus on. When I was at Uber, I actually started off as a machine learning engineer. So working on kind of the vertical problems of ML.
[00:03:18] Unknown:
And what I found was that there were a lot of things I wanted to do. Like, I wanted to, you know, try deep learning and try training on large datasets and multi GPU and all these sorts of things. But there just wasn't a lot of good tooling available at the time to do that. There was TensorFlow and we wanted to run TensorFlow scalably. There was like a whole lot of hoops you had to jump through. And then integrating that with something like Spark that we were using for data processing, it was just like, forget about it. So what I realized was that if I wanted to solve it all problems, I really need to start with the and all tooling and infrastructure. And so I joined the Michelangelo deep learning team.
And while I was there working on this horizontal problem, you know, we worked with a lot of customers that had very similar patterns that emerged to mine where we often found that there was the struggle to get something that was productionizable, like, at scale. Right? And there was also a repeating pattern of there just being frankly more in all problems than there was in all expertise of the company. And so from a kind of horizontal platform standpoint of Uber AI tasked with figuring out, you know, how can we help get models into production faster, we kind of realized that there was a need for better abstractions, better infrastructure more generally.
And so the tool Ludwig that's my cofounder Puro put together ended up being a perfect encapsulation of that vision of being able to say, you know, let's let researchers build state of the art models and put them into as components into this framework. And then that gives the vertical teams in the company this very easy declarative interface to just kind of swap in and out different components for their data without having to rewrite, you know, huge swaths of Python code every time. And we realized that this was like a very successful pattern for Uber. And at the same time, you know, Ludwig became open source, saw that it resonate very strongly with the community, and realized that there was, like, a very real need for this kind of better abstraction layer, if you will, in the industry and decide to form prediabase with the intent kind of pushing the state of the art forward in terms of what kinds of tools data science and machine learning teams have available to them to make them more productive and kind of decrease the time to value of machine learning in the enterprise.
[00:05:43] Unknown:
In terms of the audience that you're focused on, particularly given that you're very early in your journey of being a new company and starting to work with some of the first sets of customers, I'm curious how you think about the priorities that you're trying to support and how that influences the areas of focus that you're investing in and the user experience and feature development that you're prioritizing?
[00:06:08] Unknown:
Yeah. So we like to say that we don't expect a prediabase that we're gonna be your first machine learning model that you've ever trained. Right? So oftentimes, we're coming into organizations that have a lot of machine learning problems that they wanna solve. Maybe they have some kind of horizontal team that's focused on trying to build out a platform for doing machine learning. And maybe they've tried using some AutoML tools in the past and have struggled with getting them into production. And so what we identify is, you know, we see these teams that have struggled in this way, and the value proposition that we wanna bring to them is to say, you know, if you've used, like, you know, some traditional ML systems in the past, like Spark MLlib or what have you, and you're struggling to kind of uplevel to deep learning and more state of the art techniques in the industry, you know, we provide, like, a platform that gives you those capabilities in a form factor that's, like, much more familiar to you. And if you're struggling to kinda keep up with the amount of problems that the organization has, like, maybe you have teams of engineers that have ML problems that maybe are not the top priority for the whole company, but a very important priority for that team.
Credit based provides a platform that allows those teams to be unblocked and allows everyone in the organization to collaborate together towards building these solutions. And so this focus on, you know, collaboration and kind of mixed modality, like, you know, very broad set of tasks that people might wanna solve, those are very core focuses for us when we look at companies that we wanna partner with at this stage.
[00:07:42] Unknown:
1 of the interesting things that you mentioned is that you're working with companies who have a lot of machine learning problems to solve. And I'm wondering if you can talk to what that really means. Like, how you can identify that a problem that you have is a machine learning problem or whether machine learning is the right approach to being able to provide value and utility for a a given objective that you're trying to achieve?
[00:08:06] Unknown:
I would say that it comes in a few different flavors. Like, on the 1 hand, you have kind of a traditional data warehouse type systems that have tables, that have historical data or transactional data. And so very often, the story there is people wanna do some kind of predictive analytics. Right? So we know who churned last month. You know, we wanna predict who's gonna churn next month. So that kind of forward looking predictive capability is, like, 1, like, type of problem that we see a lot with companies that fits very nicely into machine learning. So you have a lot of data in your database. You wanna be able to, you know, predict or, like, make forward looking statements about that data.
That's 1 area where we saw it in. But beyond that kind of structured problem, there's also this question of unstructured data as well. Right? And so you have a lot of companies that have text data or image data or audio data sitting around that they've collected, and they don't really know what to do with it. And it's maybe not so much a question of, you know, I have data that says what customers submitted and support tickets in the past. I wanna predict what support tickets are gonna say in the future. It's not anything like that, but it's more about just understanding semantically what's going on in the data and then how that can be used to better inform the predictive forward looking models that we want to build on that more transactional data. So this idea of unlocking the power of unstructured data is another really core 1. And so 1 of the things I think is very unique about Predabase and Ludwig is the ability that we can kind of bridge this gap between structure and unstructured data. So the platform is is very flexible. It's data oriented in such a way that if you have some transactional tabular data and you also have some unstructured image or audio or text data, those can be combined together into a unified machine learning model in a very simple and straightforward way. And so we can unblock organizations that have all this disparate data and they wanna derive value from it, but they haven't figured out how in an effective way,
[00:10:04] Unknown:
that's where we think there's a lot of power for machine learning, particularly on credit based to kind of slot it. Yeah. 1 of the ways that I've seen those different applications categorized is the difference between predictive analytics, which is the first category that you mentioned versus prescriptive analytics of this is what you should do, and then descriptive analytics to say, I just wanna understand what this is trying to tell me. Right. Right. Absolutely. I think that descriptive component
[00:10:31] Unknown:
is 1 that not a lot of people have tapped into.
[00:10:34] Unknown:
In terms of the way that you have formulated the product that you're building at prediabase, you're positioning it as declarative ML. And you've mentioned earlier that may have had experience trying to use the category of tools called AutoML. And I'm wondering if you can just talk to the differences in the nomenclature as far as what that really means and how the sort of expectations are different between an AutoML category of tool and a declarative ML category of tool.
[00:11:07] Unknown:
Absolutely. Yeah. I think this is, like, a very key differentiator between how we're thinking about the problem and how a lot of other companies out there are. So the way that we think about it is that at a high level, there are very similar capabilities in terms of being able to put ML in the hands of non experts at kind of the starting point. But we believe that declarative ML provides a more principled and flexible path forward. So whereas a lot of AutoML solutions, I think in a degenerate state, AutoML often becomes this kind of kitchen sink style approach where the system will throw everything it can at the problem they can think of. And if something works, then great. You know, you kinda take that baton and run with it. If something doesn't work, then there's not really a lot of options that you have in terms of what's going to happen next. Like, how someone who is maybe a domain expert or a data expert can come in and kind of help out with, you know, unblocking things.
Where we think declarative ML provides a difference here is that because it gives you this very complete specification, you can start at something very high level and say, you know, I just wanna predict this target given these input variables. And you can get a baseline from there, but it's not the end of the story. And so because it's very explicit about saying, here's everything that the system did. And you can modify and customize any aspect of this down to, you know, individual, you know, layers of a neural network. Right? It allows people to then iterate on these systems, on these implementations over time and kind of build towards a working solution in a more principled way. So for example, they can say, oh, well, I initially tried building this model with this set of parameters.
And then for v 2, I swapped out this model architecture, and I changed the learning rate from this to this. And it gives you that audit trail of being able to say, here are all the things I tried. Here's what changed, and here is the effect that that had on on model performance. And so we think that this is also very powerful for enabling collaboration. So if, you know, someone who is an engineer maybe wants to train their first model in prediabase, they can do that without having to know a lot of details about the how of what's going on under the hood. But if they get a solution and then they want to maybe have a more expert data scientist take a look and say, you know, what do you think we should try in order to get better performance? They can take a look at the config and say, oh, well, you know, I see that you're using this parameter here. Let's maybe try swapping that out, see what happens. And so, you know, it gives you that ability to make these incremental changes in a very simple way that if you were kind of going down to just pure low level tools like PyTorch or TensorFlow, it'd be much more difficult to do that because you'd be having to ship over entire, you know, Jupyter Notebooks and Python libraries and, you know, what's the execution environment for all this. It's much more difficult for someone to just quickly take a look and kind of provide feedback and kind of next steps.
And we also believe this ties in very nicely to our version of AutoML, which we would call iterative ML or iter ML, where we see it being much more of a conversation that you're having with the system where you try something out, the system can propose some new things to modify in the in the specification for the model. You can choose to either accept or reject any of those things, train for a little bit more, and then use the results of that previous run to then inform what you're gonna try next. So it becomes a very in the loop back and forth process that progresses in a much more in a way that we think is much more like how traditional software development is done. Right? Where you have a git repo, make some code commits, and over time, you can see the code evolve and change to kind of better conform to the end state as opposed to just trying to get all out there at once and, you know, not have any way of knowing, you know, what was the history, what was the effect of every change that we made. In this category of declarative ML,
[00:15:11] Unknown:
another company that I've seen using that terminology is Continual, which is based on being able to build machine learning pipelines on top of your data warehouse so that you can just treat your machine learning workflow as SQL effectively. I'm wondering if you can just characterize the relative strengths and use cases of what you're building at prediabase versus what they're building at continual.
[00:15:35] Unknown:
Yeah. Absolutely. So I guess the first thing I'd say is that, you know, in general, we're very glad to see other companies kind of validating the idea behind declarative ML, you know, from following the work that Tristan and the folks that continue have been doing. It's always been very nice to see that they've referenced our work on Ludwig and Overton. So 1 of our cofounders, Chris Ray, was had a company called Lattice that had a product called Overton that was acquired by Apple, which was another early declared of a mail system. And so I think in general, there's like a really good shared vision of kind of moving the conversation forward about better abstractions in a mountain. So I think there's definitely an element of, you know, rising tides, you know, raising all ships to it. Where I think there are differences would be they definitely are very leaned in to the kind of modern data stack operation side of things. I think that their value proposition resonates very nicely with people who are active DBT users, for example. That's a big part of kind of how they're approaching the problem, which I think is a totally valid way to think about it. For us, we definitely think that we can do a lot, not just on the operation side, but on the model development side as well. So with Ludwig, we provide a framework that is also pushing forward the state of the art of what ML models can do. Right? And so that's a big part of the story for us is trying to figure out how do we help users get good models in the first place and do it in a way that is very low barriers to entry, but very high performance and high ceiling. Right? We also believe the operations component is a big part of that, but it's not the only part. I think there's still a lot of work that needs to be done on just getting to a good model that you wanna put in production.
And so that's where I think Ludwig is a bit different from what some of the other tools out there provide and that we're also tackling that aspect of the problem in the big. And as far as
[00:17:27] Unknown:
the implementation and architecture of what you've built at prediabase, can you talk to the overall system design and the ways that you've thought about the architecture of how to approach this problem of making declarative ML accessible and easy to operate so that teams who don't necessarily want to invest in building the entire MLOps stack can be able to pick it up and run with it and start to gain value very quickly.
[00:17:55] Unknown:
Credit base is built as a multi cloud native platform. We're built on top of Kubernetes. And so, you know, we have deployments that run AWS, Azure, GCP, and also on some on premise Kubernetes clusters. So we believe that that's like a very core part to make it flexible so that wherever your data happens to live, you know, we can push the compute down to be as close to that data as possible to minimize the latency and minimize the egress costs and all those things. So that's a very core part of how it's architected. We also have a separation between what we would call our control plane and data plane of the system. Our data plane is built on top of the Ludwig on Ray open source work that we've done. So we use Ray for doing the distributed aspect of, you know, scaling the to large datasets and paralyzing the work. We use Horovod for doing distributed data parallel training.
And then we also have a serving layer as well that's built on top of that. And then we also have a separate control plane that provides a serverless abstraction layer on top of this data plane so that, you know, from the user perspective, they don't need to be as concerned about provisioning ray clusters that run and, like, how to right size it for the workload and whether I wanna use this GPU or that GPU. So that's a big aspect of what we provide on the infra side is this kind of intelligent provisioning life cycle management of the compute resources and making sure that these long running training and prediction workloads can be processed end to end in an efficient way. And then, of course, there's a whole another serving stack as well that we're building out that's built on top of NVIDIA Trident, and and we'll have a lot hopefully on the to say on in terms of our work there, kind of some blog post coming out in the future. But that's something that we're also looking to push into the open source to some extent as well as some of the serving capabilities for Ludwig that we're bringing to the enterprise.
[00:19:55] Unknown:
As you started to go down the path of starting to build out this platform and explore the capabilities that you wanted to offer, I'm wondering how the initial design and ideas and vision around where you wanted to end up have shifted and evolved and some of the directions that you have moved in order to be able to accommodate some of the early feedback that you've gotten as you worked with design partners and just some of the overall evolution of the platform as you started to dig deeper into this space?
[00:20:27] Unknown:
That's a great question. So definitely, we had a certain set of assumptions coming in about what the market was looking for and kind of what level the user wanted to think about the problem. Right? Whether this was a production problem first and foremost for them, a research problem somewhere in between. Right? And so we definitely had a very early focus on thinking a lot about the analyst use case and how, you know, there are people who have data but don't have a background in ML who want to be, you know, up level with that capability. And so we thought a lot about, you know, making it kind of operations and production oriented to begin with. What we found in kind of the early working with early customers is that there's still a lot of interest in, you know, the model development aspect. And, you know, you're never going to at least, you know, how AI machine learning works today gets that perfect model every time, like, without any kind of manual intervention or kind of domain expertise.
And so we definitely, from a very early on working with customers, realized that having a teaching element of the platform was very important as well explaining, you know, here are the options you have available to you in this declared a specification. Here's how you should think about using 1 option versus the other, what the appropriate ranges are, you know, how you should go about doing model development in terms of starting with, you know, a really complex model or a baseline, understanding how the model performed kind of in a post hoc way and saying, you know, these were the features that contributed the most to the model's predictions or these were fields that maybe, you know, were imbalanced in some way, and so there are other corrections we need to make. So definitely that aspect of iteration and instilling kind of machine learning best practices is something that has been a learned experience. And so, know, 1 of the most recent additions we made to the platform just before coming out of stealth was investing really heavily in a Python SDK that's very similar to what the Ludwig Python SDK does, but provides but with some more enterprise features that really make it well integrated into a data science stack where you're able to experiment with data, experiment with models, iterate as opposed to just going straight for the production model from day 1. Right? So that was definitely something we learned early on in the process of working with customers.
[00:22:49] Unknown:
And it's also interesting to think about which areas of the overall machine learning problem and life cycle you're looking to be able to facilitate because there are, you know, boundless capabilities where, you know, there's the experiment tracking and model tracking. There's the model monitoring to be able to understand, you know, what the concept drift is happening once it's in production. You know, there's the kind of pipelining. So there are definitely a number of different areas that you could try to focus on, a number of different directions you could try to push into. I'm curious given the fact that you were starting with Ludwig as the kind of core building block, how that helped to shape your overall consideration of the appropriate scope for what you're trying to build and how you were thinking about what are the maybe boundaries and interfaces that you want to incorporate to be able to let prediabase fit into the overall workflow and life cycle of machine learning in an organization while being able to be very opinionated and drive the conversation around the areas that you wanted to own? It's a really good question, I think, because
[00:24:03] Unknown:
this, I think, is at the core of the problem of startups in the space trying to define their category. Right? Because I think when you look at the space in general, what you see is that there are a lot of really good tools that are what you might call point solutions that are solving 1 aspect of the problem, whether it's explainability, model training, model serving, what have you. Right? But really, what organizations need is they they do need something that is end to end. Right? That is actually a platform that is fundamentally delivering business value, not just, you know, model training or something like that. So the way that we think about it broadly is that we want to be able to provide a story that goes from data to deployment for the user. So connecting data from your data warehouse or database, providing best in class model training at scale with the serverless infrastructure, and then providing a really clean and simple path to deployment that can be either a REST API for low latency real time prediction or the PQL SQL Life Language that provides batch prediction capabilities to the user.
And starting with that core vision of, like, this is the journey for the user. There are, as you said, a lot of other aspects to it as well, like, you know, model explainability, data preparation and data quality and data versioning, model monitoring, model drift detection. And the way we're thinking about these things today is pretty similar to how we think about them actually on the open source side with Ludwig, where we want to try to be as integrated with the community as possible. Right? So on Ludwig, we have integrations with CommonML, Weights and Biases, y logs, MLflow, AIMstack that we're working on. And so these tools, you know, provide, like, different capabilities at different parts of the process. Right? Like experiment tracking or model monitoring.
But the way we wanna think about it is if you already have a tool that you like that solves these problems, like, we don't want to have to say prediabase is a rip and replace solution for you. Right? We wanna be well integrated. So if you wanna use weights and biases or comment, you know, you just give us an API key, and we'll log things there and then have a nice way to link back and forth between the 2. Or if you're using y logs slash y labs for doing model monitoring, you know, we're thinking about ways that we can integrate there to do automated model retraining based on triggers that come from from ylabs. So that's the way we're thinking about the problem today is let's integrate as much as possible in the parts of the platform where we don't feel we're providing strong differentiated value or that we could provide, like, a best in class value proposition on while still telling the user, like, hey. If you are starting from scratch, right, and you don't have an ML platform today, credit base isn't a point solution. It's something that will get you end to end from the data to a deployed model that can start delivering value. And then you can layer on more tooling on top of that, you know, as we see fit.
[00:27:06] Unknown:
And so in terms of that workflow, in the case where somebody is greenfielding, they say, I want to adopt ML. This is my first foray into that. I'm going to use prediabase to be able to experiment with how can I take this data that I have and turn it into something useful that I can do with it? I was wondering if you can just talk through that kind of end to end workflow of starting with the data and ending with, I have a model running in production, and I'm doing something with it. In the tool, there are different ways that users can do it. We do have a web UI that people can use
[00:27:37] Unknown:
to take all the actions. Everything that you can do in the platform can also be done through our sequel like language people as well as the Python SDK. So we have many different views depending on the persona that's using the platform that do the same thing. But regardless of which entry point you choose to use, the steps are largely the same as you first start with the data. So if you have your data in Snowflake or s 3 or BigQuery or whatever, You just give us some credentials, point us to what table or what bucket you're interested in working with, and then we can start with any data that is structured in some kind of table like form. Right? So that can be an actual database table.
That can also be a parquet file, a CSV, anything like that. And then, you know, maybe a question that comes after that is, what if I wanna use unstructured data like images or audio? So the way we think about that today is that give us the URLs to those images or those audio files as columns in your tabular data, and then we can pick those up and we'll join all that together until, like, a single flat tabular view for training. Right? Once you've pointed us to the data that you wanna work with, we automatically do all the metadata extraction and schema extraction from the data for you. So we know, you know, what data types the data is and and that sort of thing. And then you can start creating models. So you, you know, go into the model builder UI or use the SDK to build a model in a way that's very similar to how you do it in Ludwig.
And all you need to specify to get started is just the target or targets since we support multitask learning that you wanna predict. From there, you can customize any aspect of the training through either layering on full kind of, like, hyperparameter optimization, AutoML suggested like configurations for everything, or start with just a very simple baseline. Either way, you can go like any level of the extremes, right, or any level of customization between, and then start training a model. Once you start training a model, it will be sent to 1 of the, what we call, engines. So that's 1 of our serverless clusters that does the computation for you that lives, you know, wherever your data happens to live and, you know, the same kind of region. Right? Model will get trained. And from there, you can start using PQL or the Python SDK to validate. We also provide a full set of visualizations for the user to explore in terms of, you know, understanding the explainability of, like, feature importance. We also have calibration plots, all sorts of other things like confusion matrices, etcetera, that you can dig into. And from there, you can either iterate on the model, continue to develop it in kind of a incremental way that with a fully kind of versioned and lineage process.
And then once you're happy with the model, there's kind of a 1 click deployment that we have where you can deploy it to a rest end point and then start curling it with, you know, JSON objects as you see fit. And then if you'd like to, you know, retire the model or replace it with a new model version, it's a similar 1 click kind of deployment process as well. And then there's, of course, ways that you can automate this as well-to-do retraining as well as do validation to determine when you want to trigger redeployment. Right? So if you say I have a held out test dataset, I only wanna redeploy when the, you know, new model does better than the old model on that dataset.
You know, that's something that you can configure within the platform as well. And at a high level, that's the journey that we see as being the core flow that the user wants to go through. So connect data, train model, deploy. And then there's a lot in between, of course, that kind of fills in the gaps, but that's fundamentally what the platform provides.
[00:31:16] Unknown:
And you mentioned the PQL dialect a couple of times. And I noticed that when I was going through some of the blog posts and some of the early material that you have about what you're building at prediabase. And I'm wondering if you can speak to some of the motivation behind creating this new dialect and this new, I guess, language you could call it, and some of the ways that you think about the semantic and syntactic design of it. Yeah. So with people, we think it's a very natural
[00:31:44] Unknown:
way to extend the declarative idea because I think it plays very nicely into the way that we think about ML systems today compared to databases a few decades back, right, where you had in the early days, like low level languages like COBOL that people would be writing, would be interacting with databases and and then SQL comes along and provides this very nice declarative way of expressing all sorts of complex data analysis that you might wanna do. And we see people as being the natural extension of that idea to the ML domain. Whereas since you already have this, declared specification that provides a very tight semantic link between your data fields, the fields of your dataset, and the fields that are the inputs and outputs of your model, and then everything that happens in between.
PEQOL provides a very natural way to express, you know, the model prediction request that you might wanna do. So what I think is very powerful about PQL is that you can do something as complex as a batch prediction over, you know, a 10 terabyte dataset using some model that you then wanna write out to a downstream table. Normally, you would end up writing like an ETL job and spark to do something like this. But that's just a 1 line people query, which would be predict target given select star from data or whatever. Right? And you can, of course, then do all sorts of more complex things from there in terms of joining tables across different datasets, filtering them, doing sliced analysis. You have ways of doing what we call hypothetical queries that are kind of similar to what you might do for real time prediction, where you want to take, like, an entirely new data point and then express it as a query that then can be predicted on. And so I think, you know, certainly 1 powerful use case of people is this idea of a more efficient way of doing batch prediction that fits in nicely in to other tools that do ELT. Like dbt is a really good example there where we already have a dbt integration that we we've written that some of our customers are using. And so if you want to be able to express your prediction pipeline as SQL, essentially, like people provides a very natural way to do that. But we also think that people is a very powerful enabler of putting ML inference and, like, letting people interact with and understand the model, stakeholders who might not today ever really interact with an ML model. Right? So anyone now who understands SQL can start making predictions and start to play around with understanding, like, what do I have to change in this input to make the model predict something else? Really, it's very fun kind of just interactive process that users can go through.
And these sorts of people queries are a very nice sharing point as well. So if the data scientist has a model that they've trained and the analyst wants to play around with it or wants to see the result of some prediction on some slice of data, you just need to share that people query with them and then, you know, say, hey. Go run this and, you know, let me know how it goes instead of having to ship whole notebooks and Python files and whatever with it. So that's kind of where we see the value of people is batch prediction as well as for kind of pipelining and doing, you know, ELT type workloads as well as this kind of shareability and making NL inference more accessible to the broader organization.
[00:35:03] Unknown:
Noting the PQL acronym, my initial thought when I first read that in the blog post was, oh, obviously, the p stands for prediabase, but it stands for predictive. And so I'm wondering if you can talk to what your overall vision is for this syntax and if you intend for it to be something that is maybe adopted outside of prediabase as kind of a general standard for this means of interacting with machine learning and just some of the overall vision there?
[00:35:34] Unknown:
So the interesting thing there is that the name people actually predated the name Credit Base. So we had the name people in mind before we came up with Credit Base. But to your point, I definitely do see people being something larger than Prettabase in a lot of ways. Like, we want to see more folks in the industry adopted. And so the vision for people is that, you know, you do have a lot of the AI tools today that have very tight integration with SQL. And we'd like to be able to see, you know, very nice integration with people on a lot of these tools in the future as well. You know, thinking about how we can make it a standard that the larger community embraces.
I think there is a lot of value to that. And so we do work very closely with some companies in the BI space through the Linux Foundation, actually, where Ludwig and Horovod, the projects that we maintain are hosted. And so we do have a collaboration there with the AI plus bi committee that is working on exactly this problem of integrating, you know, machine learning prediction into BI systems. And that's where I think things can go if, you know, the standard ends up becoming well adopted in the future.
[00:36:44] Unknown:
And another interesting element of the overall ML space is the question of collaboration. You mentioned PQL allows you to say, I've got this model. I wanna pass it off to this analyst to be able to play with it and experiment with its capabilities, maybe provide some feedback on ways that I should tweak it to, you know, make it more powerful for a certain use case. And I'm just curious how you think about that collaboration aspect of prediabase and how you've designed the platform to be able to be kind of idiomatic and recognizable for different roles and stakeholders across the organization who are interacting with the different capabilities of the model and the overall workflow?
[00:37:24] Unknown:
So I do think that collaboration is very core to what we're doing because we see this as being a tool not just for an individual data scientist or engineer, but a tool for an organization. Right? And so we do have different metaphors that we think about that relate to different stakeholders that you can see kind of visions of each of them in the platform. So for folks who are more on the analytics side, we do have a query editor built into the UI that lets you just start writing people or even ordinary SQL queries. The parser is expressive enough to kinda support both in the editor and kind of playing around with things as you would if you were using superset or some other, like, BI slash analytics tool. But we also then, for the kind of data scientists and the engineer personas, we have a lot of tools that kind of adhere to more of like a GitHub style workflow where, you know, folks will be able to kind of incrementally update models in a way that is, you know, versioned. And so you can kind of diff between different models and then have this ability to do experiments in separate branches. And then once you're happy with how the experiment is doing, saying, oh, this experiment is now doing better than what is currently in production.
Let's, you know, merge this back into the main line similar to how you would in Git. And then, you know, that kind of becomes a concept very similar to a poll request where people can comment and kind of say, hey. I don't agree with this particular parameter choice. Can we maybe revisit this? So there are different ways that you make it more approachable to people that we've thought about doing it to make it more approachable to people by having those callouts that's gonna hearken back to things that they are already familiar with, but at the same time, giving them something that's net new. Right? Like, I think the problem with just, you know, Git using ML today is that Git doesn't provide a story around the non source code artifacts. Right? And so you need to use external tools for that, and things are not super tightly integrated. And that's where something like prediabase can slot in to fill that gap for that particular persona. Right? So it's all about large part providing the right metaphors for the right person that's using the tool.
[00:39:29] Unknown:
Digging into the Ludwig aspect of what you're building, as you mentioned, it's an open source project. It predates the business. You have used it as sort of the core building block of what you're providing. I'm curious if you can talk to some of the ways that you're thinking about the governance of the open source project and how you identify which pieces of the engineering that you're doing on and around Ludwig are part of the business and which parts belong with the open source project. And along with that, some of the ways that your work on prediabase has fed back into the Ludwig project.
[00:40:04] Unknown:
From the governance standpoint, we have been making a concerted effort to get more folks involved and, you know, we hold regular monthly meetings with the community. Talk about the road map, get buy in from different people about what features are important. So right now, 1 thing that we've been working on in the open source side, because a lot of other companies have been interested in is working on a model hub that provides, you know, some ability to share different trained Ludwig models and configurations. So that's something that's definitely been a community driven effort today. And then I would say that in terms of how we see the relationship between what's preda base and what's Ludwig, we do have a very substantial part of the engineering team that works almost exclusively on the open source. And so it's very important to us that, you know, we not just take Ludwig as something that we consume downstream, but that we also actively, you know, are investing back into Ludwig and making it better, that that will make credit based better by default. Right? But a really good example there is on the work we've done on scalability recently. You know, we have some customers that we've worked with that are training on larger datasets, you know, terabyte plus.
And so we've had to think a lot about, you know, what are the bottlenecks in Ludwig. You know, Ludwig is a very complex system in a lot of ways. We deal with every type of modality of data at the same time potentially. Right? I need to have an efficient way to pipeline it all in terms of both the data processing and the model training and the prediction. And so we've invested quite a lot in building that out specifically to improve preda base. But the nice thing is all of those features then, you know, become ultimately part of Ludwig because that's where the core of those capabilities live. I'd say 2 other big features that are coming to Ludwig being driven by requirements on the credit based side have been 1 would be the improved AutoML capabilities that we've been investing in the tools. So this would be kind of suggesting configurations and suggesting hyperparameter search ranges based on the data, based on past trials, trainings, and things like that. And then the other is on the serving side. 1 thing we definitely found on Credit Base is that there's a very strong need to make sure that the serving environment is isolated and doesn't have tons of external dependencies that blow up the deployments and, you know, adds to your overhead.
Since moving from TensorFlow to PyTorch last year, we've invested quite a lot in building out a Torch script layer for doing serving, which allows us to strip out all of the Python dependencies on Ludwig at serving time and provide a very low latency end to end servable that does not only model inference, but also does the preprocessing, so the data transformation as well as the post processing. And this is something that, you know, was very core to what we're doing in front of base, but we've made all of that open source as part of Ludwig as well. So now the community can take advantage of it as well.
[00:43:01] Unknown:
In terms of the early applications of the preda based platform that you're building and how you've been working with some of your early design partners, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen the platform used.
[00:43:16] Unknown:
Yeah. There have definitely been some ones that surprised us. So we definitely expected that Tabular was going to be a very important use case for folks, And so we invested a lot in making sure that we had state of the art architectures and capabilities on tabular data. And so that turned out to be true, but we've also found there are quite a lot of really interesting unstructured datasets that people have been working with as well, where they're trying to predict, you know, anomalies and like image data, very large image data sets, or doing kind of really interesting mixed modality training with, like, text and tabular.
We've also found that there are a lot of situations where users wanna do machine learning training without a lot of labeled data. And that's, I think, a particularly interesting 1 because it's been leading us to invest a lot more heavily and building out self supervised learning capabilities into Luke Lake. And so, you know, 1 thing that we're working on actively right now is building out a really sleek pre training API for Ludwig so that you can, without needing to specify, you know, a target call or anything like that, do some initial training to learn a good representation of the data that you can then apply downstream to a lot of different tasks. And so that's 1 that has definitely been informed by what we've been seeing from customers as being a a very critical need for them. That's now informing a lot of product roadmap.
[00:44:39] Unknown:
In your own experience of going from working at Uber and helping to solve the problems that they have for machine learning and producing these useful open source projects that have been available to the community and then turning that into building a business around those capabilities and from the lessons that you learned in Uber. I'm wondering if you can just talk to some of the most interesting or unexpected or challenging lessons that you've learned in the process of building prediabase.
[00:45:07] Unknown:
Yeah. So I would say that there have been some really interesting problems that mirror a lot of problems that we encountered at Uber. So I think that when you look back at my time at Uber, the story there was that I was very keen on unification of infrastructure. And so 1 of the things that I was really heavily pushing towards the end of my time there was on moving away from a kind of spark and then plus random bespoke like training architecture built on top of or about and some other things towards using Ray as a unified infrastructure layer. And so that very heavily informed the direction that we took with Credit Base in terms of building out our training system of being capable of being this single compute cluster that is capable of doing the preprocessing, the training, batch prediction, kind of the whole thing end to end. That's worked out really well. And then, you know, when we were starting to build credit base, then we had to take this kind of data plane that came from all these years of working at Uber and, like, all the lessons that we learned along the way. And then think about, okay. Now how are we gonna make this into a truly serverless enterprise experience. Right? And so we did a lot in terms of the early days of, like, building out the control plane layer. I think there were quite a lot of lessons we learned along the way about how you should think about coupling in these sorts of big complex distributed systems where, you know, we had early on some, like, interface boundaries between the control plane and the data plane that were not particularly well defined. You know, there was a lot of tight coupling. So sometimes failures would occur, and certain things that should not have failed would fail because there was too much coupling in there. And what we've done over time is rearchitect the platform to be much better isolated so that we use more kind of event driven architecture.
So more message brokers and things like that that kind of make things very cleanly separated. And that's been a very big learning building an enterprise platform is, you know, how important it is to really define the service boundaries well between the different components of the system. And overall, you know, we found that reliability, robustness, stability, these have been, like, concerns that when you start building the company, you don't initially think, oh, yeah. These are gonna be the top things that I'm gonna put on the road map. Right? But now that's definitely, like, top of mind for us at all times is, you know, how do we build the platform in a way where we account for as many things going wrong as possible and have a story around making sure that at the end of the day, the user gets a very clean and a very responsive experience, right, that doesn't fail in some weird unexpected way.
[00:47:50] Unknown:
Because of the fact that you are running a large and scalable and multi cloud system with a lot of distributed systems going on, I'm curious how you have approached the kind of testing and validation so that as you iterate on the product, you're able to very quickly get feedback as to whether a change has caused your aggression in terms of your, you know, ability to quickly recover or being able to identify potential issues with fault tolerance and just how you're able to think about managing forward progress and iterative development on the platform while ensuring that you maintain those principles of stability and scalability and fault tolerance?
[00:48:31] Unknown:
Yeah. That has been, I'd say, 1 of the more difficult challenges to solve. I'd say that we're still figuring out the right way to think about some of these things. But we've definitely invested quite a lot in both ensuring, like, the benchmarking of the Ludwig sign. And so there's an active project from 1 of our employees working on building out an entire benchmarking pipeline for Ludwig so that every time a change happens, we can, you know, validate it against different workloads and make sure that model performance is good, GPU utilization is good, memory utilization is good, that sort of all the kind of metrics that we care about for the workload are there at that level so that we know that, okay, it's not a change in Ludwig that is causing memory to spike or something like that after this change is made. So that's, I think, the first aspect we have to get right is, like, making sure that the open source is very stable and, you know, meets the requirements that we have set. And then from there, we have quite a lot that we've done on the platform side in terms of building out continuous integration and different tiers of deployments for the whole system to make sure that it's all well tested before we do a release.
So we do have a regular release cadence that we have set up with our customers. Every change we make goes into a live staging environment that we test out internally, goes through a full battery of integration tests that actually run on a Kubernetes cluster on, you know, live compute resources and make sure that all the different models that we, you know, regularly test out are working correctly and, not failing in any unexpected ways. And then we've also invested a lot on the observability side as well, in terms of making sure that we know, okay, so if this workload used to take a minute to run and now it takes 5 minutes, you know, what's the part of the system that's suddenly taking longer? Like, what's the part of the system that's suddenly taking more memory? Right? Be able to see what that trend line looks like and what the inflection point was. Right? And so that's been a big area of focus for us lately because it's just very important for us to ensure that as we get more and more people contributing to the code base and more and more moving parts, that we identify as quickly as possible, like, when something changes and then can go back and and address it. Right? So having every single commit go through the full CI process has been very critical to that. And I think pretty good policies in place where, you know, we make sure that we don't commit anything to the main line if, you know, the tests aren't in a good state and that, you know, we always make sure that we prioritize stability and bug fixes above new feature development. So all of those best practices, I think, are very key to getting it right. It's still something we're learning as we go.
[00:51:15] Unknown:
And so for individuals or organizations that are looking to be able to accelerate the rate at which they're able to experiment with and adopt machine learning to address some of the organizational and product problems that they're trying to solve for? What are the cases where Predabase is the wrong choice?
[00:51:33] Unknown:
I mean, that's, I think, a very valid question. And I think there are definitely times when it might not be the right choice for your organization. When we think about, like, where the market segments, you know, you can kind of think of it in 4 quadrants, I guess. Right? On the 1 hand or maybe 2 axes to make it sound like, on the 1 hand, you have organizations that have low data versus organizations that have high data. And then on the other axis, you have organizations that have high ML experience and low ML experience. Right? And so definitely, you know, the bread and butter customer for us would be a company that's very high in terms of, like, data volume and quantity, but not as high in terms of, you know, having a big sophisticated ML team. Can certainly have an ML team, but, you know, not 1 that wouldn't want to necessarily say that, like, Google Research would be a target customer. That's right.
And then on the flip side, you have organizations that maybe don't have a lot of data at all. And certainly, I think there are companies out there that are trying to think about ways that they can bring in all to companies that don't have data. But, you know, for specialized use cases where it's like using pre trained models and things like that. But that's not what we're currently looking at. We definitely still are thinking, you know, companies that have a lot of data and don't quite know how to get enough value out of it. That's very core to to what we do well. Right? And I would also say that it's very important for a customer prediabase to have some variety of of use cases that they wanna solve. You know, it's definitely not a prerequisite, but I would say that when you look at the market, there are companies that only do fraud detection or only do computer vision or something like that. And, you know, I won't necessarily wanna say that private base is gonna beat all of them all the time on every task. Right? Like, what I would say is that we provide a very good solution for time to value relative to these other platforms. Right? If you have a good variety of different things you wanna do on the ML space.
So certainly, I think if you wanna do, you know, computer vision and NLP from, like, a purely cost benefit standpoint, I think that we have a much stronger value proposition there than if you were to try to do point solutions for all of these different things. Right? So that's the other aspect that's maybe less of a hard requirement, but still, I think an important differentiator.
[00:53:53] Unknown:
As you continue to iterate on the product and now that you have gone out of stealth and you're starting to accept new customers onto the platform, what are some of the things you have planned for the near to medium term or any particular areas of focus or new features that you're excited to dig into?
[00:54:10] Unknown:
I'm certainly very excited about having a fully SaaS version of the product that people can try out. Right now, we're in a closed beta. So, you know, we are certainly really excited when people come to us and say they wanna try it out, and we'll set up some time to do a pilot with them. But I'm, you know, very excited about the possibility of having a website people should just log in to, start using it, you know, without any commitment. Right? And so that's something that we're definitely working on right now and and thinking about how we can put that in people's hands. From a product standpoint, there's also a lot that we're thinking about now. So I mentioned the self supervised learning work before.
There's also some work that we're doing to the open source community as well around better support for custom components and kind of user defined functions, if you will. So, you know, with Ludwig and prediabase, there's quite a lot of flexibility in terms of, like, your degree to specify, like, every parameter of the model. But if you wanna add new model architectures, it's possible today, but we wanna make that experience even easier for folks so that it's just a very lightweight interface you implement and then you can register that component as just another option in your config within prediabase that other people in the organization can use.
And then also this concept of the model hub slash model registry, I think is 1 that I'm very excited that will be provide benefits for both the open source users as well as the commercial users where you can do things like define canonical components that you wanna use in your organization. So if there's, like, a feature that gets used all the time in different models. If I remember at Uber, we had some features, like, related to, like, customers related to, like, locales that were just used in all different types of models. Right? So being able to have canonical encoders for those that are maybe pre trained even on, you know, a very large datasets, so there's very low cost to fine tuning them. I'm very excited about building out that capability as well.
[00:56:05] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[00:56:19] Unknown:
So definitely, I think that there is a very big barrier to adoption that comes from just not having good enough abstractions to start getting value out of machine learning. So I think the analogy that I like to draw here is really about software and kind of what's enabled software to eat the world as the famous article in the Wall Street Journal once said. Right? And it really comes down to this idea of modularity and being able to kind of stand on the shoulders of giants. So instead of being able to reimplement every, you know, great new idea that comes along, like you just download a library and use that software. I think that ML hasn't had this abstraction before. Right? And I think that it's been a very big inhibitor to people actually being able to adopt it as you know, great new idea comes out from research, but companies aren't able to productionize it and actually get it to deliver value because they're too busy trying to reinvent the wheel and reinventing the infrastructure and figuring out how to get data from 1 place to another, clean up their data.
So definitely, I think having better abstractions and better canonical sources of data as well are the 2 biggest barriers in my opinion. So I think once you get to a point where all the data is clean and then standard data warehouse systems and is ready for machine learning, and then you have very powerful abstractions like CrediBase that allow you to take best in class models and just run it right on this, you know, nice clean canonical data source, then you'll have a very, very fast pass to production. And so, we definitely think we can move the needle on the modeling side. And I think, you know, certainly companies like DBT, Snowflake, others are doing a great job on the data side. Then once these 2 things converge, then hopefully we'll be able to really start, you know, delivering more value. But that's definitely, I think, where companies struggle the most today.
[00:58:06] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at prediabase. It's definitely a very interesting platform and product that you're building there. So I'm excited to see where you go from here. So thank you again for all the time and energy that you and your team are putting into making it easier for organizations to get onboarded with ML and be able to experiment with it and gain some of the value from its capabilities. So thank you again for that, and I hope you enjoy the rest of your day. Awesome. Thank you, Tobias. I really appreciate it and as well. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com with your story.
And to help other people
Introduction and Episode Overview
Interview with Travis Adaire: Introduction and Background
Predibase: Concept and Development
Target Audience and Use Cases
Understanding Machine Learning Problems
Declarative ML vs. AutoML
System Design and Architecture
Evolution and Customer Feedback
Workflow and Integration
PQL: Predictive Query Language
Collaboration and User Roles
Open Source and Ludwig
Customer Use Cases and Feedback
Lessons Learned and Challenges
Testing, Validation, and Stability
When Predibase is Not the Right Choice
Future Plans and Features
Biggest Barriers to ML Adoption
Conclusion and Closing Remarks