Summary
Building a machine learning model is a process that requires well curated and cleaned data and a lot of experimentation. Doing it repeatably and at scale with a team requires a way to share your discoveries with your teammates. This has led to a new set of operational ML platforms. In this episode Michael Del Balso shares the lessons that he learned from building the platform at Uber for putting machine learning into production. He also explains how the feature store is becoming the core abstraction for data teams to collaborate on building machine learning models. If you are struggling to get your models into production, or scale your data science throughput, then this interview is worth a listen.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to pythonpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s pythonpodcast.com/talkpython, and don’t forget to thank them for supporting the show.
- Python has become the default language for working with data, whether as a data scientist, data engineer, data analyst, or machine learning engineer. Springboard has launched their School of Data to help you get a career in the field through a comprehensive set of programs that are 100% online and tailored to fit your busy schedule. With a network of expert mentors who are available to coach you during weekly 1:1 video calls, a tuition-back guarantee that means you don’t pay until you get a job, resume preparation, and interview assistance there’s no reason to wait. Springboard is offering up to 20 scholarships of $500 towards the tuition cost, exclusively to listeners of this show. Go to pythonpodcast.com/springboard today to learn more and give your career a boost to the next level.
- Your host as usual is Tobias Macey and today I’m interviewing Mike Del Balso about what is involved in operationalizing machine learning, and his work at Tecton to provide that platform as a service
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what is encompassed by the term "Operational ML"?
- What other approaches are there to building and managing machine learning projects?
- How do these approaches differ from operational ML in terms of the use cases that they enable or the scenarios where they can be employed?
- How would you characterize the current level of maturity for the average organization or enterprise in terms of their capacity for delivering ML projects?
- What are the necessary components for an operational ML platform?
- You helped to build the Michelangelo platform at Uber. How did you determine what capabilities were necessary to provide a unified approach for building and deploying models?
- How did your work on Michelangelo inform your work on Tecton?
- How does the use of a feature store influence the structure and workflow of a data team?
- In addition to the feature store, what are the other necessary components of a full pipeline for identifying, training, and deploying machine learning models?
- Once a model is in production, what signals or metrics do you track to feed into the next iteration of model development?
- One of the common challenges in data science and machine learning is managing collaboration. How do tools such as feature stores or the Michelangelo platform address that problem?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building operational ML platforms?
- What advice or recommendations do you have for teams who are trying to work with machine learning?
- What do you have planned for the future of Tecton?
Keep In Touch
Picks
- Tobias
- Sandman graphic novel series by Neil Gaiman
- Mike
- At Home: A Short History of Private Life by Bill Bryson
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Tecton
- Michelangelo
- sklearn
- Pandas
- Data Engineering Podcast Episode About StreamSQL
- Feature Store
- Master Data Management
- Amundsen
- Jupyter
- Algorithmia
- Unix philosophy
- Feast feature store
- Kubeflow
- Andreesen Horowitz Post On Emerging Data Architectures
- What is a feature store? post on the Tecton blog
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. And do you want to get better at Python? Now is an excellent time to take an online course. Whether you're just learning Python or you're looking for deep dives on topics like APIs, memory management, async and await, and more, our friends at the talk Python training have a top notch course for you. If you're just getting started, be sure to check out the Python for absolute beginners course. It's like the 1st year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving.
Go to python podcast.com/talkpython today and get 10% off the course that will help you find your next level. That's python podcast.com/talkpython. And don't forget to thank them for supporting the show. Your host as usual is Tobias Macy. And today, I'm interviewing Mike Del Balso about what is involved in operationalizing machine learning and his work at Tekton to provide that platform as a service. So, Mike, can you start by introducing yourself?
[00:01:49] Unknown:
Hi. Thanks for having me. Yeah. I'm Mike. I'm the CEO and cofounder of a company called Tekton. Before starting Tekton, spent a bunch of time working actually at Google on the or the machine learning systems that power the ads auction there, and then joined Uber where I helped start the machine learning team there and built out a platform called Michelangelo, which is a end to end machine learning platform. And from that, worked on a bunch of stuff that really inspired Tekton, which is a company we're working on now, and love to share about that as well today.
[00:02:20] Unknown:
And do you remember how you first got introduced to Python?
[00:02:23] Unknown:
I first got started in Python when I was back in undergrad, and I had been doing a machine learning course actually. And we started we were trying to predict this is such a silly project. We were trying to predict the stock market using Twitter feeds and the Twitter fire hose, and so we were using sklearn. I'm sure Pandas was around at that time, but we had never heard of it. And I remember my friend just showing me Python and thinking, wow. This is way simpler than all the other languages I had used up to this point. But that project was actually quite funny because we had come up with some good results with this model that we were building. We were trying to do some NLP stuff on tweets and trying to predict, you know, which stocks would go up or down. But it's actually, like, a great example of how building and deploying an ML application typically involves far larger hurdles than just building the ML code itself. And so we had this model that we were happy with, but we had no way to test it in production, no way to hook it up to a production dataset or build the like, the application around it. A good example of, you know, testing offline performance, we had no way to validate that it would work online. And so I think, ultimately, we may have won some award for that project, but we never really trusted our methods sufficiently to really put our college money on the line with this thing.
[00:03:46] Unknown:
Yeah. It's definitely interesting how the early generations of data tooling and capabilities were very robust in terms of being able to deliver impressive results, but there was never a very strong focus on how to actually operationalize that and put that into production and have some sort of complete development cycle and feedback loop to make sure that there was the option for being able to make sure that these projects are able to be long lived.
[00:04:12] Unknown:
That's totally true. And it's also something that we hadn't even really recognized that that was actually the problem at the time. It was just like, oh, well, we don't know how to put this in production. But because the tools weren't there and it wasn't, like, a nice path to production provided by the infrastructure that we were using. But, also, that was kind of a silly project. Like, I'm sure that model wasn't even a good model in the first place.
[00:04:37] Unknown:
People here are pretty familiar with the concept of machine learning and being able to use things like deep learning, but there's also the term operational ML that has been floating around lately. And and that's 1 of the things that you're focusing on with your company at Tekton and with your prior work, well, on Michelangelo at Uber. I'm wondering if you can just start by describing what is encompassed by that term when somebody says operational ML.
[00:05:01] Unknown:
I mean, operational ML, we think of it as, you know, machine learning built into a production or customer facing application. And it refers to not just the machine learning algorithm, but really the about treating the ML system as a complete operational data application. And so a common distinction that can be helpful is this notion of analytic machine learning, where a lot of use cases where ML exists in the enterprise today is really for kind of internal consumption. So you might have your, analytical datasets in your warehouse, then building machine learning models that are used to power internal analyses, deliver insights, or even, you know, like internal 1 off forecasts or something like that, and really focus on powering kinda human decision making, human actions. And that's what really, like, analytical machine learning. That we think of that as quite different from operational machine learning that is really driving production automated decisions and action in your product. You know, an example of this is like a fraud detection system, real time pricing algorithm, personalization system, you know, product recommendations, ad bidding, stuff like that.
And these operational machine learning systems tend to be battle tested and fully productionized. They power use cases that need decisions now, you know, not in a report next week kind of thing, and tend to be really high stakes and affect the bottom line. So there's, you know, serious software engineering projects that need a different level of care and productionization than, like, a standard ML project you might work on kind of in the lab or kind of to do some 1 off analysis within a company.
[00:06:44] Unknown:
There's a fairly robust set of tooling available for traditional application delivery for somebody who wants to put a web app into production or be able to automate deployment of cloud resources. I'm wondering what the complications are and the additional difficulties are for people who want to be able to bring these intensive applications and real time machine learning workloads into production and some of the complexities and challenges that exist on that path.
[00:07:12] Unknown:
There's a lot of differences. But, ultimately, we should be using those tools. Right? And the tricky thing is that people just aren't because these applications are not just code that we're writing, but they're artifacts from some data that we have. So that's a model. Right? It's an artifact from running some code on on some data, but also live data pipelines that power these models, and they need to be monitored in a different way when they're operating in production. And so the kind of artifacts that we're deploying to production are slightly different than standard software engineering projects. But, secondly, there's a a totally different, like, set of people who are involved in different skill sets involved in building and deploying these projects. You have sometimes analysts, but very frequently data scientists who come from a very broad range of skill sets, who are coming up with these models, coming up with the features that power these models, and the productionization process for these models, but also the feature pipelines, power these models are quite different.
[00:08:14] Unknown:
The feature pipelines aspect is something that is worth calling out. And I had another conversation with somebody on my other podcast talking about some of the concepts of feature engineering and feature stores. I'm wondering if you can discuss a bit more about the level of importance of that capability for operationalizing machine learning and some of the ways that the definition of features impact the capability of the model, particularly once it gets to production, but also for the teams who are trying to iterate on and experiment on building a model that is going to be useful for their particular goals.
[00:08:51] Unknown:
To kind of recap, you know, features. What are features? They are the signals that power your machine learning, that are inputs to your machine learning models. Right? So if I'm making a if I have a recommender system, you know, an important feature might be, has this user purchased an item from this vertical before, right, or something like that. And so these signals are really, like, the critical pieces of input that power our machine learning models, but also, like, affect these machine learning models' performance. And so feature engineering is the process of coming up with these signals, evaluating them, and then there's this element of putting these signals into production, which is a completely separate challenge. And so the concept of a feature store has risen, and it kind of came from the stuff that we developed at Uber when we built the Michelangelo platform.
We came up with the notion of a feature store, which is really kind of like a central hub for the definitions, transformations, and the data that power these feature pipelines and all of the infrastructure to serve these features in production. So that has become, like, a really critical element there. So maybe we should take a minute to kind of define what a feature store is and what it does. So it's something we developed at Michelangelo. Feature store is really a data system built for supporting the data side of ML workflows. And so what feature stores do is they operate data pipelines that generate feature values. They persist and manage the feature data themselves, and they serve this feature data consistently for training and inference purposes.
Kind of like this central hub for feature data and the metadata across an ML project's life cycle. So how does this affect my workflows? Well, I use this data for kind of feature exploration and to power my feature engineering processes. I use it while I'm iterating on my model and training and debugging my model. And when I'm discovering new features, I wanna see what other features are out there and actually serving my feature values to my models in production. And so what are the core problems that this solves? Really productionizing feature pipelines. So how do I go from a model prototype that we built offline to a fully operationalized ML application in production extremely quickly? The feature store provides kind of like a path of least resistance to go get into production as quickly as possible.
But secondly, it really enables this element of collaboration. Now a data scientist can do this at scale. They can share their feature pipelines, which are previously a really challenging thing to do, and search and reuse pipelines, feature pipelines that others have created across the organization. Does that make sense? Yeah. That definitely makes sense. And that's 1 of the common challenges
[00:11:42] Unknown:
that I've heard from people who are working on machine learning is that for data scientists, the iteration cycle is generally very solitary where there's 1 person who's looking at the data. They're exploring. They're experimenting with different parameters. They're changing some of the weights of the features, and they're maybe using different algorithms to generate outputs. And once they have something that works, then they're able to then say, okay. Well, this is what I did. But the actual process of getting from the idea to the delivery, there's not really a lot of opportunity for other people to be able to contribute to that because of the tooling and because of the nature of the work being done.
And my understanding with feature stores, as you mentioned, it helps to open up that possibility for collaboration where rather than 1 person doing all of the discovery and experimentation of different features, then that can be a shared resource. It can be sort of the artifact storage for other machine learning engineers and data scientists to not have to start all the way back at 0. They can actually start at, you know, step 2 and then iterate from there to figure out additional algorithms or weights to apply to those different features for determining what the next iteration of that model is going to look like.
[00:12:59] Unknown:
That's a pretty good description of it. And, you know, it might be helpful to share kind of, like, an anecdote that motivated this from Uber. So, you know, when we were building this Michelangelo system, we had a ton of teams that were trying to put machine learning into production. There was a lot of machine learning use cases in the company, and there's a lot of data scientists who are trying to solve these use cases. We had standard problems that organizations have, data silos, disconnected tools. There wasn't really a path to production. And so it was really hard for data scientists to get things past the finish line and really get into production. As we were building Michelangelo, we really cared about we kind of, like, had this internal focus on outcomes. Like, does the project actually get into production and launch on time or not end to end? Not really did a model get built or did another revision of a model happen. But did this thing get into production and deliver business impact? And so, ultimately, the Michelangelo system had a pretty big impact on Uber. So today, there's tens of thousands of models in production.
We went from 0 to, you know, tens of thousands of models in about 2 and a half years. And so some of the things that we did uniquely well there were really our focus on the machine learning application as a whole, not just the models. So, for example, getting, like, a recommender or an ETA system in production wasn't just about deploying a model building and deploying a model, but building and deploying the feature and the data pipelines that supported this model as well. And so the data side of things tend to be some of the harder problems to solve, but we really solve this with a feature store. And so, you know, we did this by making it as easy as possible for data scientists to get and some kind of loan engineering teams to get ML applications into production as quickly as possible. This is where we had, like, a 1 click deploy kind of thing. And that would deploy your model and set up all the production pipelines to calculate the right features either in batch or in real time and feed that data to your model at prediction time. That was kind of element 1. Like, how do you get this stuff into production? But the second thing that really kinda led to a Cambrian explosion of machine learning at Uber was making it as easy as possible for teams to reuse the work of others.
So what ended up happening was this would allow everything that someone has built that ends up going into production, whether it be a feature or a model, to be reusable by someone else on the team or across the organization. So when I, as a data scientist, am building, you know, the new fraud model that is, you know, fraud model number 6, and it's solving fraud in the specific decision making area. I'm not starting from scratch. I can say, there's all of these signals that have already been defined, productionized, and vetted in the organization. I can go and use this common library of these features.
So I can pick out you know, I wanna get started with all of the user features, all of the transaction features, all of the recent activity features. So I can hot start my modeling efforts with a bunch of signals that basically get me 90% of the way there. And that really lowered the barrier of entry to get new models built. And then also these systems had a really fast path to production, and so that allowed for scaling machine learning really quickly across the organization. Like like the the rate of new models getting into production really took off then.
This was really kind of driven by this concept of having a feature store. It allowed teams to register these features essentially in a standardized way and then have all of their regular calculation of these features be orchestrated and handled by this platform, and then everything be linked up properly into production. So when you register a feature, that's automatically productionized. And it's really impactful because that removes these workflows that we saw at Uber. We see in with companies that we work with all the time today that kinda solved some of these workflow challenges where it's not really clear who owns what part of the ML workflow between data scientists and engineers.
And we often see data scientists looking to engineers to productionize different elements of different features or different models that they build. Right? And that requires the data scientists to basically submit a ticket or have a request for the engineers for every single modification they wanna make to production. And what we were really focused on doing at in the Michelangelo system, and the feature stores accomplish this on the data side, was how do we kind of decouple those things so data scientists can operate at the speed or the modelers can operate at the speed and iterate on their own, either offline or in production for experimentation in a way that doesn't require other people to be in that core iteration loop. Right? It doesn't require me to wait for someone to build a whole bunch of pipelines that are going to deliver this new data into my model and production. It doesn't require me to send a request to the engineering team and wait around for a couple weeks for them to fulfill that.
That's really powerful impactful for these data science teams and, frankly, like, empowering for them because now they're able to own their work end to end. Right? They're responsible for their models that are in production because they're the ones who deployed it to production. It's also turned out to be super impactful for the engineering teams that they were working with before because the engineering teams aren't playing this role of, okay, I'm just fulfilling requests from the data science teams and rewriting Python code into Java or something like that. You know, just productionizing code that I don't even understand. And we talked to a lot of teams where their engineering teams are kind of being overwhelmed. You know, their task is to and their their charter is to support data science teams across the organization, and they are in a position where they're kind of being bombarded by these requests to put something in production, to help them operationalize something. They feel like they're taking on tech debt for in every single iteration. And so I talk to companies and say, look. We have 5 models in production. There's no way we even wanna take on number 6. We don't even wanna build number 6 because this is gonna be too much for us. And so providing the right tools that employ this workflow that allows data scientists to go end to end with their work and allows data scientists to support them in a well defined way really kind of brings the right patterns there to allow these guys to work together and allows engineers to scale as resources to support data science teams. So that was, like, 1 of the most impactful parts of bringing in the concept of a feature store into developing it at Uber and what we're seeing with some of our customers as well. It's just that, like, it allows the engineers who are tasked with, you know, supporting machine learning, building an ML platform for the company to actually support a ton of machine learning use cases rather than being like a miniature professional services group for their internal customers.
[00:20:00] Unknown:
Yeah. Having that clear handoff point and the delineation of responsibilities is absolutely a critical piece of being able to allow teams to interoperate. Because as you said, if the data engineers have to be able to redo everything that the data scientists are doing, then you're just duplicating effort for very little gain. I kind of think about the use of the feature store in some ways along the same lines of what master data management does for business intelligence, where you have this common set of definitions that everybody can look to and understand the semantics around a particular set of data and be able to then use that effectively rather than having 5 different versions of the customer ID or, you know, 5 different ways of calculating the, you know, annual revenue that don't all agree with each other and so that nobody really knows what the canonical representation is supposed to be, where the feature store kind of does that for machine learning, where everybody can coalesce around the canonical representation of a particular feature and not have to try and reinvent it every time.
[00:21:05] Unknown:
I think that's true. And I'm just recalling there's a conversation I had with someone who's building, like, a metrics platform at 1 point and kind of talking through when I was building the feature store of Michelangelo, talking through what each of us were building. And then we were like, wait. I do that too. And we it was kind of like, hey. We're built there's a lot of overlap and a lot of similarities in in what each of us is building, and it reminded me just of, like, that that meme of the Spider Man pointing at the other Spider Man. Like, hey. You're doing the same thing as me. We ended up kind of diving into that in that conversation, diving in a little bit deeper and also identifying some of the differences. And some of the important differences that we think of in feature store is that the Feature Store is not just like a metadata catalog. So it's not just something like a month's end where it provides a lot of good stats about what data is there and how you can use it. But, also, like, the feature store operationally runs these data flows, has your data, and delivers your data to the right place at the right time. This could be used as this feature data to generate a training dataset in your model development process.
And there's a whole bunch of special things that, you know, our feature stores do to make that a really good experience for data scientists, like kind of built in time travel to ensure you can find the exact right value of a feature at a specific point in time. But, also, like, serving these featured values and monitoring these feature values in production and calculating these feature values either offline or in batch. So the core components of these feature stores are really, like, this transformation component, storing features, feature storage, feature serving, monitoring, and then a central registry, which is kind of like that metadata catalog, which says these are the features, these are their definitions, these are their owners, this is how they're defined.
And so at Tekton, we're taking a very composable, pluggable approach, which is similar to some of the lessons we learned building Michelangelo, actually, which might be interesting to share. But the goal here is, you know, we want someone to be able to pick and choose which parts of this that they need. And having Tekton come with built in almost best practices for data management workflows for machine learning applications. But if you have your own transformation system, you have your own serving layer, what we should plug right into that. And we we are building this system to reuse as much of your existing data infrastructure and just integrate right in as possible. And so I was just mentioning, you know, this kind of composable, pluggable approach that we converged on at Michelangelo. And, you know, it's worth mentioning, you know, when we built Michelangelo at first, we really built this kind of monolithic end to end ML platform. It's 1 system that does your feature engineering, your training, your model evaluation, deployment, monitoring, and production.
And the tricky thing there was every data science use case has different requirements. And when we coupled all of those elements together too tightly, it led to some challenges where, you know, maybe 1 of those stages wasn't the perfect solution for that use case, and then the whole solution was kind of like a take it or leave it, an all or nothing kind of thing. And so we, over time, had a bunch of teams that would say, hey. Look. I wanna use all of this, but I'm self driving cars, and I deploy my model in a totally different way. So I wanna bring my own serving system or pick and choose different parts. So over time, we actually define the interfaces really well between the different components and kind of took those different components and treated them like different building blocks that could be used to put together production ML applications.
And so this would be like the Michelangelo training system, serving system, feature store, obviously, a number of elements like that where you could pick and choose, and you could say, hey. I wanna use all of these, or it's just a very powerful toolkit. Or I just wanna use this 1 thing because I'm doing a very special kind of, you know, unique research thing, and I only need to reuse your distributed training system. And so that's the approach that we're bringing with the feature store also as there's a number of components within the feature store. You know, calculating your features in real time and in batch and serving features and storing them, all those things I just mentioned. And teams have problems with different areas, and their needs kind of need some of these components and sometimes don't need others of these components. And so it's really important to just be able to be maximally pluggable and just fit right into their existing infrastructure and be almost, like, gradually adoptable as well. So that's a core design principle as we build out Tekton and deploy it with our customers.
[00:25:35] Unknown:
Discussion of moving into production is also interesting as well because production for 1 person might mean, like, as you said, a self driving car where it might be running in an embedded context, and there might be limited or no connectivity for being able to pull in additional information as it's executing. Production for another system might mean being embedded into a web application with an API that responds to requests where it's doing real time evaluation within the machine learning model for those HTTP calls. Whereas for another system, it might mean 2 different machine learning programs talking to each other, maybe doing generative adversarial networks or something like that. And so I'm curious how those differences in deployment models factor into the overall conception of how to operationalize machine learning and deliver it into that production environment and then being able to do things like monitoring or metrics gathering to be able to understand when to deploy a new version of a model or redo the training because of model drift or understand when a model is not fulfilling its intended goals and you need to retire it and build something else?
[00:26:44] Unknown:
And the actual answer is that no one's figured this out fully end to end right now. No one's figured out, like, the 1 system that is the perfect solution for every ML use case. So what the industry is converging to is kind of best in breed components for all of these different building blocks that I was just talking about. So, you know, you have a common set of components that is also used when you're doing, like, ML research. And so this is, you know, your notebook platform, you know, Jupyter, Cola Colab, Databricks, and experiment tracking, and maybe model tuning, and your deep learning framework, your distributed processing framework. Right? And then there's a separate set of separate part of your stack that tends to be used more for the deployment and operationalization, feature store to productionize your feature pipelines and centralize your definitions, you know, a model registry to organize and and manage those models, serve them, and then model monitoring. There's more components that are kind of more specialized, but what I think we're going to see over time is different blueprints for different kind of modes of deploying machine learning operationally. There's like an IoT use cases. There's kind of a centralized cloud based real time prediction use cases. There's different kinds of embedded use cases, batch use cases.
And they all have different requirements. And so it's really, 1, really scary to go all in on, like, a single end to end platform right now. But, 2, I think we're gonna see more best of breed solutions focusing on single components. So, you know, you might we're gonna see people, like, focusing just on model serving, let's say. But then we're gonna see these blueprints emerge for these common use case types and say, hey. This is the stack. If you wanna do a recommender system, you're gonna use, you know, Algorithmia to do your model serving. You know, Feast or Tecton is a great feature store to power that, and you might do your training in in the system. And so these kind of blueprints are probably the right way because we don't really have the these kind of well defined stacks today, and the right stacks are kind of emerging over time. What we are seeing is, like, specialization at the individual component layer right now. Does that make sense?
[00:29:03] Unknown:
Yeah. That definitely makes sense. And that's a trend that I've been seeing in other areas of the data landscape as where people are moving away from monolithic solutions that are fully vertically integrated for a particular industry or use case and instead leveraging best of breed technologies that are swappable as different components either improve or get replaced by the next generation. And I think that that kind of plays into a lot of the original concepts of how software systems should be built using kind of the UNIX philosophy of doing 1 thing and doing it well and having very well defined Totally. And
[00:29:49] Unknown:
Totally. And so I mentioned, you know, that was a core thing, a core realization and kind of journey for us building and and operating Michelangelo over a number of years. Interestingly, when we started building Michelangelo, there really just weren't good tools for this. It makes sense that in different domains, you see kind of these monolithic systems come up at first and then the specialization over time. And there is certainly, like, a ton of internal kind of, like, demand for different levels of specialization and modularity that that we encountered. So I think that that's, like, a trend that you will continue to see. And I think, you know, today, if we're gonna start there's another company like Uber starting to figure out how to do ML for the company today. There's no way that it would make sense to, 1, build a system completely internally, and 2, build a monolithic system. It's a pretty tricky ROI calculation today to invest in building a whole stack on your own.
And what we see is, you know, now that there's enough momentum behind these specialized tools, they're moving extremely fast. So so even if you are building something on your own, you're not gonna outpace company that's specializing just in model serving or feature stores or or whatever it is, monitoring. The kind of, like, ROI calculation on internal implementations for a lot of this stuff just isn't there. And so what we see a lot of teams starting to do, and this is what I recommend for a lot of companies that come to us and are just starting out and trying to figure out, you know, how do we do our ML infrastructure to find the best in breed components that make sense for your use case. Figure out what components you need for your use case and work with the best of breed. Make them all plug into 1 another rather than kind of align on 1 end to end system that you might be hostage to its inflexibility in the future. That could be extremely costly to migrate off of I really prioritize kind of pluggability, extensibility, all of those kind of that kind of, like, attitude towards putting internal solution together today.
[00:31:52] Unknown:
With all the discussion about using machine learning and putting it into production and the utility of deep learning or artificial intelligence models, it's easy to forget that there are a lot of companies who have yet to even start down the path of trying to build their first model, let alone put it into production. And in your work at Uber and with Tekton, I'm wondering what you are seeing as some of the common challenges that organizations or enterprises are facing and trying to understand the potential benefits of machine learning and the steps necessary to be able to actually bring those capabilities into a production context to be able to realize the value?
[00:32:33] Unknown:
I think a lot of this stuff is not surprising. Right? So there's a talent problem where it's really hard to find people who really know machine learning well and especially people who have engineering skills and the kind of data science, analytics skills at the same time. It's just super hard for a lot of organizations to find those folks, hire them, and make use of them in their organization. There's also kind of organizational level challenges where even if you have the right person who knows how to build these systems, just getting internal alignment and convincing that executive that, hey. This is something we can actually do, and it'll be a huge opportunity if we invest in this. That's really tricky. I think we had a really fortunate experience building out Michelangelo at Uber, and the leadership was a 100% bought into the strategic importance of machine learning to the company. And there's certainly companies that buy into the teaching importance of machine learning, but then don't know how to kind of allocate resources to make that successful.
And then there's kinda like the technical component, which is, you know, you need definitely the ML infrastructure, the ML platforms to support the data scientists to build these things, but you also need the right data platforms to power these data science applications. Right? Your data scientists need to be able to find the right data, access the right data, and then get that data available in the right place, both when you're building your model and when you're productionizing it. So there's a lot of technical challenges involved along the way. And the trickiest part is that these organizations don't just have 1 of these. It usually will be an organizational problem kinda mixed in with the technical problem. And the technical solution is somewhere in the organization, but you just have to convince that person in the other organization that they should invest a little bit of time into collaborating with you on this thing. And there's just all these little frictions here and there that make this new technology be quite challenging to put into production. And so this is what the kind of stuff that, like, slows down ML projects quite a bit. And the very common mistake that we see is teams assuming that investing in tools that will help them build a model offline and thinking that that's the finish line and then coming to the point of, alright, let's put this in production and then running into all kinds of challenges to get their ML in practice in practice. So, like, a lot of challenges all around.
[00:35:06] Unknown:
As you mentioned, there are a number of different capabilities that are necessary where for building a web application that you wanna be able to experiment with, you can throw a team of 2 or 3 people at it. They can have an MVP up in a couple of months, and you can start iterating on that from there. Whereas for machine learning, as you mentioned, there is a whole bunch of upfront work that needs to happen before you can even start working on the project where you have to have access to the data. You need to make sure that it's reliable and cleaned and that you can discover it and that everybody understands what its semantics are before you can even start to think about writing the code to begin on that model. And then, as you said, being able to bring it into production. I'm wondering what you've seen as some of the common pitfalls or, you know, specific stumbling blocks that people run into when they are trying to go down that path, either because they don't have the existing data infrastructure in place or even if they do have it, they don't necessarily have the appropriate scope for understanding how to go all the way through to that finish line.
[00:36:09] Unknown:
Yeah. Some of the, like, interesting things that hold back the success of a lot of these projects, there's some kind of inter interesting things that have come up. Like, 1 thing we noticed is that in a lot of cases, the constraints app the production environment so the production environment imposes, like, pretty limiting constraints on what a data scientist can actually do. So, you know, our production environment doesn't have this data in it. So if you're gonna build a model that depends on, you know, user data or something, like, we just don't have that data in production. So your model is never gonna be able to get that data there. It's just not there. And so then the engineering teams are telling the data science teams don't build models. They use the user data as an example. And it's just kinda backwards to have kind of, like, the engineering or the production teams deciding or imposing, like, really significant constraints on the data science groups that really should be defining what the behavior is of the application in production. Right? Like like, how does my model make decisions? Right? The data science team should be go the other way around. Right? The data science team should be saying, we want our model to make decisions in these ways. It should be able to make use of this kind of data, recent activity data, user data, transaction data, etcetera, and be able to convey that to the engineers or be able to deploy that on their own without worrying too much about, like, the production infrastructure. And so there's a number of, like, elements of that where there's, like, weird constraints that come in in different ways, and then the data science teams kind of have to work around that. Sometimes they don't even recognize that they're kind of taking in these constraints in their design processes.
But there's a lot they know that there's a lot of ways that they could build much more accurate models offline and deliver a lot more value if they were just able to get these models into production. So it's not uncommon that we see data science teams where they're on, you know, version 11, version 12, version 20 of their model, and they put in requests to their partner engineering teams for, you know, the requirements. You know, this model needs this dataset in production or this feature or, you know, this package, this framework in production. And the model that's actually running in production was deployed 18 months ago, and it's still, you know, the version 2 model.
And so that's why it's so important to have a really tight collaboration between the data science groups and the engineering groups and also empower the data scientists to again across the industry and a lot of tools trying to, like, help build up to that point. That makes sense.
[00:38:57] Unknown:
Python has become the default language for working with data, whether as a data scientist, data engineer, data analyst, or machine learning engineer. Springboard has launched their school of data to help you get a career in the field through a comprehensive set of programs that are 100% online and tailored to fit your busy schedule. With a network of expert mentors who are available to coach you during weekly 1 on 1 video calls, a tuition back guarantee that means you don't pay until you get a job, resume preparation, and interview assistance. There's no reason to wait. Springboard is offering up to 20 scholarships of $500 towards the tuition cost exclusively to listeners of this show. Go to python podcast.com/springboard today to learn more and give your career a boost to the next level.
In your work at Tekton, I'm curious what were some of the lessons that you learned in your prior work with Michelangelo at Uber that informed your overall approach to the design of the product itself, but also the overall user experience and the baseline functionality that you wanted to be able to target in this new product?
[00:40:04] Unknown:
I think there's kind of 2 types of lessons we got from that. There's certainly a lot of the lessons about what kinds of capabilities are really powerful to help teams put ML into production and organize their work. And so, you know, examples of this are, it's really important to have very clear and easily accessed definitions of features, make features reusable, have the right monitoring, be super easy to apply all of the features, and have a really good way to surface any implications or consequences of any changes you're making to any features. So if I wanna change a feature, I should know which if it's gonna affect any models. You know, these are, like, tactical or practical things of when we learned about specific capabilities that are really important to have in, like, in a feature store or an enterprise feature store. But more like, a little bit on a meta level, you know, we certainly made a lot of mistakes building Michelangelo. We got a lot of things wrong before we got a lot of things right. And 1 of the things that became a principle that we stuck to developing this internal platform over time was every new capability that we're building, we have to directly have someone that like, a customer that's asking for it that that can give us concrete requirements to us. So this is kind of like a general building a platform product manager kind of thing, but that was super helpful in ensuring that we were able to kinda insulate ourselves from the executive that was saying, you know, it'd be cool if you guys did this, or the product manager would kind of have their own ideas and go off coming up with their own specs.
And always focusing on a specific customer really guided us with concrete requirements. And so now bringing a new kind of product to market, there's definitely an interesting balance between building something to your vision and building something that is based on listening to exactly what customers are asking for. And so the interesting thing here is that and for a lot of productionizing machine learning, it's very new for many people. And so a lot of people don't really know how machine learning is properly productionized, and so it's hard for them to, like, provide requirements. So they know they have this problem, but they don't know how to describe the solution. They don't know exactly what their requirements are. And so there's a mix between bringing in lessons from, previous life and sticking to, like, kind of core product development principles of working with the customer and listening to, like,
[00:42:52] Unknown:
additional pieces of advice or recommendations or reference material that you recommend?
[00:42:58] Unknown:
Yeah. I think there's kind of 2 pieces of advice that I commonly go back to when talking to platform teams that are just getting started. And the first thing is to start simple. This goes with this is just common for every machine learning project, whether it's individual modeling project or an infrastructure thing, and probably it's super valid beyond that. But in machine learning, especially, it's super easy to jump ahead and think that you need deep learning for something, focus on, like, a much more complex solution than you might need at first. And in almost every case, it's always better to get something, the simplest solution running end to end first before you invest in a more complicated solution. And I've seen so many projects kinda go awry by not focusing on that principle.
And the second element is don't reinvent the wheel. When we built Michelangelo, there just weren't a lot of good tools out there. So we ended up having to build a lot of this stuff. But today, there's not really a need to build to staff up a 35 person ML infrastructure team and have them work for a couple of years to rebuild a bunch of stuff that you can get off the shelf in from vendors or from open source today. And as we discussed, teams are specializing in these different areas, and these tools are way faster than a generic internal platform can keep up with. So I really encourage teams who are taking on these ML infrastructure and just general, like, ML and production challenges today to reuse as much as possible of what's out there. And before you build internal infrastructure, talk to the folks who have already taken on these problems.
There's likely a lot more complexity than the obvious on the surface that is solved by some of these existing solutions.
[00:44:42] Unknown:
And are there any particular projects or overall industry trends that you're keeping an eye on in the ML and operationalizing it space?
[00:44:51] Unknown:
You know, I spend a lot of time focused on the feature store area, and the most dominant open source feature store is Feast. So that's a really interesting, great community over there. I think it's a a component in Kubeflow, and it's something that has a lot of similarities to a lot of the stuff that we built internally at Uber. And I like the design there, and I like their approach. So that's 1 of the projects that I've been following quite closely, which is in the specifically in the feature store space. And, you know, Tech is also, like, an enterprise feature store, which obviously encourage folks to check out. Operational ML more broadly, I think 1 thing that is is quite interesting is a blog post that the entries in Horowitz folks published maybe last week's, you know, mid October time frame, which is called something like emerging data architectures or a guide to emerging data architectures.
And it really has and, Tawbas, I don't know if you've interviewed any of those folks before, but it contains, like, these really interesting blueprints that show how these components plug in together most commonly, kinda like clusters or, like, kinda archetypical infrastructures. And that's something that's quite illuminating, I think, for a number of people because it will show, you know, stuff that will be very familiar for you and then stuff that like, data architecture that will kind of represent use cases that you're not super familiar with. And so there's a lot to learn there. Recommend checking that out. Yeah. That was definitely a really great post, and I appreciate it, as you said, the way that they laid out the kind of sections of the overall architecture
[00:46:28] Unknown:
and where different concerns fit within the life cycle of the data. It's something that is very difficult to put together concisely, and I think that they did 1 of the best jobs I've seen of being able to actually lay that out in a way that's understandable without being overly simplistic.
[00:46:43] Unknown:
I'm sure they were working on that for a long time because it was it's a long post, but there's a lot of work that must have gone into that. So I think they did a really good job. Yeah. I think that in the end of the post, they said something about them having worked on it for about 6 months. So I'll definitely have to try and have them on for 1 of my podcasts.
[00:47:02] Unknown:
And so as you look to the future of the work that you're doing on Tekton, what are some of the features that you have planned for that or direction that you wanna take the business?
[00:47:11] Unknown:
Yeah. So, Tekton is a cloud service. So we are on AWS today, and we are gonna be on GCP and Azure next year, so that's a really important project for us. And we are going to have an open source announcement soon, which will be a very big deal for us and the business as well. So those are kind of 2 really big things coming up for Tekton. I do recommend that if anyone, you know, interested in learning more about feature stores, just generally, we have a blog post called what is a feature store that just walks through the what are the problems a feature store solves, and what are the components of a feature store, and how does that help solve some these core problems and really transform the workflows for putting machine learning into production? And that's just you could find that on techton.ai/blog, I think.
That's actually coauthored with the folks who made Feast, which is that open source feature story I just mentioned a moment ago. And so that's probably a pretty good resource for people to check out if they're interested in feature stores generally.
[00:48:16] Unknown:
Are there any other aspects of the overall workflow of operationalizing ML and bringing it into production or the concept of feature stores and what you're building at Tech that we didn't discuss that you'd like to cover before we close out the show?
[00:48:30] Unknown:
I think we talked a lot about in this conversation, you know, that feature stores are these pipelines that deliver feature values to your models. And, you know, we really see feature stores as as the heart of the data flow for an operational ML application. So, you know, while while feature search generate feature data and serve feature data for predictions, there's other really important data flows in operational ML applications, like logging the feature values that are served and monitoring and logging a bunch of various metrics for operational metrics or data quality metrics that can be used for offline analyses or for, you know, live monitoring.
And in the future, we think there's gonna be additional data flows like tracking cost analytics and servicing metrics about feature usage and the value of different features that are all going to live within this concept of a feature store. So we really see it as kind of like the data hub for an operational machine learning application. I think it's just worthwhile to think of kind of feature stores occupying that space.
[00:49:33] Unknown:
For anybody who does want to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'm going to move us into the picks. And this week, I'm going to choose the Sandman series of graphic novels by Neil Gaiman. It's a really great series, and it's interesting because each of the different entries is drawn by a different artist. So it has a lot of variance in the visual style, but the storytelling throughout is just great. It's an excellent story. They actually turned it into an audiobook recently, which I'm excited to listen to. So definitely recommend checking that out if you're looking for something to read. And so with that, I'll pass it to you, Mike. Do you have any picks this week?
[00:50:11] Unknown:
For me, I'll recommend a book called At Home. I think it's by Bill Bryson, and it's called At Home, A Short History of Private Life. It's a history book that unlike other history books, which might focus on some very important moment in history or kind of very important people. This focuses just on, like, what was the average person's life like over time? And this is just a book that I recommend to different people all the time because it was just super fascinating for me learning all the different nuances of what it was like to just be at home as a normal person as the concept of a home changed throughout history.
[00:50:49] Unknown:
So, recommend people check that out as well. That's At Home by Bill Bryson. I'll definitely have to take a look at that 1. Well, thank you very much for taking the time today and sharing your experience of building feature stores and helping to operationalize machine learning pipelines. It's definitely a very interesting and challenging area of work. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Take care. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site of python podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Mike Del Balso's Background and Introduction to Python
Operational ML and Its Importance
Challenges in Bringing ML to Production
Feature Engineering and Feature Stores
Michelangelo at Uber and the Role of Feature Stores
Feature Stores vs. Master Data Management
Different Deployment Models for ML
Common Challenges in Adopting ML
Lessons from Michelangelo and Building Tekton
Advice for ML Infrastructure Teams
Future Directions for Tekton and Feature Stores