Summary
Netflix uses machine learning to power every aspect of their business. To do this effectively they have had to build extensive expertise and tooling to support their engineers. In this episode Savin Goyal discusses the work that he and his team are doing on the open source machine learning operations platform Metaflow. He shares the inspiration for building an opinionated framework for the full lifecycle of machine learning projects, how it is implemented, and how they have designed it to be extensible to allow for easy adoption by users inside and outside of Netflix. This was a great conversation about the challenges of building machine learning projects and the work being done to make it more achievable.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This portion of Python Podcast is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place (sometimes fast, sometimes slow)? Do you know why? With Datadog, you will. You can troubleshoot your app’s performance with Datadog’s end-to-end tracing and in one click correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at datadog.com/pythonpodcast. If you sign up for a trial and install the agent, Datadog will send you a free t-shirt.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Savin Goyal about Netflix’s infrastructure for machine learning
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing the work you are doing at Netflix to support their machine learning workloads?
- How are you addressing the impedance mismatch of machine learning/data science work between local experimentation and production deployment?
- What was the motivation for building Metaflow?
- How does Metaflow compare to other tools in the ecosystem such as MLFlow?
- What was missing in the other available tools that made Metaflow necessary?
- workflow for someone using Metaflow
- How do you approach the design of the developer interface to make it approachable to machine learning engineers?
- level of coupling with overall Netflix data stack
- How is Metaflow implemented?
- How has the architecture and design of the system evolved since you first began working on it?
- supporting infrastructure/integration points
- motivation/benefits of releasing it as open source
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building infrastructure and tooling for machine learning?
- When is Metaflow the wrong choice?
- What do you have planned for the future of Metaflow and
Keep In Touch
- @savingoyal on Twitter
- savingoyal on GitHub
Picks
- Tobias
- Savin
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Metaflow
- OCaml
- EC2
- S3
- Data Lake
- PyTorch
- Tensorflow
- Netflix Data Stack
- Spinnaker
- Chaos Engineering
- Chaos Monkey
- Netflix Simian Army
- Netflix Titus
- AWS Batch
- Netflix Meson
- Dataflow Programming
- DAG == Directed Acyclic Graph
- MLFlow
- DVC (Data Version Control)
- CML (Continuous Machine Learning)
- AWS Step Functions
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode today. That's l inode, and get a $60 credit to try out our Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Savin Goyal about Netflix's infrastructure for machine learning, including his work on Metaflow. So, Savin, can you start by introducing yourself? Hi. I'm Savin, and I'm super stoked to be here at this podcast today. I'm an engineer on the machine learning infrastructure team at Netflix and,
[00:01:13] Unknown:
excited to talk about some of the work that my team has been doing over the last couple of years. And do you remember how you first got introduced to Python? Oh, yes. So, actually, my very first introduction to programming surprisingly was when I, went to university. So Python was, I think, the 3rd language that I got introduced to. I started off, learning functional programming with OCaml, moved to Java, and then I moved to Python.
[00:01:41] Unknown:
So that was kinda like my beginning to Python. That's interesting that Ocaml was your introductory language. It's not something that I typically hear from people.
[00:01:49] Unknown:
Yes. Yes. Indeed. And, I'm actually, you know, really thankful that I started off with OCaml because, that really helped me to build a functional mindset when it comes to just programming.
[00:02:02] Unknown:
And as you mentioned, you're working on the machine learning infrastructure team at Netflix. So I'm wondering if you can just start by describing a bit about the work that you're doing there to support their machine learning workloads and some of the challenges that they're facing.
[00:02:15] Unknown:
Sure. So, you know, Netflix uses machine learning in a whole variety of different ways. Our recommendation systems, are kinda like our crown jewel when it comes to our machine learning efforts. And as you can imagine, when you are doing any sort of data science at that scale, there is an increased need for engineering, not just in terms of putting those machine learning models in production, but also in terms of prototyping those machine learning models, in the very first place. And that's where my team steps in. So we built the infrastructure so that data scientists and machine learning practitioners at Netflix can very easily prototype their ideas. They can interact with huge amount of data. They can build, many different type of models. And once they deem that certain models are promising, then they can very easily roll them out into production, see how they are behaving, and if there are, like, any further changes that they need to make. So all of these activities at the end of the day, they need a good amount of infrastructure,
[00:03:17] Unknown:
and that's essentially what my team, tries to do. And as far as the infrastructure, are you talking about things like the underlying e c 2 instances that they're running on and some or some of the tooling that they rely on. I'm just curious what the overall scope is of the types of interest infrastructure that you're providing. Sure. Excellent question. You know, when when we talk about machine learning infrastructure,
[00:03:38] Unknown:
it's it's a very thick stack of infrastructure at the end of the day. At the lower levels of the stack, you have, these primitives around storage, compute, networking, so, you know, easy 2 instances using s 3 as your data lake, so on and so forth. And then as you sort of, like, move up the stack, there are a lot of other elements that you sort of, like, add into the mix. Right? If let's say you are orchestrating your chops or if you're running your chops on EC 2 instances, then you need to have a way to orchestrate that compute, in the first place. So some sort of a job scheduler, so to speak. Right?
Then if you are interested in running machine learning workloads, then given that machine learning is a very iterative process, you need some capability where, you know, users have some notion of semantic versioning, of their workflows, their training processes, their models. And, ultimately, towards the top of that stack, you have tools that are that directly interface with data scientists like, you know, IDEs, for example, Jupyter Notebook, RStudio IDEs, or, algorithm implementations when it comes to say, TensorFlow, PyTorch, or feature stores, for example. So when when it comes to just the scope of machine learning infrastructure, there are a lot of different pieces, a lot of different moving pieces, so to say. And, uniquely, my team's viewpoint has been that there has been incredible amount of work that a lot of excellent people have been doing in this community. And we want to make sure that we are able to pick the best of the tools and sort of like provide them in an end to end package so that our data scientists, they can spend their time and effort focusing on data science while we can place our opinions on the lower levels of the stack.
More specifically, you know, how to launch your compute on EC 2 instances, how to store your data on s 3, how to do versioning. These are some of the things that, the data scientists would much rather have the platform take care of while when it comes to what sort of machine learning can I do, what sort of algorithms can I use, Those are the kind of opinions that the data scientists would want to exercise on their own? And that's inherently at the end of the day what Metaflow is. It's it's an opinionated Python library that places its opinions on the systems engineering part of machine learning infrastructure and allows machine learning practitioners
[00:05:59] Unknown:
to exercise their freedom of choice, at the top levels of this machine learning stack. 1 of the common things that I've heard from people who are working in data science and machine learning is that there's this impedance mismatch between the iterative nature of the experimentation phase and figuring out what is the machine learning model going to do, how are you going to achieve these desired outcomes, And then the work of actually putting that into production and the types of tooling that are used in each of those different phases of the process, I'm curious how you're addressing that impedance mismatch between the local experimentation and then putting things into production and adding things like monitoring and model drift and being able to handle the overall life cycle of those models?
[00:06:46] Unknown:
Good question. This is definitely, you know, 1 of the biggest questions that data science as well as software engineering teams who wish to introduce machine learning in their systems constantly struggled with. Data scientists, they are good at, prototyping, these models, often on a limited set of data. But then at the end of the day, those models need to be used in a particular context. Right? And, at that point in time, they would usually work with, software engineers who understand engineering well, to graduate these models into production. But then at that handover point, the data scientists, they lose control over these models.
They're not necessarily sure how these models are actually being used in production in terms of say, you know, like scoring on these models. And and that's that's a big problem at the end of the day. And with MetaFlow, we essentially try to make sure that our data scientists ultimately at the end of the day can be full stack data scientists where they are able to prototype easily on datasets, that might be stored locally on their laptop. And then with our integrations with the cloud, they can very easily and very seamlessly scale out that compute, by, say, leveraging, Amazon's s 3 or launching, their jobs on much bigger instances in the cloud using EC 2 instances. And then at the end of it, once they are happy with the model that they have trained, then they can deploy that model. And deploying a model has various different connotations when it comes to data science. Right? Deploying a model could mean that, hey. I have this model training workflow, and now I want to deploy this model training workflow to execute in an autonomous manner, and Metaflow will do that automatically, for the end user. Deploying a model could also mean that you're running some sort of inferencing on the model. So, usually, you have some sort of a function that you have defined on the model, and that can do either batch inferencing or real time inferencing.
And given that now function as a service, that ecosystem has, matured quite a bit, it's relatively simpler and straightforward, for us to provide capabilities to our users where they can just write these very simple, iconic functions, and then we can host those functions as services or as batch endpoints on their behalf. And that helps us in addressing this impedance mismatch, so to say, between production and prototyping.
[00:09:07] Unknown:
And then in terms of the actual consumers of MetaFlow, there's the data scientists and machine learning engineers who are creating the models, but there's also the infrastructure engineers who are providing the underlying storage and compute for getting it into production, as well as providing test and sample data sets for the original iteration process. I'm curious how you've approached the interfaces for the Metaflow package
[00:09:33] Unknown:
to make it accessible to each of these different stakeholders in the overall process? Mhmm. So Metaflow, ultimately, at the end of the day, is targeted towards data scientists. Now, Netflix, is an AWS shop. We run all of our workloads, on top of AWS. And, the great thing about AWS is that when it comes to these components like storage, compute, networking, all of these at the end of the day, the operations are managed by AWS, and they are fantastic at that. So we don't ever have to worry about managing, any of these resources. We just simply go to our AWS console, and we are able to spin up new buckets on, Amazon S3 or launch new instances on EC 2. And that that sort of, like, has helped us in, sort of, like, reducing the scope of Metaflow significantly.
And, then the biggest stakeholders, for us, remain data engineers who are responsible for producing these datasets that are then consumed by data scientists. And then, software engineers, on the other side of the equation who are responsible for interface with the datasets, that might be stored interface with the datasets, that might be stored in any sort of data store, primarily inside of Netflix. Our data lake is built on top of Amazon S3. So so we ship with utilities that allow our data scientists to really efficiently grab all the data that lives in S3, in a very high throughput manner and, interact with that data in instances, which, you know, might have resources in terms of, say, memory, GPU, CPU, much larger than what their local workstation can provide. And then, that sort of, like, allows them to be very much self serve in terms of building models at scale. And once they have built a model, once they are happy with the model that, has been generated, then at the end of it, they can very easily expose an interface to that model through a function as a service platform, where at the end of the day, they are writing a function, and then the software engineering teams can then very simply just ping those API endpoints, that we stand upon their behalf.
[00:11:48] Unknown:
And with Netflix, I know that there is a fairly extensive ecosystem of tooling and libraries and the overall to the process of building Metaflow in a way that's accessible to people outside of Netflix, given that it is an open source library and that you're intending for other people to be able to use it within their own ecosystems and their own environments?
[00:12:18] Unknown:
So if if you look at, the way Netflix operates, like 1 of our biggest, cultural principle is freedom and responsibility. What that means at the end of the day is that, every single team has the freedom to choose the tooling that works for them. And in practice, what ends up happening is that, you have all of these different teams who are working on IO set of problems. And for each of these problems, they will necessarily end up choosing a diverse set of tooling. And Metaflow is intended to work with all of these diverse set of tooling, at the end of the day. And that was really helpful, for us in our open source strategy as well.
We don't place any assumption in terms of what specific machine learning tooling you are using or, any sort of, like, specific data store that you are interacting that's that's the case internally at Netflix as well. We have data which is stored in a bunch of different, subsystems, and we wanted to make sure that our data scientists can very easily and seamlessly, just within the scope of Metaflow, can interact with all of these different data stores and can then execute or schedule their compute on different compute frameworks as well. So when when we were planning to open source Metaflow at that point in time, given that we had a good amount of experience with, AWS technology, it made natural sense to have open source integrations with some of the underlying storage and compute systems that AWS offers.
But the architecture of the code is such that you can very easily integrate with, say, Google Cloud or, Microsoft Azure and, things things will just work very seamlessly.
[00:13:58] Unknown:
Another element of the work that you're doing with Metaflow and the timing of the release is that it's interesting in how it closely coincides with a lot of other projects. They're aimed at solving similar or tangential problems. I'm thinking in particular of MLflow from Databricks, And then there's the team at Iterative who are building the data version control and continuous machine learning projects. I'm curious what the timeline was for Metaflow in terms of how long it's been around and some of the motivation for building it versus leaning on some of the other existing tooling that might have been around at that particular point in time? So 3 years ago, when we first started, building Metaflow,
[00:14:39] Unknown:
the machine learning infrastructure landscape looked very, very different. Some of these projects either didn't exist or, they were just starting out. So it didn't really make much sense for us to sort of like, buy something, and we decided to rather invest our activities in building Metaflow. 1 thing to note here is that with Metaflow, we are not trying to reinvent the wood. Metaflow at the end of the day is an opinionated library, which strings together all of these excellent infrastructure pieces that already exist in the world. So the overarching focus or the overarching philosophy that we had with Metaflow was that if you look at, a data scientist, right, they can still get their work done. They can train their model. They can train their model at scale. They can deploy their models, but none of that is easy enough. They end up spending a lot of time, in engineering concerns, rather than data science concerns.
And, I think it's safe to say that in many companies, data scientists are the most precious resources. And, for us, it was rather important that we were providing a framework that was very human centric, that was completely focused on removing, the small little pain points that our data scientists were facing every single day so that they could just focus their efforts, their attention on data science, the thing that they enjoy doing the most. And that that was the entire philosophy around going ahead and building Metaflow.
[00:16:08] Unknown:
And as far as releasing it as an open source project, what have you seen as being the benefits to Netflix and the engineers working on Metaflow as well as the overall machine learning infrastructure community? And what were some of the changes that were necessary in the code to make it, to make it ready for being released as an open source project with the intent of it being usable outside of Netflix?
[00:16:33] Unknown:
Internally, at Netflix, we use Amazon S3 as our data lake, and we use Titus, which is our container orchestration platform for compute, and, Netflix Mason for scheduling all of our batch ETL processes. So if you are a data science inside of Netflix, then essentially you'll use Metaflow to store all of your data in s 3, execute all of your training jobs, feature engineering jobs on Titus, and schedule this entire workflow end to end on top of, Netflix Mason. In the open source community, if you are already customer of AWS, then you have access to Amazon S3. Unfortunately, you wouldn't have access to either Titus or, Netflix Mason. So it was important for us to create open source integrations, for compute as well as scheduling. So we integrated with AWS Batch, which is, at a very high level, a chop queue in front of Amazon EC 2 instances, as well as AWS step functions, for scheduling. And, actually that's the integration that released, just a couple of weeks ago. Now it's so when we, were implementing Metaflow internally, at that point in time, we decided to go with a plugin based architecture. Of the day, MetaFlow needs a data store. So there is no tight coupling with Amazon S3. You can very easily write a similar integration with GCS, and people have actually full Metaflow code base, and they have those integrations as well. For compute as well, it's rather straightforward to have an integration with, Kubernetes as well. There are instances where there are certain organizations who were using Kubernetes rather than AWS batch or who were in the Google Cloud world, and, they went ahead and they, wrote their own plugins.
So so this this plugin framework has been really, helpful for us in terms of managing our codebase. So the code base that you see, is exactly the same code that we use internally at Netflix. It's it's the same core that sort of, like, manages all of machine learning pipelines internally as well as in the open source, ecosystem. In terms of adoption of Metaflow, our initial target audience were people, who were created data science, but we're looking at, some framework that would take care of all the operational concerns of building and managing these pipelines.
Netflix has been doing machine learning for a number of years. And because of that, my team specifically has been in a unique position, to learn from all of these efforts and codify all of our learnings. And at the end of the day, Metaflow is a result of all of these accrued up learnings that we hope people are finding useful.
[00:19:26] Unknown:
This portion of podcast dot net is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place, sometimes fast and sometimes slow? Do you know why? With Datadog, you will. You can troubleshoot your app's performance with Datadog's end to end tracing and, in 1 click, correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at python podcast.com/ Datadog. If you sign up for a trial and install the agent, Datadog will send you a free t shirt to keep you comfortable while you keep track of your apps.
And in terms of Metaflow itself, can you talk through the overall workflow of a data scientist or machine learning engineer who's building a model and the changes that they would need to make to their existing code to make it compatible with Metaflow and then the work for the infrastructure engineers to use Metaflow and integrate it with their existing systems?
[00:20:30] Unknown:
Sure. So Metaflow, ultimately, at the end of the day, is a Python library. So you can get started by just doing a PIP install Metaflow, and, it introduces a data flow programming model. So, you have this notion of a directory cyclic graph. And in this graph, the user is essentially specifying their steps as well as transitions between different steps. And, once once the user has sort of like specified this entire graph, then they can execute this graph using Metaflow. Now there are a lot of machine learning frameworks, that espouse this notion of converting the user code into a DAG for execution.
So so there's nothing new here. But then what Metaflow allows the user to do is for each of these steps, and these steps are essentially at the end of the day Python functions, they can very easily annotate these steps to execute them in different, execution environments. So for example, say, you have a workflow and say, it's a very simple example, a 3 step workflow. You pull the data from a data store. You clean that data, generate some features. Then in the next step, you are training that model on say, you know, some GPUs. And then in the very last step, you are storing that model somewhere. Right? Now our users, they love prototyping on their laptops, and that was something that we wanted to enable. So they can they very easily just pip install Metaflow.
They write these 3 functions, and they put dependencies amongst these 3 functions. And then, now the interesting thing is what happens when your data volume increases so much so that it cannot no longer fit in your laptop's memory? At that point in time, you can very easily just annotate your step, with resources. So we have a resources, Python decorator. So you can just go at resources, CPU equals 8, memory equals 100 GB. And then Metaflow will take that function, and it will execute that function in the cloud on behalf of you, and it will pull back, the results back to your, laptop so that then the other steps can execute on your local machine. And to the end user, essentially, what ended up happening was as soon as they added this at resources decorator, all of a sudden, their laptop got 8 CPU cores and 100 gigs of RAM. They can do a similar thing with their model training process. If their model training needs GPUs, then they can just specify, hey, I need 2 GPUs.
And then Metaflow will take care of moving their compute and their data to the cloud, executing that compute, and then getting back all the results to their, laptop so that then they can proceed forward. And that's that's really powerful because now the end user, they are writing code in idiomatic Python, the way they know, and they get the entire flexibility of the cloud. And, there there are a lot of other features that's, that we provide as well. So for example, now when you're running multiple of these steps in the cloud, how do you pass state between these steps? And we make it very idiomatic where the user doesn't have to care about how that is happening.
All the variables, all the state that was available to them before the execution of any given step, that's directly made available to them. So so that simplifies a lot of concerns that they have. And then, every single execution of their workflow is logged and tracked by Metaflow. All of the code is stored. They can specify specific dependencies for each of the steps. So they can go like, hey. I need to execute my feature engineering step in an environment that only has pandas and nothing else. And, with Metaflow, they can very easily do that. For their model training step, they can be like, hey. This model training step needs to execute in an environment that has TensorFlow and nothing else, and they can very easily mix and match that. And, that that has been really powerful because now our end users, our data scientists, they can use the latest and the greatest of the research that's happening, all the releases that happen in the PyTorch, TensorFlow, scikit learn world without actually relying on any infrastructure engineer or anybody to sort of like provision an environment for them. On the other side, for infrastructure engineers, for them to enable Metaflow for their end users, it's it's actually a very straightforward deployment, as I said before. Right?
Metaflow is a Python library that's installable from PyPI, from conda. And, to configure Metaflow, what it needs is it just needs access to an s 3 bucket. If you want to use s 3 as your data store, needs access to an AWS batch or queue, something that is very easy, to set up, and that's about it. So so you just, like, set some environment variables, within the library, and off you go, you have, exactly the same setup that we use internally at Netflix.
[00:25:26] Unknown:
Another challenge in the data science and machine learning space is that of collaboration between machine learning engineers and data scientists. I'm curious how MetaFlow addresses that problem or if that's something that's left to other tooling to handle collaboration and things like data versioning and data lineage tracking. Excellent question. So the notion of collaboration
[00:25:49] Unknown:
is very tightly linked to the notion of reproducibility and repeatability. As an end user, I can collaborate effectively if I can repeat the work and reproduce the work that, my colleagues have done. And that's that's incredibly important, at the end of the day for us because it also instills a greater trust in the systems that have been built. And Metaflow, by default, what it will do is it will snapshot the code that is being executed, and it will store that in s 3. It will also snapshot all the intermediate data and the intermediate state for every single execution, and that also gets stored in s 3 in a content address storage. And, we also go 1 step further. And, as I mentioned just before, Metaflow allows people to declare what sort of libraries, what sort of environment they want to execute their code on. So we'll actually snapshot the entire environment as well for them. And then this gives us very strong guarantee that at any time in the future, anybody can go in and be like, hey. This machine learning pipeline that you executed a few months ago, I want to re execute it. And I want to make sure that I'm getting the exact same set of results. 1 good thing is that because we use Amazon S3, like Cloud Storage is becoming cheaper and cheaper, all the larger datasets that the data engineers have generated, if if we assume immutability of those datasets, then Metaflow provides all the other immutability guarantees, on the executions that have happened before to sort of like guarantee this very, very strong notion of reproducibility.
And that's that's really helpful. Like we have instances where, you know, people will execute their pipelines and they will generate these intermediate datasets, which will then be consumed by other Metaflow work flows. So examples could be that, within the scope of your machine learning workflow, maybe you are generating some embeddings and then you are training a model based on those embeddings. And now some other user might want to use those embeddings. And with Metaflow, they have a very easy and simple way to reference those embeddings. They can be like, hey. This machine learning workflow that executed a week ago, I want the embeddings generated by that, and then Metaflow will make sure that, that exact same dataset is available to them, with all their data versioning and linear tracking built in.
[00:28:06] Unknown:
And then as far as Metaflow itself, can you discuss how it's actually implemented and architected and some of the design changes that have happened as you have evolved the code base and brought on more use cases and more disparate users outside of Netflix? So Metaflow,
[00:28:23] Unknown:
at the end of the day, is a pure Python library. So so the way we have implemented Metaflow is that it has a local scheduler that allows you to declare your jobs, or declare your machine learning workflow as a Direct3D cyclic graph, and then that local scheduler would be responsible for executing Metaflow on your laptop. And, this local scheduler through a variety of different plugins can make sure that, you know, the backing data store can be either your local laptop or it could be some cloud store. It can execute every single step as a local process on your laptop, or it can form out that compute to a Cloud job orchestration system. So so that's that's been really helpful, for us as we iterated with Metaflow, and that is something that has allowed us to very easily cater to internal leads as well as, our OSS, use cases. Now in in terms of the evolution, every single feature of Metaflow has been informed by a pinpoint that we have directly observed. And 1 of the big reasons why the open source Metaflow was just so that we could expand our worldview as well and identify what are some of the pain points that people outside of meta outside of Netflix are facing, so that then, both our internal users as well as, users in the open source community, can benefit from those learnings.
[00:29:47] Unknown:
Another element of the machine learning life cycle is after you've got it into production, you need to be able to monitor the model for performance and drift in terms of the outputs that it produces. And I'm curious what you have found to be some of the useful ways to collect metrics for the model and some of the types of information that you're looking to to determine its overall efficacy and when it needs to be retrained and redeployed?
[00:30:15] Unknown:
That's that's a very interesting question. So if if I have to unpack this, there are a bunch of, different dimensions. So 1 is, just around collecting these metrics. So MetaFlow, as I said before, snapshots, all of your internal state, and it stores that in, DataStore and makes it widely accessible, which means that you can very easily, within your code, log all of the information. You don't even need to actually, log anything explicitly. Metaflow will just do that, behind the scenes for you. Because what we found in practice was that when things fail, when people people are debugging their workflows at that point in time, there's always this point where they were like, hey. Just if, you know, I would have logged this extra variable, then it would have made life super easy for me. So we went with the philosophy that, okay. We'll just log everything we compress everything that we store. We make sure that, you know, none of the information is duplicated in our data store so that we are being cost efficient about it. And what that helps is that at any time in the future, the user can go in and be like, hey. How did my model behave? What were some of the input values to my model? And they can compare across different executions as well very easily. Now the next question is that, okay, what is this interface through which they can monitor these models once they have this data?
And is is there a specific UI, that we can build, that we can ship, that sort of, like, allows people to monitor their machine learning workflows, in a very cohesive manner? And the answer that, or the conclusion that we came to was that it's actually very difficult to build 1 UI that rules all of these different use cases. The kind of things that you would want to monitor for, say, computer vision use cases are going to be very different from natural language processing use cases. But the interesting thing is that our users, they are very well familiar with the notebook ecosystem.
And with Metaflow, we provide a client that allows our users to very easily interact with all of these data artifacts and metrics that have been stored automatically by Metaflow within a notebook so that they can very easily create these custom dashboards according to their needs. And they can monitor how their training processes are behaving, how how multiple different models are performing against 1 another. And then they can make a judgment call, whether to retrain a model or not. And, if, let's say, you have deployed your machine learning training workflow onto a scheduler like Netflix, Mason, or AWS Step Functions, then it's very easy to even automate, that feedback loop, at the end of the day. So so that's that's been our philosophy.
Internally at Netflix, we have an excellent hosted notebook ecosystem. So our users, they can publish these dashboards that are backed by wood books, and these dashboards are essentially constructed using, the Metaflow client, so that it's not just them. But if they have any external stakeholders, they can also very easily visualize
[00:33:22] Unknown:
how the, machine learning systems are behaving. And for people who are interested in extending Metaflow or adding new integration points for it, what does the interface for that look like and some of the ways that you're helping to support engineers in broadening the overall ecosystem beyond the overall ecosystem beyond things like just AWS, which I know you mentioned you also had the ability to support other clouds. I'm just curious what the developer, what what the developer experience looks like for being able to add those new integrations and if you are working to foster an overall ecosystem to make those new integrations easy to find and implement? Yes. So ultimately, at the end of the day, Metaflow has this plug in architecture.
[00:34:04] Unknown:
So all of the steps in Metaflow, you can modify or mutate their behavior using these Python decorators. And if you are interested in writing a new plugin, we'll essentially write a plugin that obeys some of the life cycle endpoints that Metaflow exposes. And we we do have a bunch of inflight plugins, that people are working on. And if people are interested in contributing more plugins or if they feel that, there are other integration points that would be really helpful within their organizations, my team would be really glad to have a conversation with them.
[00:34:42] Unknown:
So please feel free to reach out to us on any of our chat channels or via GitHub issues. For people who are using Metaflow, what are some of the most interesting or innovative or unexpected ways that you've seen it employed?
[00:34:56] Unknown:
Excellent question. You know, ever since we launched in the open source world, I've across many, many different use cases. We have instances where there are car manufacturers trying to optimize engine efficiency, and they're trying to build models, and they found Metaflow really useful for those purposes. There have been instances where, you know, there are companies doing cancer research, and they need to train a lot of models on a lot of data. And, they found that Metaflow was able to scale out effortlessly, come across people using Metaflow in the scope of algorithmic training as well. Now the fact that at the end of the day, Metaflow is not necessarily placing any assumptions on the kind of machine learning that you do, the kind of libraries that you get to use means that there's a wide diversity of use cases. And when we launched, our target audience was essentially companies which have, machine learning talent, but who don't want to invest a lot in machine learning platforms for big tech companies like Netflix or Facebook or Google.
These companies can afford to spin up dedicated engineering teams who get to work on building these machine learning platforms. But that might not necessarily be the case for many other companies, outside of Silicon Valley specifically. And that was also 1 of the big reasons, why we wanted to open source Metaflow so that we could learn from these very diverse set of use cases as well. And in your own experience
[00:36:26] Unknown:
of building and using Metaflow and interacting with the community of users inside and outside of Netflix, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:36:36] Unknown:
This happens to be my first open source project. So, I guess, you know, like managing and maintaining an open source community. That's that's a lot of work. But at the end of the day, it's also something very enjoyable because you see a lot of people using, what the team has worked on tirelessly over the last couple of years, and that's that's always motivating. Yeah. I mean, for us, it's still very early days. Metaflow has been in the open source community
[00:37:05] Unknown:
for all of 6 months now, so we are still learning. And then for people who are considering using Metaflow or are reevaluating the way that they're handling their machine learning workloads, what are the cases where Metaflow is the wrong choice and they would be better suited either using an existing off the shelf tool or building their own internal tooling?
[00:37:25] Unknown:
Very good question. So Metaflow is available as a Python library, and, we also do have bindings in our lang for Metaflow. So, if you're not a Python or an R user, then maybe Metaflow might not really be a great match. We definitely have use cases where people use Metaflow for creating these Pythonic workflow and they embed, say, you know, c plus plus code, as part of, every single step. Besides that, at the end of the day, you know, Metaflow is trying to address challenges which come at scale. And by scale, I just don't necessarily mean high throughput, low latency use cases, but also high diversity cases.
Those those were the cases that we saw internally at Netflix. So there could be scenarios where, you know, as a company, you have a very specific machine learning problem that you're trying to solve, and you have already built highly bespoke infrastructure for that. And maybe at that point in time, it might not be necessary for you to sort of, like, move away from that infrastructure. But in case, you know, you are curious to try out Metaflow, we do offer, sandbox environments. So when we were open sourcing Metaflow at that point in time, we recognized that it might not be easy for people to set up, say, a private S3 bucket or set up an AWS batch job queue. So we ended up, spinning up these sandboxes that people can request if they follow the directions in our documentation, where we'll grant them AWS resources at no cost to them for a period of a week so that they can actually
[00:39:00] Unknown:
experience the cloud scale, that Metaflow has to offer. And for your work on Metaflow, as you plan on next steps or new features, what do you have in mind for the future both as far as helping to grow the ecosystem and contribute to the state of the art for machine learning infrastructure, as well as for specifics of the directions that you're planning on taking the code base of Metaflow? So very recently, we launched integrations
[00:39:27] Unknown:
with AWS Step Functions as well as our our bindings for Metaflow. And, we are also in conversations with various external partners who have adopted Metaflow within their organizations to identify these pain points, and that would sort of, like, help us in formulating our road map, in the future. So some some obvious things are, you know, like integrations with, like, more cloud providers. There are certain things, certain features that we have, internally that we would want to also, provide to our external users in the open source community. We are also working on a graphical user interface for Metaflow workflows. So so this will allow people to very easily look at, the state of the world and just, like, track workflow executions.
They'll still, go to notebooks when they have to look at the specifics of, their machine learning models in itself, But it's still a really great option when you just want to know how your, training pipelines are actually, executing. So so those are some of the things in the near short term for us. As I said before, this is a learning opportunity for the team. We we are trying to learn as much from the external community as possible.
[00:40:39] Unknown:
Are there any other aspects of your work on Metaflow or your work on supporting machine learning at Netflix or the overall space of machine learning infrastructure that we didn't discuss yet that you'd like to cover before we close out the show? I think,
[00:40:53] Unknown:
the the biggest thing would be that, this area is rapidly evolving. I'm constantly amazed by the amount of progress, that, various organizations and people in the OSS community are making in this area. And that's that's super exciting at the end of the day because you never know what's coming next. It's it's really difficult to plan for a future in that sort of a world, but that's sort of like the excitement of it as well. So so no no specific plans, aside from the fact that the entire, ecosystem is trying to figure out how to make data scientists more and more productive, how to sort of, like, move from this notion, where data scientists have traditionally operated in teams of 1 to actually having data scientists collaborate very, very effectively. And that's honestly
[00:41:41] Unknown:
the most interesting aspect of my job personally to me. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing on Metaflow and the other projects, I'll have you add your preferred contact information to the show notes. And so with that, I'm going to move us into the picks. And this week, I'm going to choose a project called vdist that I started using recently. It's a tool chain for being able to package up your Python projects as an OS package, either Debian or RPM or what have you. And it leans on Docker and FPM for being able to provide reproducible builds. And the the thing that I like about it is that it actually packages up the entire Python binary at a specific version along with your code so that you can release that as a single artifact without having to rely on the version of Python that is running on the target operating system. So it's just a good way to package everything in a hermetically sealed box and ship it along without necessarily having to lean on Docker for production. So definitely worth a look if you're trying to figure out the best way to deploy your Python code. I've been happy with it so far and plan to dig some more into it. And so with that, I'll pass it to you, Sawan. Do you have any picks this week? Yeah. I mean, because I've been staying home
[00:42:54] Unknown:
all day long, due to coronavirus, so I have tried to pick up this new skill around preparing vintage watches. So that's that's been something that's been taking a good amount of my time. I highly recommend that activity to people, for looking for something which is at the intersection of art, design, and technology. Highly, highly invigorating.
[00:43:18] Unknown:
Yeah. Definitely sounds like an interesting, hobby to pick up and something that would, keep you fairly well occupied with all the little intricate mechanics. It's something that I'm sure is, its own whole set of rabbit holes that you can go down.
[00:43:31] Unknown:
Exactly. I mean, you know, it it stretches your level of patience.
[00:43:35] Unknown:
Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing on Metaflow and supporting machine learning at Netflix. It's definitely a very interesting project and something that appears to be well engineered and is a very, as you said, rapidly evolving problem domain. So I'm excited to see where you take it and some of the other work in the area. So I appreciate all of your time and effort on that front, and I hope you enjoy the rest of your day. Thank you. Thanks for having me. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at at dataengineeringpodcast.com for the latest on modern data management.
And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Savin Goyal's Background and Introduction to Python
Netflix's Machine Learning Infrastructure
Addressing the Impedance Mismatch in Machine Learning
Metaflow's Target Audience and Interfaces
Metaflow's Development and Open Source Strategy
Workflow and Integration of Metaflow
Collaboration and Reproducibility in Metaflow
Implementation and Architecture of Metaflow
Monitoring and Metrics in Machine Learning
Extending Metaflow and Developer Experience
Lessons Learned and Community Engagement
Future Directions for Metaflow
Closing Remarks and Picks