Summary
Building a machine learning application is inherently complex. Once it becomes necessary to scale the operation or training of the model, or introduce online re-training the process becomes even more challenging. In order to reduce the operational burden of AI developers Robert Nishihara helped to create the Ray framework that handles the distributed computing aspects of machine learning operations. To support the ongoing development and simplify adoption of Ray he co-founded Anyscale. In this episode he re-joins the show to share how the project, its community, and the ecosystem around it have grown and evolved over the intervening two years. He also explains how the techniques and adoption of machine learning have influenced the direction of the project.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Robert Nishihara about his work at Anyscale and the Ray distributed execution framework
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what Anyscale is and the story behind it?
- How has the Ray project and ecosystem evolved since we last spoke? (2 years ago)
- How has the landscape of AI/ML technologies and techniques shifted in that time?
- What are the main areas where organizations are trying to apply ML/AI?
- What are some of the issues that teams encounter when trying to move from prototype to production with ML/AI applications?
- What are the features of Ray that help to mitigate those challenges?
- With the introduction of more widely available streaming/real-time technologies the viability of reinforcement learning has increased. What new challenges does that approach introduce?
- What are some of the operational complexities associated with managing a deployment of Ray?
- What are some of the specialized utilities that you have had to develop to maintain a large and multi-tenant platform for your customers?
- What is the governance model around the Ray project and how does the work at Anyscale influence the roadmap?
- What are the most interesting, innovative, or unexpected ways that you have seen Anyscale/Ray used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and Anyscale?
- When is Anyscale/Ray the wrong choice?
- What do you have planned for the future of Anyscale/Ray?
Keep In Touch
- robertnishihara on GitHub
- @robertnishihara on Twitter
- Website
Picks
- Tobias
- Robert
- Production RL Summit
- Project Hail Mary by Andy Weir
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Ray
- Anyscale
- UC Berkeley
- Matlab
- Deep Learning
- Pandas
- NumPy
- Horovod
- XGBoost
- Modin
- Dask
- Ray Datasets
- Reinforcement Learning
- Production Reinforcement Learning Summit
- AlphaGo
- Databricks
- Snowflake
- TPU == Tensor Processing Unit
- Weights and Biases
- MLFlow
- RLLib
- Ray Serve
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host, as usual, is Tobias Macy. And today, I'm interviewing Robert Nishihara about his work at any scale and the Ray distributed execution framework. So, Robert, welcome back. And for the folks who haven't listened to the previous episode that you did, can you give a brief introduction?
[00:01:11] Unknown:
Yes. I will. And thank you so much for having me here. So I'm Robert. I'm 1 of the cofounders and CEO of AnyScale, and also 1 of the creators of Ray, which is an open source project that we created to make distributed computing easy and to make it easy to scale and productionize AI applications. Before starting this company, I did a PhD in machine learning at UC Berkeley, and that's actually some of the challenges we ran into, you know, doing machine learning research are what motivated us to really start Ray. And that's how all this got started.
[00:01:42] Unknown:
And do you remember how you first got introduced to Python?
[00:01:47] Unknown:
Yes. You know, I had previously been doing machine learning research as an undergrad in college, and we were all using MATLAB to, you know, do sampling and other machine learning algorithms. And that was right around the time where deep learning was starting to take off, and everyone just started transitioning to Python at around 2012.
[00:02:09] Unknown:
In terms of the Ray project, as I mentioned, you've been on the show before. We did a pretty deep dive into the technical underpinnings of that project and some of its history, but it was right around the time that you and your cofounders were building the AnyScale company to help support that project and grow its ecosystem. I'm Wondering if you can describe a bit about what it is that you're building at any scale and some of the story behind how you decided to turn that into a business and why you wanted to spend your time and energy focused on this particular problem space.
[00:02:42] Unknown:
Let me start with why we started Ray and then why it it made sense to turn that into a company. Like I mentioned, my cofounders and I, we were doing machine learning and machine learning research at Berkeley. And 1 of the challenges that we were running into over and over was just that in order to do machine learning, there's often a huge infrastructure investment. You're often spending a lot of time on the, you know, distributed systems engineering or, you know, low level systems engineering to speed things up or to scale things up. And that requires a ton of expertise. It's super hard. Right? So you have all these people trying to do machine learning who end up going on this detour and gaining all of this expertise in distributed systems and infrastructure, just so they can build the tools they need to do machine learning.
That's incredibly costly. It's expensive. It's hard. You know, we felt there was an opportunity. We were going through this as well as tons of other people around us, and we're all kind of doing it independently. And we realized, you know, there's an opportunity here to build significantly better tools. If we can build tools that sort of take care of all of this infrastructure work for you and make it easy, then that could potentially, you know, really accelerate progress in machine learning and just enable developers to be much more productive. So we started Ray as an open source project to try to do that. We had no ambitions to start a company. I think if you asked us at the time whether we would start a company, we probably would have been fairly dismissive of the idea. And, you know, our focus was on making the open source project useful for people. We started working on it, trying to grow the project, get more users. We're running around Berkeley, trying to get our friends to try it out and, you know, washing over their shoulders as they installed it and started using it. Of course, they run into issues and then we try to, you know, debug those issues on the fly and and get them to try it out again. And this evolved over time. All at Berkeley, we were running boot camps, running tutorials to train people how to use Ray. We started collaborating with companies that were using Ray in production, giving talks, you know, and running training sessions. As the project started to take off, we ended up, you know, asking ourselves, what would it take to really solve this problem and really try to, you know, build something super useful here? And I think to take something, you know, all the way to where it's really something that companies and everybody can rely on in production, and that's a part of the the infrastructure that companies are, you know, relying on for their mission critical workloads and everything. It's a long journey. Right? There's a lot of energy that needs to go into that, a lot of effort that needs to go into hardening and productionizing things. And, of course, we still we felt this was an important problem.
More and more applications were becoming distributed. AI, you know, has the potential to really just reshape every industry, you know, but it's hard because of the infrastructure investment. That's 1 of the things making that difficult. And so if you ask, like, what will it take to really enable just every organization to succeed with AI, to to get value out of AI? And how do we, you know, solve that infrastructure challenge for a company so they can succeed with AI. That is something where we weren't gonna be able to achieve all of this with just, you know, a few grad students working together at Berkeley. We really needed to be able to move much more quickly and put much more energy behind it. That led us to start the company. So basically, we started the company because, a, we felt it was an important problem. Right, that everything's becoming distributed.
Distributed computing is hard and is necessary for doing AI. At the same time, you know, we thought it was a timely problem. Right? This was happening all around us. And the open source, you know, Ray was taking off. We had a group of people that were working really well together, And we, you know, managed to convince ourselves that it could make sense as a business, in addition to all of that. So those are some of the factors that we were thinking through and that led us to start the company.
[00:06:32] Unknown:
And in the couple of years since you were on the podcast last and the couple of years of building and growing the AnyScale business, I'm wondering if you can talk to some of the particular areas of focus and ways that the project and the business have evolved or changed in ways that the ecosystem has grown up around them both?
[00:06:53] Unknown:
Yeah. Absolutely. So let me start with Ray. You know, what's changed with Ray? What's new with how we think about Ray? So Ray really has 2 parts to it. There is the core, you know, underlying distributed system, which is fairly low level. It lets you take regular Python code built out of functions, Python functions, Python classes, and to kind of translate those into the distributed settings so that you can take pretty much an arbitrary Python application and find a way to scale it. So it's super general, super flexible, but it is fairly low level. Right? So if you want to build powerful applications on top of the core Ray system, you can do it, but you have to build a lot of stuff. So the second part of Ray is on top of this core system, there's an ecosystem of libraries.
And, you know, they're higher level tools. They're easier to use. They just work off the shelf. Right? This could be for, you know, scaling deep learning training. It could be for scaling, you know, your hyperparameter tuning and its machine learning experimentation. It could be for your data loading and preprocessing right before you do the training. It could be for deploying the models. It could be for reinforcement learning. You know, there are many libraries like this you can use and scale on top of Ray. 1 of the things that's most notable over the past couple of years is this shift in focus to really emphasize the library ecosystem and really to try to create the best possible ecosystem of scalable libraries on top of Ray. And the reason this is important for us is there's 2 reasons. 1 is that the libraries are just easier to use, you know, because a lot of logic is already built that you don't have to rebuild. There's an analogy there with Python. Right? Python, you know, itself is very general, very fairly low level. And then on top of Python, you have a bunch of great libraries like pandas and NumPy and so on. And of course, a lot of the value in Python comes from the library ecosystem.
And we think the same thing will be true in the distributed setting. That's 1 reason it makes it easier to use for users. The second thing is that there's this network effect where a lot of the value of a given library comes from the fact that you can use it along with the rest of the ecosystem. So maybe I wanna do training. You know, I wanna train my model, say my TensorFlow model, and I wanna scale that. But chances are in order to do that, I'm also going to need to load some data, maybe transform that data as it's being fed into the training. And then maybe I'm going to want to, you know, as I'm training the model, I'm gonna wanna train multiple models and do do this, like, hyperparameter search experimentation. And then maybe I'm gonna wanna take that model that I trained and deploy it, you know, serve it in production. Without Ray, each of these steps would be a separate distributed system.
And with Ray, you know, this whole end to end thing can just be a Python script. And so, you know, as we build this ecosystem, every new library is gonna add value for all of the users of the existing libraries. And so that's something we've really tried to double down. And as far as, know, how we're thinking about Ray, that's something we're super excited about.
[00:09:56] Unknown:
And to your point, each of those different pieces being its own distributed system, 1 of the things that came to mind is the Horovod project that was spawned out of Uber and has been adopted by the community. And I'm curious what you see as the overlap from projects such as Horovod and the capabilities of Ray and maybe ways that the 2 can operate in concert.
[00:10:17] Unknown:
That's a fantastic question. So Uber created Horovod to be able to scale deep learning training. Uber has been, you know, using Horovod to train their deep learning models. Now the interesting thing there is that it's not a Ray or Horovod kind of question. You and Uber is now today training Horovod models on top of Ray. Right? Scaling the training with Ray. And, like, given that they already have something working, you know, why why port Horovod to run on top of Ray, you you know, when you already have something working? Well, the value there for them, there were sort of 2 benefits. 1, like I mentioned, was integration with the rest of the ecosystem. So there's the potential to use Horvath along with the other libraries, whether that's for data ingest or hyperparameter tuning and whatnot. The second thing though is that when you build a library on top of Ray, you inherit a lot of the benefits of Ray. Ray solves a lot of distributed systems challenges, whether that's elasticity, you know, auto scaling, fault tolerance, you know, scheduling, things like that. And so when they were deploying Horovod on top of Ray, they inherit that elasticity, the fault tolerance, and things like that. And so that is, you know, 1 of the reasons we see libraries, developers scaling these libraries and sort of integrating them as part of the Ray ecosystem.
[00:11:38] Unknown:
And the same thing is true with libraries like XGBoost and other popular libraries as well. Yeah. I know that another library that is taking advantage of the Ray ecosystem is Modin to be able to abstract the distributed capabilities
[00:11:53] Unknown:
and be able to just provide that Pandas interface to be able to scale from local to distributed compute. That's absolutely right. You can run modem on top of Ray to scale pandas. You know, you can run Dask on top of Ray to scale the, you know, Dask data frames, Dask arrays,
[00:12:09] Unknown:
and things like that. You know, there's also RAID datasets, which are a library for data loading and preprocessing for training. Yeah. I was gonna ask about DAST because I know that last time we spoke, I asked about it, and they are the 2 frameworks that are widely viewed as the way to scale out your Python compute. So it's interesting to see that Dask is now being available as an API layer on top of Ray as the underlying engine.
[00:12:35] Unknown:
Yeah. In a lot of ways, and Dask has great libraries for data frames and arrays and datasets and things like that. And you can use those together with Ray's libraries for machine learning, you know, and use them together and to build a single application, and it all runs on top of Ray. And as you mentioned, the
[00:12:52] Unknown:
overall space of machine learning and AI has exploded in the past decade with the widespread adoption of deep learning. And even in the past couple of years, there has been a lot of activity in terms of new model types, new approaches to transform learning. And I know that 1 of the areas of focus for Ray in particular is the space of reinforcement learning. I'm just wondering if you can talk to some of the ways that the overall landscape of AI and ML technologies has shifted and evolved over the past couple of years and some of the ways that you are looking to take advantage of that evolution with the Ray engine and some of the ways that you're supporting that transformation in the space?
[00:13:34] Unknown:
Reinforcement learning is a good example of this. You know, the way that a lot of people do machine learning today is that you have 2 distinct stages where there's the model development, you train the model, and then there's a handoff, and it's put in production and served. Right? And so what we are seeing is that more and more, you're gonna close the loop. Basically, you're going to have machine learning models that are interacting with the world, you know, making decisions, taking actions, getting feedback, and then learning based on that experience.
Right? And so this is something we've seen companies do very successfully in the context of recommendation systems. We've seen it in other domains as well. And, you know, reinforcement learning is 1 sort of formulation of this. Online learning is another sort of way to formulate this. The fact that Ray supports, you know, you can do the deployment. You can do the serving on top of Ray. You can do the training on top of Ray. You can potentially do the data processing or streaming on top of Ray. As that emerges more and more as a sort of, you know, standard way to do machine learning, that Ray will be really well positioned to support these kinds of online learning and reinforcement learning use cases. 1 thing I've been impressed with is the number of reinforcement learning success stories that I've seen at companies. We're actually hosting a conference, like a 1 day summit on production reinforcement learning end of next month, or actually end of this month. And to really highlight some of these use cases, but you see people doing this successfully in the gaming industry. You know, you see people doing this for recommendations. You see people doing this for, like, controlling machinery and, like, industrial applications.
And that's something that if you can get it right, reinforcement learning can be applied very broadly and to a lot of problems.
[00:15:25] Unknown:
When organizations and teams are trying to adopt and apply machine learning and AI capabilities to their products and their organizations? What are some of the common challenges and issues that they encounter when they're trying to move from initial prototypes into a production environment.
[00:15:45] Unknown:
There are a few big challenges that stand out to me. 1 is about scaling, And this is really a lot of our just bread and butter with Ray and scale. But, you know, the amount of compute required to do machine learning, you've probably seen these plots, is just growing incredibly quickly. It's not just reinforcement learning and things like AlphaGo. It's also, you know, large language models or large computer vision models. It's all over the place. You know, to do machine learning seriously, depending on the domain, you often have no choice but to scale things. And when I say scale, I mean, you know, across many machines, perhaps in the cloud. And the problem, of course, is that scaling these applications requires that there's a big infrastructure investment. You're often building an infrastructure team and then relying on teams to maintain and manage these kinds of platforms for scaling.
And so that is an area where 1 of the core motivations for what we're building, you know, with AnyScale is to try to take that infrastructure investment off of the critical path we're doing AI. But, of course, that is a big challenge today. And and often we see if they're evaluating using Ray or AnyScale, the alternative is often building their own distributed systems. So the first challenge around scaling, you can think of that as, like, I know how to do stuff on my laptop, but how do I really go from my laptop to the cloud? Or how do I make that a coherent, you know, kind of a seamless experience? The second 1 is about not going from the laptop to the cloud, but really it's about productionization.
Right? It's going from prototype to production or development to production. And there, in many companies, that might be a hand off to a different team. It might be a different software stack. It might be a rewrite of of everything. And so can you enable the same people who are developing and prototyping to also put stuff in production and to do that easily and potentially without relying on other teams. And there's a lot more considerations around production, whether that's monitoring, you know, observability, you know, just requirements like high availability and having SLAs and things like that, which don't come up in train development.
And, you know, that is, again, 1 of the main things we're trying to solve with AnyScale. So 2 of the main things we're trying to solve with AnyScale are about making it easy to scale from your laptop to the cloud and then making it easy to go from development to production. So instead of being a rewrite when you go from development to production, really 1 is just a sort of an extension of the other. You know, the third challenge, which is related to all of this, is just the expertise required to do all of this, whether that is, you know, expertise in infrastructure for scaling and production and things like that.
[00:18:19] Unknown:
Along with the productionization and the utility of reinforcement learning, It also requires generally a real time feed of data being able to continuously execute that training and feedback loop in real time. And I'm wondering what are some of the additional challenges that are brought about by trying to manage that real time machine learning flow beyond just the kind of batch based model development and deployment?
[00:18:48] Unknown:
So there are many challenges around data. As a company and with Ray, we are very focused on the compute side of the story. And so, you know, our customers or Ray users, you know, their data may live in Databricks or their data may live in Snowflake or, you know, in s 3. There are a variety of different options that we integrate with there. And so, you know, fortunately, we don't have to solve everything, but, you know, we integrate with all of these different tools. But some of the challenges that people see are, you know, the usual challenges you might expect around scaling, around performance.
Some of these challenges we do have to to solve ourselves because they're so tightly intermixed with, you know, the rest of the application. So for example, say you're doing training, but as you're doing the training or between training at Box, you want to shuffle your data. That may be a a common thing to do. And if your data is 100 of terabytes or petabytes of data, you know, having a system like Ray, which can potentially, you know, perform that kind of scalable shuffle along with being able to do the training is super valuable. So that's 1 of the challenges. I think balancing that also with, you know, the real time and streaming nature of a lot of applications.
Right? Being able to do serving and respond to queries or in a very low latency setting, that's another challenge. And there are a variety of things like that.
[00:20:13] Unknown:
Particularly in the deep learning space, I know that another 1 of the concerns that a lot of organizations are dealing with is the, particularly recently, the general availability of GPUs for being able to do fast and heavyweight computation for model training, but also in order to maybe circumvent some of that, being able to do network pruning to scale down the number of connections and nodes within the model so that you can actually run it on CPUs at high speed. And I'm wondering how much of that applies to the reinforcement learning space as well and just some of the work that you and your team have seen or some of the ways that the community has been orienting around being able to improve the overall efficiency of executing these machine learning models and maybe some of the ways that Ray is able to contribute to some of that network optimization in the models themselves?
[00:21:11] Unknown:
So there's a few things there to touch on. 1 is that a lot of this work on, you know, things like model pruning or, you know, making, like performance optimizations for neural networks. A lot of the work being done there is being done in the deep learning frameworks themselves, whether that is TensorFlow or PyTorch or JAX or things like other tools. And so because Ray integrates and is compatible with all of these different frameworks, we inherit the benefits of all of these things. So if somebody comes up with a way to, you know, discretize your neural net weights or, you know, prune the neural network, you know, to improve the performance, then you can scale that with Ray or you can serve it with Ray. You can do training with Ray. And, you know, Ray will handle all of the kind of, like, distributed computing aspects of it, but you sort of get the benefits of all of these things together. So, you know, a lot of the focus of Ray is on the infrastructure and sort of the compute layer, and then we integrate with all of the other, you know, tools that everyone is building. But we do see that come up certainly with training.
It's a common theme to see, you know, a lot of these training algorithms are bottlenecked by communication between the machines. And that's an area where 1 of the things we make sure to do is to have all of the latest and, you know, best algorithms in the Ray training libraries so that we can get all of the benefits of the, you know, advances that are coming out from other frameworks. The other thing I wanna touch on there, you mentioned, like, accelerators like GPUs. As you know, GPUs are are essential for doing deep learning. And you're gonna see more and more of these kinds of accelerators. It's not just GPUs.
It's TPUs, and there will be other, you know, specialized accelerators coming out as well. This is super complimentary with what we're doing with Ray because the experience that we want people to have is, you know, you can just define your neural network or you can just define your application logic. That's all you have to worry about. And then you can run that with Ray or with any scale. And if you want to use a TPU or a GPU, you know, you can just sort of say that you want 8 GPUs or say that you want, you know, 4 TPUs. And those resources will just appear, right, in kind of a serverless fashion. They just start added to the resources are just added to the cluster, and then your your code runs on them and can take advantage of them. You didn't have to, you know, install CUDA or whatever on all the machines, get the right AMI, figure out which instance type to choose, and then, you know, figure out how to place your code there. All you really did was write your Python logic. And so that's the experience we want for people. Just this as if you have this infinite machine with all possible accelerators on it. And you just say, you know, this training task uses 8 GPUs. And then when it's done, the GPUs go away. And then and if you need a 100 GPUs, they just appear. That's the kind of experience we want for developers.
[00:24:06] Unknown:
On that note, I'm wondering if you can talk to some of the operational characteristics of deploying and managing a Ray environment and some of the complexities that are associated with just that productionization of Ray as an infrastructure component?
[00:24:22] Unknown:
If you're using Ray, AnyScale should be the best way to run Ray. Right? So maybe I didn't say this earlier, but AnyScale is providing a managed Ray service. Right? So if you don't wanna be in the business of managing infrastructure, then, you know, that AnyScale can be a good choice. Of course, you can always run Ray yourself. And 1 of the things that people love about Ray is just how you can run it anywhere. Right? Whether that is on your laptop, on any cloud provider, on a Kubernetes cluster. And, you know, the way you would deploy Ray on Kubernetes is similar to how you might deploy other distributed frameworks like Spark on top of Kubernetes.
We have a Kubernetes operator that people find useful. But, of course, a lot of the value, you know, and what people like about any scale are, you know, the way in which we make production easier and some make some of these operational considerations easier. Whether that is providing, you know, monitoring out of the box dashboards and things like that.
[00:25:23] Unknown:
As somebody who is providing the infrastructure and providing the managed experience for Ray to your end users, I'm wondering what are some of the specialized utilities and additional capabilities that you've had to develop to be able to manage those environments and being able to scale them and handle multi tenancy and trying to optimize the performance and experience of end users who
[00:25:52] Unknown:
just want to use the React API, get their work done, and not have to worry about all of the underlying machines that are actually doing the compute. So there are many, many things we've had to think about. You know, since you mentioned multi tenancy, 1 thing I'll add there that we think a lot about is security. Right? And, you know, people always talk about, you know, if you think about remote code execution, well, like, we are running arbitrary code from our customers and users. Right? So isolation is something we think about a lot. Right? And generally, really invested from very early on in the company in building a strong security team.
So that's 1 of the considerations. You know, a lot of what we've built and things we've tried to do to make it easier, you know, to add value for our customers is not just about developing Ray, but about developing some of the things adjacent to Ray. So for instance, you know, 1 thing we give a lot of thought to is how to make deployment easy. How to, like, let people build the right environment and deploy their applications in that environment in a producible way, in like a deterministic way, you know, in a way that is easy to specify and you don't have to go deep into, you know, building Docker images or things like that. So you can think about the kind of build farm infrastructure around that that's been important.
And some of the things we've been building are around collaboration, around, like, giving organizations more insight into costs and being able to control those costs. And then, of course, the monitoring and production side of the story as well.
[00:27:25] Unknown:
In your experience of running Ray in this production context at such scale that's beyond what most people are going to have to deal with. I'm wondering what are some of the
[00:27:36] Unknown:
sharp edges that you ran into in the Ray project that you've had to smooth over and just some of the ways that the work at any scale has fed back into the open source? I mean, at any scale, we are, you know, developing Ray very heavily. A lot of our engineering effort goes into trying to make Ray really good. We always run into limitations of Ray, whether that is around pushing the limits of scalability. You know, I think 1 of the big efforts is around, like I mentioned, making production easier. A lot of the requirements around production and deployments have really influenced, you know, how we think about the RAI roadmap.
And performance as well, I think there are a lot of requirements around just making things faster or making things perform better, that we've done a lot of work there as well.
[00:28:26] Unknown:
And on the note of the balance between the open source and the corporate work that you're doing. I'm curious if you can talk to the governance model around the Ray project and some of the ways that you think about sustainability of the project to the ecosystem and how you think about the dividing lines about what to make commercial and what to keep in the open source?
[00:28:50] Unknown:
You asked about the governance model around Ray. So Ray, you know, of course, we do a ton of Ray development at any scale. We collaborate very closely with a number of companies who are all, you know, developing or contributing to Ray. And that's something we've had, you know, really productive collaboration there. The way this works, we have a group of committers, like Ray Committers. There's people at any scale, people at other companies. And, of course, we are regularly adding new committers. This is done based on discussion among existing committers, and it's essentially based on the contributions that the person has made to the project. And that's worked super well so far. You know, I think 1 of the important things here is just being very communicative with the community and sharing proposals, soliciting proposals, you know, getting feedback on design docs
[00:29:42] Unknown:
and things like that. And that's been working very well for us so far. As far as the road map for the Ray project, I'm wondering what you see as being the main driving factors as to where AnyScale and the broader community focuses their efforts and what you see as being some of the major areas for improvement and areas that could use a bit of additional polish to make the overall Ray experience more accessible and productive for everybody.
[00:30:15] Unknown:
There's always a lot of room for improvements. And I like the way you phrased it because, you know, when we think about the value that we're adding, a lot of it is around developer productivity. It's taking away a lot of the kind of low level infrastructure work or systems work and letting people iterate faster at a large scale. And so, you know, the core areas that we wanna focus on are really making it easy to scale, right, from your laptop to the cloud and making it really easy to move from development to production. And so, of course, these are areas we've been working on for a while, and the the production 1 is especially a large focus right now. I think we see a lot of room for improvement in performance when it comes to especially at scale. We see a lot of room for improvement in, you know, the serverless experience by making it so that developers don't have to think about clusters, don't have to configure clusters or instance types, that they can just, as much as possible, just focus on their application logic. So that's something we're really trying to double down on.
[00:31:21] Unknown:
1 of the other things that's interesting to cover is maybe the ways that you think about engagement with the broader community and how the ecosystem has developed as to sort of which pieces should integrate closely with Ray and which ways Ray should kind of meet the rest of the community where it's at. You know, over time,
[00:31:44] Unknown:
people and, like, the industry will converge on a tech stack for AI. And right now, that's pretty wide open. But we see Ray emerging as the compute layer for that stack. And, of course, there are many pieces to the stack. Right? There's everything from, you know, the data component to the model registries and feature stores, experiment tracking. And so I think all of these components of this eventual stack should integrate together. And, of course, we want it to be really seamless for our users if, you know, if your data is stored somewhere and whether it's, you know, Databricks or Snowflake, and then you're, you know, doing training or serving with Ray, it's critical to have seamless integrations there.
Similarly with, you know, integrate with experiment tracking tools like Weights and Biases or MLflow. Like, 1 of the things that people like about Ray, I mentioned you can run Ray anywhere, but Ray is also super framework agnostic. Like, you can use it with any other tools out there. Right? Certainly with any Python libraries or deep learning frameworks like PyTorch and TensorFlow or scikit learn, it's very compatible with what you're trying to do. So I don't know if there's any of these tools in the space, you know, where we don't think there could be a good integration story.
[00:32:57] Unknown:
And I know that the last time we talked, towards the end of the conversation, I was asking about the language ecosystem aspect of Ray where, initially, it was very focused on Python because of the strong focus of the Python community on machine learning and AI. And I was asking about the capabilities of Ray to be able to support some of the other language run times, and I know that you had said that there was at least some initial goals of being able to do that. So I'm curious what the current state of affairs is with regard to being able to use Ray for scaling compute across other language run times. Things may not have changed there so much since the last time we chatted. We are very focused on Python as,
[00:33:41] Unknown:
you know, the main use case where we're trying to make it super successful. That said, people also use Ray to scale Java and, you know, supporting a couple different languages like that has made us, you know, in a good way, forced us to design the whole system in a pretty language agnostic way. Right? So that eventually, you'll be able to, you know, use Ray to scale computation in many different languages. There's a little bit of work right now happening on Rust, scaling Rust, but by far, the majority of work is on Python. And, of course, part of Ray is written in Python, but a lot of Ray is written in c plus plus and that turns out to be a pretty good combination for building performance systems.
[00:34:22] Unknown:
In your experience of working with the Ray community, building the framework, building the business around it, and helping to grow that ecosystem, You
[00:34:38] Unknown:
You know, this isn't entirely unexpected, but 1 of the things that has been great to see is, you know, seeing more and more organizations either evaluating reinforcement learning for their use cases or actually, you know, really succeeding with reinforcement learning in ways that, you know, they wouldn't have been able to do with without reinforcement learning. We've seen that certainly in gaming recommendations, you know, starting to see that as well in, like, industrial applications and then potentially medical applications. So there's a lot of promise there. I think that's 1 of the main unexpected things.
You know, I think we're always surprised by just how broadly machine learning can be used in just every industry and across, you know, every different use case. In your own experience of
[00:35:27] Unknown:
building the Ray framework and growing the AnyScale business around that, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:35:35] Unknown:
For building the business? It's been exciting. You know, my background is more on the engineering side. Right? Like, actually doing the software engineering to build Ray and things like that. You know, I gravitate toward APIs, the user experience, and, you know, 1 of the things that I really care about is just how can we create a magical experience for developers. You know, that said, in addition to the engineering side and thinking about the product, there are many different aspects to running a business, many of which are things that I had essentially no exposure to beforehand. And so everything's surprising to me. You know, I've certainly been learning a lot along with my cofounders. Had a really fantastic time doing this. For anybody who is looking to
[00:36:21] Unknown:
accelerate or scale their adoption of machine learning and AI? What are the cases where AnyScale or Ray are the wrong choice and they might be better suited with a different execution framework or a different operating environment or just some homegrown capabilities?
[00:36:39] Unknown:
Good question. I think, you know, for scaling AI, that really is, you know, our bread and butter, like, where we are strong. 1 thing we don't do, for instance, is more on the data side. Like, whether that is, you know, large scale SQL queries or other kinds of data analytics like that. The extent to which we do data processing in Ray, it's really targeted at the data processing you need in order to do training and other parts of the machine learning life cycle. So there are many, you know, data scientists whose use cases would not be a good fit for Ray or any scale. But when it comes to, you know, deep learning or the compute behind training or the experimentation or, you know, the deployments, Those are all things where I think, you know, Ray and AnyScale are a pretty good choice.
[00:37:25] Unknown:
As you continue to grow the business and the technology stack around AnyScale and Ray, what are some of the things you have planned for the near to medium term or any projects or initiatives that you're excited to dig into?
[00:37:39] Unknown:
I would say for Ray, 1 of the things we're super excited about is just building a great library ecosystem on top of Ray. And there's sort of 2 components to that. Right? There's having best in class libraries so that, you know, the best way to do training or the best way to do reinforcement learning or serving is to use a library on top of Ray. And then there's also the component of making sure these things work seamlessly together. Right? It's not just a collection of libraries, but really a coherent ecosystem, a coherent seamless experience for users.
So that's something we are excited about with Ray. With any scale, it's really about, you know, the serverless experience, making it so that, you know, you just don't think about configuring clusters. As a developer, you just focus on the application logic, and everything else just works out. If you need more resources, you have them. If you don't need them, they go away. That's true for any, you know, type of accelerator or memory or compute, you know, CPUs, GPUs. And, like, the kind of experience you wanna have is, like, if you know how to program on your laptop, then you could take advantage of all the benefits of the cloud without becoming an expert.
[00:38:46] Unknown:
Are there any other aspects of the Ray project and its ecosystem and the work that you're doing at any scale or just the overall space of being able to adopt and scale artificial intelligence and machine learning applications that we didn't discuss yet that you would like to cover before we close out the show? You know, I would say
[00:39:04] Unknown:
1 thing that I could say a little bit more about is I'll say a few of the interesting directions we've seen with AI workloads. So 1 is for serving. We all know about deploying machine learning models and, you know, taking your model and putting it behind some endpoints. But the complexity can really increase in a few different scenarios. So 1 is if you need to scale. Another is if you start to have many models or multiple models that you're composing together. And as an example of, you know, imagine you're trying to play a service that reads street signs or something like that. You might want to have a first model that detects where the street signs are, And then a second model that if is are some street signs, then the second model, like, reads the signs. Right? And so that's just an example. But when you start, there's a lot of natural scenarios where you wanna compose multiple models together.
3rd case is where your machine learning you know, it's not just a model. Right? It's actually machine learning models combined with other business or application logic. Right? And so when these things start to get intermixed, you need more flexibility. And sort of a 4th challenge that a lot of people run into is, you know, doing model serving, but in a very framework agnostic way. So some of your developers are using PyTorch. Some of them are using TensorFlow. Some are just writing Python codes. Some are using scikit learn. You know, how do you have infrastructure that makes all of these things work easily. And, you know, these are all challenges that, you know, Ray serve and the serving library on top of Ray is solving. And that we've seen people doing doing pretty successfully. But I think if there's 1 direction that is, you know, introducing a lot of complexity, but also, you know, we see it as a really growing use case, it's the composition of models together to build applications. 1 thing I'll say about reinforcement learning is that know, reinforcement learning can be hard to do. And 1 of the challenges, like, when you think about traditional reinforcement learning, you might be thinking about, you know, do I really want to do trial and error in production? Like, that doesn't sound like a good idea. What we've seen is more of a focus on what's called offline reinforcement learning, where essentially, you can, you don't have to do trial and error in production.
You can actually take data that's been collected, you know, that you've logged from previous interactions and learn from that and, like, train a reinforcement learning model that's sort of learning from observing what was done in the past. And so this can be safer to deploy. It can be easier to deploy, and it can kind of fit into the existing paradigm for how machine learning models are trained and deployed. So that's something that we think has the potential to make reinforcement learning much more, you know, accessible to use and sort of open it up to a lot more applications.
[00:41:44] Unknown:
Yeah. Definitely interesting considerations, and particularly the composition of models and being able to manage the sort of stages of execution, I can see as being very valuable and unlocking a lot of additional applications to machine learning that would be otherwise unwieldy or impractical for organizations to try and adopt?
[00:42:09] Unknown:
Yeah. There are many reasons you might wanna do it. I mean, 1 is that you might have a model that's cheaper to evaluate but less accurate and another model that's more expensive but more accurate. And you may want to do multiple passes to kind of conserve resources, but get to the best decision. And so that's something we see all the time. And if you're interested in, you know, this model composition stuff or some of the challenges around model deployment, RayServ is the library on top of Ray that's trying to solve these challenges.
[00:42:35] Unknown:
Yeah. It's definitely interesting because I can see a number of organizations would likely just try to fit all of the capabilities into a single model and explode the complexity with having a reduced capacity and, you know, degraded overall experience.
[00:42:51] Unknown:
And it gets hard. 1 thing to build the infrastructure yourself when you're deploying 1 model, And we see that, you know, that's very normal to do. But we have customers that are deploying hundreds of models. You know, some of them are these, you know, tiny, you know, inexpensive models to evaluate. Some of them have very high memory footprints and are very computationally intensive.
[00:43:12] Unknown:
And so figuring out how to place these on different machines or how to, you know, scale these in different ways that you end up starting to build a lot of infrastructure. And that's something we we simplify a lot. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. This week, I'm going to choose a book that I've been reading with my son called Beyond the Deep Woods, and it's the first book that was written as part of the Edge Chronicles, which is actually a fairly expansive series of stories. But the first 1 that we've been going through so far has just been a lot of fun, really interesting story, great characters, has a lot of really cool illustrations in it, but it's a chapter book. So it's got a lot of good sort of meat to it. It's a great story to read with, you know, anyone from, you know, upper elementary school up to, you know, ancient adults. I've been having a lot of fun with it. So definitely recommend checking that out if you're looking for something to read. And with that, I'll pass it to you, Robert. What do you have for picks this week? 2 things that come to mind. So 1, Ray related,
[00:44:17] Unknown:
is that the end of this month, we are having a 1 day RL summit on, like, production reinforcement learning. And that's something that you're excited about, interested to see how a wide range of companies are deploying reinforcement learning in production, what that looks like. That's gonna highlight a lot of use cases. The other is a science fiction book that I read recently, which I enjoyed a lot called Project Hail Mary, and it's about, you know, saving the planet from, I guess, extinction.
[00:44:46] Unknown:
Alright. I'll have to check that 1 out. Well, thank you very much for taking the time today to join me and share the work that you've been doing at any scale and on the Ray project and ecosystem. It's definitely a very interesting and valuable project and capabilities that you're offering to people for being able to support their machine learning and AI capacities. So appreciate all of the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Thanks so much. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site of python podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com with your story. To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
Introduction and Guest Welcome
Robert Nishihara's Background and Introduction to Ray
Building AnyScale and the Motivation Behind It
Evolution of Ray and AnyScale
Ray's Integration with Other Libraries
AI and ML Landscape Changes
Challenges in Real-Time Machine Learning
Deep Learning and GPU Utilization
Operational Characteristics of Ray
Ray's Production Challenges and Improvements
Ray's Roadmap and Community Engagement
Unexpected Lessons and Reinforcement Learning
Future Plans for Ray and AnyScale
AI Workloads and Model Composition
Reinforcement Learning and Offline Learning