Summary
Distributed computing is a powerful tool for increasing the speed and performance of your applications, but it is also a complex and difficult undertaking. While performing research for his PhD, Robert Nishihara ran up against this reality. Rather than cobbling together another single purpose system, he built what ultimately became Ray to make scaling Python projects to multiple cores and across machines easy. In this episode he explains how Ray allows you to scale your code easily, how to use it in your own projects, and his ambitions to power the next wave of distributed systems at Anyscale. If you are running into scaling limitations in your Python projects for machine learning, scientific computing, or anything else, then give this a listen and then try it out!
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Robert Nishihara about Ray, a framework for building and running distributed applications and machine learning
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Ray is and how the project got started?
- How did the environment of the RISE lab factor into the early design and development of Ray?
- What are some of the main use cases that you were initially targeting with Ray?
- Now that it has been publicly available for some time, what are some of the ways that it is being used which you didn’t originally anticipate?
- What are the limitations for the types of workloads that can be run with Ray, or any edge cases that developers should be aware of?
- For someone who is building on top of ray, what is involved in either converting an existing application to take advantage of Ray’s parallelism, or creating a greenfield project with it?
- Can you describe how Ray itself is implemented and how it has evolved since you first began working on it?
- How does the clustering and task distriubtion mechanism in Ray work?
- How does the increased parallelism that Ray offers help with machine learning workloads?
- Are there any types of ML/AI that are easier to do in this context?
- What are some of the additional layers or libraries that have been built on top of the functionality of Ray?
- What are some of the most interesting, challenging, or complex aspects of building and maintaining Ray?
- You and your co-founders recently announced the formation of Anyscale to support the future development of Ray. What is your business model and how are you approaching the governance of Ray and its ecosystem?
- What are some of the most interesting or unexpected projects that you have seen built with Ray?
- What are some cases where Ray is the wrong choice?
- What do you have planned for the future of Ray and Anyscale?
Keep In Touch
- Website
- @robertnishihara on Twitter
- robertnishihara on GitHub
Picks
- Tobias
- D&D Castle Ravenloft board game
- One Deck Dungeon
- Robert
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Ray
- Anyscale
- UC Berkeley
- RISELab
- MATLAB
- Deep Learning
- Theano
- Tensorflow
- PyTorch
- Philip Moritz
- Reinforcement Learning
- Hyperparameter Tuning
- IPython Parallel
- AMPLab
- Apache Spark
- Actor Model
- Horovod(?)
- Flink
- Spark Streaming
- Dask
- gRPC
- Tune
- Rust
- C++
- C
- Apache Arrow
- Wes McKinney
- DataBricks
- MongoDB
- Elastic
- Confluent
- Embarassingly Parallel
- Ant Financial
- Flame Graph
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, node balancers, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI and CD pipelines, they've got dedicated CPU and GPU instances. Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Robert Nishihara about Ray, a framework for building and running distributed applications and machine learning workloads in Python.
So, Robert, can you start by introducing yourself?
[00:01:11] Unknown:
Yeah. Well, first of all, thanks so much for for speaking with me. So I am 1 of the creators of an open source project called Ray. And I'm 1 of the co founders and CEO of AnyScale, which is commercializing Ray. And before starting this company about half a year ago, I did a PhD in machine learning and distributed systems at UC Berkeley. And of course, Ray was a project we began several years ago at UC Berkeley doing the PhD program to address problems that we were running into, in our own research. And that has since transitioned into into the into AnyScale. And do you remember how you first got introduced to Python? So I first started using Python in college. You know, when I actually started out doing machine learning research as an undergraduate, the dominant tool at that time was MATLAB. Everyone was using MATLAB too, to run experiments and do regressions and and things like that. And, you know, at some point, everyone started transitioning to Python.
And once deep learning really took off around 2013 and and frameworks like Theano, then, of course, later on, TensorFlow and PyTorch took off. That really solidified Python's dominance in in machine learning. And so you mentioned
[00:02:25] Unknown:
that Ray is a distributed task execution framework, and that it started when you were doing your PhD research at UC Berkeley. And my understanding is that you were part of RISE Lab at the time. I'm wondering if you can just give a bit more background on the project and the motivations for starting it, some of the challenges that you were facing in your research that led to it, and how the overall environment of the RISE Lab factored into some of the early design and development decisions of the project? That's a great question. So the research that I
[00:02:54] Unknown:
started out doing and and that Philip so Philip Moritz, 1 of my cofounders and also, 1 of the creators of Ray, we were doing research in more theoretical machine learning. So trying to design better algorithms for reinforcement learning, for optimization, for learning. And if you, you know, if you if you sort of walk into the AI research building at Berkeley or any at any research university, 1 thing you'll notice is that you have all these machine learning researchers. And, you know, they have backgrounds in math and statistics. But they're spending and they're trying to design better algorithms for machine learning. But what they actually spend a ton of their time doing is low level engineering to scale things up and to speed things up because, you know, the experiments in machine learning are very computationally intensive.
They can be quite time consuming to run. And so speeding things up or scaling things up to run on a cluster instead of on a laptop can be the difference between an experiment taking a week versus a day versus an hour. And so you have all these different graduate students and machine learning practitioners who are not only are they designing the machine learning algorithms and implementing their algorithms, they're also building all of this infrastructure and scaffolding underneath to to run their algorithms at a large scale. And, you know, we did this ourselves, of course, but we felt that there had to be a better, you know, a better way that could let us just focus on the algorithms and the applications that we were trying to implement and to not have to, you know, become experts in distributed computing. So we set out to try to build better tools. And, of course, we had used many of the existing tools out there. Even, you know, back as an undergraduate, when I was doing machine learning research a number of years ago, I had used a tool called IPython Parallel, which is 1 of the first tools for doing, parallel and distributed Python. Of course, that made my life a lot easier. But when it came to deep learning and more modern machine learning applications, the tools just weren't there yet. And so these researchers were constantly building their own ad hoc tooling and new infrastructure. So that was sort of the the scenario that we found ourselves in where to do machine learning, you really had to build a lot of infrastructure.
And that's why we set out to try to build tools to make this much, much easier. And you asked about the RISE Lab. Well, so when I started the PhD program, I was part of the AMP Lab, at Berkeley, which, you know, created Spark. And that transitioned into the RISE Lab. And in both of these settings, I was surrounded not just by experts in machine learning, but also And now I was starting to get interested in how to build tools and build systems. And now I was starting to get interested in how to build tools and build systems. And it wouldn't have been possible if I hadn't been in that environment. I was in that environment, I was surrounded by experts in distributed computing. I was able to learn from them, to learn from their experience with Spark and other open source systems.
And that really helped us avoid some of the pitfalls.
[00:05:58] Unknown:
It helped us, you know, really move much more quickly in terms of building a great framework. And 1 of the common themes in a lot of the projects that I've had on the show, and people that I've spoken with is that they set out to solve 1 problem, and on the way, discover that they actually have to solve some completely orthogonal problem first. And half the time, it seems that they never actually get back to the original problem that they're trying to deal with. And I'm curious what your experience was of if Ray just consumed all of your attention and you never actually made it back to your original research? Or if you are able to at least, get back to what you were originally trying to accomplish after you got to a working state with Ray? That's a that's funny that you asked. Yeah. When we started Ray, it was a little bit of a detour. We were doing machine learning research. It became clear to us that the tooling and the infrastructure was the bottleneck. And so we set out to build something better there. And, of course, the goal from the start with building Ray was to build useful open source software. It was to to try to solve this pain point where people really struggle to scale up their applications from 1 machine to a cluster.
[00:07:00] Unknown:
And when we started out, we had a, you know, we had an idea of the API that we wanted and the behavior that we wanted, and we figured it would take us about a month to to finish that. And, of course, in a month, we did have something. We had, you know, we had a prototype and we had something working. But to really make something useful for people to, you know, give companies confidence that they can run it in production, it requires much more than a prototype. And you have to really take it all the way. And so even if you can get some initial prototype running up and running pretty quickly to really polish it and really, you know, make it great, There's a lot of extra work involved, and that's and so, you know, it takes longer than than you initially expect. Now 1 interesting thing related to what you asked is that there haven't been too many changes to the scope of what we've been building. We started out as quite broad. But over time, we encountered more and more use cases that we realized we could support that we didn't initially realize we could support. And I can talk more about these later, but 1 example, Ray actually started out as purely like, more like functional and stateless.
And and, essentially, it started out as a way to scale up Python functions. But then we realized a lot of the applications we really cared about were stateful and required managing, you know, mutable state in the distributed setting. And so we introduced the ability to scale up not just Python functions but also Python classes, by introducing Actors. And that was 1 API change that really opened up a lot of new use cases in a in a really powerful way. Yeah. When I was looking through the documentation
[00:08:43] Unknown:
and seeing the difference between just a remote task versus a remote actor, at first glance, it wasn't really clear to me the significance of just the minor syntax difference because you still just use a decorator, but it's just a matter of whether you decorate a class versus a function to determine whether or not statefulness is incorporated into it. So I think that the API and the sort of user experience that you landed on is fairly elegant and very minimal while still being able to provide a significant amount of power to the end user. That's exactly that's something we we very much strive for. And
[00:09:15] Unknown:
to keep it simple, to have only a few concepts that are powerful in in general.
[00:09:21] Unknown:
And so you mentioned that some of the original work you were doing was on some of these advanced machine learning algorithms, such as reinforcement learning, and that was some of the initial use case that you were targeting. I'm curious, now that it's been publicly available for some time and it's been out in the wild, what are some of the other ways that it's being used which you didn't originally anticipate? Yeah. 1 use case that we didn't initially anticipate was serving, so, like, inference putting machine learning models in production.
[00:09:48] Unknown:
But it turns out that Ray is actually a pretty natural fit for this use case. And the powerful thing here, or the exciting thing, is that when you have an, a system that can support not just the serving and inference side of the equation, but also the training side, also, you can build online learning applications that do both, where you have algorithms and applications that are continuously interacting with the real world and making decisions, but also and, you know, learning from those interactions and updating themselves, updating the models. So you have applications that look like we are companies doing this in production where you have data streaming in, you're incrementally processing this data in a streaming fashion and training new recommendation models, and then serving recommendations from those models back to the users, which then affects the data that you're collecting, and there's this tight loop. Normally, when companies implement this kind of online learning application, they're actually stitching together several different distributed systems. They'll take 1 for stream processing, 1 for training machine learning models, and then 1 for serving. And here, we've had companies that are able to take that kind of setup with several different distributed systems that are stitched together and rebuild it all on top of Ray. And, you know, it becomes simpler because it's on top of 1 framework. They don't have to learn how to use a ton of different distributed systems. It can become faster because you aren't moving data between different systems, which can sometimes be expensive. And it generally, you know, it can get simpler because you these different distributed systems are not usually not designed to interface with each other. So if you have if you're using Horovod or distributed TensorFlow for training and you're using Flink or Spark Streaming for the stream processing, Flink and Horovod were not really designed to just compose together.
But if you can do these things on top of Ray, then it can be as natural as just using Python and importing different libraries like NumPy and Pandas, and you can use all of these things together. Another project in the Python ecosystem that is tackling the
[00:11:52] Unknown:
distributed computation need that I'm also aware of is Dask. And I know that it was originally built for more sort of scientific computation and research workloads. But I'm wondering what your thoughts are on some of the trade offs of when somebody might be looking to use Dask versus Ray or ways that they can work together. That's a great question.
[00:12:14] Unknown:
And Dask has some has some great libraries for, DataFrames and distributed array computation, which we haven't been focused on so much. With Ray, we've been very focused on scalable machine learning, everything from training machine learning models to reinforcement learning to hyperparameter search to serving. And the kinds of applications that, you know, we've been focused on really benefit from the ability to do stateful computation with Actors. I think that's in terms of what Dask provides at a low level, it's very similar to the Ray remote function API, so the ability to take Python functions and scale them up.
But a lot of the scalable machine learning libraries that we're building really require stateful computation. So there's some difference in emphasis there. And of course, I think Dask began its life focused on large multicore machines and really optimizing for high performance on a single multi core machine. And, you know, we are focused on performance very much in the large multi core machine setting as well as the cluster setting and really trying to match the performance of low level tools, like if you were just using gRPC or on on top of Kubernetes and things like that, then using Ray should ideally be,
[00:13:44] Unknown:
just as fast. For the types of workloads that somebody might think of for Ray, what are some of the limitations that they should be aware of or edge cases that they might run into as they're developing the application or any of the specific design considerations that you have found to be beneficial for successful implementations?
[00:14:02] Unknown:
There are a couple different things to be aware of. So 1, the Ray API can be pretty low level. And so to implement all of the features and all of the things you want for your application, you may need to build quite a bit of stuff. So for example, if you're looking to do distributed hyperparameter search, well, we provide a great library for scalable hyperparameter search. And so you can do that out of the box on top of Ray, and it works well. On the other hand, if you want to do something like large scale stream processing, we currently don't have a great library for stream processing. And so while it should be possible, in principle, to do that on top of Ray, doing so would require first building your own stream processing library or building something like that or some subset of that before you can really do that well.
[00:14:52] Unknown:
And if somebody is trying to convert an existing application to be able to scale out some subset of its functionality. What does the workflow look like for being able to do that? And how does it compare to somebody who's approaching this in a greenfield system? Yeah. So this is actually an area where Ray really shines. And we've had
[00:15:13] Unknown:
So with Ray, if you want to scale up a Python application, an existing Python application that you already wrote, you can often, not always, but often just add a few lines, change a few lines, add a few annotations, take that code and run it, you know, anywhere from your laptop to a large cluster. And the reason for this and, you know, the important thing here is not just the number of lines of code. It's nice, you know, if you only have to add 7 lines of code. But the important thing here is that you don't have to restructure your code. You don't have to re architect it and change the abstractions and change the interfaces to to course it into using Ray. It should really be something you can do in place. And the reason for this is that it has to do with the abstractions that Ray provides So if you want to scale for example, if you want to scale up an arbitrary Python application using Spark, Well, the core abstraction that Spark provides is a large scale data set, like a distributed data set. And so if you have an application that is centered around where the core abstraction is a large dataset and you're manipulating that dataset, then Spark is the perfect choice. On the other hand, if you're trying to do something like hyperparameter search or inference and serving, where the main abstraction is not really a data set. It might be something like a machine learning experiment or it might be something like a neural network that you're trying to deploy Then coercing that into to scale that up with Spark you have to, you know, coerce it into this dataset abstraction which is not a natural fit, and that can lead you to having to re architect your application. On the other hand, with Ray, you know, of course, we have higher level libraries, but the core Ray API is not introducing any new concepts. It's not introducing these higher level abstractions.
It's just taking the existing concepts of Python functions and classes. And of course, all Python applications are pretty much built with functions and classes. It's taking those 2 concepts and translating those into the distributed setting. So you have a way of taking, you know, any application that's built with functions and classes and then, ideally, with a few just modifications, running that in the distributed setting. So this is something where Ray actually can feel very lightweight, and it's something where we've had a number of different users tell us that, you know, they love Ray because they can just get started with it so quickly. They can tell people on their team how to use it in 30 minutes, and then they're off and running, and they don't have to really change their code much. They can just take their existing code and scale it up. In terms of Ray itself, I'm wondering if you can discuss how it's actually implemented and some of the ways that it's evolved since you first began working on it. 1 thing we've done is we've actually rewritten it from scratch several times, as we got new ideas and and thought of improved ways to architect it. We had an initial prototype that was written in Rust.
And I think this the implementation after that was in, c plus plus. And then we switched to, it was a they're like a C plus plus implementation with lots of multithreading. And then we switched to an implementation in C, actually, where all of the different Components, like the Scheduler and the Object Store and so on, had single threaded event loops. And from there, we switched back to you know, more recently, it's back to C plus plus using gRPC as the underlying technology or the underlying RPC framework. And, you know, we've continued to rearchitect it quite a bit, both to improve performance, to improve stability, and to simplify the design. So that's something we are you know, I think the people working on Ray have done a great job of, you know, making these bold changes and really not being afraid to tear things down and and rewrite them when it makes sense to do so. Now, how does it actually work?
There are, you know, a number of key components. So 1 is, the Scheduler, And the Scheduler is actually a distributed concept where we have a different Scheduler process on each machine. And, the scheduler is responsible for things like starting new worker processes, for assigning tasks to different workers, and book doing the bookkeeping for what resources are in use, what, you know, CPUs or GPUs or memory or things like that are are currently allocated. There's an object store process, and this is actually, again, a pretty important optimization for performance optimization for machine learning applications.
So a lot of machine learning applications rely heavily on large numerical arrays and large numerical data. And when you have a bunch of different worker processes, if they're all using the same dataset or the same neural network weights or things like that, it can get quite expensive to send copies of this data around to each worker process and for each, process to have its own copy of the data. And so an important optimization is to allow each Worker to, instead of having its own copy, to just have a pointer to just use shared memory to to share it to access the the data. And so what happens is we have this object store process which can store arbitrary Python objects, but the important part is objects containing large arrays of numbers. And then all the different worker processes can have essentially can access that data just by having a pointer to a shared memory instead of having to create their own copy.
And then you can avoid a lot of expensive serialization and deserialization. So this is something where we first developed this object store as part of Ray, And then we contributed it to the Apache Arrow project, which is started by Wes McKinney. And that's something, a project we've collaborated very closely with. And now, you know, people are using the object store as a standalone thing independent of Ray. And I would say those are some of the key architectural pieces. There's more, of course. There's the ray autoscaler, which is a cluster launcher. Of course, 1 of the pain points with doing distributed computing is starting and stopping clusters and managing the servers and installing all of your libraries and dependencies on the on the, you know, VMs. And what the Ray autoscaler does is it lets you just run a command from your laptop. Like, you can run Ray up, And that'll start a cluster. You can do it on AWS, or GCP, or Azure, any cloud provider.
And you can essentially configure and manage these clusters on any cloud provider just by running a single command. So that's another important part of what we're building.
[00:21:53] Unknown:
Yeah. The operational characteristics of distributed computation frameworks are often 1 of the hardest parts and 1 of the biggest barriers to entry. So I was pretty impressed when I was reading through the documentation and saw that you had that built in as a first class concern. And just briefly going back for a moment, I was surprised to hear that you ended up going from Rust to c plus plus because it's usually the other direction that I've heard people going. So I'm curious what the, state of the Rust language was at the time that you made that decision or any constraints that it had because I'm usually heard people who are going to it because of the memory safety cap capabilities of it. Yeah. That's a great question. And I think, you know, Rust, if we had stuck with Rust, it would have been a fine choice
[00:22:35] Unknown:
and would have had a lot of advantages. You know, and, of course, it was a lot of fun to write. And the memory safety is a big advantage. At the time, we were somewhat concerned about the or less sure about the trajectory of Rust and just the maturity of the ecosystem. And there were a lot of projects you know, for some of the 3rd party libraries that we were integrating, like gRPC and Arrow. You know, their C plus plus support was very mature. And it was, in some sense, maybe the the conservative choice was to to use C plus plus And in addition, we wanted it to be something that a lot of people in the community could contribute to. And and, of course, a lot of people do know Rust, but, there's perhaps more people who are able to contribute in C plus plus But that said, you know, we do care a lot about memory safety. And we do, you know, follow best practices with when using c plus plus to try
[00:23:30] Unknown:
to, be as careful as possible. Yeah. The, the ecosystem is generally 1 of the biggest motivators for choosing any particular project or language is because if you go with something that's brand new and very promising, it can be exciting. But after you get started for a little bit, you realize how much you're leaning on the rest of the ecosystem and the rest of the work that everybody is contributing to be able to help you move faster. So I can definitely sympathize with their motivation for going back to c plus plus. And then going back to the operational characteristics, I think that it's great that you have this auto scaler built in. And 1 of the questions that I was having is the thought of being able to manage dependencies in the instances that you're deploying, and how you manage being able to populate the code that you're sending to these instances, and any dependent libraries that are necessary for them to be able to execute.
[00:24:24] Unknown:
Yeah. And that's that's, an ongoing area that we're working on. So I don't think it's it's fully solved yet. But 1 thing that's always been a challenge is, you know, with distributed computing, if you build some large scale distributed application and you're running it on a cluster, it's quite challenging to actually share that application with someone else in a way that they can just easily reproduce it. And part of the reason for that is it's not enough to just share the code. There's quite a bit more in terms of the cluster configuration and the libraries you're depending on, maybe some data that lives somewhere. And in machine learning, people often spend more time trying to reproduce other people's experiments than they do running their own experiments. And this is something we've seen quite a bit.
So 1 thing we are doing is trying to make it easy for people to share their their applications and then for others to just run those applications. And that's something, I think, you know, we can make it as easy as just clicking a button, essentially. And part of that will be is about using if you're building an application with Ray, it's a very portable API. It's something that you can run anywhere from your laptop to any cloud provider or your Kubernetes cluster. And then making sure that we are specifying all the relevant dependencies, whether those are Python libraries or instance types or data.
And that's something we're developing in the open source project to try to encapsulate all these dependencies.
[00:25:56] Unknown:
As you have built out some more of the capabilities of Ray, I know that you've also added these other layers and libraries in the form of rllib and the tune library that you mentioned for being able to do distributed hyperparameter search. And I'm wondering what you found to be some of the most interesting or challenging or complex aspects of the work of building and maintaining Ray and its associated ecosystem?
[00:26:18] Unknown:
Yeah. So let me say a little bit more about all the different pieces that we're building. So there are 2 parts to ray, at least 2 parts. There's the core runtime, which has the remote decorator and the ability to scale up Python functions and classes. And then on top of that, we're building an ecosystem of libraries targeting scalable machine learning. So that includes things like, as you mentioned, hyperparameter tuning, training, serving, and more, right, and reinforcement learning. And these are all things that, you know, to do reinforcement learning and to do training, you you typically have to distribute the computation to get it done in a reasonable amount of time. So those are libraries that we're very excited about. You know, you can use them together seamlessly.
So what is the challenge? Well, in some sense, there are several different projects happening here that are being developed in the same code base. And so it can it can be a little challenging to keep everything organized and to make it easy for people to contribute to, say, rllib without having to know anything about the Ray core or without having to know anything about the serving library. It can be a little harder for people to navigate the code base because there are just more directories and more things going on. But But there are also a lot of advantages to having things in the same codebase. And in particular, 1 is that you don't have to worry about incompatible versions of, you know, the reinforcement learning library and the and the tuning library and the core, core array library.
And that's pretty big advantage.
[00:27:52] Unknown:
And another aspect of this overall effort is, as you mentioned, the AnyScale company that you and some of your collaborators have formed, around Ray to help support its future development. I'm wondering if you can discuss a bit about the business model of the company and some of the ways that you're approaching a and some of the ways that you're approaching the governance of the Ray project and its associated ecosystem.
[00:28:15] Unknown:
Great question. So I think if you you know, we don't have a product yet. We haven't announced any product plans. But if you look at other companies that are commercializing open source projects, Databricks, for example, or GitHub, there are a number of companies like this. You could also think MongoDB or Elastic or Confluent. Having a great open source project is a great starting point, but it's not enough to create a successful business. In addition to having a great open source project, you also need to have a great product. And that's something that will add, if you look at GitHub, for example, should add a lot of value on top of just the open source project. And Git, of course, is an open source project. And GitHub is much more than just Git. Right? It also includes tools for code reviews, tools for continuous integration, tools for sharing your applications with other people, and collaborating on large projects. And so similarly, we plan on building products that add a lot of value to users of the open source project and that make life easier for developers.
You can actually 1 way to think about what we're trying to do so stepping back a little bit, the main premise of this company is that distributed computing is becoming the norm. That distributed computing is becoming increasingly essential, especially for all types of applications, but especially applications that touch machine learning. And the main challenge here, you know, 1 of the challenges is that doing distributed computing is actually quite difficult and it requires a lot of expertise. So what we're trying to do with the company is to make it as easy to program clusters of machines as it is to program on your laptop.
And you you could think about the products we're trying to offer as essentially the experience of being able to develop as if you're on your laptop but you have an infinite laptop. And your laptop has infinite compute and infinite memory, and you don't have to know anything about distributed systems. 1 of the elements
[00:30:21] Unknown:
that's associated with that is the idea of being able to design your software in a way that it is able to be processed in either an idempotent or an embarrassingly parallel way. And I'm curious what your experience has been of people onboarding onto Ray and trying to build an application and then accidentally coding themselves into a corner because they have something that doesn't actually work when it's spread out across multiple machines, and it's actually much better as a single process computation.
[00:30:51] Unknown:
So to scale up an application with Ray, for that to make sense, they're actually you know, there does have to be some opportunity for parallelism. There has to be, you know, some things that can happen at the same time. And not every application is like that, but a good number are, and that includes more than just embarrassingly parallel applications. For example, hyperparameter tuning, you can write it in an embarrassingly parallel way. But when you get to more sophisticated hyperparameter search algorithms, it's no longer embarrassingly parallel. You're actually looking at the experiments as they're running and stopping the poorly performing ones early. You're potentially devoting more resources to the ones that are doing well. You're sharing information between from some of the experiments to choose parameters for starting new experiments. And, you know, of course, within each of these experiments, you're doing multiple experiments in parallel and each experiment might itself be distributed, might be doing, you know, distributed reinforcement learning or distributed training. And so you have this nested parallelism.
And even a conceptually simple thing like hyperparameter search can get quite sophisticated. And so the the simple embarrassingly parallel story, it's on its own is is not enough for supporting hyperparameter search well. So you're right. It does take some amount of sophistication to, you know, to scale up applications and do it in the most efficient possible way. But if you are implementing a hyperparameter search library or if you are implementing 1 of these applications, you often have a good understanding of the structure of the computation and what kinds of things could potentially happen in parallel.
So we found that people generally have the right intuition for how to scale up their applications if they've developed the applications.
[00:32:42] Unknown:
What are some of the most interesting or unexpected ways that you have seen Ray used in some of the ways that you have seen Ray used in some of the types of applications that people have built with it? You know, it's all over the place. I think 1 of the the most exciting applications I've seen is this
[00:32:56] Unknown:
large scale online learning application that was used by, Ant Financial, which is the largest fintech company in the world. They're using Ray for a number of different use cases, ranging from fraud detection to recommendations. These are applications that are really taking advantage of Ray's generality because they combine everything from processing streaming data to training models to serving models and doing hyperparameter search and, you know, launching tasks from other tasks and being able to both do batch computations as well as interactive computations. So these are some of the most stressful workloads. And they were running these applications, you know, in production during, double 11, which is the biggest shopping festival in the world. So that's 1 of the most 1 of the most exciting applications.
We are also really excited to see a number of companies using our libraries, both for reinforcement learning and tuning and distributed training, to build their own internal machine learning platforms, as well as to, to scale up their machine learning applications. So this is everything from trading stocks to designing airplane wings to,
[00:34:13] Unknown:
you know, optimizing supply chains and things like that. What do you have planned for the future of Ray and its associated ecosystem?
[00:34:21] Unknown:
We think that, like I mentioned, that distributed computing is becoming increasingly important and will be essential for many different types of applications. And at the same time, we think that there's no good way to do it today, that that really doing distributed computing is just there's a big barrier to entry. For example, if you're a biologist, you might know Python. You can write some Python scripts to solve your biology problems. But if you have a lot of data, you have a lot of you know, you need to run things faster to scale things up, then you not only have to be an expert in biology, but you also have to be an expert in distributed computing. And that's just kind of too much of a barrier to entry for most people. So the goal really is to make it as easy to program clusters of machines as it is to program on your laptop to let people take advantage of these kinds of cluster resources and take advantage of all the advances in machine learning but without having to really know anything beyond just Python. And given how many people are learning Python these days and how rapidly Python is growing, we think that'll be a lot of people. And also given the fact that the core elements of Ray are written in c plus plus in terms of the capabilities of the task distribution
[00:35:36] Unknown:
and the actual execution, I imagine that there's the possibility for expanding beyond Python as well. Oh, yeah. I forgot to mention that.
[00:35:45] Unknown:
We already support Java. And there are companies using Java doing distributed Java through Ray in production. Python, of course, is our biggest focus. But we're also working on, support for distributed, you know, C plus plus through Ray And there are even some applications of Ray that are combining both Python and Java. So a lot of companies, their machine learning is in Python, but their business logic is in Java. And so if they want to be able to call their machine learning from the application side, and they're able to do that using Ray, which is pretty exciting. And, of course, you know, Ray is designed in a language agnostic way. And there's the potential to add support for for many more languages.
[00:36:28] Unknown:
Are there any other aspects of the Ray project itself or the associated libraries or the overall space of distributed computation that we didn't discuss yet that you'd like to cover before we close out the show or the work that you're doing with any scale?
[00:36:41] Unknown:
So 1 answer to that, there's a lot of things we're working on, is that while we are trying to make it as easy as possible to develop distributed applications, developing distributed applications is not enough on its own because once you develop your application and you run it and then you run into some bug, you have to debug your applications. And debugging distributed applications is notoriously difficult. So 1 thing we're focused on or that is very important to us is to make the experience of debugging cluster applications as easy or easier than debugging applications on your laptop. And, of course, a big part of that is thinking about the user experience.
It's about providing the right visualizations and surfacing the right information so that developers don't have to go looking for that information. It's just right there. It's about providing great tools for profiling and understanding the performance characteristics of your application. And 1 thing we've we've developed, which is sort of still an experimental feature but is really exciting, is that through the Ray dashboard, you can if you're running a live Ray application, through the dashboard, you can essentially click a button to profile a running actor, a running task, and get a flame graph showing you where time is being spent in that actor and, you know, what the computational bottlenecks are.
And normally, to profile your code, you have to, even on your laptop, you have to look up the documentation for some profiling library. You have to instrument your code. You have to rerun your code. It's a bit involved. And so we're trying to make it so that you don't even even if you didn't think about that you would want to profile your code ahead of time and after you've already started running it you can just click a button and
[00:38:25] Unknown:
get the profiling information And this is this and things like this have the potential to provide a really great debugging experience Yeah. That's definitely pretty impressive and important element of the overall user experience. So I'm glad that you're considering that and trying to integrate that into the core runtime. Yeah. We think it'll be pretty fantastic. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or get involved with the project, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to follow-up with a recent pick on the topic of board games with a couple of specifics. So I recently picked up the Dungeons and Dragons Castle Ravenloft board game, which is a simplified mechanism. It doesn't have the whole dungeon master and everything. It's something that you can play just with a couple of people, and it's pretty enjoyable. So I had fun playing that recently with my kids. And I also picked up another game called 1 Deck Dungeon, which you can play with 1 or 2 people. And it's just a card game that you, you know, can quickly go through and just have a quick little, dungeon crawl and and, survive through to the boss. So interesting game mechanic there. So recommend picking up either of those if you, need something to do with some spare time. And with that, I'll pass it to you, Robert. Do you have any picks this week? I recently read The Everything Store, which is,
[00:39:43] Unknown:
a history of Amazon and some stories about Amazon
[00:39:48] Unknown:
during the early days and really enjoyed reading that. Yeah. I'll definitely have to take a look at that as well. So, thank you very much for taking the time today to join me and discuss the work that you're doing with Ray and AnyScale and the associated libraries that you're building on top of all of that. It's definitely very interesting project that I've been hearing a lot about recently. So I'm happy to have the chance to talk to you, and I appreciate all the work that you're doing on that. So I, thank you again for that, and I hope you enjoy the rest of your day. Thanks so much, and you too. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your
[00:40:46] Unknown:
friends and coworkers.
Introduction and Guest Introduction
Robert Nishihara's Background and Introduction to Python
The Origin and Motivation Behind Ray
Challenges and Evolution of Ray
Use Cases and Applications of Ray
Implementing and Scaling Applications with Ray
Technical Implementation and Architecture of Ray
Managing Dependencies and Operational Characteristics
Building and Maintaining Ray's Ecosystem
AnyScale and Commercializing Ray
Debugging and Profiling Distributed Applications
Closing Remarks and Picks