Distributed Computing In Python Made Easy With Ray

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, node balancers, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI and CD pipelines, they've got dedicated CPU and GPU instances.

Go to python podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Robert Nishihara about Ray, a framework for building and running distributed applications and machine learning workloads in Python.

So, Robert, can you start by introducing yourself?

Yeah. Well, first of all, thanks so much for for speaking with me. So

I am 1 of the creators of an open source project called Ray. And I'm 1 of the co founders and CEO of AnyScale, which is commercializing

Ray. And

before starting this company about half a year ago, I did a PhD in machine learning and distributed systems at UC Berkeley. And of course, Ray was a project we began several years ago at UC Berkeley

doing the PhD program to address problems that we were running into, in our own research. And that has since transitioned into into the into AnyScale. And do you remember how you first got introduced to Python?

So I first started using Python in college. You know, when I actually started out doing machine learning research as an undergraduate,

the dominant tool at that time was MATLAB. Everyone was using MATLAB too,

to run experiments and do regressions and and things like that. And, you know, at some point,

everyone started transitioning to Python.

And once deep learning really took off around 2013 and and frameworks like Theano, then, of course, later on, TensorFlow and PyTorch took off. That really

solidified Python's dominance in in machine learning. And so you mentioned

that Ray is a distributed task execution framework, and that it started when you were doing your PhD research at UC Berkeley. And my understanding is that you were part of RISE Lab at the time. I'm wondering if you can just give a bit more background

on the project and the motivations for starting it, some of the challenges that you were facing in your research that led to it, and how the overall environment of the RISE Lab factored into some of the early design and development decisions of the project? That's a great question. So the research that I

started out doing and and that Philip so Philip Moritz, 1 of my cofounders and also, 1 of the creators of Ray, we were doing research in more theoretical machine learning. So trying to design better algorithms for reinforcement learning, for optimization,

for learning. And

if you, you know, if you if you sort of walk into the AI research

building

at Berkeley or any at any research university,

1 thing you'll notice is that you have all these machine learning researchers.

And, you know, they have backgrounds in math and statistics.

But they're spending and they're trying to design better algorithms

for machine learning. But what they actually spend a ton of their time doing is

low level engineering

to scale things up and to speed things up because, you know, the experiments in machine learning are very computationally intensive.

They can be quite time consuming to run. And so

speeding things up or scaling things up to run on a cluster instead of on a laptop can be the difference between an experiment taking a week versus a day versus an hour. And so you have all these different graduate students and

machine learning practitioners

who are

not only are they designing the machine learning algorithms and implementing their algorithms, they're also building

all of this infrastructure

and scaffolding

underneath

to to run their algorithms at a large scale. And, you know, we did this ourselves, of course, but we felt that there had to be a better, you know, a better way that could let us just focus

on the algorithms and the applications that we were trying to implement and to not have to, you know, become experts in distributed computing. So we set out to try to build better tools. And, of course, we had used many of the existing tools out there. Even, you know, back as an undergraduate,

when I was doing machine learning research a number of years ago, I had used a tool called IPython Parallel, which is 1 of the first tools for doing,

parallel and distributed Python. Of course, that made my life a lot easier. But when it came to deep learning and more modern machine learning applications, the tools just weren't there yet. And so these researchers were constantly

building their own ad hoc tooling and new infrastructure. So that was sort of the the scenario that we found ourselves in where to do machine learning, you really had to build a lot of infrastructure.

And that's why we set out to try to build tools to make this much, much easier. And you asked about the RISE Lab. Well, so when I started the PhD program, I was part of the AMP Lab, at Berkeley, which, you know, created Spark.

And that transitioned into the RISE Lab. And in both of these settings, I was surrounded

not just by experts in machine learning, but also And

now I was starting to get interested in how to build tools and build systems. And now I was starting to get interested in how to build tools and build systems. And it wouldn't have been possible if I hadn't been in that environment.

I was in that environment, I was surrounded by experts in distributed computing. I was able to learn from them, to learn from their experience with Spark and other open source systems.

And that really

helped us avoid some of the pitfalls.

It helped us, you know, really move much more quickly in terms of building a great framework. And 1 of the common themes in a lot of the projects that I've had on the show, and people that I've spoken with is that they set out to solve 1 problem, and on the way, discover that they actually have to solve some completely orthogonal problem first. And half the time, it seems that they never actually get back to the original problem that they're trying to deal with. And I'm curious what your experience was of if Ray just consumed all of your attention and you never actually made it back to your original research? Or if you are able to at least, get back to what you were originally trying to accomplish after you got to a working state with Ray? That's a that's funny that you asked. Yeah. When we started Ray, it was a little bit of a detour. We were doing machine learning research. It became clear to us that the tooling and the infrastructure was the bottleneck. And so we set out to build something better there. And, of course, the goal from the start with building Ray was to build useful open source software. It was to to try to solve this pain point where people really struggle to scale up their applications from 1 machine to a cluster.

And when we started out, we had a, you know, we had an idea of the API that we wanted and the behavior that we wanted, and we figured it would take us about a month to to finish that. And, of course, in a month, we did have something. We had, you know, we had a prototype and we had something working.

But to really make something useful for people to, you know, give companies confidence that they can run it in production,

it requires much more than a prototype. And you have to really take it all the way. And so even if you can get some initial prototype running up and running pretty quickly to really

polish it and really, you know, make it great, There's a lot of extra work involved, and that's and so, you know, it takes longer than than you initially expect. Now 1 interesting thing related to what you asked is that there haven't been too many changes

to the scope of what we've been building. We started out

as quite broad. But

over time, we encountered more and more

use cases that we realized

we could support that we didn't initially realize we could support. And

I can talk more about these later, but 1 example,

Ray actually started out as purely

like, more like functional and stateless.

And and, essentially, it started out as a way to scale up Python functions. But then we realized

a lot of the applications we really cared about were stateful and required managing,

you know, mutable state in the distributed setting. And so we introduced the ability to scale up not just Python functions but also Python classes,

by introducing Actors. And that was 1 API change that really opened up a lot of new use cases in a in a really powerful way. Yeah. When I was looking through the documentation

and seeing the difference between just a remote task versus a remote actor, at first glance, it wasn't really clear to me the significance of just the minor syntax difference because you still just use a decorator, but it's just a matter of whether you decorate a class versus a function to determine whether or not statefulness is incorporated into it. So I think that the API and the sort of user experience that you landed on is fairly elegant and very minimal while still being able to provide a significant amount of power to the end user. That's exactly that's something we we very much strive for. And

to keep it simple, to have only a few concepts that are powerful in in general.

And so you mentioned that some of the original work you were doing was on some of these advanced machine learning

algorithms, such as reinforcement learning, and that was some of the initial use case that you were targeting. I'm curious, now that it's been publicly available for some time and it's been out in the wild, what are some of the other ways that it's being used which you didn't originally anticipate? Yeah. 1 use case that we didn't initially anticipate was serving, so, like, inference putting machine learning models in production.

But it turns out that Ray is actually a pretty natural fit for this use case.

And the powerful thing here, or the exciting thing,

is that when you have an,

a system that can support not just the serving and inference side of the equation, but also the training side, also, you can build online learning applications that do both, where you have

algorithms and applications that are continuously

interacting with the real world and making decisions,

but also and, you know, learning from those interactions

and updating themselves, updating the models. So you have applications that look like we are companies doing this in production where you have data streaming in, you're incrementally processing this data in a streaming fashion and training new recommendation models,

and then serving recommendations

from those models back to the users, which then affects the data that you're collecting, and there's this tight loop. Normally, when companies implement

this kind of online learning application,

they're actually stitching together several different distributed systems. They'll take 1 for stream processing,

1 for training machine learning models, and then 1 for serving. And here, we've had companies that are able to take that kind of setup with several different distributed systems that are stitched together and rebuild it all on top of Ray. And, you know, it becomes simpler because it's on top of 1 framework. They don't have to learn how to use a ton of different distributed systems. It can become faster because you aren't moving data between different systems, which can sometimes be expensive. And it generally, you know, it can get simpler because you these different distributed systems are not usually not designed to interface with each other. So if you have if you're using Horovod or distributed TensorFlow for training and you're using Flink or Spark Streaming for the stream processing,

Flink and Horovod were not really designed to just compose together.

But if you can

do these things on top of Ray, then it can be as natural as just using Python and importing different libraries like NumPy and Pandas, and you can use all of these things together. Another project in the Python ecosystem that is tackling the

distributed computation need that I'm also aware of is Dask. And I know that it was originally built for more sort of scientific computation

and research workloads. But I'm wondering what your thoughts are on some of the trade offs of when somebody might be looking to use Dask versus Ray or ways that they can work together. That's a great question.

And Dask has some has some great libraries for,

DataFrames

and

distributed

array computation, which

we haven't been focused on so much. With Ray, we've been very focused on scalable machine learning,

everything from training machine learning models

to reinforcement learning to hyperparameter search

to serving. And the kinds of applications that, you know, we've been focused on

really benefit from the ability to do stateful

computation

with Actors. I think that's

in terms of

what Dask provides at a low level,

it's very similar to

the Ray remote function API, so

the ability to take Python functions and scale them up.

But a lot of

the scalable machine learning libraries that we're building

really require stateful computation. So there's some difference in emphasis there. And of course,

I think Dask began its life focused on

large multicore machines and really

optimizing for high performance

on a single multi core machine. And, you know, we are focused on performance very much in the large multi core machine setting

as well

as the cluster setting and really trying to match the performance

of low level tools, like if you were just using gRPC

or on on top of Kubernetes and things like that, then using Ray should ideally be,

just as fast. For the types of workloads that somebody might think of for Ray, what are some of the limitations that they should be aware of or edge cases that they might run into as they're developing the application or any of the specific design considerations that you have found to be beneficial for successful implementations?

There are a couple different things to be aware of. So 1, the Ray API can be pretty low level. And so

to implement all of the

features and all of the things you want for your application,

you may need to build quite a bit of stuff. So for example, if you're looking to do distributed hyperparameter search, well, we provide a great library for scalable hyperparameter search. And so you can do that out of the box on top of Ray, and it works well. On the other hand, if you want to do something like large scale stream processing, we currently don't have a great library

for stream processing. And so

while it should be possible, in principle, to do that on top of Ray, doing so would require first building your own stream processing library or building something like that or some subset of that before you can really do that well.

And if somebody is trying to

convert an existing application to be able to

scale out some subset of its functionality.

What does the workflow look like for being able to do that? And how does it compare to somebody who's approaching this in a greenfield system? Yeah. So this is actually an area where Ray really shines. And we've had

So with Ray, if you want to scale up a Python application, an existing Python application that you already wrote, you can often, not always, but often

just add a few lines, change a few lines,

add a few annotations,

take that code and run it, you know, anywhere from your laptop to a large cluster. And the reason for this and, you know, the important thing here is not just the number of lines of code. It's nice, you know, if you only have to add 7 lines of code. But the important thing here is that you don't have to restructure your code. You don't have to re architect it and change the abstractions and change the interfaces

to to course it into using Ray. It should really be

something you can do in place. And the reason for this is that it has to do with the abstractions that Ray provides So if you want to scale for example, if you want to scale up

an arbitrary Python application using Spark, Well, the core abstraction that Spark provides is a large scale data set, like a distributed data set. And so if you have an application

that is centered around

where the core abstraction is a large dataset and you're manipulating that dataset, then Spark is the perfect choice. On the other hand, if you're trying to do something like hyperparameter search or

inference and serving, where the main abstraction is not really a data set. It might be something like a machine learning experiment or it might be something like a neural network that you're trying to deploy Then coercing that into to scale that up with Spark you have to, you know, coerce it into this dataset abstraction which is not a natural fit, and that can lead you to having to re architect your application. On the other hand, with Ray,

you know,

of course, we have higher level libraries,

but the core Ray API is not introducing

any new concepts. It's not introducing these higher level abstractions.

It's just taking the existing concepts of Python functions and classes. And of course,

all Python applications are pretty much built with functions and classes. It's taking those 2 concepts and translating those into the distributed setting. So you have a way of taking, you know, any application that's built with functions and classes

and then,

ideally,

with a few just modifications,

running that in the distributed setting. So

this is something where Ray actually

can feel very lightweight,

and it's something where we've had a number of different users

tell us that, you know, they love Ray because they can just get started with it so quickly. They can

tell people on their team how to use it in 30 minutes, and then they're off and running, and

they don't have to really change their code much. They can just take their existing code and scale it up. In terms of Ray itself, I'm wondering if you can discuss how it's actually implemented and some of the ways that it's evolved since you first began working on it. 1 thing we've done is we've actually rewritten it from scratch several times, as we got new ideas and and thought of improved ways to architect it. We had an initial prototype that was written in Rust.

And I think this the implementation after that was in,

c plus plus. And then we switched to,

it was a they're like a C plus plus implementation with lots of multithreading. And then we switched to

an implementation in C, actually,

where all of the different Components, like the Scheduler and the Object Store and so on, had single threaded event loops. And from there,

we switched back to

you know, more recently,

it's back to C plus plus using

gRPC as the underlying technology or the underlying RPC framework. And, you know, we've continued to rearchitect

it quite a bit,

both to improve performance, to improve stability,

and to simplify the design. So that's something we are you know, I think the people working on Ray have done a great job of, you know, making these bold changes and really not being afraid to tear things down and and rewrite them when it makes sense to do so. Now, how does it actually work?

There are,

you know, a number of key components.

So 1 is,

the Scheduler, And the Scheduler is actually

a distributed

concept where we have a different Scheduler process on each machine. And,

the scheduler is responsible for things like starting new worker processes,

for assigning tasks to different workers,

and book doing the bookkeeping for what resources are in use, what, you know, CPUs or GPUs or memory or things like that are are currently allocated. There's an object store process, and this is actually,

again,

a pretty important optimization

for

performance optimization for machine learning applications.

So a lot of machine learning applications

rely heavily on large numerical arrays and large numerical data. And when you have a bunch of different worker processes,

if they're all using the same dataset or the same neural network weights or things like that, it can get quite expensive to

send copies of this data around to each worker process and for each, process to have its own copy of the data.

And so an important optimization is to allow

each Worker

to, instead of having its own copy, to just have a pointer to just use shared memory to to share it to access the

the data. And so what happens is we have this object store process which can store

arbitrary

Python objects, but

the important part is objects containing large

arrays

of numbers. And then all the different worker processes can have essentially can access that data just by having a pointer to a shared memory

instead of having to create their own copy.

And then you can avoid a lot of expensive serialization and deserialization.

So

this is something where we first developed this object store as part of Ray, And then we contributed it to the Apache Arrow project,

which is started by Wes McKinney. And that's something,

a project we've collaborated very closely with. And now, you know, people are using the object store as a standalone thing independent of Ray. And

I would say those are some of the key architectural pieces. There's more, of course. There's the ray autoscaler, which is a cluster launcher.

Of course, 1 of the pain points with doing distributed computing is starting and stopping clusters and managing the servers and

installing all of your libraries and dependencies

on the on the, you know, VMs. And what the Ray autoscaler does is it lets you just run a command from your laptop. Like, you can run Ray up, And that'll start a cluster. You can do it on AWS, or GCP, or Azure,

any cloud provider.

And

you can

essentially

configure and manage

these clusters on any cloud provider

just by running a single command. So that's another important part of what we're building.

Yeah. The operational

characteristics

of distributed computation frameworks are often 1 of the hardest parts and 1 of the biggest barriers to entry. So I was pretty impressed when I was reading through the documentation and saw that you had that built in as a first class concern. And just briefly going back for a moment, I was surprised to hear that you ended up going from Rust to c plus plus because it's usually the other direction that I've heard people going. So I'm curious what the, state of the Rust language was at the time that you made that decision or any constraints that it had because I'm usually heard people who are going to it because of the memory safety cap capabilities of it. Yeah. That's a great question. And I think, you know, Rust, if we had stuck with Rust, it would have been a fine choice

and would have had a lot of advantages.

You know, and, of course, it was a lot of fun to write. And the memory safety is a big advantage. At the time, we were somewhat concerned about the or less sure about the trajectory of Rust and just the maturity of the ecosystem.

And

there were a lot of projects

you know, for some of the 3rd party libraries that we were integrating, like gRPC

and Arrow. You know, their C plus plus support was very mature.

And it was, in some sense, maybe the

the conservative choice was to to use C plus plus And in addition, we wanted it to be something that a lot of people in the community could contribute to. And and, of course, a lot of people do know Rust, but, there's perhaps more people who are able to contribute in C plus plus But that said, you know, we do care a lot about memory safety. And we do, you know, follow best practices with when using c plus plus to try

to, be as careful as possible. Yeah. The, the ecosystem is generally 1 of the biggest motivators for choosing any particular project or language is because

if you go with something that's brand new and very promising, it can be exciting. But

after you get started for a little bit, you realize how much you're leaning on the rest of the ecosystem and the rest of the work that everybody is contributing to be able to help you move faster. So I can definitely sympathize with their motivation for going back to c plus plus. And then going back to the operational characteristics,

I think that it's great that you have this auto scaler built in. And

1 of the questions that I was having is

the thought of being able to manage dependencies in the instances that you're deploying, and how you manage being able to

populate the code that you're sending to these instances, and any dependent libraries that are necessary for them to be able to execute.

Yeah. And that's that's,

an ongoing

area that we're working on. So

I don't think it's it's

fully solved yet. But 1 thing that's always been a challenge

is, you know, with distributed computing, if you build some large scale distributed application and you're running it on a cluster, it's quite challenging to actually share that application with someone else in a way that they can just easily reproduce it. And part of the reason for that is

it's not enough to just share the code. There's quite a bit more in terms of the cluster configuration and the libraries you're depending on, maybe some data that lives somewhere.

And in machine learning,

people often spend more time

trying to reproduce other people's experiments than they do running their own experiments. And this is something we've seen quite a bit.

So 1 thing we are doing is trying to make it easy for people to

share their their applications and then for others

to just run those applications. And that's something, I think, you know, we can make it as easy as just clicking a button, essentially.

And part of that will be is about

using if you're building an application with Ray, it's a very portable API.

It's something that you can run anywhere from your laptop to any cloud provider or your Kubernetes cluster. And then making sure that we

are specifying all the relevant dependencies, whether those are Python libraries

or instance types or data.

And that's something we're developing in the open source project to try to encapsulate all these dependencies.

As you have

built out some more of the capabilities of Ray, I know that you've also added these other layers and libraries in the form of rllib

and the tune library that you mentioned for being able to do distributed hyperparameter

search. And I'm wondering what you found to be some of the most interesting or challenging or complex aspects of the work of building and maintaining Ray and its associated ecosystem?

Yeah. So let me say a little bit more about all the different pieces that we're building. So there are 2 parts to ray,

at least 2 parts. There's the core runtime,

which has

the remote decorator and the ability to scale up Python functions and classes.

And then on top of that, we're building

an ecosystem of libraries targeting scalable machine learning. So that includes things like, as you mentioned, hyperparameter tuning,

training,

serving,

and more, right, and reinforcement learning. And these are all things that, you know, to do reinforcement learning and to do training, you you typically have

to distribute the computation to get it done in a reasonable amount of time. So those are libraries that we're very excited about. You know, you can use them together seamlessly.

So what is the challenge? Well, in some sense, there are several

different projects happening here that are being developed

in the same code base. And so it can it can be a little

challenging to

keep everything organized

and to make it easy for people to contribute to, say, rllib without having to know anything about the Ray core or

without having to know anything about the serving library. It can be a little harder for people to navigate the code base because there are just more directories and more things going on. But

But there are also a lot of advantages to having things

in the same codebase. And in particular, 1 is that you don't have to worry about

incompatible versions of, you know, the reinforcement learning library and the and the tuning library and the core, core array library.

And that's pretty big advantage.

And another aspect of this overall effort is, as you mentioned, the AnyScale company that you and some of your collaborators have formed,

around Ray to help support its future development. I'm wondering if you can discuss a bit about the business model of the company and some of the ways that you're approaching a and some of the ways that you're approaching the governance of the Ray project and its associated ecosystem.

Great question.

So I think if you you know, we don't have a product yet. We haven't announced any

product plans.

But if you look at other companies that are commercializing

open source projects, Databricks, for example, or GitHub,

there are a number of companies like this.

You could also think MongoDB or Elastic or Confluent.

Having a great open source project

is a great starting point,

but it's not enough to create a successful business. In addition

to having a great open source project,

you also need to have a great product.

And that's something that will add,

if you look at GitHub, for example,

should add a lot of value on top of just the open source project. And

Git, of course, is an open source project. And GitHub

is much more than

just Git. Right? It also includes tools for code reviews,

tools for continuous integration,

tools for sharing your applications with other people,

and collaborating on large projects. And so similarly,

we plan on building products that add a lot of value to users of the open source project

and that make life easier for developers.

You can actually 1 way to think about what we're trying to do so stepping back a little bit,

the main premise of this company

is that distributed computing is becoming the norm. That distributed computing

is

becoming

increasingly essential, especially for all types of applications, but especially

applications that touch machine learning. And the main challenge here, you know, 1 of the challenges is that

doing distributed computing is actually quite difficult and it requires a lot of expertise.

So what we're trying to do with the company is to make it as easy

to program clusters of machines

as it is to program on your laptop.

And

you you could think about the products we're trying to offer as essentially

the experience of being able to develop

as if you're on your laptop but you have an infinite laptop. And your laptop has infinite compute and infinite memory, and you don't have to know anything about distributed systems. 1 of the elements

that's associated with that is the idea of being able to design your software in a way that it is able to be processed in either an idempotent or an embarrassingly parallel way. And I'm curious

what your experience has been of people

onboarding onto Ray and trying to build an application and then accidentally coding themselves into a corner because they have something that doesn't actually work when it's spread out across multiple machines, and it's actually much better as a single process computation.

So to scale up an application with Ray, for that to make sense, they're actually you know, there does have to be some

opportunity for parallelism. There has to be, you know, some things that can happen at the same time.

And not every application is like that, but a good number are, and that includes more than just embarrassingly parallel applications. For example, hyperparameter tuning,

you can write it in an embarrassingly parallel way. But when you get to more sophisticated

hyperparameter search algorithms, it's no longer embarrassingly parallel. You're actually

looking at the experiments as they're running and stopping the poorly performing ones early. You're potentially

devoting more resources to the ones that are doing well. You're sharing information between

from some of the experiments

to choose parameters for starting new experiments. And, you know, of course, within each of these experiments, you're doing multiple experiments in parallel and each experiment might itself be distributed, might be doing, you know, distributed reinforcement learning or distributed training. And so you have this nested parallelism.

And even a

conceptually simple thing like hyperparameter search can get quite sophisticated.

And so

the the simple embarrassingly parallel story,

it's on its own is is not enough for supporting hyperparameter search well. So

you're right. It does take

some amount of sophistication to,

you know,

to scale up applications and do it in the most efficient possible way.

But if you are implementing a hyperparameter search library or if you are implementing 1 of these applications, you often

have a good understanding of the structure of the computation and what kinds of things could potentially

happen in parallel.

So we found that people

generally have the right intuition

for how to scale up their applications if they've developed the applications.

What are some of the most interesting or unexpected ways that you have seen Ray used in some of

the ways that you have seen Ray used in some of the types of applications that people have built with it? You know, it's all over the place. I think 1 of the the most exciting applications I've seen is this

large scale online learning application that was used by,

Ant Financial,

which is the largest fintech company in the world.

They're using Ray for a number of different use cases,

ranging from fraud detection to recommendations.

These are applications that are really

taking advantage of Ray's generality because they combine

everything from processing streaming data to training models to serving models and doing hyperparameter search and, you know, launching tasks from other tasks and

being able

to both do

batch computations as well as interactive computations. So these are some of the most stressful workloads. And

they were running these applications,

you know, in production during,

double 11, which is the biggest shopping festival in the world. So that's 1 of the most

1 of the most exciting applications.

We are also really excited to see

a number of companies

using our libraries,

both for reinforcement learning and tuning

and distributed training,

to build their own internal

machine learning platforms,

as well as to, to scale up their machine learning applications. So this is everything from

trading stocks to

designing airplane wings to,

you know, optimizing supply chains and things like that. What do you have planned for the future of Ray and its associated ecosystem?

We think that, like I mentioned, that distributed computing

is

becoming

increasingly important

and will be essential for

many different types of applications.

And at the same time,

we think that there's no good way to do it today, that that really doing distributed computing is just there's a big barrier to entry. For example, if you're a biologist,

you might know Python. You can write some Python scripts to solve your biology problems.

But if you have a lot of data, you have a lot of you know, you need to run things faster to scale things up, then you not only have to be an expert in biology, but you also have to be an expert in distributed computing.

And that's just kind of too much of a barrier to entry for most people. So the goal really is to make it as easy to program clusters of machines as it is to program on your laptop to let people

take advantage of these kinds of cluster resources and take advantage of all the advances

in machine learning but without having to really know anything beyond just Python. And given how many people are learning Python these days and how rapidly Python is growing, we think that'll be a lot of people. And also given the fact that the core elements of Ray are written in c plus plus in terms of the capabilities of the task distribution

and the actual

execution, I imagine that there's the possibility for expanding beyond Python as well. Oh, yeah. I forgot to mention that.

We already support Java. And there are companies using Java

doing distributed Java through Ray in production. Python, of course, is our biggest focus.

But we're also working on,

support for distributed, you know, C plus plus through Ray And there are even some

applications of Ray that are combining both Python and Java. So a lot of companies,

their machine learning is in Python, but their business logic is in Java. And so

if they want to be able to

call their machine learning from

the application side,

and they're able to do that using Ray, which is pretty exciting. And, of course, you know, Ray is designed in a language agnostic way. And there's the potential to

add support for for many more languages.

Are there any other aspects

of the Ray project itself or the associated libraries or the overall space of distributed computation that we didn't discuss yet that you'd like to cover before we close out the show or the work that you're doing with any scale?

So 1 answer to that, there's a lot of things we're working on,

is that while we are trying to make it as easy as possible to develop distributed applications,

developing distributed applications

is not enough on its own because

once

you develop your application and you run it and then you run into some bug, you have to debug your applications.

And debugging distributed applications is notoriously difficult. So 1 thing we're focused on or that

is very important to us is to make the experience of debugging

cluster applications

as easy or easier

than debugging applications on your laptop. And, of course,

a big part of that is thinking about the user experience.

It's about

providing the right visualizations and surfacing the right information so that

developers don't have to go looking for that information. It's just right there. It's about providing great tools for profiling

and understanding the performance characteristics

of your application.

And 1 thing we've we've developed, which is sort of still an experimental feature but is really exciting,

is that through the Ray dashboard,

you can if you're running a live Ray application,

through the dashboard, you can essentially

click a button to

profile

a running

actor, a running task, and get a flame graph showing you where time is being spent in that actor and, you know, what the computational bottlenecks are.

And normally, to profile your code, you have to, even on your laptop, you have to

look up the documentation for some profiling library. You have to instrument your code. You have to rerun your code.

It's a bit involved. And so we're trying to make it so that you don't even even if you didn't think about that you would want to profile your code ahead of time and after you've already started running it you can just click a button and

get the profiling information And this is this and things like this have the potential to provide a really great debugging experience Yeah. That's definitely pretty impressive and important element of the overall user experience. So I'm glad that you're considering that and trying to integrate that into the core runtime. Yeah. We think it'll be pretty fantastic.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or get involved with the project, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to follow-up with a recent pick on the topic of board games with a couple of specifics. So I recently picked up the

Dungeons and Dragons Castle Ravenloft board game, which is a

simplified

mechanism. It doesn't have the whole dungeon master and everything. It's something that you can play just with a couple of people, and it's pretty enjoyable. So I had fun playing that recently with my kids. And I also picked up another game called 1 Deck Dungeon, which you can play with 1 or 2 people. And it's just a card game that you, you know, can quickly go through and just have a quick little,

dungeon crawl and and, survive through to the boss. So interesting game mechanic there. So recommend picking up either of those if you, need something to do with some spare time. And with that, I'll pass it to you, Robert. Do you have any picks this week? I recently read The Everything Store, which is,

a history of Amazon and

some stories about Amazon

during the early days and really enjoyed reading that. Yeah. I'll definitely have to take a look at that as well. So,

thank you very much for taking the time today to join me and discuss the work that you're doing with Ray and AnyScale and the associated

libraries that you're building on top of all of that. It's definitely very interesting project that I've been hearing a lot about recently. So I'm happy to have the chance to talk to you, and I appreciate all the work that you're doing on that. So I, thank you again for that, and I hope you enjoy the rest of your day. Thanks so much, and you too.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com

for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your

friends

and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__