Summary
Machine learning is a tool that has typically been performed on large volumes of data in one place. As more computing happens at the edge on mobile and low power devices, the learning is being federated which brings a new set of challenges. Daniel Beutel co-created the Flower framework to make federated learning more manageable. In this episode he shares his motivations for starting the project, how you can use it for your own work, and the unique challenges and benefits that this emerging model offers. This is a great exploration of the federated learning space and a framework that makes it more approachable.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to pythonpodcast.com/census today to get a free 14-day trial.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at pythonpodcast.com/hightouch.
- Your host as usual is Tobias Macey and today I’m interviewing Daniel Beutel about Flower, a framework for building federated learning systems
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what federated learning is?
- What is Flower and what’s the story behind it?
- What are the trade-offs between federated and centralized models of machine learning?
- What are some of the types of use cases or workloads that federated learning is used for?
- Federated learning appears to be a growing area of interest. How would you characterize the current state of the ecosystem?
- What are the most complex or challenging aspects of federating model training?
- How does Flower simplify the process of distributing the model training process?
- Can you describe how Flower is implemented?
- How have the goals and/or design of Flower changed or evolved since you first began working on it?
- One of the design principles that you list is "understandability". What are some of the ways that that manifests in the project?
- It also mentions extensibility. What are the interfaces that Flower exposes for integration or extending its capabilities?
- For someone who has an existing project that runs in a centralized manner, what are some indicators that a federated approach would be beneficial?
- What is involved in translating the existing project to run in a federated fashion using Flower?
- What is involved in building a production ready system with Flower?
- How does your work at Adap inform the design and product direction for Flower?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Flower used?
- What are the most interesting, unexpected, or challenging lessons that you have learned from your work on and with Flower?
- When is Flower the wrong choice?
- What do you have planned for the future of the project?
Keep In Touch
- danieljanes on GitHub
- @daniel_janes on Twitter
Picks
- Tobias
- Daniel
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Flower
- Adap
- Hyperparameter Optimization
- Federated Learning
- University of Oxford
- University of Cambridge
- Nvidia Jetson
- PyTorch
- Tensorflow Lite
- Tensorflow Federated
- PySyft
- Flower Summit
- Jax
- CNN == Convolutional Neural Network
- Keras
- gRPC
- MQTT
- NumPy NDArray
- AWS Device Farm
- Ray Framework
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hitouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at pythonpodcast.com/hitouch.
Your host as usual is Tobias Macy. And today, I'm interviewing Daniel Beutel about FLOWR, a a framework for building federated learning systems. So, Daniel, can you start by introducing yourself?
[00:01:39] Unknown:
Hi. My name is Daniel. I'm the founder of a Hamburg based startup called Adapt, and I'm 1 of the creators of the flower federated learning framework, and we call it flower, the friendly federated learning framework. And do you remember how you first got introduced to Python? Yes. Absolutely. So I started doing Python a bit more than 10 years ago, actually, during my undergrad studies. Then I didn't use it for quite a bit, but then, basically, resurfaced and rediscovered it or the whole data science space. Everything machine learning is pretty much based around Python.
So that's how I, again, got into it and started to learn properly. And now we basically use it for most of the things we do apart from some some edge cases like the web and like mobile applications where other languages are more suitable. We now use it for all of the machine learning, all the back end, systems,
[00:02:29] Unknown:
and actually for some robotics cases too. And before we get too too deep into flower specifically, you know, I mentioned that it's a framework for building federated learning systems. So I'm just wondering if you can just start by giving a bit of a description about what that means, what federated learning is in the context of data science and ML.
[00:02:48] Unknown:
So when we usually talk about machine learning, the underlying assumption usually is that we have a single dataset that sits in a single place. So we can access the entire dataset. We can analyze the entire dataset, and we can basically read any example that is placed in the dataset. With federated learning, that's not the case anymore. I mean, having a dataset in a single location is pretty restrictive. So you have to collect all of the data first, which is not always possible. Sometimes not possible due to, for example, legal constraints. Sometimes it's not possible due to the technical constraints because the data is just too large, and the rate at which the data is being generated is just too large to transmit all of that over the network.
So in many cases, you simply can't centralize all of the data in a single place. Federated learning is an approach to overcome that limitation. So in federated learning, the underlying assumption is that data is distributed across multiple partitions. So this could be just a few partitions, just a handful of partitions, but it could equally well be a very, very large number of partitions. For example, data stored on mobile phones or the entire globe, basically. So so there could be billions of partitions. Federated learning, basically, what we do in centralized machine learning is we centralize the data, and then we train the model on top of the centralized data. Federated learning turns this inside out. So instead of sending all of the data to a single server from those individual partitions, we are sending the machine learning model to those partitions. And we are not moving the data, but we are moving the machine learning models.
So we moved the machine learning models to those partitions, then we train the machine learning model a little bit on those partitions, and then we collect the refined machine learning models in a central place and aggregate all of those fine machine learning models. And usually, especially if we have a very large number of partitions, we only do this federated training across a subset of those partitions. It's just not beneficial if we select all of these partitions every time. So we select a subset of those partitions. We sent the model to these partitions. We train the model on these data partitions, and then when we collect not the data, but the model from these partitions, we aggregate the updated models. And then that gives us a new version, basically an improved version of the global model, which hopefully contains learnings from all of these partitions that we selected.
And that describes a single round of federated learning, and then we do multiples of these rounds in order to select more partitions, get more training done, learn more about all of these local partitions. And, eventually, we hopefully end up with a model that contains the learnings from all of these partitions.
[00:05:47] Unknown:
I'm definitely interested in digging a bit more into some of the specifics about some of the distributed capabilities and what actually gets sent back and forth. But before we go to that point, let's dig a bit more into Flower itself and what it is and some of the story behind why you created it and how you ended up to where you are today.
[00:06:04] Unknown:
So when we started out doing machine learning projects, we immediately discovered that not the model architecture so so how many layers do you have and what kind of hyperparameters do you use? But, actually, the underlying data is the major limiting factor when it comes to doing machine learning projects and when it comes to designing and creating good models that can give you good predictions. Oftentimes, you can play a lot with the hyperparameters, optimize the hyperparameters, improve your model architecture. But the gains you're getting for the final prediction quality are pretty limited if you if the underlying data doesn't reflect properly the kinds of new examples that you're going to get over time.
But then on the other hand, when you are able to collect more data, and especially if you're able to collect better data, then that can give the model quality a huge boost. Way better than than most kinds of hyperparameter optimizations and other things. So that was basically a realization that we had early on, but then the question, obviously, is how do you collect more data? How do you collect better data? And oftentimes, as I was saying before, this is simply not possible because of, legal constraints, because of, for example, technical constraints that prevent you from collecting more data. That was around the time when federated learning originally was proposed. So that was in 16, 2017 when the first credential paper came out, and it was originally discussed in the context of mobile devices.
But when we looked at it, we immediately thought, this is not just a solution for mobile devices, but this is a solution for many other kinds of machine learning problems in general, where the data simply cannot be centralized. So So we started to look at it out of basic practical necessity, and because we had cases where the data was distributed, and we could not centralize the data, and we wanted to train a model, but it was difficult to do so. So we started to look at existing tooling, for example, in order to build such workloads, and it turned out that none of the tooling that was available was suitable for the kinds of workloads we wanted to build. And it was also around the time when I was just finishing my degree and want to investigate a certain properties of federated learning, and turned out that there was just no toolkit in order to to build the kinds of workloads we had in mind.
We then decided to shift our focus and, actually, first design and then build a tool that allows us to build the kinds of workloads we had in mind. So with the first prototyping experience, we then took the learnings that we gained from this first experience, and we basically designed a proper framework and start to implement that at the beginning of last year. And that, initially, its framework had no name. And then just after after a few weeks, the name flower started to pop up. We immediately and from starting from the first commit, we open sourced the framework, immediately started to work with few researchers at the University of Oxford and then later University of Cambridge, and gathered a lot of feedback from these researchers to improve the framework, to see what kind of APIs are easy to work with, to see what kind of functionalities they would need in order to build new kinds of state of the art workloads.
And that's how we got started with the whole flow framework and how we continued to evolve the APIs and make the framework better for these new kinds of workloads.
[00:09:39] Unknown:
And then as far as the federated versus centralized model, you were mentioning that 1 of the main motivating factors is whether it's feasible to locate all of the data in 1 place versus keeping it separated because of legal or technical reasons. I'm wondering if you can just dig a bit more into some of the trade offs in terms of the capabilities and complexity and use cases for federated versus centralized machine learning?
[00:10:08] Unknown:
Both centralized and federated machine learning come with their own types of complexity. In centralized machine learning, usually, when you collect all of the data in a single place, you usually end up with a lot of data. So what you need are these so called big data systems that can handle these vast amounts of data. In federated learning, you don't collect the data anymore. So that's nice because you don't need this kind of big data system in the cloud. But then, on the other hand, you add complexity because you have to send the machinery model back and forth between devices and this centralized instance.
So this obviously has has an impact on how models converge, for example, and how long these these workloads take to converge. So convergence time is 1 of the things that, obviously, it takes longer to converge in a federated system because you have to send the model back and forth all the time, because you have to average the updates that you get from the individual clients. Then convergence properties are also something to keep in mind. In a federated system, you do have additional hyperparameters. You, for example, need to make a decision about how many of these partitions you select for training all of the time.
You need to make a decision about how long you train on these partitions. And there are a few other hyper parameters like, for example, the strategy, how you select these partitions. So that could be all of these partitions could be fairly equal, for example, but there could be also huge heterogeneity in these partitions, and that can influence have an effect on how it's best to select these these partitions for training. It's actually 1 other thing that we find quite interesting that we looked into from a research perspective is also the c o 2 consumption of these workloads.
So centralized machine learning is well known for the c o 2 impact it has. And we start to look at l and federated learning to see what be consumed by these kinds of workloads. And, initially, we thought that FL was way worse because it takes a bit longer to converge, obviously, because there's so many devices involved, because there's more network communication involved. But it actually turned out that there are some cases where FL is actually better in terms of c o 2 impact than centralized machine learning. And the major factor contributing to that was the reason that in centralized machine learning, you do have all of your accelerators in a data center, and then you need to cool these accelerators actively in order to to perform these workloads.
And the federated learning, if you perform it across, for example, mobile devices, these devices are usually passively cooled. So there's no active cooling involved, and that factor alone led to the situation where for some kinds of workloads, federated learning can be more CO2 efficient than than centralized learning.
[00:13:10] Unknown:
Digging a bit more into the aspects of running on mobile devices or running across, you know, disparate devices, there are, I'm sure, a number of different factors that play into how you then design the model and the training aspects to try and reduce the overall power consumption or to account for the fact that there are going to be variances in terms of the capabilities of the hardware that you're running on. And I'm just wondering if you can talk through some of the additional complexity that that brings in and, you know, maybe tie that into some of the types of use cases or workloads where federated learning is used and how that factors into the types of devices that it's executing on. It's definitely sometimes more challenging to design a federated learning workload, because you have to to factor in, all of these aspects.
[00:13:57] Unknown:
So if you, for example, run across mobile devices, you first need to determine what kind of mobile devices you will run on, because there's a huge variety in hardware capabilities, on these devices. Some of these devices, they do have a very strong hardware acceleration that you can use to accelerate the training on these devices. There are very strong devices like, for example, NVIDIA Jetson devices, which come with perfect support for, for CUDA based workloads, for example. They even support Python, basically, deliver a similar developer experience like you have when developing on a server. But then there are other devices which are much more constrained, regular handsets, like like the latest iPhones, for example, but also slower kinds of devices like IoT edge devices that sometimes only powered by a microcontroller.
So this is something that you definitely have to take have to account for. In some cases, you can pretty much run models that you can almost run on a server. And then in other cases, you obviously need to have much smaller models. And, oftentimes, that also means switching programming languages. So, oftentimes, you cannot implement these models in Python anymore. You have to move to other languages that are better suited for for the particular platform. So in some cases, like on iOS, you would move to Swift, for example. On Android, you would would move to Java or Coplin. And then on other platforms, you would move to, for example, c plus plus on very resource constrained devices.
And that's something that we support with the flower framework. Because the flower framework offers different kinds of integration possibilities on the client side. This is actually what the flower framework was intentionally designed for this heterogeneity on the client side. So the flower framework allows you to either build workloads very easily with high level SDKs, our current best maintained high level SDKs, the Python SDK, But then we offer another possibility to integrate with the flow framework, which is directly implementing the underlying protocol.
So that would mean you could take, basically, any kind of programming language that is compatible with gRPC, for example, and then implement the events that basically, handle the events that get sent back between server and client and do a model training. Obviously, that poses some challenges because not every programming language has has very good support for machine learning. So in some cases, you would need to code a model by hand, but then there's also very interesting recent developments. For example, PyTorch is actively working on making c plus plus a first class citizen. So LipTorch is basically the underlying c plus plus implementation of of PyTorch, And then that allows you to build machinery models in c plus plus similar to way you build machinery models using Python.
So that's a very promising direction. But for some platforms, obviously, there's still limited support in doing machine learning training on the edge. For example, on Android devices, you can already use something like TensorFlow Lite to do small kinds of model customization steps, but it's not as flexible as TensorFlow currently is on the server side, for example.
[00:17:17] Unknown:
In terms of the broader ecosystem around federated learning, as you mentioned, it's something that's fairly recent in terms of being actively employed. I think you said it was in the order of about 3 or 4 years since it was sort of first popularized. And in the past few months, I've been seeing it pop up a few different places. So I'm wondering if you can just kind of give your characterization of the current state of the ecosystem and state of the art for federated learning and what you view as Flower's role within that
[00:17:48] Unknown:
space. The current state of the ecosystem for federated learning is still quite early, but we see first signs of it. It's beginning to mature. So there are frameworks like, for example, in TensorFlow Federated, and, some very interesting concepts on on how you can build these workloads, and obviously, our own framework, Flower as well. And they all take sort of different perspectives on the whole field, and there's a lot of smaller frameworks that all come with their own unique benefits. But I think it's still very early in the sense that there's no clear winner amongst these frameworks yet, and there's no established way of doing things. It's it's not as proven as, for example, other areas like web frameworks, for example, not quite as mature, and I think there's still a lot of innovation happening in that space.
So it's not been too long since we started out with our it was the beginning of last year when we actually did the very first commit, and it's been growing very rapidly. Now in just about exactly 1 week, we are having the 1st flower summit or the 1st conference around flower. It's a pretty exciting development for us since it's not been that long since we started out, And I think for the ecosystem, this is a very positive development, because there is a lot of competition. There are a lot of new ideas, and the frameworks that are being developed, they're still working on supporting all of these kinds of ideas that are out there. So there's new ideas being proposed in federated learning, practically, on a on a daily basis, and frameworks need to provide the flexibility to support all of these kinds of ideas.
And, obviously, that poses a challenge for framework authors. It's easier to build something that has a static target, but a federated learning ecosystem is pretty much a moving target. So you have to keep up with the development, and you have to think hard about how can we build a framework that allows us to build the kinds of workloads that are around today, but also that allow us to build the kinds of workloads that will be around tomorrow and maybe a year down the line. So that's, for example, where the flower framework takes a very flexible approach. So what we do is we try to be as agnostic as possible towards the kind of implementation that runs on the client side. So as I mentioned before, we are, for example, agnostic towards the programming language that we use on the client side. Flower is also agnostic towards the kind of machine learning framework that you use on client side. So I think that's something that is a little bit unique about Flower. You can use it together with, for example, TensorFlow. You can use it combination with PyTorch.
But, also, by design, it's forward compatible with new kinds of developments. For example, you can use it with JAX, which is gaining popularity. And then there are some other kinds of more niche frameworks. And by design, flower is compatible with these kinds of frameworks without the flower authors needing to know about the existing of these frameworks at all, which I think is pretty good in terms of forward compatibility. We don't know which framework will dominate 5 years down the line, but the approach that Flower takes will almost guarantee that we're compatible with these upcoming developments.
[00:21:14] Unknown:
And so in terms of the actual functionality of Flower, you mentioned some of the capabilities that it has for being able to act as the sort of centralized server and interact with these different clients and its composability and extensibility with different frameworks. But can you just talk a bit more into how it's implemented and some of the ways that it actually simplifies or smooths out some of the complexities of building these federated systems?
[00:21:43] Unknown:
Yes. So building these federated systems provides a couple of challenges. 1 is this inherent heterogeneity, so you can have very different clients in the same workload. You can have very different data distributions on each client in these workloads, And you can have a very large number of clients in this workload, so scale is also an issue. And what flower, basically, provides is everything that's basically above the actual model implementation. So you can do the model implementation and the local training of this model in any kind of framework and language that you want to do. So if you have an existing project, for example, that's implemented in PyTorch and does some kind of CNN training, for example, you can use that existing model, and you can use the existing pipeline and the existing, basically, all of the libraries that you use to implement this workload.
And then you can add in a few lines of flower code, basically implement a couple of callbacks. And that's enough to federate these existing pipelines and these existing machine learning workloads. And then if you want to do just a very simple setup, you can just start a flower server, and then start a couple of these clients, and you have a federated setup. But then, over time, obviously, you you want to customize these these workloads further. So what Flower does is follows this approach of so called gradual disclosure of complexity, which is something that we adopted from Keras, and I think it's just brilliant.
So this gradual disclosure of complexity means that you can start out in a very simple fashion. Start out by building a very simple workload, And then over time, you can start to customize individual aspects of this workload, make them more complex, make them more sophisticated. So on the server side, I was already saying that you can basically start out with a single line of code almost and start the server. But then over time, you can start to customize the server. You can start to customize the the strategy, for example, add in ways to select the clients, add in ways to better aggregate these models that you get back from the individual clients, and basically, build out fully fledged systems on the server side. But you still have this capability if you're just getting started with federated learning. You can start very gently, in a very simple fashion.
[00:24:07] Unknown:
And in terms of Flower itself, you mentioned this sort of gradual disclosure of complexity. And I'm just wondering if you can talk through some of the ways that the overall goals and design of Flower have changed or evolved since you first began working on it and as you've been iterating on it in your work at ADAP and with people who are using it in the community and some of the overall evolution of the project from when you first conceived of it to where it is now?
[00:24:33] Unknown:
The overall goals that we had for the project, they haven't changed very much, but it became more clear that these goals are actually important. And we as we worked on them over time, we understood them better, and we understood how important they are. And we started out with the vision to build a system that allows us to to really build these workloads, not only in an academic simulation environment, to also build these workloads for in the real world, basically, on real edge devices, communicating over real networks that may be unreliable, and and all these kinds of system factors that you need to consider.
And as we start to build this, we learned a lot about what it means to build these systems in the real world. So we learned a lot about what kind of topics, for example, you need to consider when you communicate over the network as opposed to just simulating these systems on a single large server, for example. Initially, we, for example, we just planned on supporting multiple frameworks on the client side, and that was already something that was not easily possible in the other frameworks. But then we realized we can provide this, basically, this modularity, not only on the client side, but we can provide it on basically all layers of the framework.
So we can provide this modularity on the client side, supporting multiple machine learning frameworks. We can provide it when it comes to programming language, coding not just Python, but also c plus plus and Java and Swift. But we can also provide modularity when it comes to the communication layer itself. So initial version and the version that is open source right now is based around gRPC. So So it uses gRPC to communicate between the client and the server. But, actually, there's nothing inherently tied to gRPC. So we realized we can actually build this in a modular fashion, where the communication layer itself is something that you can that you can configure, and you can add in additional communication implementations.
So we could, for example, build an additional communication layer that is, for example, based around rest principles, or we could build 1 that's built around MQTT or some other kind of communication technology, and then actually have workloads that are comprised of some clients communicating over gRPC with the server, and then some other clients using, for example, MQTT to communicate with the server. So that's modularity on the communication layer. And then we have other ways of providing modularity, which is on the server side itself. So on the server side, you make all of these decisions about how to Slack clients, how to aggregate updates coming from the clients.
And that's obviously something that is very specific to the actual workload that you're running. So in some workloads, you have a very small number of clients, so you want to select all of the clients all of the time. But then in other workloads, you have a huge number of clients that make sense, or would even lead to a collapse in the training if you selected all of the clients, during a single round. So you want to make that as configurable as possible. So on the server side, we build an abstraction that is called a strategy, where you can implement your own logic that defines select clients for training, how we aggregate the training results, how you select clients for validation, how you aggregate the the validation results, and also other topics that we want to actually expand in the future as well that allow you to really customize your workload deeply without having to dig into, for example, the communication protocol, something like that. I think that the very interesting aspect of it, flower is modular on basically each layer of the stack, but you don't have to touch this modularity. You you can basically make use of this modularity if you need to, but you don't have to. As I was saying earlier, it's still easy to get started. It's easy to build a simple server, simple client, and use the default communication technology.
But if you want to, you can dig into each of these pieces and different users based on what their background is and what their interests are. They want to customize different aspects of this. So, for example, a machine learning researcher, they are probably interested in the kind of model that you have on the client side and the kind of strategy that you use to aggregate these updates coming from the client side. But then maybe a network or systems person would say, okay. I can use some out of the box model and some out of the box strategy, like federated averaging, but I'm really interested in the network layer, and I want to switch out the network layer and and try different things there. So that's what this modularity brings to the table.
[00:29:24] Unknown:
Digging more into the extensibility aspects of the project, I know that there are some core tenets that you've laid out as to sort of the guiding principles for flower, 1 of those being extensibility. I'm wondering if you can just talk through a bit more of the interfaces that Flower exposes for being able to add integration points or extending its functionality.
[00:29:46] Unknown:
Fundamentally, there are 2 points of integration where you can plug in your own code and and your own machine learning pipelines. 1 is on the client side, and the other 1 is on the server side. The 1 on the client side, those mentioned earlier, there are basically 2 ways to do it. There's a high level SDK. Currently, we provide high level SDK in Python. There's another high level SDK in the works, which is based on Java, which is currently just a draft pull request open for that. And there will be other high level SDKs in the future. And these high level SDKs, they basically enable you to to just implement very few callback functions, and that's all you need to do in order to pluck your own machine learning pipeline into the the flower framework and federate these pipelines using flower. But then there's this other way of integrating with flower on the client side, which is to implement the flower protocol directly, and this is something that you would probably want to do for more sophisticated use cases. So if you're on some exotic kind of hardware that is not well supported, for example, that that doesn't run Python very easily, you would probably want to implement your own client that basically directly implements the Flow protocol, and use that to customize your workload.
1 thing that's interesting is that, well, you could think that, okay, there is a Python SDK, then that's the only way to do it in Python, but it's technically not true. So you could also go ahead and implement basically do an alternative implementation of the file protocol in Python if you have very specific requirements. So if you have specific requirements of when the client should attempt to connect to the server, for example, or how the client should handle the messages coming from the server, You could just ignore the fact that there is already a high level Python SDK, and basically implement the file protocol directly, and then make very informed decisions about how to handle these events coming from the server, how to handle the the connectivity with the server, and other kinds of things. The other major integration point with the Flower SDK, in terms of extensibility, is on the server side, and that's the so called strategy interface.
So the strategy interface is basically an abstraction that allows you to bring your own federated learning algorithms, and also allows you to use other algorithms, press implemented by others, sometimes implemented by the core framework itself as a built in strategy, or implemented by a third party, for example, in a GitHub repo. And it allows you to reuse these strategies. And that's very interesting because it means that you can oftentimes depends on specifics of the algorithm, of course, but oftentimes, you can just go ahead, bring your own model, then federate this model using flower, and then plug 1 of the existing Wesley best practices into the framework, and then have a federated learning system. And and it also allows, for example, researchers to easily conduct research, easily compare different strategies with the same kind of model on the client side. And that's something that this reusability of strategies was something that, at least I'm not aware of it, wasn't easily possible before.
[00:33:06] Unknown:
In terms of the actual communication pieces, you mentioned that you're currently using gRPC. But as far as the actual data that's going between the clients and the server, I'm wondering if you can dig a bit more into the specifics about the sort of shape that that information takes. So for instance, is it just a, you know, a matrix of vector information for the derived model? Is it, you know, some sort of numeric values or serialized objects going back and forth? I'm just curious what the actual communications between the server and the clients look like that factors into the overall life cycle of the model development.
[00:33:41] Unknown:
The actual communication that happens between the server and the client, right now, it's it's based around gRPC, but as I was saying earlier, there's actually nothing stopping us from adding additional communication technologies into the mix there. But if you look at your PC, basically, the implementation is based around our protocol, which defines a handful of messages. So on the 1 hand, messages that are going from the server to the clients, and then on the other hand, messages that are going from the clients to the servers. And then what these messages, contain is obviously serialized, either model parameters or gradients.
And now the interesting question is and that's something we we thought about quite a bit. How do we serialize this properly? Of course, there are some existing serialization approaches that are out there, but the question is, which 1 is the best 1? And, obviously, there's no best solution for every case, but it, again, quite depends quite a lot on what you actually want to achieve and what you actually want to implement. So it took the decision to make the communication as agnostic towards the serialization as possible, and we ended up with just communicating bestly binary arrays. And that's really the heart of the serialization that that is going on. You have your model that's implemented in PyTorch or in TensorFlow, and then you serialize it out into an an array of binary array, basically, into a byte array.
Then you communicate this byte array over the network. And then on the other side, you deserialize it again. And the way this works is, actually, we provide, again, multiple ways of doing this. If you implement the protocol directly, then we have a ton of flexibility there. But we provide 1 high level build, which is NumPy and the arrays. So most Python based machine learning frameworks, they have very good interoperability with NumPy. So if you extract the model weights as NumPy and the arrays, you can just hand it over to the Python SDK. For example, in Keras, that's a simple call to model dot, get weights, and then you get a list of NumPy and the arrays, and Python SDK serializes this list of NumPy and the arrays into byte arrays, which we send over the network. And then on the other side, we do the reverse. We deserialize it, and we call model dot set weights to update the model weights.
What you can also do, obviously, is serialize it into any other format. So you could plug in your own deserializers and then use other formats, other ways to serialize the model and send the model over the network. And this is something so the the NumPy story is well documented and well supported, but the way to plug in other serializers and deserializers, this is something that we want to document better in the future and and provide examples around how to do that. Because it's something that in the Python world, you might say, okay. NumPy and the errors should be enough for what I want to do, and they usually are enough. But then, as soon as you start to look at other languages, you need other ways of serializing the model parameters.
[00:36:58] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census today to get a free 14 day trial and make your life a lot easier.
And then another 1 of the core principles that you have for the design of flower is understandability. And I'm wondering if you can just talk through some of the ways that that manifests in terms of the work that you're doing on the project or the way that you think about the interface and implementation. And
[00:37:57] Unknown:
Understandability is Understandability is actually a difficult 1. So it's often if you look at a project and the project provides very nice high level APIs, APIs that you immediately understand. It feels like these APIs are obvious, and that's if you get the feeling from looking at an API, okay, that's obvious API, and I immediately know what this API does, that's a very good thing. But it's actually not something that is easy to achieve. So that's something that we look at quite a bit and that we think about quite a lot before we mature any features, before we release new features. We think a lot about the API and how it speaks to the user.
So far, we've gotten very good feedback about the APIs we provide. They seem to be very intuitive. Not all of them, but most of them, hopefully, at least the ones that we we documented well. And that's something we think a lot about these APIs. And we, actually, when we build a new feature, we don't just ship the first version of this feature. We actually build multiple versions of these features, and we try to iterate a lot on the API. And once we build version 1, and we get a basic feeling for how the API could look like, but usually the API in the very first version is not perfect yet. And then we build another version, which is hopefully a bit better. We build yet another version, which might be slightly worse, but we learn something from it. And then over time, we get a feeling for what kind of APIs feel intuitive and feel easy to work with. The way we do this is, on the 1 hand, we build the API, but then we also, at the same time, we build code examples that use the API, so that we immediately see what this would look like for user of the framework.
That takes some time, and it takes patience to iterate on it, and, well, it's not always easy to throw something away that that you built before, but it really pays off in the long run because these APIs, we've actually had users saying, well, I actually haven't looked at the documentation yet because from the code example that I saw, the API was clear to me. So so they didn't have to look at the documentation at all, which I take that as quite a big compliment for the APIs that we have in flower because that's not something that is always the case. So that's something, we focus on a lot, and we want to get even better at that in the future.
[00:40:23] Unknown:
And so in terms of actually building a production ready system with Flower, I know that it gives a lot of capabilities out of the box for being able to build these models and distribute the actual training across these different clients, whether it's mobile devices or, you know, IoT sensors or distributed fleet of machines running in Amazon. But what are some of the additional systems or services or development effort that's necessary for being able to go from, I have this idea that I've built out, you know, proof of concept with flower to actually running it in a production ready environment and being able to sort of monitor the performance and distribute the actual model code to clients and all these sort of overarching considerations that are necessary from going from idea to production.
[00:41:12] Unknown:
When you start out with an idea and you want to build out this idea, as you're saying, you would usually want to start out with a POC. And oftentimes, if you implement such a POC, you implement a POC in Python, as most researchers do, for example. And then you run this first workload. You run this either on a developer machine, or you run this on a large scale server or a cluster of servers to simulate the workload. And before you move into production, you actually want to make sure that hyperparameters that you said, that they basically allow the model to converge over time. So what you want to do is you want to create a simulation that hopefully is is large enough to simulate the real world properties of your workload adequately.
So that's something you can do on on a large server or on a large server cluster. And then the next step is to actually bring this workload into production, bring it onto real devices, for example. And the next step this next step of taking it into production, this is something that is highly dependent on the kind of workload that you have. So sometimes, you have a workload that you want to actually deploy onto IoT devices. So you have a lot of IoT devices, and they may be low powered. They may not have proper support for Python, for example. So what you need to do there is you need to basically translate the model that you implemented in in Python before.
You need to translate that, sometimes, even into another programming language, and you need to implement this model in c plus plus for example, for research constraint devices. But then there are other kinds of workloads where you can pretty much take the existing simulation and add in the usual things that you need for production workloads, like monitoring of the system, proper logging, and things like tracing and so on, where you can add it into your Python workload and then deploy this across multiple servers. So that's the case for federated learning systems that, for example, run across, data partitions of different organizations.
Usually, these workloads, they run on regular servers and actually don't need to change that much about the code before you can actually deploy this into production. But, of course, if you want to run on IoT devices, for example, then the challenges are a little bit bigger. The interesting aspect here is that these workloads, they vary quite a lot. It it really, really depends a lot on on where you want to deploy it and what kind of data you have, what kind of model you have, and what kind of infrastructure in terms of communication you have between these partitions.
And depending on that, you need to make good decisions about what components to combine in order to make the system run-in the real world. And because this is such a, basically, heterogeneous field that because there are so many moving pieces, the vision behind Flour is really to provide a single framework that allows you to go from research workloads all the way to POCs, all the way to production real IoT edge devices. And that's where the modularity of flower comes into play. So So because there is so much variety in these systems, you need a modular framework that allows you to customize individual aspects of the systems.
For some cases, maybe most of the default implementations of flower, most of the defaults are okay, and you can just start to deploy the system and add in some monitoring and logging capabilities. But then for other systems, you have to start to customize this heavily. You might even need to implement your own communication stack, but Flowe gives you these capabilities, gives you the the modularity that allows you to implement a new kind of communication stack, but still reuse what you did on the client side and reuse what you did on the server side earlier. So the interesting aspect is you can reuse most of the things, almost all of the things, but then really customize these acts aspects that you need to customize.
And then for deploying these models, deploying these these clients, for example, that, again, depends heavily on the environment that you're running in. So if you deploy on IoT Edge devices, obviously, the way to deploy to these devices and the way to update the code that runs on these devices is going to be very different from deploying to servers, for example, where you might even have ways to do some kind of continuous deployment or something like that. And then on the server side of things, I'm curious
[00:45:53] Unknown:
what the possibilities or capabilities are for being able to build a high availability system where where you're able to either scale out horizontally with multiple server implementations and having them collaborate on, you know, sending the model parameters or centralizing the model parameters and just some of the considerations along those lines, if that's even something that's feasible in this type of system.
[00:46:14] Unknown:
Yeah. Having a distributed server is something that we have designed into the system, but we haven't implemented it yet. So the current server is basically a centralized server. It's a it's a single process that you start. You start a single process, but then in in the background, the gRPC limitation obviously sponsored a lot of threats for the connections that are coming in. But, fundamentally, it's still something that runs on a single machine. You could sort of use the strategy abstraction usable way. It doesn't have this understandability aspect that we discussed earlier. So there are hackaways to distribute the server right now, but we have planned for distributed server implementation that will do this in a very nice and in a very clean way. And, actually, we'll do this in a way where if you use it, you'll, basically, just have to set a configuration parameter or start the server in a slightly different way, and then it does all of that in the background for you. So it will be still compatible, for example, with the strategies that are implemented.
The strategies will not know whether the clients that get selected by the strategy, whether these clients are connected to the same server or to different distributed servers across the globe. So that's something where we focus heavily on, again, getting the APIs right and making this feel very natural. And, again, we're trying to disclose all of this underlying complexity in a very gradual way. So if you're just starting out and you just want to build a simple system, you can start that basically on your laptop. But then over time, you can can basically expand this, use more advanced configurations, and eventually distribute to server, basically, across the globe onto different servers, have incredible scalability by doing so, but you don't have to begin with this complexity in the beginning.
[00:48:12] Unknown:
And in terms of the ways that you're using flower, either at ADAP or in some of your side projects or the ways that you're seeing it used in the community, what are some of the most interesting or innovative or unexpected projects that you've seen built with Flower?
[00:48:26] Unknown:
There have been a few very surprising and interesting developments. So let me give you 2 examples. 1 example is actually fully distributed or, actually, peer to peer kind of systems, where you would not want to have a single centralized server, but you would have multiple devices or multiple parties that are collaborating and training, each party acting both as a client and a server. So they can send weights, they can receive weights, they can train the weights, and then they can receive weights from multiple other parties, aggregate these weights, and then maybe redistribute them, and so on. That's some something we haven't initially intended for our to do this, but it's something that, after thinking about it for a bit, is something that wouldn't be too difficult. There would need to be another layer for these peer to peer protocols to work and to discover each other, but it's actually something that's not infused with with the framework.
And then some other aspect that I found very encouraging is people started to extend core parts of the framework very, very heavily. So, initially, we thought, okay, this is something we haven't documented properly how this works. So it's something that is probably quite difficult for someone else to do, But, actually, people started to read the code, started dig into the code base, and then start to build out a very sophisticated system. 2 examples for that. 1 is someone who just took the framework and then built an Android client. So it's the first time Flow was used on Android.
And 1 of the PhD students in Oxford basically took the protocol, implemented the protocol in Java, built a flower client together with TensorFlow Lite, and then used TensorFlow Lite to do, basically, some light form on device training. And, actually, took it a step further and then deployed this entire workload onto the AWS. There's there's an AWS service called I don't remember the name, but it's a service that basically gives you access to real Android devices, and they device form. Exactly. Device form is the is the name. So this workload was deployed to AWS device form and then trained on real Android devices in the AWS device form, and it's something that there was no documentation whatsoever about how to do this. But, apparently, it was easy enough to get going and to do this.
Another example is, basically, I mentioned earlier that communication is based around gRPC, and someone at the University of Cambridge basically ripped out this gRPC part and plugged in something different, something based around the Ray framework, and used Ray to basically do large scale cluster simulations based on flower. And this is something that it's actually a very interesting development, and we are talking and thinking about how to incorporate that into the core framework.
[00:51:25] Unknown:
And in your own experience of building flower and working with it and helping to grow the community and building projects on top of it, what are some of the most interesting or challenging lessons that you've learned in the process?
[00:51:37] Unknown:
Well, there have been quite a few lessons. There's a ton around coding, obviously. So when we worked on Flower, it was first time for us to use bidirectional gRPC streaming, which is very powerful, but also challenging in some edge cases. So coding, obviously, there are a ton of things. But then maybe more interesting is the entire open source journey. So this is the the first real proper open source project that we started. So we really started out building a community, incorporating user feedback, reviewing pull requests, and then all these kinds of things that are related to building a community. I'm super happy about how it's going.
So we do have to code in open source in GitHub from day 1. We get issues almost on a daily basis. We get users in the flower Slack channel. These are reaching out. Users are telling us how to use the framework, what, for example, what kind of topics are still challenging in the framework, and we use all this kind of feedback to improve the framework itself, to learn about what users are doing with it. It's not always easy. We we got so many feature requests that we have to really prioritize super heavily, and it's not always easy to do that, obviously. So we learned a lot about engaging with the community.
We try to really answer every GitHub issue and and every message that we get in Slack. Always do in a timely fashion, and the answers that we provide are not always the answers that users expect, for example. So that's something where we really need to find a good balance between incorporating user feedback and building new features. But then, on the other hand, not getting too distracted by a single particular interest, for example. So that's not always easy, but it's also fun, and it's always nice to hear from users, especially if it's positive feedback, of of course. And so far, it has been a very, very good experience for us. 1 of the things that we wanted to do for a long while is to bring users together, and that's why we started to organize the 1st flower summit, which will happen next week.
And I'm really looking forward to that, because that's a very nice milestone in the journey, from the first commit last year, now to a conference fully dedicated to flower itself.
[00:53:55] Unknown:
And for people who are starting to investigate the space of federated learning, and they're debating how to build it and what to build, what are the cases where Flower is the wrong choice and they might be either better suited with another framework or better suited with a centralized model or just doing some sort of distributed training?
[00:54:13] Unknown:
So if you're purely looking at this from a performance perspective, if you have all of the data centralized and you just want to scale out to multiple machines, that's something where you should just consider regular distributed training, for example. And that's something where frameworks like TensorFlow and PyTorch, they provide very good built in support, and that's something where the approach that Flower takes will never match up to that when you just look at distributed training. And there's this fundamental difference between just distributed training and federated training in the way that data is distributed. In in distributed training, that's a question we actually get quite a lot.
What's the difference? The same thing because you have a couple of machines and you do machining on them. No. It's not the same thing. There there's 1 fundamental difference, and the fundamental difference is that in distributed training, you have full control over which data example goes onto which machine. And this is something that's not the case in federated learning. You have a fixed data partition on each client, and there's nothing you can do about it. You basically you have to deal with it. Sometimes, there's a client which has large amounts of data, but then might be another client with very, very few data points on it. So that's something that is a challenge, obviously. And federated learning, the entire field of federated learning is based around solving these these kinds of challenges.
So there are a couple of other things where Flow would be the wrong choice. So, for example, if you want to do very sophisticated protocols where there's a lot of communication rounds, maybe even between the clients themselves and so on. That's not something that is well supported out of the box in flower. It's something where we look at ways to provide more flexibility in the framework. But again, as I mentioned earlier, we want to do it in a way where it feels intuitive, where it feels easy to use for the user. It's not something that every user would want to do, so we are very careful about opening it up in ways that can potentially harm the user experience.
[00:56:21] Unknown:
In terms of the near to medium term future of the project, what are some of the things that you have planned and that you're looking forward to working on? Yeah. I mean, the 1 big thing, obviously, that we have planned is the reaching flower 1.0.
[00:56:36] Unknown:
So currently, we have a version 0.16 released. It's very stable, and it's being used quite a lot. And we eventually want to release a version that we call a 1.0, very good, forward compatibility and guarantees. The last few releases, they have been fairly forward compatible. There have been minor breaking changes here and there, and we provided, for example, migration guides on how to to migrate from the old APIs to the new APIs. There are a couple of changes that that are upcoming and that will still be breaking changes that we already have on our road map. But, eventually, we want to get to the point where we call this a 1.0, where we say everything that you build based on publicly documented APIs is something that will continue with any 1 point x release.
That's a big thing, but again, there are a few a few kinder breaking changes that we will do to the APIs. Remove some of the old APIs that are still supported, but already duplicated. That's basically the big overarching goal. There are a few features that we want to implement in time for the 1.0. So there's things like a handshake kind of feature where the client can, basically, provide information about itself to the server, and the server can use that to make more informed, sampling decisions. There are a couple of features around server APIs where we want to make the server even more flexible, but again, in a very useful way.
And then there's this topic of being agnostic towards the way we serialize the weights. So what we can do there is to provide, basically, this plugin functionality, where you can plug in different serializers and deserializers. So that's the core framework. And then around the core framework, we want to provide additional client implementations and other organ languages. We want to provide an ecosystem of baselines, which makes it easy for researchers to just experiment with it,
[00:58:34] Unknown:
and then other bigger topics like this large scale simulation engine and so on. Are there any other aspects of the flower project or federated learning in general or the ways that you're using it at ADAP that we didn't discuss yet that you'd like to cover before we close out the show? Well, I would like to invite everyone to join the flower community.
[00:58:53] Unknown:
Give it a try, and especially let us know how it feels, what was intuitive about it, what was something that you you found difficult, for example. Feedback is always very welcome even if it's something where people say, hey. That that was super difficult for me. I I don't know how to use this. We look forward to receiving such feedback because it helps us to improve the APIs and so on. There are many ways to get in contact with us. You can just drop us an email. You can join the Slack channel, for example, and reach out to the core developers directly, and open an issue on GitHub.
Yeah. Give it a go, and let us know how it feels.
[00:59:28] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the card game, rummy. There are a number of different varieties of it. So there's gin rummy, 500 rummy, and I'm sure others that I'm not aware of. But it's just a fun game that's easy to pick up and play quickly with just a standard pack of playing cards. So definitely something to learn to help pass the time when you're looking for something to do. And so with that, I'll pass it to you, Daniel. Do you have any picks this week? My special pick of the week is actually sports related. So what I can highly recommend is doing stand up pedaling. It's something I do quite a lot to decompress from all the development work that we do. And it's really a good activity to focus your thoughts on something else, be in nature, be on the water, get a little bit of exercise, but not too much. And basically, get your head free to get inspired to think about new things again. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on Flower. It's definitely a very interesting project and an interesting problem space that I'm gonna have to take a closer look at myself. So thank you for all the time and energy you've put into helping make that more tractable problem, and hope you enjoy the rest of your day. Thanks for having
[01:00:45] Unknown:
me.
[01:00:47] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dotcom for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Daniel Beutel: Introduction and Background
Understanding Federated Learning
The Flower Framework: Origin and Development
Federated vs Centralized Machine Learning
Challenges and Use Cases in Federated Learning
Current State and Future of Federated Learning Ecosystem
Implementation and Extensibility of Flower
Communication and Data Serialization in Flower
Understandability and API Design in Flower
Building Production-Ready Systems with Flower
High Availability and Distributed Server Implementation
Interesting Use Cases and Community Contributions
Lessons Learned from Building Flower
When Flower is the Wrong Choice
Future Plans for Flower
Invitation to Join the Flower Community
Closing Remarks and Picks