Lightening The Load For Deep Learning With Sparse Networks Using Neural Magic

Hello, and welcome to podcast dot in it, the podcast about the Python language and how it's being used for data and science.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Hi touch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL.

Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at python podcast.com/hitouch.

Your host as usual is Tobias Macy. And today, I'm interviewing Nir Shavit about neural magic and the benefits of using

techniques for deep learning models. So, Nir, can you start by introducing yourself?

I'm 1 of the founders of neural magic. I'm also an MIT professor.

My expertise

is kinda multicore

computing. I've done a lot of work in theory and

in practice of multicore computing.

And in recent years, I've been doing what's called computational neurobiology,

where we actually are using tools from computer science

to help understand

kind of map connectivity.

So that's kind of where my research is. Yeah. Kinda neural magic is kind of the brainchild of this combined research in both multicore systems and neurobiology.

And do you remember how you first got introduced to Python?

I think teaching 6 double 0 6, which is an MIT course. Right? And it's kind of the first course where students get to actually code algorithms. And the language of choice in 6 double 0 6 is really,

you know, Python and I hadn't actually used Python before that. But that was a kind of a jump into the water kinda thing.

Yeah. I know that the 6 double o courses are definitely pretty popular both at MIT and also now that they're publicly available on the edX platform. That's right. Yep.

And so

in terms of what you're building at Neural Magic, I'm wondering if you can just start by giving a bit of an overview about what it is that you've made there and some of the story behind how it came about and why you decided to turn that into a business.

Yeah. Maybe I'll start at the very beginning.

You know, what I've actually learned

from the work in neurobiology

is

kinda that our our brains, right, as a computing device, right, are extremely sparse.

Okay. There are not a lot of compute,

but they're very big in terms of the representation.

So if you will,

right, you know, you can think of your brain as a petabyte of data. Right? The graph.

Okay?

And,

you know, somewhere on the order of cell phone of compute on that graph, that's what you're doing. If if it was more than that, then you probably

would be epileptic. Okay. In terms of the behavior. So like a cell phone of compute on a petabyte of graph. Okay. And what I saw, you know, where the hardware

people were building was going is to actually go towards a petaflop of compute on a cell phone of, you know, kind of memory. And that's kind of the wrong approach. And so what we decided to do is try to mimic, you know, the things that brains do, right, to achieve performance

and try to do that. That's how neural magic was born from kind of this research.

So, you know, we tried to run

neural networks on CPUs,

and we found that we could make them run as fast as GPUs. And from there, you know, the company was born, basically.

And as you mentioned, there has been a lot of work in the deep learning community of pushing towards this

GPU oriented or specialized hardware approach of very dense graphs and using these dedicated

pieces of hardware to be able to churn through the compute on them. And I'm wondering what it is about the

overall sort of mathematical

aspects of deep learning architectures and

neural networks in terms of the state of the art that has biased towards that specialized hardware,

at least up to date?

Yeah. So, you know, initially, when people started out, neural networks were

kind of these, what we call, perceptrons. These are kind of fully connected layered neural networks.

And with time, people discovered

that if they actually use convolutions,

okay, you could get better

accuracy.

And so convolutional neural networks

appeared and they delivered the kind of accuracy that we didn't have before.

But convolutions are extremely expensive.

And so what happened was that for every piece of data that you brought in, you did an enormous amount of computation.

Okay?

And CPUs just didn't have that amount of compute. So people said, oh, well, you know, graphic

processors are actually designed to deliver these kinds of flops. Let's use them.

And that's how this whole idea

of, you know, acceleration in hardware was born. And it serves us well, especially in training, not so much in inference, but but definitely in training. The fact that we have devices

with enormous amount of flops is really very helpful.

1 other thing I'll say is that this is not a new kind of thing. So the history of computer science is littered with examples

where

a new area emerges

and the first thing that we do to solve the algorithmic problems

is to make throw special hardware at it. Okay? And this was, you know, if you go back to the time when, you know, when Lisp was introduced as a language, people built Lisp machines machines because they didn't believe that you could get car and computer to be implemented efficiently on a CPU. Okay? And, you know, and this is on once upon a time, data centers were filled with all kinds of special hardware to help you route packets.

All this has disappeared.

People don't do that and, you know, it's all CPU based.

And so, you know,

what's happening and what's happening in machine learning is that we don't know the algorithms.

Okay? And because we don't know the algorithms,

our natural way of overcoming

this lack of algorithmic knowledge is to make the hardware compensate for what we don't know how to do in algorithms.

And that's where we are right now. We're at the kind of the birth of a field,

and fields are born with hardware acceleration.

And my bet is that like all other fields,

you know, whatever is necessary

to make machine learning run fast will be incorporated over time into the CPU architecture.

And that's what we'll have going forward. It's just my bet, but that's how I view the world.

Yeah. It's definitely interesting how much effort people will push into just brute forcing a problem rather than trying to figure out the nuance to it just because

figuring out the nuance requires

time. And after you figure it out, you can maybe move faster, but it slows you down at the outset, and people just aren't interested in taking that time to figure it out. That's correct. And, also, it's not just that they don't

wanna spend the time. They really wanna rush out and and provide products to people. That's what's happening. I mean, the actual commercial

expansion of machine learning is much faster than our ability to actually devise the algorithms. And even though an enormous body of research is being created every day, I've been in other fields in multi course and all kinds of other programming, and I've never seen this kind of explosion

of research. Okay. So people are doing this, but it's not easy stuff. Right?

And the commercial demand is so big. Right? That you have to solve the problem. That's why hardware is serving this purpose right now.

But it's coming at a price. Right? The price is enormous because what happens when you introduce

hardware acceleration

is you throw away the thing that is

kind of the, you know, the lifeblood of modern software that essentially it's containerized and movable from place to place. I mean, people wanna just write Lambdas and run them. They don't wanna know what it's running on. But when you're doing hardware acceleration,

you know, there are about 75 companies offering different accelerators out there. So imagine you wrote your code for 1 accelerator. Now it doesn't move to the next accelerator. What do you do? Right? So there's a price paid for this, but it's birth price. Right? This is where we are because we're just moving really fast.

And I'm wondering what your thoughts are on the impact of machine learning and people's willingness to

investigate some of these specification techniques

and be able to run on CPUs given the supply chain issues that we've been seeing with chips and GPUs in the past year because of COVID and all of the turmoil that's come about around it? I think the availability of hardware is 1 issue. I think the other is the complexity of the software that you have to write on top of it. And I think that it's a combination of the 2 things. And the reason that people are willing to move to sparsity

and not just sparsity, there's also quantization. People are willing to do things to just manage to run you know, more efficiently.

I mean, there are 2 problems here. Okay. You know, a lot of the compute we wanna do on the edge and not in the cloud To do that, we've gotta make the model smaller.

Right? And we've gotta make the compute that's available on these devices

better. Right. And so, or, or more effective. Right. And so by sparsifying and quantizing, we are actually

shrinking the footprint of the model. And at the same time, if you know how to write the right kind of algorithms, you can also deliver, you know, more effective compute through sparsity. And that's really what neural magic does. What we've kind of discovered is how to actually run, you know, sparse computation sparse neural network

computation on CPUs, on commodity CPUs. So you don't need an accelerator to do that.

And the other thing that's kind of evolved, and that's neural magic, but also other people, you know, a couple of years ago,

sparsity, I mean, Song Han from MIT and others, wrote papers about sparsification of neural networks. But it was very kind of,

you know, beginning of the of the research. And so what you saw was, you know, 50%,

75%

sparse neural network. You know, at 50%,

you're just reducing compute by 2 x. At 75, you're reducing it by 4 x. Right? But if you sparsify,

say, an NLP model by 95%,

that's 20x less compute. That's more than any GPU will give you. So the potential, right, of understanding how to sparsify

goes beyond just the hardware for it, even if we build special hardware. Right? It's the very idea that I can reduce the flops.

And when you look at the evolution of models also, you know, when we started out, most of the models were computer vision models. And in these models, right, like YOLO, right, what you have is a very small footprint. The number of parameters is in the 60, 000, 000, 000. Right? But for those but for those 60, 000, 000 parameters, right,

you're getting, you know, maybe more than half a teraflop of compute.

Okay. The new models, right, the new models that we're looking at, a lot of them based on transformers,

the ratio

right of the compute to memory is significantly lower. And especially in these and recommendation systems, we've got huge models, but they don't have that much compute. Right. And so it's natural for us to kind of figure out, you know, okay, what are we going to do about this? Right? Well, we can run it on CPUs if we just kind of reduce the compute a little bit more and make the model smaller and boom, we're in business. So that's the kind of where we're going, I think.

And another trend that I see pushing towards the reduction in terms of the size of the models is the sort of federated learning approach. We wanna be able to, as you said, push out this compute to the edge, and we want to also be able to

preserve some measure of privacy by not pushing all of the data to a, you know, set of central compute. We wanna be able to do this learning at the edge and then just send back the, you know, the metrics and the parameters that we've derived from the data that lives on the end users' devices.

Absolutely. I think the next big step, which a lot of people are working on, is actually to do both the inference and the training together. So these neural networks that don't really require you to collect, you know, a whole bunch of data and then

readjust everything, but rather are kind of continuously adapting.

And if we're going towards a continuously adapting kind of neural network,

we better make it run on CPUs where the applications run. So that's kind of my view. And, you know, and there are also areas, right, like reinforcement learning. I mean, it's all about memory. CPUs are good enough in that area. Right? That's what what people do in reinforcement learning. They don't necessarily need a GPU for it. And I think those techniques, which are currently,

you know, at much much less mature

than convolutional neural networks or transformers,

are the future of what we'll be doing. Right? There's a lot of that coming down the road, and that is not about compute. It's about algorithms. All of these things are about algorithms.

Digging deeper into the actual sort of mathematical or architectural elements of sparsifying a model, I'm wondering if you can just talk through what's actually involved in taking

a dense neural network, you know, a CNN

that you may have trained with, you know, PyTorch or TensorFlow,

and then taking the tools that you're building with neural magic and turning it into a sparse neural network that can effectively be run on a CPU instead.

So there are really 2 phases to it. 1 phase so let me just describe maybe a little bit for people what a pipeline for machine learning typically looks like. K? So

company

Joe doesn't build a new network from scratch. This is the job of the Googles and the Facebooks and so on. And they spend enormous and, you know, OpenAI, they spend enormous amounts of compute

to research and build a new model category.

GPT,

you know, Yolo,

a lot of money is spent there. And then once we have a category,

now we want to use this for different kind of applications. Right. And so what happens now is when a researcher

or a developer

gets a model, now he wants to fit that model, which is a general model to his dataset.

Right? That involves, right,

a process that we call transfer learning. Typically, it's either retraining or transfer learning. So I take my data and I either retrain the network a little bit with my data or I, you know, adjust the parameters, basically. Right? And that process is what happens in many, many companies.

Okay? And what

that opens an opportunity for is to essentially, at that point, when I'm doing the transfer learning, I can also use that data to sparsify and quantize my network. And that's what people do. So neural magic actually provides you with a recipe.

The recipe is just a set of instructions,

you know, on your CPU or GPU, where you take the network that you have, the data that you have, and you run a process, right? That essentially what it does is it takes and reduces the number of parameters in your network. And the way this is done, there are various techniques for this. The easiest 1 is magnitude pruning, where I just throw away things that have a lower a smaller size, you know, and I repeat that process.

I get a script for that. It does it. At the end, what you have is a new neural network that essentially looks exactly like the older 1, but has less parameters.

So we're not changing the structure of the network. We're just removing a lot of things that are 0 from it. K.

And surprisingly,

or maybe not surprisingly, given the sparsity in our brains, right, Is that, you know, this process actually

can be done without losing a lot of accuracy.

Okay. That part of the company, this thing is we're kind of

mimicking what nature does, by the way, your brain. Also, when you were a child, you know, until age 2, you're adding synapses and then there's a massive phase of pruning.

And that happens again in adolescence, we think. So so same thing we do to our neural networks that happens in brains.

And then once you have your sparse network,

now you actually have to run it. If you run it on a GPU,

if the sparsity is, you know, is not structured, which is what we really want to get 95% of sparsity, right, then the problem is that it's not just really designed for accelerators because the GPU has 5, 000 cores. It's gotta saturate them. Right?

Well, if you write the right kind of kernels on the CPU, you can actually, you know, let's say I specified a network 90%.

So now I have 10 x less compute. Right? I can restructure my computation on the CPU

so that now I have 10 x less compute.

In the same way that the GPU would have 10 x more flops to run this on, the CPU with specification

has 10 x less flops,

but you run into a problem then. The problem you run into is that you've taken a computation

that was what we call the compute bound. Right? There was a lot of computation for every data element, but now there's a lot less. Right? So CPUs are not great at moving data. GPUs move data in and out of memory a lot faster. Okay?

So the next thing you need to do, and this is kind of the magic and neural magic is you need to run this on a CPU, even though it's now memory bound. And what we really do with neural magic does

is we've found a way

for running the neural network instead of layer after layer, we actually break it down depth wise and we run it in stripes. Okay. Which we call tensor columns that go the full depth of the network. Okay. And we run each 1 of these in cache.

So what we're doing essentially is reducing the compute through sparsity and then eliminating the memory bottleneck by running it all in cache. And that's the key. That's how you get, you know, essentially GPU speeds on CPUs.

And then 1 of the things that you mentioned is that by specifying, you can get nearly the same accuracy as GPUs. And I'm wondering what are some of the ways that you have

compared the accuracy of the resulting models and some of the techniques that you've been able to bring in to

prioritize

the accuracy

as you're pruning these networks to make sure that you don't end up with a sort of significant drop in performance for the prediction outcome?

Yeah. So first of all, there are standard measures of accuracy. Right?

But any real practitioner of machine learning will tell you that even though you have a network that on a standard dataset has given you a certain accuracy,

right, when you take it and run it with your own data, most likely it's gonna look completely different. And so in the machine learning

area, people are now becoming more and more aware of the fact that there is a discrepancy

between what the accuracy is that you're stating and what the accuracy is that you're going to actually get on your dataset.

Okay. By that, I mean the actual prediction.

Right. And so, you know, we're not in that business. We really try to compare to the actual accuracy on the test dataset.

And it's a kind of a standard kind of thing. You run it, you do the pruning

and you try to bring back essentially through your process,

bring the accuracy back just like you would in regular training where you train and bring it to a certain accuracy,

the same kind of thing you do, but now you prune.

So you prune and try again and prune and try again until you basically have brought it back to the accuracy that is, you know, a typical place to be is 99%

of the original accuracy.

Even though I'll tell you it's funny, but sometimes you get improved accuracy. People have observed this. You get improved accuracy through pruning.

Okay.

So that also happens.

But in general, you know, you try to kind of sparsify and set your criteria now in different areas,

facial recognition.

Right? So it really depends on what the application

is. Right? Also in natural language processing,

accuracy

is a thing that people are aware of, but also they're willing to for example, they're willing to quantize to go to 8 bits and lose accuracy,

but get the speed and get the smaller model. That's okay. You know? When we go from, you know, the accuracy that we expect on a server

running on Amazon is very different from the accuracy we expect from the same network running, you know, being kind of made smaller and running on your cell phone. We cannot possibly hope that it'll be the same, but both of them have value.

And in terms of the actual

components

of the platform and the architecture that you're building at Neural Magic, I'm wondering wondering if you can just give a bit of an overview of the different projects that you have been building out and how they all work together for people to be able to take these dense neural networks and prune them and then be able to deploy them on commodity CPU hardware.

In a big kind of stroke,

Neural Magic's got 2 types of software. We have an open source and mail offering,

okay, and a free engine. The engine is essentially the same as, you know, as if you were running on a GPU, you would have these CUDA kernels.

So we have an engine that runs on a CPU.

Okay?

And the engine is, you know, if you give it an Onyx described

neural network, it will just run it, a sparse 1. It'll just run it at a very high speed.

To get the sparse network,

we have an offering of essentially

3 kinds of libraries.

The first 1 is a zoo, which is just a collection of models that are sparse, pre trained sparse models, which you can you can just pick your sparse model and transfer learn on it and use it.

The next 1 is, let's say that I don't find a model in the zoo, but I wanna take my own model. I wanna take another model and do it. So for, you know, a bunch of standard models, we just have recipes.

And in the zoo also. Right? Every model comes with a transfer learning recipe, and we also have tooling for taking your network that that I've never seen before and sparsifying

it. Right?

And then finally, right, you know, we have basically what you would call a GUI that makes this all simple that allows you essentially you tune a whole bunch of parameters

and it spits out the code that you just put on your GPU and run it in sparsify.

That's kind of the idea.

And hopefully, you know, more and more people kinda see the benefit of the fact that it's, you know, with a very limited amount of work, you suddenly can bring yourself to a place where your network now is so much sparser and therefore,

you know, you can run it on a CPU or even if you wanna run it on another application, it's just a much sparser network.

And for people who have their existing model that they've maybe built and trained using Py TensorFlow. Can you just talk through the workflow of actually taking the libraries that you've built and pruning the models and some of the

considerations that they should be aware of as far as any input parameters to how the pruning is done or just some of the things that they should be watching for and how to measure the comparisons

for the before and after?

Yeah. I think, you know, the classical thing to try to do is, first of all, download our recipe

and try, you know, with your data and the model that you have. Let's say that you have, I'm just gonna throw it out, a YOLO model. Right? Whatever version of YOLO. So you take the YOLO model and you take the recipe for YOLO. And for YOLO, typically, let's say, ultra analytics pipeline, people just retrain.

For object detection, people just do a retraining. So you have your standard retraining process with your data. Okay? And you just add the recipe for sparsification

to it. Now what will happen typically

is that you'll have to spend, you know, 1 or 2 rounds of actually

training,

and you'll see maybe if the first time you do it, you'll see a drop in accuracy.

Then you kind of have to do a retrain with that. And then maybe you try it again. And, you know, again, you get a drop, maybe smaller.

So typically, you'd have to repeat this several times and hopefully, you know, because I don't know what an individual's dataset looks like. But what we've seen is that for most datasets, right, they are so much less complex than COCO, the the base 1 that people use that it's kind of easy

to prune based on that. And so you do the pruning, and now you have the network that you want, and now you can choose where you wanna run it. That's really the process.

All the libraries are just available. You just download them. All the tooling is right there. And we've talked about it a little bit, but once you have this

sparser, smaller model, you're able to run it on a CPU. I'm curious what you see as some of the

sort of downstream and long term effects as far as

how machine learning is applied and some of the use cases that it can potentially unlock by not having these heavy models that need to be deployed with dedicated hardware for them?

Well, so

Neural Magic is about software

delivered AI. Basically, what we think is that we can push machine learning

back to where all software is.

From the place where

I have to know what hardware I'm running on to the place where

I'm just delivering it as, you know, it's just a container. You download it, install it, or you run it on your on the web, whatever you want, but you don't have to. You know what? We have customers, they have to specify for their client. Right? Together with the software that they download, they've gotta tell them what the hardware is that they need them to run on. Okay? We wanna kind of

be able to, you know, go back to a world where you don't need to do that. Right?

Basically,

you know, if you're let's say, you know, you have a software package, some application that you've written, you wanna just attach the engine that runs the machine learning to your application and wherever,

you know, platform that application is running on, which of course is going to be some form of CPU, then your machine learning will run on the same platform. Why can't we have that? Right. I mean, instead, I have to know what kind of accelerator I need to make this run fast enough. I don't want to worry about that. So that's the future. The future is software delivered ML or software delivered AI, And that's where neural magic is going. That's what we believe can happen. And the thing

is that we believe this can be done

especially for the really big models.

Models are growing. I mean, you probably have heard the names GPT 2, GPT 3. Right. Turing.

There's so many

models now that are in the billions of parameters.

So what happens with a model? Let's say a GPT 3 model. You can't run it even in inference on a single GPU

because GPUs are memory limited. I mean, you know, a typical GPU will have 16 or 24

gigs of memory.

A model like GPT 3 doesn't fit. So you have to connect 10 of them or 20 of them together to make the model run at all. Okay? But remember when we say, you know, a model with a 175, 000, 000, 000

parameters and we go, oh my god. That thing just fits on your desktop.

Okay?

It's just big because the accelerator memories are small, not because this is actually that big.

And so that's where the future is. I think the big unlock,

okay, that neural magic and by the way, there might be other players that'll go in this direction. Right? The big unlock here is to bring, you know, machine learning back to where it's really software.

And once we do that, okay, we can actually go to

brain size

neural networks, you know, a petabyte network. Imagine that. Right? That's where we're headed, I think. I hope.

Yeah. That'll definitely be interesting to see what we can deliver in that regard and, you know, whether that allows us to make the leap to general AI versus the sort of specialized AI that we have right now.

Yeah. I think look. I mean, we should not underestimate. The the specialized AI that we have right now is really an enormous

economic driver. I think it's going to

make our lives better, all of us. Right? And

it will be the driver.

Right? For other things that are, of course, you know, the things that I mean, we would all like to mimic, you know,

the behavior of mammals, the behavior you know? But

on the way there, we're gonna have an enormous benefit from

just having, you know, even conversational

AI would be just, you know, I sit, you know, every evening in front of my, you know, TV,

and I have to click, click, click through the movie title to just be able to run the why this this pain? Why can't I talk to my television? And that's not hard, and that'll change your life. Right? So before your TV actually analyzes you and just tells you about your mood, I wanted to first just find the right movie, and that's within our reach. Right? So that's where I'm I'm putting my my efforts right now. Let's make that

as easy as possible.

And 1 of the other

big challenges that people run into as they're trying to build and deploy machine learning and incorporate that into their overall product offerings is the availability of data because of the

large volume that's necessary for being able to train neural networks, particularly with unsupervised learning approaches.

And I'm wondering if this sparsification of the network and the reduction of parameters has any impact on the volume of data that's necessary to be able to

create a trained model.

That is a $1, 000, 000, 000 question.

I wish I knew the answer to that. I can tell you about our brains. Okay? So, you know, the connectivity in neural tissue is a combination of genetics,

right, and learning.

Okay?

And the genetics part is essentially the encoding

of

a lot of data over

1, 000, 000, 000 of years.

Okay? So the first thing that your body is doing when creating a brain is actually taking the encoding

of all the data that's been seen by all the creatures that came before it. Okay? Now that I have that,

now I have to do what we would call transfer learning.

Right? Which is the part where I make it specialize for Nir Shavit. Right? Okay. And that takes Nir Shavit a lifetime. Right? But that's okay relative to the 1000000000 of years Nir Shavit's lifetime is not a lot of time. And I think this is what we're going to see. What we're going to see is that

enormous amounts of data are going to be used

to train neural networks. And right now the models that people are producing,

okay,

are dense models. With a massive amounts of data, we're getting

dense models.

Cutting edge research that's showing

that I can prune in that initial phase. I don't need to deliver you a dense model. The first model I give you can be a sparse model already, you know, trained on the big body of data that is available to Google or Facebook and so on. And then you will do some more pruning maybe and some more transfer learning, and that will be localized with a small amount of data.

Okay. I think that's where we're going. I mean, I hope that's where going. The research is only in its initial phases. Right. But I think it's very, very promising, and I think that's where we're gonna go. And to your point of the, you know, billions of years of evolutionary

pruning and selection of, you know, building our brains, you know, extrapolating from that, it seems that

as we build sparser models and we deliver those, it seems that we should be able to,

you know, do transfer learning for localized capabilities, but also be able to combine these different sparse models to get the best out of both of those and then sort of you have evolutionary

capabilities of these

model generations.

I love that. Yes. Absolutely. Yes. That'll definitely be interesting to see.

And so

in terms of the actual business side of neural magic, you know, we were talking about the you have a few different open source libraries. The engine that you have for being able to do these, you know, horizontal tensors of

compute on your sparse models is free to use. I'm curious what your overall business model is

and your thoughts and approach to governance of the open source components and kind of where you're thinking of taking the commercialization

aspects of what you've built.

Yeah.

So a few months ago, Brian Stevens

joined Neural Magic as its CEO. To those who don't know, Brian was the CTO of Red Hat and the CTO of Google Cloud. So

a person who knows the technology

of open source. Right? And he's done this many, many times, and this is our plan. Essentially,

you know, what we're gonna do is

what a lot of companies have done. First of all, we're going after the developers.

We want to provide developers

with open source

tooling

to just, you know,

sparsify neural networks and create effective neural networks. And by the way, our offerings are also gonna be good for GPUs. I mean, if there's sparsity on a GPU,

go ahead. We want as many people to develop sparse models as possible.

Okay?

And then the typical thing that has happened with a lot of these open source projects, right, like Mango DB and so on, like like Red Hat itself. Right? There is a lot of money to be made

by offering,

you know, commercial versions of your open source software.

Right? And that's the beauty of of the difference between

a company starting out that just wants to download your open source and use it for a while to a company that wants a lot more features and support and so on. And that's this model. I didn't invent it. Brian didn't invent it, but he sure did play a big part in making it what it is. Right? You can build a big company

just on open source, and that's our idea. Basically,

we're going to make it all available out there for people to use

and whoever needs it in a version that has more features, more support will pay for it. And then going back to 1 point earlier and something that I noticed as I was preparing for the show and looking through the documentation

is you called out specifically convolutional neural networks and the density there. And I know that another major pattern in machine learning is recurrent neural networks, and I'm curious

what

the applicability

is of specification

between convolutional versus recurrent networks.

You know, like any 1 of the things in machine learning,

things are moving so fast. I don't know how much we're gonna see recurrent

neural networks. I think they're going to be,

you know, maybe in in reinforcement learning, there's gonna be the recurrence. But in general,

I think that for now, transformers

are right?

So

just

the

fully

connected

components

Right?

So just the fully connected components

with attention, without attention, with some convolutions, without some convolutions.

But as a core building block, I think we're actually going to go back to where we were in the beginning, and we're gonna see larger and larger

networks

that

kind of have this very simplified compute

without recurrence feed forward. That is what I think is going to take over the market. Now

because I've been in science for close to 40 years now I can't believe it. Yeah.

You know,

1 thing you cannot do is predict.

You know,

every time you think that you've seen everything that's been invented, the next thing comes along. And I am sure just like

convolutions

surprised us with their ability to do prediction on images

and BERT surprised us with its capability of doing

predictions on NLP.

The next big thing is just down the road, just wait a little bit and it'll show up. So what we need to be able to do, and again, I know I'm pitching the thing that I you know, what we need to do is have flexibility, and that's called software.

Okay? So rather than build new hardware,

okay, I would say let's focus on making our software more agile. It's always proven to be a good idea.

And, you know, my colleague Charles Leiserson

teaches kind of the introductory to performance engineering,

actually, and with some onarmacies.

They they teach you an introduction to performance engineering class at MIT. And 1 of the beautiful things they show is how you takes matrix multiply

in Python.

Okay.

And you basically just by kind of performance optimizing it can get 9 orders of magnitude reduction in performance.

Okay. So from the initial matrix multiply that you wrote in Python, if you write it in parallel

with assembly and c, you can get 9 orders of magnitude

reduction in compute. It's just

all of it is performance engineering.

So that's where I think the big wins for machine learning are gonna come. There's a lot of performance engineering to be done. Okay? It's basically algorithms.

And if you've got the right algorithms,

everything falls into place. And

I'm amazed at the amount of ingenuity that people are showing, whether it's Google or Microsoft or Facebook or OpenAI,

so much is moving so quickly on the algorithmic front.

It's exciting to be in this field today. I mean, it's just so exciting.

And in terms of the

applications

of the tools that you're building at Neural Magic, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it used.

Let me name 1 area that I think is really fascinating. Okay?

You know, what is modern programming like? And modern programming, even in Python. Okay. What you do is, okay, you have an idea of what you wanna do, and then you

go and you basically look around on the Internet

to see pieces that fit what you want. Right?

People are using, right, neural networks to do code completion.

Right? So you you basically have a neural network model that completes the code for you. Right? But the question is, what is it training on? Right? How does it learn? Well, it learns from you. Yeah. Okay. Sure. But it also learns, right, from the whole body

of code that is out there, all the Python code that is available in the world.

Right? And so essentially, instead of you having to look,

a neural network

has already looked for you and it's doing a completion

based on what it saw might fit you. So

this step that is so

common and painful right now where we are actually looking through libraries to find something that does what we do or or we'd look for a description that might match what we're doing, right, is going to be perhaps replaced by a neural network just guessing for you what the completion is. You just tell me, is that what you wanted or not? Okay?

I think that it's a fantastic area, and there are a whole bunch of companies that are addressing this problem. It's it's really great. I mean, you can check out tab 9. They're a company that does this kind of stuff. And,

you know, so I think code completion is a fantastic example of of what neural networks can do to make our lives better. Absolutely. You know, that's an interesting avenue for being able to also explore things like federated learning because I want my, you know, completion engine to get better because of the things that I do, but I also don't wanna necessarily send all of my code to some central company. So being able to have the learning applied on my local machine and then just send, you know, the obfuscated and de identified information back to this engine and make it better for everyone is a huge opportunity. Absolutely.

A

I had all these ideas that I thought were the best ideas. They were just the greatest ideas.

And

my investors said to me, Nir,

ideas are important,

but execution

execution is the key to be able to execute on your ideas.

And

now, you know, 2 and a half, almost 3 years in from there, I agree. I agree.

This is it. So

the thing that I've learned is that I've got to control all the ideas that I have and just focus on actually delivering on ones that are doable.

And I think it's a lesson for every entrepreneur learns in 1 way or another.

And

when you're a professor

and you do this, then your susceptibility

to this mistake is bigger than anybody else's

because in your world, right, you don't actually need to build a real thing. All you need to build is a proof and write a paper and let everybody else worry about taking it and doing something real. So this has been the challenge for me to actually go all the way to the last mile of this thing rather than just, okay. Here's my proof of concept.

CPUs can be in fastest GPUs. Thank you very much. Wrote the paper and I move on.

This is not what it is. It's about

just

hunkering down

and making it happen. It's a fantastic thing to do, but it's hard.

And so for people who are excited about this idea of specification

and all of the performance improvements that they're going to be able to get by pruning their models, what are the cases where neural magic or sparse networks might be the wrong choice?

So the smaller the network,

the less possibility

you have to prune it. So what is pruning about? Pruning is about the fact that the network has is storing a lot more information than what is actually needed for the problem.

Right.

So what people have done, right, is actually found ways to distill networks

that already are very small. Right? And the smaller they get, the less you can actually prune them.

So efficient nets are networks that are, you know, already have a very high reduction in the number of parameters.

It's very hard to prune an efficient net to

95%,

whereas

a big n l p model is very easy to prune to 95%.

So it's not just 1 case, but there are many many cases of small networks that people use

where, you know, you just won't win a lot

by basically by by by pruning it. And as you continue to build and execute on neural magic, what do you have planned for the near to medium term in terms of business direction or features or new products that you're planning on launching?

So Neural Magic till now has launched computer vision

support. We're going to introduce NLP in the next in the coming months

and then recommendation systems. So those are the 2 big things that are coming down the road.

And, you know, the other big thing is that we wanna move

the whole process

of using neural magic.

Right now, the retraining is still fitting into a pipeline

that is GPU based that customers use. Right? And I wanna eliminate that piece also.

And that's neural magic's next big step. So we're gonna actually do that. We're gonna provide a kind of a product

that trains

and

deploys on CPUs.

Are there any other aspects of the work that you're doing at Neural Magic or the overall space of machine learning, pruning, and classification techniques, and the potential that it unlocks that we didn't discuss yet that you'd like to cover before we close out the show? I think the thing that I would like to remind, you know, the listeners is that, you know,

our brains, as I said,

are much sparser than any neural network that we have, and they use much less compute

than any neural network that we have. And so if we are actually going to mimic that mimicking, by the way, doesn't have to mean that my silicon looks like a neural network in nature, but that the compute looks the same. But if but the algorithms look the same. But what I'm saying is if the algorithms,

right, are going to look the same, then we better be building machines

that have a lot of memory and little compute

and not a lot of compute and little memory, because that's not where we wanna be. And by the way, there are companies like that that are actually trying to build, let's say, for example, memory controllers

that do

the machine learning piece. Right. So I'm really close to the memory, right? That I think is a beautiful direction,

right? That we want to go

forward in. Right? Is essentially

bring the computation

closer to the memory, to very big memory, to very large memory.

And whatever we do in that direction is gonna be very beneficial going forward.

Alright. Well, for anybody who wants to follow along with the work that you're doing and get touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the TV show, The Tick on Amazon. I started watching that recently, and it is

hilarious.

It is a a great farce of all of the sort of superhero tropes and just

very irreverent and sort of tongue in cheek. So I've been having a lot of fun watching that. Definitely recommend that if you're looking for something to keep you entertained for a bit. And so with that, I'll pass to you, Nir. Do you have any pics this week? I'm watching a TV show

that was about the early days of Bauhaus.

And it's kind of,

very romanticized,

but I thought it was a lot of fun, though. Well, thank you very much for taking the time today to share the work that you're doing at Neural Magic and the potential for sparcification of neural networks. It's definitely a very interesting area of research and an important 1, and I definitely look forward to some of the long term impacts that it will have on the overall machine learning ecosystem. So I appreciate all of the time and effort that you're putting into that, and I hope you enjoy the rest of your day. Thank

you.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host atpodcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.init