Exploring The Patterns And Practices For Deep Learning With Andrew Ferlitsch

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Scaling your data infrastructure is hard. Maintaining data quality standards as you scale is harder. DataBank solves this.

Their unified data observability platform gives data engineers visibility over their stack without changing existing pipeline code.

Get end to end visibility into your pipelines and identify the root cause of issues before bad data is delivered.

Seamlessly integrate with over 20 tools like Apache Airflow, Spark, Snowflake, and more.

Use customizable dashboards to see where pipelines are broken and how that impacts delivery downstream.

Get alerts on leading indicators of pipeline failure.

Open up your pipeline and see exactly which code strings are broken so you can fix the issue immediately.

Create more reliable data products.

Go to python podcast.com/databand

today to start your free trial. Your host as usual is Tobias Macy. And today, I'm interviewing Andrew Furlich about the patterns and practices for deep learning applications and the book that he's writing to help people learn about it. So, Andrew, can you start by introducing yourself?

Thank you for inviting me to your podcast. I'm Andrew Furlich. I work for Google Cloud AI. I'm in

developer relations.

Prior to, joining Google, I was actually in Japanese IT. I worked for Sharp Corporation in Japan for over 20 years as a principal research scientist,

and my specialty

was in imaging. You could say I was an imaging scientist,

did a lot of computer vision.

And, you know, what's really interesting I found about deep learning is

everything we do today in deep learning with computer vision,

we used to also do it before

deep learning and machine learning. But it took, like, a mass number of people with PhDs

and 1, 000, 000 and 1, 000, 000 of dollars. And what I find just so

fascinating about machine learning and deep learning is how that is really

scaled way down the cost, the speed,

and has really brought it into the realm of, you know, your software engineer can now, deploy build and deploy applications that used to take this massive amount of, cost and resource.

And do you remember how you first got introduced to Python?

It was probably more accidental.

You know, I knew when I first got first used, it was 2016.

And I know before that, I wouldn't have thought it was just some kind of toy programming language I didn't take serious. You know?

I was doing this consulting job with an agricultural

technology company,

CID Bioscience.

They were developing an autonomous

farm vehicle

that would use computer vision to count crops

and, you know, detect infestations

in the crops. Okay? And so all the experimental code that they were putting together with some Python. Mhmm. Of course, I had to jump in and participate,

so I had learn Python.

And so now in terms of the work that you're doing with the deep learning patterns and practices book and your work at Google and your previous work with Sharp,

you know, definitely gives you a good background in terms of the machine learning and deep learning space. And as far as the actual,

you know, deep learning aspects before we get too far into, you know, some of the patterns and practices, I'm wondering if you can just give a bit of an overview about what are the major elements of a model architecture that people need to be aware of as they're starting down the path of building something with deep learning and thinking about how they want to approach the problem space and approach the actual building of the project?

Well, what we find today in models instead of sort of being the sort of wild west,

you know, graph of nodes

is now there's, you know, formal sections to models.

Okay? Call them components

and sort of a formalization

or designs on how to put those components together and how to connect them. And so at really at the top level, whether you're doing computer vision, natural language understanding,

or structured data, you're gonna have what's called a stem,

then a learner.

And then the last part is the task component, and that's the task you actually wanna learn.

And the learner itself will gets broken down into groups. Those groups get broken down in the blocks.

And how they're assembled

will really say whether you're doing, say, repressational learning or transformational

learning, etcetera. I hope that helped address that question.

Yeah. That definitely is useful because I know that, you know, for people who are new to this field, you hear the terms, you know, neural networks or deep learning architectures, and it's just this, you know, black box, and you don't really understand what are the different pieces to it. But being able to break it down into these are the different subcomponents

of the network

helps to at least make it an approachable problem of, okay. Well, now I know I need these different components to be able to put it together so that you can then

direct your research to figuring out how to approach that problem more specifically.

And in terms of the

specifics of a given task, you mentioned that a lot of the detail for, you know, for given projects that you're trying to work on is in the learner or the task layer of the network. And I'm wondering if you could just dig a bit more into the relationship between the specifics of the task that you're trying to build for, whether it's natural language understanding or computer vision or image recognition

and the specific architecture and the components that you're pulling off of the shelf for being able to build that full network?

Well, 1, we could probably spend a whole hour just talking on that, so I'll just kinda

briefly summarize.

You know, over the last few years in the research, we came up with this phrase. You see, I used a lot called essential features.

Okay? So if you look at a model and you start at the top of the model and you sort of work deeper and deeper in the model, you get near the end before you do the task learning.

Okay?

You have what's called a latent space.

It's a representation of the input in a low dimensionality.

And what we really want in there are what's called the essential features.

Okay? And if we can keep that constraint, we'll prevent things like memorization,

memorizing

examples,

and allow the model to better generalize

to examples

it's never seen, particularly if those examples are outside the distribution

that the model was trained on.

So a lot of effort in the last few years is all the different ways to represent that latent space and how to train the model so that latent space just has

the essential features. But once you got that, the task learning is simple.

And it's actually plug and play. I can take that same

bottom part of the model that I could put on a regressor,

you know, to say,

predict the selling price of a house.

I could take that off and put on a class ifier and predict

who the target demographic is that would be most interested in buying the house. So I don't even have to have 2 separate models. It really just comes down to

really the latent space and how it's trained, what I can accomplish on the other end.

That puts me in mind of the, you know, recent developments in transfer learning. Is that pretty much the exact same thing of what you're talking about where you're using the regressor versus the classifier?

Before I even jump into it, there's a little humorous part here. I don't mind to date myself, but, you know, I got my degree in artificial intelligence,

graduate degree in the late 19 eighties. And when I got out of college, you know, it was the AI winner. So, you know, nothing happened, you know. I'm in a different course. And 1 of my specialty things was just a really obscure statistical area called

distribution.

And it turns out that deep learning

is all about

distributions,

whether it's distributions in the data or distributions in the weights

inside of the model. So suddenly here 30 some odd years later, what I actually learned in college became important.

Okay.

Okay. So in transfer learning,

there's really it goes into several different directions.

But essentially, what you're saying is

you're either transferring 2 things. Either

trying to transfer

the essential features in the latent space

and say those essential features

are similar enough to another domain.

Okay? For example, let's say you have a model that's learned to tell the make and models

of cars.

Okay?

It's probable that the essential features in that latent space are pretty much almost identical will be needed for trucks.

Okay? So I'm just gonna transfer

all of that up to the latent space,

take my truck dataset

and then just fine tune it. The other cases, the transfer may not be related to the latent space, particularly if the datasets

are sort of far apart domain.

And the other important thing we find in a model

is the distribution of the weights on the model. I'll try to keep it simple. You know, when you first train and you have all those nodes and each 1 has this just simply a weight on it. Okay? You have to start with some initial value.

And but if you started with the same value, let's say everybody was a 1 or a 0,

their updates would

be identical and effectively to be symmetric. And this is the same as having a model with just 1 node.

So you have to start off with this sort of

random amount,

you know, distribution of weights

when you train it.

And the thing is, is

1 random distribute not all random distributions are good. Okay. Some are better than others. And the idea, how do I find a good distribution that gives me what's called numerical stability

in my model?

So that when I train it, I get that convergence

on that global or best optimal outcome I can have for that model. Well, in some cases,

and that could take a lot of, you know, training and experimenting.

But if you have an existing architecture

that has a highly numerical

stabilized weights from previous training,

sometimes you just say, that's my initial weight distribution instead of doing a random draw. So that's a different form of transfer

learning. In terms of the numerical stability of a model and trying to optimize that, that sounds like the hyperparameter search and hyperparameter optimization problem that I've heard a lot about. Actually, that comes afterwards. Okay. Okay. 1 of the mistakes that people were making in hyperparameter

search is

you take a model. Right? So you got this generic

tuning. Tuning. The problem is, why would you do your hyperparameter tuning on weights that are not even stable yet?

So first, you're gonna do all these pretext tasks

to make sure you have stabilized weights, numerically stabilized weights, then you do your hyperparameter

tuning.

As far

as your experience of working in the space and working with other people on building these deep learning models and applying them to different problem spaces.

I'm wondering what you have seen as the sort of overall level of awareness and understanding

of some of the different

considerations and challenges in the overall life cycle of building and developing and deploying a model among sort of the general

community of machine learning engineers or data scientists

and sort of the emerging best practices in the deep learning space?

Yeah. I guess I could cover another hour, but

keep it brief.

You know, there's always challenges. You know, 1 area that never ceases to amaze me is on the data side. The more and more data we have,

it continues to be a challenge.

Particularly

now that, you know, we're on such a large scale, we have to find solutions using unsupervised at least at some degree, use unsupervised

training so that the data doesn't have to be labeled.

What are the right ways of doing it? How will they fit into a model that is then later fully trained on labeled data?

And then we get to the other end. You know, I gotta deploy these models, and I gotta deploy them, you know, at a large scale.

When you talk about the real world, you know, today, models go into production. It's a business. It's an enterprise business. You know, if it's a social media site, you may have, you know,

tens of millions of people a day using the site. Right? So we're on a very, very large scale.

And no matter how you train the model, what data you've got, it's pretty much guaranteed that the distribution of your training data won't be exactly the same as the

distribution that the model sees in the real world. And so from a statistical point of view, what it sees in the real world is a population.

Okay? What you're training on is on some subpopulation

of that population. And on top of that, you don't have every example in that subpopulation.

More data,

but how can I train it in a way

that given this population I'm training on, it will still generalize into that differential between the subpopulation and the

the problem of trying to

counteract any potential bias in that, you know, randomized subsample of the population?

Yeah. You know, we we see all kind you know, a few years ago, all kinds of techniques tried that would fail or misdetecting.

I mean, first, I'll give you a classic example,

ideas of how it's done. When they started to first make X-ray classification

models,

they inadvertently

were biased on what's called the view.

Okay? Here is a scenario. You know, most of this data was coming from big hospital groups.

And big hospital groups are very cost billing conscious.

Okay. So at any 1 of these big hospitals, you're gonna have more than 1 x ray department. So you're gonna have 1 department

that's low cost,

you know, more basic X-ray equipment, And you have another department has very say costly X-ray equipment that's high cost. Now you're a doctor and a patient comes in. Right?

And as a doctor you try to make a prediction of the likelihood

that this person has Lymphoma.

If you think it's low likelihood,

you would send them to the low cost department.

And if you thought that there was high likelihood, you'd send them to the high cost 1. Okay. The problem is these departments use different, you know, models of the x-ray machines.

So when you get your data, it turns out that almost every instance from the low cost 1 is a negative,

not pneumonia.

And almost every instance from the high cost 1

is pneumonia. So it's totally skewed. And of course, how they take the images is not identical.

That sort of framing, that's your view. And what happens in the early days is the models were learning the view,

not the x-ray.

Okay? And so so known as an unseen covariant.

Okay?

You know, we have all kinds of ways of trying to deal. At first, you have to detect the existence of that. So we might use a surrogate model. Okay? So if we have suspicion that, let's say, feature x is is a bias or an unseen covariant.

Okay? So you have your regular model, your training, and your predicting.

And what I might do

is, it's not that I remove

the feature, is that I will on a prediction level, I will

intentionally

invert

that feature.

And what I wanna do is look at the outcome of the prediction.

What I'm looking for is not a bias, then the change in the outcome should be random.

That is it should not have any effect.

But if on the other hand is not random, then I know, oh, it's injecting some bias. So now I gotta look at a wide variety of techniques

to remove the bias.

Big 1 I see being used are GANs, for example,

other means of generating

synthetic data.

You know, previously years ago, people tried to use boosting. The problem with boosting is you're adding

so many identical examples that

models tend to be overparameterizing,

you're really increasing the likelihood

of memorization

in the model.

That brings us to the work that you're doing right now of the book that you're writing to try and

encapsulate some of the lessons that you've learned and the knowledge that you're sharing right now of some of the useful patterns and practices for how to go about working with deep learning and building these models. And I'm wondering if you can just give a bit of an overview about what your goals are for the book and how it is that you decided to go down the path of writing the book and starting the project.

Kinda an interesting road. I would say I started to teach I'd call it be more data science,

you know, so more classical machine learning probably back in 2016.

And most of the people I was teaching, you know, came from the math statistical world.

And then, I'd say about late

mid 2017,

you know, deep learning, it was a big hype. Okay? And I really saw,

demographic change in the students. A lot of them were software engineers.

They weren't statisticians

and math people and that. We don't do stats.

We program.

And so, you know, you're trying to teach them this stuff and just goes

over the head. And so I had to learn

somewhere in this process of stumbling how to teach to that target audience, you know, I realized that I had to reframe

everything in deep learning in the concept of a software engineer. What are their terminologies?

What are their mock methodologies?

What are their best practices?

And how can I fit

deep learning into that? So I did that for a number of years, and my courses got really popular.

Today, I teach vast numbers of software engineers,

not data scientists who do machine learning.

And this book, you can say, is sort of my compilation of all those experiences of how to map this what was once a very statistical

PhD type of world

into your rank and file software engineering world. And in your experience of making that translation from your experience with your mathematical and statistical background into the parlance of software engineers and the problems and, you know, what their priorities are, I'm curious

what were some of the more interesting

ways that you had to reframe things or some of the interesting sort of misconceptions or biases that the software engineers had coming into this space of, you know, a very mathematically oriented field.

Well, they all thought you had to know what gradient descent is, and I never teach gradient descent.

I'll start with that.

You know, early in my career, I was a software engineer

in the 19 nineties, you know, and there was a sort of a groundbreaking

book, Design Patterns in C plus plus by the Gang of 4. It's really what started the whole concept of design patterns and software engineering. And so I was already that concept already, you know, programmed into me. And I did early on come to the realization

to teach this to software engineers.

You're gonna have to define the design patterns. Now that doesn't mean they weren't there.

The problem was is it was all sort of scattered across seminal research papers.

Okay? And each researcher

invented their own words.

Many times, 2 different papers could be talking about practically the exact same thing, and their terminology is totally different. And their diagrams, how they draw it totally

So a lot of it was first identifying,

where all those bits and pieces are in the research papers.

Finding what's the most common terminology,

you know, then solidifying it so you're always talking the same even if the paper says it differently.

And then really coming down,

mapping all those architectural

and DAGs and that into a more common set of diagrams that could be

uniformly

applied to any of these models on research papers.

And you see that in the book. I use a sort of a overall,

you know, what you say DAG framework description that we call idiomatic.

And so 1 of the nice things about the book is no matter what model architecture

you're looking at or what DAG deployment process, it's in a consistent

style of diagram. You don't have to learn 1 style from another.

And as far as the sort of identifying

those common patterns and best practices across the different research and throughout your own experience of working in the space, I'm curious

how

broadly those lessons have been learned in the industry or if it's something that you still see as being very nascent

and something that everybody has to kind of

discover and learn about on their own. I would say in the last couple years, at least in the research field, there's been a lot of consolidation

terminology

and representation.

It's not the wild west anymore.

And then as far as the actual

tooling

and practices and infrastructure

around deep learning and machine learning and bringing it into production contexts.

I'm curious what you see as some of the most useful advancements and some of the areas that are still

under supported or underrepresented

as far as the overall life cycle of going from idea to delivery.

I think a big contribution is really the growth of how to

augment the training

using unlabeled data

or

noisy data that's crowdsourced.

There are a variety of what we call pretext tasks. You're not learning a task like a classification

or predicting the value of the house. You're learning a transformation.

So I have input x, and I define some transformation

function, f of x.

Okay?

And,

you know, that's a static function, so I can actually put the data into it,

get what the transformation was, and use the transformation as the label. Okay?

And so I'm trying to

force the model to learn essential features in that latent space

before I actually do the training on

labelled data, which is expensive.

Okay?

And so, you know, there is the opportunity. I've seen a lot of growth that we can

start creating models,

to solve, more and more problems that it would be,

cost challenge for data.

Another area that I see again improvements on data is, again, data costs money.

So in, you know, let's say you go into Fintech.

Okay? Maybe you're trying to do a model for fraud detection.

Okay?

Well, you might actually have not

10, 20 fields in your structured data. You might have 100 of fields. Okay? You don't know which ones are the good ones. Right? But every 1 of them has a cost,

data acquisition cost.

Okay?

And so before, you would try to figure out which ones actually

contribute or how well to the outcome doing the PCA

analysis. But, you know, on that scale, that's highly expensive.

So 1 of the, you know, with the growth of,

explainability,

being able to instrument your model and directly

attribute a prediction

back to the feature list is really shown a great promise to substantially

reduce the cost

of identifying

how my data contributes to the outcome.

And as you know, a business manager

make cost effective decisions on data acquisition costs.

And in terms of the explainability

advancements, I'm curious what your thoughts are on how that's going to influence the overall sort of state of the art of being able to build these models and identify useful data at the outset

to be able to build more effective models downstream. And if there are any sort of potential

optimizations

or performance improvements in terms of the actual

time to build and deliver these models because of the fact that there is more explainability that then contributes to a

better understanding of how the model is actually coming to these different conclusions.

Just the whole idea of explainability

implies that there will be

development production improvements. I mean, let's say we have no explainability in the model. It's just a black box. Right? We we throw it out there, and we have some outside observation.

Okay. Is that quite what we want? And so we make a guess of what's happening inside the black box,

just 1 notch above random, right? It's an educated guess, you know. And then we throw another 1 out and we make an observation, we see how far apart those observations are, we make another educated

guess. Compare that to, it's not a black box anymore.

I can look inside

and understand why it's doing that. And now I can do something better than an educated guess. So the number of iterations,

just the nature of it, you know, would dramatically reduce.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census

today to get a free 14 day trial and make your life a lot easier.

And then another interesting

challenge that I've come across myself is just understanding

what are the cases where machine learning or deep learning even

is applicable to a particular problem or, you know, the types of solutions that you can build with machine learning and deep learning and being able to identify those opportunities,

particularly if you have some sort of data source and you don't necessarily know what are the possibilities that I can build on top of from this. And I'm wondering how or what your experience has shown as far as the level of imagination

necessary or the ability to identify useful opportunities for applying machine learning in a business or, you know, educational context.

Well, I know that when I'm in meetings or when our clients, you know, business people in there, they're throwing everything at the whiteboard.

They wanna apply deep learning and AI to every aspect they can.

They see cost savings and, you know, production increases and so forth. And, of course, that's the challenge. And, you know, the new to me, this is my opinionated opinion

is the new frontier is model amalgamation.

Okay?

What that is, if you just kinda go back a little bit in time, you know, we deployed single models

that did a single task.

Okay?

But that by itself is not an application.

And so you would still would have that big application

on a server, like written in Java. And here and there is making out a call to a model, like the model is assisting

this much bigger application.

In model amalgamation,

can you take essentially every aspect

of that application

and convert it into a model paradigm,

and then connect all those models together on their own graph, and the models become the whole application.

And so we we're definitely seeing

attempts at in large companies of how to rearchitect and redesign

these applications so they are entirely boggled in the automations.

Digging more into some of the design patterns and best practices that you're covering in the book, I'm wondering if you can give a bit of an overview about

what types of problem domains are applicable to some of the decision factors or signals that go into identifying when to use which problems or, you know, how you decided to break them down and which patterns to include as far as their, you know, broadest applicability for people who are trying to get into the space? Much of the book actually cover the seminal

models.

You know, those models were vast majority

were,

you know, designed, you could say, before there was any kind of design pattern.

So not only do I explain what's happening inside the model, I show how to reverse map that existing

seminal model into a modern design pattern. Okay? And then, you know, show the pros and cons and how it improved

the science from

the last 1.

It's hard to say to really answer your question because there's all kinds of model architectures out there. Okay? They're continuously

improving.

Okay?

My book mainly focuses

on computer vision.

Okay? And 1 could make the argument that if you're doing a regressor or a classification,

just use an efficient net.

So thanks. It comes down to that. But the reality

is we're doing a lot of other things like object tracking

and video,

poise detection, getting landmark points on the body,

maybe doing instant segmentation

or just a semantic segmentation.

And these take

different types of model architectures work better because of what ends up in the latent space. And it generally

involves

the concept of feature reuse, or if you're talking about the natural language world, what to pay attention to. So, you know, what transformer has this concept of an attention head that tells later parts of the model this text or this transformation, this part of the text

is more important than the rest. Same thing in imaging.

Okay.

I always kind of humor,

you know, in the seminal paper on the transformer.

Some of them might know that as the original models, the BERT model. The paper was called Attention is All You Need.

Okay? Which really started a whole new era. We didn't need recurrent neural networks anymore,

to solve the problem.

So, yeah, an analogy I would make, again, I'll just use natural language processing.

What I really mean by essential features.

Let's say I had

a document

several pages long, and then I had a summary

of that.

I would want it if my model really

was correctly trained on essential features,

I could take

both the long form

and the short form and still get the same prediction.

Okay? And that just means that my model has to learn what's essential, what to pay attention to, and just as equally, what to throw away.

As you were figuring out what to include in the book, because of the fact that deep learning and machine learning is a space that is seeing such rapid evolution and so much attention. I'm curious

what your challenges

were of figuring out what to include and what to

sort of optimize for so that the book remained useful

for a longer period of time rather than having to rewrite it every 6 months because of, you know, new research or changes in the space.

That's a nail you hit on the head there, I have to admit. I had to make that decision. How far did I go

in describing seminal papers

and then make the switch how to apply this into production?

And you will find that about 2 thirds of the way through the book is where that switch happens.

And I said, okay, I'm not going to try to teach you any more seminal stuff. I'm just gonna show you now how it's actually applied into production.

Yeah. It's definitely 1 of the perennial problems with any sort of technical resources.

You know, how long is it going to remain valid for?

And then for

people who read the book,

what are your overall goals as far as what they're going to be able to take away from it and, you know, what other

background knowledge they're going to want to have coming into it or what capabilities they'll have coming out of it after having read the full thing?

You know, in our reviews, it's been you know, our preview is really good. A lot of data scientists

have read

it, even though they're not the target audience.

You know, a lot of them have mentioned that, you know, they had a gap. And the way the book explained it was really helpful for them to fill in that gap.

But the primary audience is software engineers.

Our goal is to sort of demystify

the whole process,

for the software engineer

and teach them how to reframe it in

their world as design patterns,

be able to take what they've learned here and practice

and start applying it in the workplace.

And then as far as your own work of being in the space and building deep learning models and teaching other people how to move into this particular

area of the industry,

what are some of the aspect of deep learning and machine learning and data management that are, you know, requisite for building these models that you see as being continuing challenges that you would like to see addressed more directly?

Yeah. Well, again, let's roll back the clock a little bit to 2016,

2017. You kinda really thought of this as a serial process.

You know, get the data right,

train the model, make a prediction,

evaluate it. If it looks good enough, make a prediction, kinda like the Kaggle thread.

The real world doesn't look anything like that. It's very dynamic. We're continuously training. We're continuously

evaluating. We're scaling these models.

And so to me, all the challenges

are substantially

shifting into production.

This is now where you get the phrase ML operations.

If anything, that area job wise is growing much faster

than, say, research and development is. Okay. So just for the audience, for who are Swiss, who are thinking about entering into the space and where into the space, there is vast employment opportunities and machine learning operations.

But the thing is it has similarity to DevOps in that you have

lots of moving parts.

It's complex and there isn't a perfect way

for all these parts to come together. And so you have to use your ingenuity

of how to put the parts together,

how to monitor it,

debug,

and

whatever issue comes up, how to address it.

And It's not a boring job. You're on the go every moment.

And in your own experience

and as you continue to work in the space and help educate others, I'm curious what are some of the areas of research or, you know, areas of focus either in the industry

or in sort of upcoming architectures or capabilities that you're particularly interested in or keeping an eye on? When I see something's been going on for a while and then something and a shift in another area. So let's talk about what's been going on for a while. This really just understanding

of the distribution

of weights

in your model

and how it affects not only how good the model is, but how big your model has to be. Okay?

Historic,

we look at models because this stuff was random, and some of it learned part of the model, learned better than other parts. You needed a lot of redundancy.

And so you have this what we call overcapacity

in your model in parameters. And, of course, the problem with overcapacity

is it opens up the door to parts of the model just memorizing data.

And so we throw in all this regularization,

this noise, noise, noise, trying to stop it from happening.

But the better and better we can find ways to get the right

initialization,

there's actually a well known paper on this called the lottery hypothesis.

The idea is get the winning ticket

on the weight initialization.

We can start training what we call compact models. These are models that are a lot smaller.

They don't have the overcapacity

for memorization.

We don't need as advanced regularization

techniques on them. Okay? And when deployed, they'll be more compute and time efficient. So that's 1 area.

Okay?

In the area of automatic

learning,

you know,

historically,

whatever system we use to propose 1 model, train it, then propose another model and train it, You were comparing model to model

and making a decision what the next proposed

model is.

And so there was a 1 and 1, and about a year ago, some pretty good work on what's called distribution

spaces for models.

So, again, now again, where distribution

comes in. So having a sort of an architecture instead of taking 1 model, is sort of taking a framework for a model architecture,

and make a whole bunch of variants

of that model architecture,

do low epoch training,

and then

map out the results across that same entire set, maybe 500 models, and that's a distribution.

Okay? Then do that with another

totally different model architecture.

Don't do model the model comparison. Do the same thing. Make about 500 variants of it, random variants if you want to.

Map out their distribution

and compare the distributions

and use that to define eventually

what the best search space will be

to find a good

model. So I find out it's it's it's an evolving area. I just found it interesting because once again,

distributions.

Right? An obscure field I just happened to learn

a long, long time ago. Ago. Yep. There's the perennial problem of, you know, I never used what I learned in school. Well, at least in your case, eventually, you did. Maybe not immediately, but it's come back

around.

And as far

as working in industry and building models in education and in your current role,

I'm wondering what you have found to be some of the most interesting or unexpected or challenging lessons that you've learned while working in the field and writing the book, and also if there are any

patterns or practices that you have

come across in your own research

that you think should be more widely adopted that has so far been, you know, either overlooked or underutilized?

Well, I I think you just go up base sort of, my current role at Google.

You know, I interface with what we call enterprise class customers. So they're very large companies, corporations,

multinationals.

And a lot of, the struggle

really continues to be

what the model sees versus what the model

was trained for.

Okay?

And how to

not necessarily come up with new model architectures,

but how to get those models to better generalize

into a population

that you really can't train it for.

As far as people who are looking to get into this space and wanna be able to learn more after they finished your book, what are some of the other resources that you recommend they dig into to help them

be more effective at building and running production ready models or being able to dig deeper into the existing literature?

This is 1 area where I may be a little bad. I'm it's been years since I've really read books or blogs on deep learning. I just read research papers.

I'm a scientist, you know. And I yeah. I'll be your software. I'll be your software engineer. Unless you came up as a scientist, you're not gonna be able to parse these papers.

But,

you know, I do look at, you know, both O'Reilly and Manning,

recent publications.

There definitely has been a significant shift in other

authors,

framing deep learning away from a statistical

description to a software engineering description. And what I would do is I would just start looking at these books and the reviews and see how well, you know, software engineers

really feel that that particular book is instrumental

in educating them. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move this into the picks.

And this is 1 that I believe I've picked before, but it's worth calling out again, particularly given the context of this conversation. And it's the book Designing Data Intensive Applications by Martin Klepman. So it's, you know, sort of adjacent to the machine learning space and that it talks a lot about the sort of data management aspect and being able to build these systems that provide access to all the data necessary to train these models, but just a very well written and well structured book to be able to learn more about some of the principles that go into those types of systems. And so with that, I'll pass it to you, Andrew. Do you have any picks this week? The weather has just been great around here. Last weekend, I was up in the mountains and plan to go to the mountains again this weekend. So my pick is enjoy the great weather and the outdoors. Definitely always a good recommendation and something that we all need to be reminded of occasionally. Thank you again for taking the time today to join me and for the work that you're doing on the book. It's definitely an interesting problem domain, and I appreciate you helping to condense some of your knowledge into a form that software engineers are able to take advantage of and for you taking the time today to help me learn more about the problem space. So thank you again for all of your time and effort, and I hope you enjoy the rest of your day. Thank you for inviting me.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dotcom for the latest on modern data management.

And visit the site of python podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.init