An Exploration Of Automated Speech Recognition

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Dylan Fox about the challenges of training and deploying large models for automated speech recognition. So Dylan, can you start by introducing yourself? Yeah. I'd be happy to. So as you said, my name is Dylan. I'm the founder of a company called AssemblyAI,

and we make an API for accurate and fast automatic speech recognition.

For people who didn't listen to the previous interview that you did talking more about the assembly product, can you remind us how you first got introduced to Python? Yeah. Definitely. So I

started learning how to program back when I was in college,

and I started attending some programming meetups.

Those turned out to be Python meetups, but I did not know at the time.

And just got into Python, got into Django,

and, really, that's where it all got started. So just made great use. I think, you know, the website I learned Python on was called learn Python the hard way dot com. Have you heard about that? Yeah. I have come across that 1. I think it's from Zed Shaw, I believe. Yeah. You know, I forgot that that's how I learned because it's a long time ago, but that is how I got started with Python. I mean, I'd always played around with computers and I talked about this in the last podcast

was really into gaming as a kid. So

would be on IRC all the time and made like websites and HTML and CSS, but

never did any like software development

till I was a little bit older. And

I had bought a book

about learning PHP and MySQL.

It's like like huge textbook

and made it like halfway through and was like, okay, I can't do this. And then, like, maybe a year later,

got back into it, and that's when I focused on learning Python. And it was much easier

than PHP

and MySQL.

It's definitely interesting because I've had a number of people who have come on the show who used PHP as the entry point to programming and then got sick of what writing PHP, and that was the entry to Python. Yeah. Yeah. You're having to learn so many new abstractions when you learn how to program.

And when you also have to learn like a clunky syntax and the environment's more complicated, it just makes it a lot harder. So, I think it must be why a lot of people are just starting with Python and

Node as well because it's just a lot easier to get up and running and get started. In terms of the work that you're doing at Assembly that's, you know, entirely oriented around automated speech recognition. And so there's a lot of machine learning involved, a lot of model training and development. And so for people who

are sort of unfamiliar with that overall space and some of the complexities that are involved, can you just give a bit of an overview about what's actually involved in building an automated speech recognition model? And for the sake of brevity for the rest of the podcast, I'll just I'll just say ASR.

ASR. Yeah. A lot of acronyms.

STT,

ASR. We can use ASR for now. What is involved? A lot of pain is is involved. No. So it's I think with most machine learning

projects, it's fairly

easy

to get to something that works reasonably well,

but it's very difficult to get

something

to get a machine learning model working

state of the art

and and extremely well. The last

10% of accuracy

seems to be where

most of the work is done. Like you could probably easily go

online and

fork DeepSpeech

or wave to letter and get something up and running that is, like, maybe 70 or 75% accurate

for speech recognition.

But the real work is in

how do you get it to be state of the art, which is, like, you know, 90% accurate

on audio files and

there are so many different speakers, so many different accents, there's background

artifacts

like babies crying or noise, the sync being on. So

it is a very challenging

machine learning problem because the data is

high dimensional.

There's a lot of noise and disturbances

in the data.

So it's a pretty hard machine learning problem.

And

especially in that last rung of accuracy,

that's where a lot of the work ends up ends up being is what we found.

In terms of just the overall machine learning space, there are a lot of other

categories

of data that people are going to work with where there might be other applications

of machine learning running on audio data that isn't necessarily oriented around the speech aspect.

There are things around computer vision and being able to do object detection and object recognition,

natural language processing, named entity recognition. What are some of the

sort of distinguishing factors of

ASR

as compared to some of those other avenues of machine learning and deep learning in terms of the complexity or difficulty that's involved in actually being able to gain high levels of accuracy?

Yeah. It's a really good question.

So let's just focus on NLP related tasks, NLU related tasks.

Compared to speech recognition,

compared to ASR,

those NLP tasks tend to be a little bit on the easier side and I think big reason is because you can piggyback

off of these

huge

models like BERT or know, you see this now with GPT 3. There's these huge models that you can piggyback off of to

build a model that is more specific or tailored to the NLP task or NLU task that you wanna build.

And you usually don't need

as much data

as compared to ASR,

and the data is not as high dimensional

as with ASR. So with ASR, you have

high dimensional

long sequences of data

that you're sending into your neural network. So the models end up being really big. And this gets complicated because now, you know, there's huge

NLP models like GPT 3. I was just looking into this, like if you were to try to train

GPT 3 on a single NVIDIA GPU, it would take like 100 of years.

And they actually trained that GPT 3 model on a huge cluster of

v 100 NVIDIA v 100s

that were, I think, donated or set up from Microsoft.

And it's still it took about 34 days on, like, 1, 000 or more v100 GPUs to train GPT 3.

So that is definitely the exception. Like, in general, I think what you see is NLP models that are just working off of text

tend to be smaller

and tend to train faster and tend

to just be easier to train and to get working reasonably well. I think it's also because there's just like a ton of text

everywhere on the Internet online

that you can quickly get as training data to train these models as well.

So for ASR models,

you have high sequential data or high dimensional data rather that is, you know, a long sequence. And so you end up having these really big models

that take a long time to train

that

are also very compute intense to train. And so if you're gonna try to train, you know, like a 500, 000, 000 parameter ASR model

on tens of thousands of hours of data,

you're gonna need at least a dozen or more

GPUs to train that on. So you're gonna have to set up a way to train your models in a distributed fashion

and

still training is gonna take a while, so iteration speed is slow and it's gonna be expensive for inference. So that it's just a more compute intense problem than

other NLP

or NLU problems,

and that makes it just a lot harder to build these models.

So are you regretting your life choices that this is the particular area that you decided to build your business around? Well, you know, I mean, we have some

NLP based features that we offer now. So

1, for example,

with our APIs, we can detect

key phrases and keywords within the transcription text and

compared to the ASR models, like, that's a fairly

easy thing for our team to build. And that's because we have all this GPU training infrastructure set up and we have these data pipelines set up and we have capable people on the team.

So it's still on its own a challenging problem but compared to the ASR

problem it's like a lot easier for our team to build and I think that we didn't fully

grasp just how challenging

ASR models are to build when we first started the company. I mean, the first

2 years of the company

was really just training models, setting up infrastructure,

getting more data and getting the models to the point where they were state of the art. And it goes back to what I was saying in the beginning,

like, it's fairly easy

to get something working

reasonably well,

but it's really difficult to get it working

to a state of the art degree where you have an ASR model that is generalizable to many different types of speakers, to many different type of accents. It can handle background noise.

It can handle channel disturbances.

It can handle echo and reverberation, like, that's very difficult. You need lots of data,

you need a big model

that can learn from all of that data.

And,

yeah, there's a lot that goes into it. So I think

now I don't regret it because, you know, it's obviously like a barrier to entry because it's so difficult to things working this well, but definitely was was challenging. And I think it's probably more similar to computer vision because computer vision

knows also

you know high dimensional data but the thing about computer vision, which is nice, is that most of the computer vision models are like convolutional neural networks that

are less

compute intense compared to with transcription

with ASR models, you're using either like recurrent neural networks or using transformers and those models are more compute intense.

And they take longer to train. They're more expensive to run. They're more expensive to train. So it definitely is 1 of the more challenging machine learning tasks. As far as the accuracy and capabilities of automated speech recognition,

and I'm wondering if you can talk through some of the ways that it has evolved as far as the capabilities

and the particular areas of focus and utility over the past 5, 10, 15 years because I know that, you know, there's the CMU Sphinx project that's been around for a long time that a lot of systems will use as sort of like a wake word detector

that, you know, obviously is a small model and, you know, very limited in capacity, but, you know, there are a lot of speech recognition systems that are being used for much more sort of long running processes. Obviously, what you're building an assembly for transcription capabilities and then systems such as these voice activated assistants such as the, you know, Amazon and Google devices.

Yeah. Have you used the CMUsyncs package before? A little bit. Yeah. So I've I've experimented with the Mycroft project, and I know that they use the Pocket Sphinx for the wake word detector. Yeah. We actually have someone on our team who used to work with on the Mycroft project. So CMU Sphinx was actually 1 of the first speech recognition toolkits that I played around with years ago when I was first getting into the field as well.

CMU Sphinx and Kaldi are probably 2 of the more, like, popular

traditional research frameworks, and there's a lot of opinions around Kaldi. So it's almost like if you say something about it, you could get in trouble with the community, but that's been around for a while. So I think 1 of the complicated things from the outside looking in

with the field of ASR is that,

like, all the use cases that you spoke about, kinda get all get bundled in together. You know? So, like, wake word detection

versus, like, utterance detection for, like to, like, an Alexa or the, like, large vocabulary speech recognition if you're trying to transcribe, like, a podcast like this. They all kinda get bundled in together to just, like, ASR. But there are different models behind

each use case.

And the technology has gotten better,

a lot better over the last couple of years. So, like,

a long time ago,

speech recognition models were built with like the GMM,

architecture setups, like Gaussian Mixture Models and Hidden Markov

Models. And the accuracy was not very good. And those are the models that CMUsphinx was based on top of. It was decent, but it was not very good.

And so what people used to do is they used to just customize the crap out of those networks. They used to, like, adapt them basically to their specific use case and domain. So like, if you only need to recognize like these 20 words, you can think of speech recognition as a search problem. And if you limit the vocabulary to just those 20 words,

all of a sudden becomes much, much easier. So if you're trying to build just a simple

like hands free device in your car, you could probably get decent accuracy with the very simple like GMM

model that just has a very limited vocabulary of like 5 words.

And

CMUsprings was, like, a pretty easily accessible

toolkit for speech recognition that allowed you to build pretty simple GMM,

models.

But what started happening is GMM part of the GMM,

models started to get replaced

with deep neural networks.

So that part of the modeling

started to become a lot more powerful and more accurate.

And then

kind of just fast forwarding a bit, there was that deep speech paper that came out. Did you

read that paper? Are you aware of that paper from buy from I think it was Baidu that put it out. I've heard of it, but I haven't read it. So Yeah. Actually, I am not sure if it was Baidu or NVIDIA. I think it might have been Baidu. I'm getting mixed up now. But

they

popularized

1 of the first

end to end

deep learning speech recognition architectures.

Tell me if I'm getting too much in the weeds here. But like most traditional speech recognition architectures,

they work by predicting phonemes.

So you first, you take your training data, you convert it into phonemes,

and then you're basically training a model to predict phonemes over time.

And so you have to do this step called forced alignment. So you're basically saying, like, at this millisecond, they're saying this phoneme, and then you're trying to predict

what phoneme is spoken over time. And then once you get your phonemes out,

you take it through another process that tries to actually convert those phonemes into words.

And so it's like this very complicated

setup.

And even from a training perspective,

it's complicated because you have to

convert your words into phonemes that's a lossy process, you have to align the phonemes with when they're being spoken that's a lossy process.

And then on the other end, once you get your phoneme phonetic predictions out, you have to try to map them bass back into words, and that's a lossy process. So there's, like, a lot of steps and what the deep speech paper showed which really popularized was a fully end to end and when people say like end to end speech recognition like, this is what they mean. You know, you're sending an audio and you're getting out graphemes, you're getting out characters. You could take the most probable character over time

and directly out of the neural network, and you could just get a transcription.

And so DeepSpeech

really popularized

the move in that direction.

And since then,

end to end speech recognition architectures have gotten a lot more powerful,

especially

with the advancements that have come out with transformer based neural network architectures.

There's been

improved loss functions that have come out

since the original deep speech paper. So

deep speech was trained with a loss function

called CTC

and

the CTC loss function

there's some flaws in that it's really amazing what the inventor you know was able to come up with but there are some flaws in that loss function

and there's been some new loss functions that have come out that have allowed for these models to be even more accurate these end to end models to be more accurate. What's nice about these advances in these newer end to end deep learning based architectures is that the GMM,

more traditional old school ASR architectures,

they had kind of plateaued. So even if you were to throw, like, 10 x more data at them,

you weren't really gonna see an accuracy improvement.

With the newer

neural network architectures that have come out the end to end based architecture. So a few of the popular ones there's deep speech, there's wave to letter,

there's RNTs.

They're able to learn from a lot more data. So, you know, if you throw like a 100000 hours of training data at those models,

they're gonna learn a lot. And not only is the modeling power behind those models better,

but they're also able to generalize better because they're just seeing tons of data.

And so long story short, the accuracy

for all these reasons has just gotten a lot better

over the last 10, 15 years.

Even over the last 5 years, the accuracy has gotten a lot better. So like when we started the company compared to

now, even our accuracy has improved like 30, 35 percent relative. So it's just gotten a lot better. And technology is just getting a lot better. And there's a lot of advances that are coming out and new research that's coming out all the time. And that's why you see applications

popping up now that probably weren't really possible, like 5 years ago even around Zoom meeting transcription

or phone call transcription,

video transcription. There's a lot more possibilities now that the tech is more accurate. A great example of that is the Alexa

and the smart speakers. Like, those would not be possible, like, without

deep learning models powering them. Just wouldn't be accurate enough to be usable by consumers.

Another interesting aspect of this space is that

speech recognition isn't the only use of audio data for machine learning contexts. 1 of the interesting conversations I had on the data engineering podcast was with a company that

was working on collecting all kinds of different

sounds and noises that you might encounter to be able to do

noise recognition or sound recognition where you wanna be able to, in a, you know, security context, identify the sound of breaking glass or in the sound of sort of industrial automation. You wanna be able to understand what are the sounds that a machine is going to make that are predictive of a breakdown so that you could do preventive maintenance.

And I'm wondering if you can just provide some context as to some of the differences that come up

of needing to be able to understand the peculiarities

of human speech in these waveforms versus some of the other ways that sound data might be used in a machine learning context.

Yeah. I actually have a Google Nest home, whatever those little devices are in my kitchen, and it went off the other day, and it was like broken glass detected.

So they must have 1 of these models locally running on their device,

which is,

yeah, maybe privacy invasive, but that's another discussion. Yeah, there's a lot of applications happening around like a motion detection, noise detection. Motion detection is an interesting 1. A lot of people will try to do sentiment analysis on top of transcription text, and that works reasonably well. But being able to

detect emotion actually within

the audio

is way more accurate than trying to do sentiment analysis on top of the text that's transcribed.

Another field that's really interesting is

source separation.

So are you familiar with source separation at all? Sounds like being able to extract

the different audio sources from, you know, maybe a mono recording

file. Yeah. Yeah. Exactly. So if you have, like, 4 speakers in a file, you can

basically create, like, 4 unique audio files each with,

you know, 1 speaker in there because you can do source separation. And

that's really cool technology because if you can basically remove a lot of crosstalk,

which

if you have a single file with 4 people talking over each other, and you have a lot of crosstalk, that makes it a lot harder for speech recognition models, that makes it a lot harder for any other model that might be modeling that data. But if you can separate out the speakers,

and this is different than speaker diarization. Speaker diarization is basically know, a model that can tell you, like, hey.

You know, we identify there's, like, 4 speakers in this audio file, and speaker 1 spoke at this time, and speaker 2 spoke at this time. So you could

use speaker diarization technology to do something similar, but source separation,

it's just more advanced. It's it's a different application. So source separation is personally something that I'm more interested in and is something that we're looking into, but it's pretty complicated stuff.

I think most deep learning research has been around, like, computer vision and NLP,

and there hasn't been that much research around speech.

It's been

not as popular as

computer vision research or NLP research, and I think that's because the models are harder to train and you need more data and data is hard to come by.

But I think that's starting to pick up, especially, you know, with companies like us. We're putting our own research. We're really pushing the boundaries of what these models can do as well.

Yeah. Digging into some of the technical aspects of ASR model development and management and deployment,

You know, you you mentioned how the size of the model is a factor when determining how you want to go about

training the model and some of the complexities that arise around working with it as compared to maybe an NLP model. And I'm wondering if you can talk about just give some sort of general orders of magnitude about the type of size that we're talking about in terms of the model and some of the contributing factors that will influence the total size of a model in terms of the different applications and how that plays out as far as being able to deploy and manage and sort of retrain them? Yeah. So in general, bigger is

better,

and

the bigger the models you train in general,

the better performance. We've actually done some experiments where, like, we'll have a pretty simple neural network architecture,

and then we'll have a pretty advanced

neural network architecture.

And if you make both

models like 50, 000, 000 parameters, for example, and you train on a dataset, the more advanced 1, 1 will perform noticeably better. But if you then make them both like 500, 000, 000 parameters,

they end up being a lot closer in performance because

even a simple model, if you just make it like big enough and you train it on enough data, you can actually get like pretty good performance. So

it is kind of interesting how like the bigger you make your model and the more data that you throw at it, the less the actual, like, model architecture matters. But

that is not always the case as shown with things like transformers being, you know, much better than recurrent neural networks for sequential data.

But to your question,

training large models

is difficult for a number of reasons

and some that are unique to startups,

which I don't know how much we're gonna get into that, but like just

for 1 reason that if you have a really large neural network,

sometimes you can't even fit that entire neural network on a single GPU.

When you're training, you have to train different layers on different GPUs or sometimes even train different layers

entirely on different machines

because there's only so much memory that each GPU card has. And, you know, you might not be able to, like, fit the entire

training time onto a GPU.

And so if you get to a big enough size, then you have to start doing distributed training.

And distributed training has a lot of

complexity that goes into it because,

1, there's a lot of just software development that has to go into

doing distributed training well. And there are some nice

libraries out there, like 1 is from Uber. It's called Horovod,

and we've been using that for a while. And that works pretty well for

efficient

distributed training of large models. But, like, as you add more

machines to your training cluster, you'd always get a linear performance improvement.

And these machines are super expensive.

So you can't just go from, like, 1 to 10 and get a 10 x speed up. You might go from 1 to 10 and get, like, a 6 x speed up, And then like 10 to 20 and get like a 7 x speed up. Because once you start communicating and having to like communicate large tensors,

a lot of data between machines, there's network overhead and it can just get complicated really quickly.

So

there's a lot of work that goes into setting up a training infrastructure that can train large models fast

and without it costing a ton of money because at a startup,

you know, you're trying to always optimize for like speed and time.

And if your models take, like, 6 weeks to train,

like, for a startup, that's eternity.

And you wanna try to iterate much faster than that because a normal startup can, like, go collect feedback from customers

and then, you know, overnight, like, ship a new feature if they're just building, like, a CRUD application or something. But when you have to train models and those models take weeks to train, you might find out 6 weeks later that you made a mistake or that your model is actually not better, then you have to go back to the drawing board and train again. And before you know it, like, months can pass by and you just don't have that kind of time as a startup. So the way to get around that that is, alright. Well, let's train on more compute. But more compute is expensive, and as a startup, you may not always have a huge budget to throw a compute. So there's just a lot of complexity that goes into it and a lot of factors that go into it. From a training perspective,

definitely, like, my advice to companies training large models is always, like, don't do it on the big cloud platforms like AWS or Google Cloud. It's crazy expensive, and you can do it much cheaper

if you buy dedicated hardware,

even like rent dedicated hardware from other providers, or even if you just try to buy your own training, like, it'll just be much much cheaper.

But I will say the landscape has changed since when we started the company. So when we started the company, we were training

models on, like, NVIDIA k eighties, which are so slow compared to the V100s or A100s that are out there now.

It's like insane how much faster they are now, especially

if you optimize your training and your network to perform really well on

the new NVIDIA cards like the v 100s and a 100s.

So

it's a very long winded answer to your question of that, like, it's just complicated to train large models. And this goes back to 1 of your earlier questions of, like, how is it different from NLP?

Like you can train pretty advanced NLP models on a single GPU that work really well.

And so, you know, and it may take like 3 days to train, which is not going to break the bank and 3 days is like no time but if you have a really large ASR model that takes like 6 weeks to train on like 42 GPUs that all of a sudden is a lot of money and a lot of time. That's from the training perspective, so there's a lot there we can keep talking about but like from an inference perspective, it does get easier because there's a lot you can do to prune large models,

you can quantize large models

to make them faster for inference and take up less memory footprint. There's also techniques like knowledge distillation,

which I don't know how familiar you are with that, but you can, like, squeeze a larger model, like, into a smaller model, basically. So you have a smaller model, like, learn the behavior of a larger model, and you're able to, like, distill knowledge into a smaller network that you can use that inference time.

So inference is actually not relatively not that challenging. On its own, it is challenging, like, any given day, we're maybe running, like, a 100, a 150 GPUs

in production for inference,

because we transcribe millions of audio files a day with our API.

But it's a beast. I mean, all this is kind of going back to it's just a beast. And, you know, I remember when we were going through Y Combinator,

there were some people at OpenAI

that were

making themselves available to, like, advise the YC AI companies.

You know, I was explaining to them what we were doing and, like, our models and our data and everything, and they were, like, wow, okay this is like you know pretty advanced stuff and like high compute stuff that you guys are doing and I think you know most like AI startups

are actually not doing

that

high compute or like large scale model training. Like it's pretty rare that a company our size

is pulling off training models like this big for this challenging of a task. Yeah. It is kind of like,

you can't just like Google the answer. You have to really figure this stuff out. How do you do this? It's been a beast, though. As far as the actual,

like, size of the model specifically just in terms of, like, sub space on disk or space in memory? Like, what are some of the sort of comparative sizes that you're dealing with from some of the models that you're building versus the, you know, size of a simple NLP model that somebody might train from scratch? Well, let me do the math really quick.

Like, some of our training machines, for example, they have 16

NVIDIA V100. Each NVIDIA V100 has 32 gigabytes of memory. So it's 512

gigabytes of memory per machine, and we're training our models on, like, multiple machines. So from a training perspective, you know, it's like, yeah, 10/24.

We get, like, reasonable training turnaround time on, like, 32 to 40

2

v 100.

So you're talking, you know, each GPU

times 32

gigabytes of memory to get, like, the total like, GPU memory

requirement is from from a training perspective. From an inference perspective, it's a lot smaller. Like, so we deploy some of our inference workloads into

AWS cloud resources, and I believe we use the NVIDIA T4, which is like the G40N instance class, I think. That has a, I think, like, a 15 gigabyte

GPU

memory

footprint, and, you know, we can easily fit the model on there. It's just about, like, how big of a batch size can you process.

So you know you could fit the whole thing on there and maybe do a batch size of like 4 but as soon as you try to go to a batch size of like 18 it'll just explode because you can't

fit all of that into

1 of those small GPUs so

yeah

Does that answer your question? Yes. Yeah. It does. So back of the envelope, we're looking at probably on the order of about a terabyte of resident memory that you're dealing with when you're going through the training process.

Yeah. Our speech recognition models,

you know, they're like a few 100, 000, 000, let's just say, like, 500, 000, 000 parameters,

and

they are trained on like tens of millions of audio files,

like unique audio files. Each of those audio files, you know, it's you're getting the input to the speech recognition models. It's either like the raw audio, which is what you see with WAV to VAC or

it's something called MFCC values or spectrograms. Spectrograms are, you know, some of the more common

feature types that are used as the input to your speech recognition models.

But those spectrograms you're talking like, you know, yeah, each data point is like a 160 dimensions. So it's high dimensional, like sequential data that you're sending in. So the models are just they're so compute intense.

Like, our real time transcription model is way less compute intense, and you can run it on your MacBook Pro

and get pretty good accuracy. But it's not as robust or accurate as our, like, asynchronous batch transcription

model that is what we transcribe, like, podcasts and and video recordings and stuff with because we're trying to maximize accuracy. So it's really just a trade off of like how accurate do you want to make your model but with our company we're trying to build just

very generalizable

accurate

models and a great parallel really is like GPT 3. I mean, that thing is enormous,

but it's so good at so many different tasks. It's just this giant language model that was trained on, like, a crazy amount of data, and you end up with this

very

performant

generalized

model that can handle a lot of different tasks.

In terms of the sort of trade off of accuracy versus model size, you mentioned it a little bit that it's, you know, a nonlinear relationship. But what are some of the observations that you've made as you've iterated on the capabilities of what you're building at Assembly in terms of the

utility of having a larger model for greater accuracy versus having a smaller model for ease of deployment and life cycle management?

It really is about for us

making sure we're solving our customers' needs. So

because for us,

since it's not just a pure research project, we're actually trying to build an API that customers are getting a lot of value from that's solving a business case for customers.

It has to be accurate enough for customers to be able to go build what they're trying to build or do what they're trying to do.

And so there's definitely

trade offs out of a certain point. Like, if I were to say to a customer, hey, 91% accuracy costs a dollar, but like 93%

accuracy costs like $10,

they probably wouldn't care enough to pay that much more like 10x.

So for us, it's, you know, we could make our models, you know, much, much bigger

and they would just be much, much more expensive both to train and both to serve.

And you do get diminishing returns as you increase model size and as you increase

the amount of data you're training on because these models only have like so much modeling power. You can't just make like,

you know, 10, 000, 000, 000 parameter ASR model and like boom, it's human level. Like, there's gonna have to be better

loss functions and neural network architectures invented

before

ASR can be human level. So at a certain point, there's diminishing returns, and you do have to make that decision. And a great example is right now,

we're

launching

the next version of our ASR model in about like a week and a half

and as part of that pipeline like you can either use

a simple like LSTM

neural network like a LSTM language model or you can use

a like more traditional classical like engram language model

And the LSTM is way more compute intense and improves accuracy by, like, 0.1%

compared to the engram language model, which is, like, much, much faster,

much, much easier to train and serve. And it's like a 0.1%

difference.

So if we're trying to write a paper to just maximize accuracy,

we would probably just, you know, do everything we can to maximize accuracy even if it was just like 0.1%

improvement.

But because we're trying to deploy this in production at a cost that is

reasonable for our customers

and at a cost that's also

manageable for us as the company that's hosting it, you know, we just look at some of these trade offs between, okay, you know, this is more compute intense, but it's only, like, a small improvement in accuracy compared to

some other model that might be

much more performant, but, you know, like, close enough to a different model in terms of

performance accuracy performance.

In terms of the

sort of cost versus reward

aspect of dealing with these large models, particularly as a start up as you mentioned, and some of the sort of both opportunity and financial cost that goes into building these models and doing the research and iterating on them before you're able to ship a new product and feature. What are some of the useful lessons that you've learned and lessons that you would share with other start up founders about some of the

ways to be able to manage some of those outlays to be able to

remain viable as a company

and be able to leverage some of these more advanced machine learning capabilities as the

sort of core element of your product?

1 of the first lessons

that is obvious in hindsight, but I think a lot of start ups are just focused on, like, shipping and moving fast.

So they miss is you've gotta really know, like, what metric are you trying to optimize,

which is different from, you know, the objective function of your neural network. Right? So I'm not talking, like, minimize loss. I'm talking, like,

you know, for speech recognition, it's the word error rate. Right? So if you're gonna start any machine learning project that you're gonna start, you should first before you any modeling or anything,

like, come up with some benchmark datasets,

figure out what the current state of the art is on those benchmark datasets,

and, like, that's what you should be working towards. Because if you just look at papers,

a lot of the datasets that are used in papers are they're in, like, a laboratory setting. Right? So it's completely irrelevant for real world data.

Great examples like speech recognition. Most of the datasets that are used are from a dataset called LibriSpeech, which is like

very clean audio, people speaking like, you know, the microphone's like in their mouth. So it's like very near field audio, very clean. They're probably in like a sound booth or something compared to a phone call in the real world where there's like wind and noise and the person's like going back and forth from their phone. And there's, like, people talking in the background, and the sample rate sucks. And there's all this, like, downsampling happening

in the network infrastructure

to reduce, like, the telephony infrastructure's costs, which is just compressing the audio data further and further. So to any machine learning company, basically, it's like you really should first start by saying,

let's create a really good set of benchmark

datasets and then

through either

humans or from comparable

products on the market, let's

figure out what is the level of accuracy.

1, what is the metric we care about? And then 2, like, how does the industry today

perform on that metric against those datasets?

And it's different from, like, a validation

or test set. It's a completely different set of datasets that you should have.

And that is really where companies should start is with that. And then once you start modeling,

luckily, a lot of the frameworks have gotten better over time. So, like, PyTorch has really good distributed training

built into it now that actually is pretty good compared to, like, a couple of years ago, you know, when we were training our models with TensorFlow and hooking up Horovod to get good performance with distributed training, it was just harder.

So you can do pretty well now, and there's access to faster GPUs like the v 100

and the Google Cloud is the a 100

GPUs that you can rent, although it's like, I think, $25 an hour or something.

So

you're definitely in a better spot today as a startup, but

another, like, major lesson is, yeah, you really just shouldn't train

models on big cloud

infrastructure like AWS or Google Cloud. It's just, like,

stupid expensive.

We have worked

with this 1 cloud.

I don't even wanna say cloud provider, like, this, like, small company that we lease, like, dedicated

GPU clusters from. So they're all, like, 247. We pay monthly fees for them, and we have, like, hundreds of

v 100 GPUs that are just always available for us to train on, and it's, like, a 6th of what it would cost if we were trying to do that with AWS.

And then there's a lot of programs that big clouds do offer for startups, like, you can get, like, $50, 000 in, like, EC 2 credits or $10, 000 in EC 2 credits.

So, like, maybe just start with, like, a big cloud and, like, use up your credits and then kinda switch to a dedicated hosting provider where you can pay, you know, a couple $1, 000 a month and you can actually have like GPUs,

you get better performance and the costs are lower. And there's always the option of actually buying the hardware yourself. There's a really cool company, I think it's called Lambda Labs, where you can, like, buy

training rigs.

But then there's a question of, I mean, where you plug that thing in because the electricity draw is huge. And we can SSH into some of our instances, and you can, like, look at the GPU temperature, and it's, like, you know, like 40 or 50 degrees Celsius,

like, all the time. So if you get, like, a 6

or 8 GPU rig and you plug that into your house, I don't think it would work. I think it would probably trip your whatever it is, and your energy bill would be really high if it didn't. Your house would also probably be very hot. I was gonna say you wouldn't have to pay for heat in the winter. Yeah. I mean, it could be an option, you know,

a way to heat up your house. Yeah. I think I might have actually seen a startup a while ago that was trying to sell, like, mini compute nodes. They they would distribute to people's houses and, like, they would pay you to run it and then Yeah. Yeah. Provide heat to your house.

Yeah. I think I saw that too. That would be amazing. Yeah. I don't know if it's a joke or

Yeah.

Yeah. I think it was real.

But, yeah, it's just like, if you're gonna try to do it at any type of scale, it is really hard to pull off hosting yourself. And then you have to have someone that's going on, like, restarting these servers all the time if they go down or something.

So it is easier to just rent machines, but you definitely can't do it with the big clouds. It's just way too expensive.

Digging more into some of the specifics of the actual model development process,

particularly as far as the

sort of feature extraction and feature identification

for understanding

what to use as the inputs to your model. I'm wondering if you can talk through some of the sort of techniques that you use and attributes that you focus on as far as the feature definitions

for

inputs to your models from the source audio data that you're using. Yeah. So some of the newer model architectures like wave to VEC, you know, they're just modeling the raw audio data, which

makes more sense because the

more traditional feature

extraction methods

are getting those what are called MFCC values

or spectrograms

off the audio data, but those are

human crafted features, which then kind of starts to feel more like classical machine learning where you have, like, someone designing features and building features. So

in theory, intuitively, it's better to just, like, feed the raw data to the neural network and then have it figure out what the features are that are important.

And that really is I think kind of where the future is going. Yeah, those are the 3 main options. It's like spectrograms, MFCC values, or just raw audio data. The raw audio data is for sure the coolest option.

Yeah. You're just letting the neural network like, if you throw a couple convolutional

layers at the top of your network, you know, you're just letting those basically turn into feature extraction layers off the raw audio data, and they're developing some abstraction that's turning into a feature set that is getting fed into, like, some transformer layers or RNN layers or something. But those are those are some of the main options. As far as the just life cycle management of the ASR models that you're building at Assembly. Can you just talk through some of the sort of complexities that go into

monitoring and identifying model drift and

figuring out when you need to retrain and redeploy the models and just that overall

operations aspect of it. Yeah. There's a lot that goes into that.

Before we push anything into production, we do just a ton of QA

because we really want to make sure that the models are performing well on

tons of different types of data,

are going to perform well for customers,

it's going to absolutely

perform better than

their current version that was in production if we're gonna make a switch. So the first step is just doing a ton of QA so that there's a lot of confidence that, okay, this new model we're pushing into production

is gonna be better. And so we have a bunch of datasets that we run our models against. We look at the accuracy numbers. We look at a bunch of metrics to make sure that they're performing

better across the board. And then once we push them into production,

usually the things that come up are,

like, DevOps related things. Like, you know, you realize, okay, for some reason, you actually can't have a batch size of, like, 16. You need to just use a batch size of 14, or there are some setting that was misconfigured

in terms of the sequence lengths that you're passing into models and the model explodes if the sequence length is too high. So there's like edge cases in the actual code that have to be

smoothed out

when

the model profile changes.

But there's 2 different types of model updates, like 1 is the nice and easy type, which is just you're just updating the model file. Like, you just have new weights, basically. Like, it's the same architecture and everything. It's just the weights are better now. And that's what we call just like a drop in. And that's usually pretty safe if all the accuracy metrics are cleared because from an infrastructure and DevOps perspective, like, you're just pointing,

you know, to a different file, same architecture, just different weights. So it's usually pretty safe. Where it gets more gnarly is if you have, like, a new architecture. So we're actually going through this right now, like, most of our infrastructure is built around

our, like, v 6 ASR model,

which is like a really large CTC model.

And we're transitioning now to our v 8 model

and v 8 is a completely different architecture.

So the whole pipeline has to change. There's like a lot of infrastructure that has to change.

And

there's a lot of actual, you know, programming and DevOps that goes into it. And that's a beast versus just dropping in a model file. So, yeah, on 1 side there's like a quality perspective, like does the accuracy of the model kind of meet what we need it to meet? And then on the other side, it's okay, you know, are we just updating the weights of something that's already in production? Or are we shipping a new model that

behaves completely differently and infrastructure needs to change completely, in which case

the engineers are usually less happy and they need to they need to roll up their sleeves and get to work. As far as just the overall space

of ASR

and some of the ongoing research that's going on there. I'm wondering what are some of the

areas

of activity and exploration

and some of the items on your wish list that that you would like to be able to push into production, but that are still too intractable or impractable to be able to productionize?

I think the future of ASR is gonna be around, like, unsupervised training.

It's very similar to like what you're seeing with GPT 3 and like you have these models that are very large that are trained on lots of

unsupervised data, and then you're just fine tuning them for specific tasks or for specific languages.

I won't say too much there, but that's like a really interesting area of research that we're actively looking into, and the results are, like, really promising. To your point of being able to extend some of these NLP models to other languages, I'm wondering what are some of the capabilities

that you're looking towards as far as being able to support other languages from an ASR perspective?

Transfer learning and

unsupervised

training, it makes shipping other languages so easy

because you can, you know, train a model on like,

50000 hours or a 100000 hours of unlabeled like Spanish data,

and then just fine tune it on, like, a 100 hours of labeled Spanish data, and you're gonna get, like, pretty good results.

And so that's a pretty clever way to quickly spin up new languages.

It sounds simple, but it's like the part of training a model on like a 100000 hours of unlabeled Spanish data is that's the challenging part, and you need the infrastructure and compute and know how to do that. But once you have that, the fine tuning to pump out new languages is actually pretty simple and works pretty well. In terms of your experience of using ASR technologies and building the models and some of the ways that you're seeing it used

either, you know, with the assembly product or in other avenues

of ASR capabilities? What are some of the most interesting or innovative or unexpected ways that you've seen those capabilities used? So in general,

you are just seeing a ton of new use cases

up here that are leveraging

transcription, and they're either

using transcripts for simple use cases like accessibility

features, closed captions or readable

voicemails like visual voicemail.

And then you're also seeing a lot of applications that are being built on top of transcription.

So they're using the transcriptions as inputs to

summarization

models or to figure out like topics that are spoken within

an audio file or video file and those are really exciting use cases that we're seeing a lot more of.

The macro trends are is that,

1, there are now APIs that make this technology accessible. This is actually a very new thing. Like 5 years ago, there were not APIs that you could just go get accurate ASR from. You had to go beg Nuance for license and pay them tons of money

and then deal with their software. But now you can come to APIs like Assembly and you can just get as a developer

without having to even put in a credit card access to state of the art speech recognition technology,

and the accuracy is a lot better as well. And if you do end up deploying something into production, the cost has come way down. And so those 3 macro trends like accessibility,

accuracy, and cost coming down

are driving a lot of innovation in this space and we're just seeing

so many

applications

launch every day that are making use of audio

or video and the transcripts

of those audio or video files. In your own experience

of working in the space of ASR and building a business around that technology, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? Yeah. I think it goes back to some of those lessons learned around training machine learning models as a startup that we just, you know, learned in the early days of the company in terms of, like, you know, really making sure as a machine learning company, as a machine learning startup that you have a good compass to use. You know, so that you have, like, good metrics that you're tracking on your model's performance, and you have good benchmark datasets that are comprised of real world data and not just, you know, laboratory or research datasets,

you really have to be able to know and measure, you know, how your models are performing on real world data, and you have to have an acute understanding of that so that you know,

you know, like, which direction to move in. You need a good compass.

And then the other is really just about that speed

and cost trade off. I think

ML startups need to recognize that you just can't move as quickly in the early days as a traditional startup

because you can't just ship a feature overnight. You know, it might take you

a couple of weeks to get more data together, it might take you a couple of weeks to train your model, it might take you like a week or 2 to deploy it.

And that feels slow for startups because startups, you know, you wanna be able to, like, get customer feedback

overnight, ship improvements,

and then you know rinse and repeat rinse and repeat. But as an ML startup,

it's kind of more similar to

starting like a hardware company where

you know, you just can't iterate as quick as a pure software company because you have this huge data science or machine learning component of your business that is at the core of what you do. As you continue to build out the assembly product and continue to invest in research for ASR, I'm wondering what are some of the overall trends in the research or in industry that you're keeping an eye on

and thinking about as far as how to factor that into your product or interesting areas to explore.

The unsupervised

pretraining is a really interesting area, like, across the board right now. I think, you know, GPT 3 is like a great demonstration of the exciting

area. There's limitless unlabeled, but that is a really

exciting area.

There's limitless

unlabeled audio and video data on the Internet

and, you know,

being able to tap into more of that data,

train and build bigger models that can learn from all that data

is a really exciting area right now. Are there any other aspects of the ASR space and the work that you're doing at assembly to help sort of drive the industry forward and build research and just manage the overall

model development

and deployment and life cycle management process that we didn't discuss yet that you'd like to cover before we close out the show? No. You know, I think we talked about most things, to be honest. You know, we got a good breadth of topics that we covered. So it's challenging stuff, but it's fun, and it's

fun engineering problems to work on, really. So, you know, while there's, like, cons to all this as a startup, it is a way to attract good engineering talent because there are hard problems that good engineers wanna work on. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

This week, I'm going to choose a movie that I just watched recently called The Hitman's Wife's Bodyguard. So it's a follow-up to The Hitman's Bodyguard, both of which are just hilarious action movies. Ryan Reynolds is always entertaining to watch and his interaction with Samuel L. Jackson. So just

fun, mindless entertainment if you're looking for something to watch this weekend. So definitely recommend that. And so with that, I'll pass it to you, Dylan. Do you have any pics this week? I'll have to watch that. I, like, scrolled over that, and I didn't hit enter, but maybe I'll go back.

Yeah. The inspiration 4 documentary

on Netflix,

actually, I think they're launching tonight in 4 hours.

It's really cool. People should check that out. And you should also watch the launch tonight that's happening in 4 hours. I guess by the time this is published, they will have already gone up and come back down. So

you and I should go watch the launch tonight.

Yeah. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing in the ASR space and some of the

complexities and challenges that are involved and some of the upcoming capabilities that people can look forward to. So definitely very interesting problem domain. So I appreciate all the time and effort that you and your business are putting into it, and I hope you enjoy the rest of your day. Yeah. Thanks for having me on here. Really appreciate it.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__