Delivering Deep Learning Powered Speech Recognition As A Service For Developers At AssemblyAI

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host, as usual, is Tobias Macy. And today, I'm interviewing Dylan Fox about AssemblyAI,

a powerful and easy to use speech recognition API designed for developers. So, Dylan, can you start by introducing yourself? My name is Dylan.

I am the founder of AssemblyAI,

and what we're working on is making a really

easy to use API for automatic speech recognition

that developers are using to do

a wide variety of things from transcribing

podcasts like this to Zoom meetings to phone calls. We

started our company a few years ago now back in 2017

and have been,

yeah, chugging away ever since. And And do you remember how you first got introduced to Python?

Yeah. Yeah. So

I actually

was an econ major in college, and

I'd always been interested in computers is the kinda where it started. I played a lot of video games as a kid, so I played a lot of, like, Counter Strike and

Runescape and, like, World of Warcraft as a kid, and my brother and I would, you know, build computers because he was also really into computers.

And

at 1 point, I believe I was, like, hosting Counter Strike servers and selling them on IRC,

and it was, like, really a sketchy time. But but I that's that's where I got into,

like, computers and programming because I remember I built, like, a website

for the Counter Strike servers that I was selling just in, like, HTML.

You know? Then I kinda put that on the shelf and thought that I wanted to

go into, like, finance in college, so I majored in econ. And then my senior year of college, I started a company with a guy I went to college with,

and throughout that process,

ended up learning how to code, and

the language that I started with was Python.

So I attended some Python meetups at Washington, DC,

made a few friends that I'm still friends with now in the Python community, and

I think 1 of the first Python books I read was 2 scoops of Django,

which shot me into it. So that's where it all started.

And so now that's led you down the path of starting a business where I'm sure you're using a lot of Python being a machine learning shop. So can you give a bit of a description about what it is that you're building at AssemblyAI and some of the story behind how you got to where you are now?

Yeah. Yeah. Of course. So,

really, our goal is to build the best

API for speech recognition for developers. So we wanna have the most accurate, easiest to use,

most affordable,

best feature set

offering for developers so that they can build

whatever it is that they wanna build with speech recognition. So

kind of what we think about in the analogy that we use is, you know, the Stripe for speech recognition or the Twilio for speech recognition. That's what we're going after.

Fast forward a couple years from the last story, I ended up joining a machine learning team at Cisco back in, like, 2015,

and we were doing a lot of NLP work,

doing a lot of the NLP in house, and 1 of the projects that we were working on had a speech recognition component.

And this was at the time where Twilio was getting super popular.

Really, at that time,

I think

developer tools were not as popular as they are now. I mean, they're definitely getting more popular, but Twilio, I don't think had even, you know, IPO'd yet. They were still up and coming, and they've really set,

like, a great example for what you can build as a developer company over the last couple years and and how big you can get if you're just really focusing on developer experience. But

at Cisco, we had a speech recognition requirement,

and

at Cisco, as a large company, they, you know, decided, okay. You know, we're not gonna try to build this in house. It's just too complicated, and to host it and maintain it and keep it up to date is gonna be a huge nightmare. So

let's go out and let's look at what alternatives

or what vendors there are out there that we can buy this technology from. And so at the time, we've looked at Nuance, which I don't know. Do you know are you familiar with Nuance? Yeah. I'm definitely familiar. That's the service went into all of the different little flip phones that would do the speech transcription for you.

Yeah. Nuance has been really, like, the company, you know, in the past for speech recognition, and they have a lot of their

consumer

products like the Dragon Natural Speaking. I think, you know, my dad even uses that to, like, dictate notes on his computer

still. So we went to Nuance, and we tried to get access to some API.

And, yeah, I think, like, you know, weeks later, I had a CD ROM mailed to me with, like, trial software, and I was just like, what is this? You know? This is, like, 2017. I I don't even have a a CD ROM drive on my laptop, and it was crazy to me that that was still how they were selling software to developers.

And then through the research that we were doing at Cisco, I had

started to look into the latest speech recognition research, and there was a lot of advances in deep learning that were happening at the time that were improving

the results for speech recognition accuracy compared to more classical approaches.

And so you started to have these end to end deep learning approaches and these deep neural networks being applied to speech recognition

that were improving the results a lot. And,

historically,

I think automatic speech recognition results had kinda, like, plateaued,

and the real way that people were getting better accuracy is just, like, customizing

the crap out of these systems. Like, okay. You are deploying this into a car. Well, let's, like, customize it for the car. You're, you know, deploying this for, like, a recipe app. Let's, like, customize the vocabulary for the recipe app because

you can sort of think of automatic transcription as, like, a search problem. So if you limit the search domain or the search space, then it's easier for the system to figure out what you're saying correctly.

But

deep learning approaches were showing that you could reach a new level of accuracy, and we're kind of creating opportunities to break out of that plateau, basically. And so I had, you know, seen the experience

with Nuance. At the time, I think, like, Google had just launched a public speech API, and we checked that out too. But it was, like, you know, just the API and a web page. Like, even

being Cisco, we couldn't really get any support and talk to them about, like, the road map or figure out, like, hey. Is this thing gonna be deprecated in, like, 6 months? And then we're gonna have, like, a core piece of our stack that just, like, disappears overnight, and then we're left scrambling. And that's kinda the, I think, the challenge when you go with a big tech company is that you don't know. Is the product gonna be supported? Is it gonna go away? Like, is the pricing gonna completely change? It's a very sterile relationship that you have with some of these big tech companies if you're not using their, like, core

products. You know? I mean, if you're hosting your stuff on AWS,

you know, you get pretty good support. But if you're using just some, like, you know, like, ancillary thing that they offer, you know, that might be updated, like, once every 2 years or something. The experience with what the current vendors for speech recognition were offering coupled with, like I've been really into Twilio and Stripe and this whole, like, idea of a developer tool startup

and then seeing what was becoming possible with the latest deep learning research, and I was also really into machine learning at the time. So then kinda got the idea. What if you could use the latest deep learning research

to create really accurate automatic speech recognition

and then wrap that up in a really easy to use API

that developers were actually excited about using and excited about building

something with. That's where it got started.

Yeah. There's probably about 3 different podcasts worth of material that we could cover

with what you're building at Assembly.

So we'll try and sort of focus it down, and maybe we can have you back on to go deeper on some of the other elements. But, you know, to your point of speech recognition

being

something that's approaching commodity level where

there's been an explosion in the availability of services

for doing automated speech recognition and transcription

where, as you mentioned, Google has an API for it. Amazon has an API for it. There are apps like Otter for being able to take and transcribe voice notes. There are numerous ones available in the podcast ecosystem for being able to generate transcripts and extract, you know, key segments from it for being able to try and

popularize the your episodes on social networks. I'm just wondering

what you see as some of the motivating factors for the recent growth in the overall industry for speech recognition.

There's a couple of reasons. Like, 1, I think the main drivers of the accuracy has just gone up. I mean, the accuracy that you can get today from a speech recognition system compared to, like, you know, 10 years ago is dramatically

improved. The results are just a lot better now. And then

I think another big part of it is is accessibility.

You know? There is an API,

you know, from Google. You know, we have an API. There's a few other providers that expose this technology

through an API. So you don't need to beg Nuance for a CD ROM anymore that you're gonna host on some, you know, server to get access to speech recognition software.

And then the third is, I think, the affordability. So, you know, it's gotten cheaper to do this now because the tech has gotten better. So with our API, for example,

our pricing starts at 90¢ per hour of audio that's transcribed,

And then it can get a lot lower if you're doing really high volumes of transcription, but that's compared to, like, you know, I think it might cost, like, $50

per hour to get a human to transcribe something, and that's even probably on the low end. Then you'll have to, like, edit it and review it. To recap, I think it's those 3 things. I think it's the accuracy has gone way up. I think the accessibility

has gone up for developers

and start ups to get access to this tech, and then the pricing has become a lot more affordable. So you can use it for all these use cases that you probably couldn't do before. And it is still it's a really new thing. I mean,

Google's API

launched in 2017. You know? Our API launched late 2017, early 2018. So you're talking only a few years have you had the ability to

sign up, you know, without even a credit card, get access to an API that you can transcribe something with state of the art speech recognition technology with. It's only a few years, and so a lot of people don't even know that you can do that still. And that's why what we see from our end is, you know, the amount of use cases are still exploding. It's still the very early days because

people are just waking up to the fact that, like, hey. Actually, we can take all this audio and video content we have that we can't do anything with right now because it's it's kinda frozen in this state of being audio or video data. You can't search in it. It's not pliable.

But now you can transcribe it really accurately on the cheap pretty easily and then, all of a sudden, this data, you can do stuff with it. You can index it. You can look for profanity. You can get topics from it, summarize it, all of these types of things. So we still see it's pretty early days in the industry even though there's all these awesome use cases popping up more and more.

And in terms of the use cases and the core focus that you have for AssemblyAI,

obviously, as you mentioned, you're targeting developers. But what are sort of the main

sort of content areas or industry verticals that you're aiming for to try and optimize for the experience with Assembly?

So we see ton of applications in

the telephony space. So, like, a lot of phone calls being transcribed. There's a lot of applications now that are being built around

sales team coaching or support center. You know, if you've got, like, someone calling into a a customer support center, being able to automate the QA of that to make sure your support agents aren't, like, cursing or or saying the wrong thing on the calls. Visual voice mail.

We've got a customer that we've been working with for Ryle, you know, CallRail that does call tracking

attribution, and they're using it for phone call transcription.

So a lot of applications in the telephony space. I think that's been

kind of the most

low hanging fruit in terms of lot of customers in the telephony space

have been transcribing their data for a while and more and more starting to do so. The newer applications,

there's a lot of them around podcasts

and podcast transcription,

Zoom meeting transcriptions,

transcriptions of virtual meetings,

applications in the video space, so

video hosting platforms,

adding live captions to their videos or transcribing their video content to make it indexable or searchable.

So those are the some of the primary use cases we see.

Then there's a ton of long tail applications where

stuff that you couldn't even think about

or wouldn't even consider were options to use the technology for, but people are just super creative and have all these ideas.

And that's a big thing that we're trying to do is drum up more awareness that, hey. This this API

exists and this technology exists and see what people will just build with it because there's probably

applications that can be built with speech recognition that you or I would need to think about

that will be huge and just be really unique and innovative

use cases. And I think it's really similar to what you've seen with

keep going back to Twilio, but I remember seeing the Twilio API and was like, oh, wow. What can I do with this now that I can send SMS or make a voice call with API? What kind of startups could I build, or what kind of businesses could I build on top of this? So

I think there's still a lot more to come, but those are some of the main use cases that we see, basically, today.

The Twilio 1 is a definitely a good example because there are some pretty amazing things that I've seen people build with that. And, actually, what you're building at AssemblyAI is a natural complement too because you can automate sending voice calls, and then you could automatically transcribe them.

Yeah. Yeah. Exactly. Exactly. We have a lot of people that are doing that. A lot of people are sending their Twilio calls to us for transcription

and building a whole pipeline with that. And, yeah, there's so many cool things that people have built with Twilio

and other developer tools that

are just really

exciting and inspiring to see as a developer. Yeah. It's definitely great to see what people will build once you start to break down some of the barriers to just, you know, letting people run free and do what they feel like and having a low cost option definitely

increases the accessibility for people to, you know,

want to experiment because they don't have to deal with any sort of, like, financial output to be able to see what happens with some off the cuff idea.

Yeah. Exactly. It's kinda like what you know? I mean, we were talking before we started recording, and you were like, oh, I came across your API, and I started thinking, what could I do with this? And I wanna spend some time to hack on this, and, yeah, I think just giving

creative people access to technology in an easy

package,

they'll do things with it that you never really thought about in in the first case. And that, for me, is personally what I think what's always really cool when I see some

developer sign up and they build some application, and maybe they just stay on the free plan. Right? It it doesn't matter. But then you see that they're building this thing that is so cool. That's what I get really excited about when I see people building just really creative things with our API.

And so in terms of

the relationship that you have with AssemblyAI

and Python, obviously, Python is a natural client because it's just a web API for being able to interact with it. But in terms of the business itself, what are some of the different places that you're using it throughout the company?

We are very much a Python shop probably,

you know, due to the the fact that I started the company and Python's my my main language, but we use Python everywhere. So, you know, all the machine learning is done in Python. We use PyTorch

and TensorFlow.

You know, we were talking about FastAPI earlier. We actually just launched a microservice in FastAPI

using Flask, Tornado

for a lot of different services. A lot of the early scaling

that auto scaling that we had in place was, you know, like, Python, cron jobs, basically, that would run and hit the AWS

Boto API and

turn on servers or take them down.

And then there's a lot of just back end kind of tooling,

kinda ops tooling, internal tools that we build with Python. So we're using it all over the place, really.

Yeah. It's definitely

1 of the sort of great powers of Python is that you don't have to constrain it to just the machine learning area

or just the web server. And as I've called out a number of times on this podcast and elsewhere, you know, Python is not the best tool at anything, but it's always the second best at everything.

Yeah. Yeah. A lot of people will talk about I don't know. If you find this too, a lot of people will talk about, like, how, like, Python is slow and, you know, that that's kinda 1 of the drawbacks, and I honestly never really, like, faced that before.

But then with this company, I remember

1 of the, like, very early versions of the models that we built, there was, like, the certain part of the pipeline where you have to, like, write this algorithm, and the only way I could write it was in Python. So I wrote it in Python, and then ended up it was, like, really slow, so then I converted it to, like, Cython and then got a little bit faster.

And then, eventually, we had someone join the team who was, like, really good at c plus plus He rewrote it in c plus plus, and it was, like, 30 times faster. So it's so easy. Python's great because you can just do, like, whatever with it, and it's pretty easy. There's such an ecosystem where

there's so many libraries that you can just PIP install. There's so much example code online. But for some of the heavy duty performance

related stuff, we've had to drop down a bit and get out of Python. It's a great prototyping language. Most of the times, the prototype can live in perpetuity, but but it does have those easy sort of escape hatches where if you do need to drop down to, you know, do some extreme performance tuning, you can still

layer Python on top of it to just call into that subroutine.

Yeah. Yeah. Exactly. Exactly.

And then in terms of the AssemblyAI

platform, you mentioned a few of the different places that Python's being used. But in terms of the sort of the core elements of the platform, can you talk through some of the ways that it's architected and how you're managing

the

sort of building and serving of a deep learning model as a developer face facing API and some of the challenges that come along with it? Definitely. I'm thinking where to start because there's a lot of challenges.

The 1 of the hardest things, honestly, is, like, auto scaling.

Right? Because

these models, they take a lot of time to start up, and they're big.

Let's just think about being on AWS. Right? So if you have, you know, some

instance that you're launching on AWS, even if you have a lot of these, like, big model files baked into the AMI, you know, it's still, like, off of Fresh Boot, takes some time for all that disk to become, like,

quickly available and where you can, you know, have, like, fast IO off that disk to load that model into memory really quick.

And if you're not doing that, if it's, like, download files at runtime and then starts up, it'll take a while

to download those, everything that you need and then start up the service that you need. So so scaling is difficult because you have these services that, you know, don't come up instantly, and then you have very irregular usage patterns. I mean, most of our volume is, like you know, follows a bell curve throughout the day, but, you know, you get spikes of traffic where all of a sudden, 1 developer or 1 customer will throw in a bunch of volume and that will change the traffic pattern, and then you'll quickly need to get

more instances available, like and this is in comparison to we run some

just basic, like, CRUD APIs and CRUD services, and those, I mean, you can get a crazy amount of performance out of just, like, a few

servers even. And if you need to scale them up, I mean, like, a few minutes later, you've got healthy instances that are up and running. Whereas,

the machine learning models, they need to run on really expensive compute, so you can't just run a ton of them. Otherwise, you know, you'll have an enormous

bill. So you need to try to really only have running what you need running to serve the traffic so that you can keep your cost down. But at the same time, you can't

reactively scale that quickly. So you need to try to, like, predictably scale or

really tune how quickly your services can come up and become healthy. So there's a lot of problems around scaling

that we've done a pretty good job stabilizing

and getting to a point where we're okay with, but it's always something where, like, you know, we could go spend the next 6 months trying to make scaling better and make these services come up faster.

1 of the trends right now with deep learning is, you know, these big models that are performing really well, but they're, like, these huge models. And you can do a lot to

kinda shrink them when you go deploy them for inference, but still, you know, you have these big models that take a lot of compute resources and don't load super quick and have slow inference times. You know, we're serving, like, millions of API requests per day to our API.

So when you have

to serve, like, that kind of traffic

with these, you know, sometimes slow models, it can just create some problems. So that's always been 1 of the things that we've kinda faced.

Another thing too is, like, you know, accelerating the time we can take models from our research team and get them into production. You know? So we we've tried to build these, you know, almost these, like, internal libraries

where we can just, like, quickly take a model and drop it in to some, you know, like, worker

library that we built so that when the research team

regardless of if it's, like, an NLP model or a speech recognition model or whatever, we have this kind of framework that we can quickly deploy models with because,

usually, the models are different. The inputs are different. The outputs are different. The compute requirements are different.

So you have to figure out, alright. You know, this model runs on CPU. This model runs on GPU. This model needs 50 gigabytes of RAM,

but only, like, 2 cores. And that's been the most challenging thing from a deployment perspective. I think every model has different compute requirements, and we wanna try to get these things deployed really quickly.

In terms of

the actual hardware requirements, I'm wondering if you have

experimented at all with using something like neural magic or some other technique for pruning the models to be able to reduce some of the hardware requirements or be able to run them on commodity CPUs instead of having to launch these dedicated GPU instances that are costly and slower to launch and harder to manage? We've definitely done a lot of research there, and that has worked for some of our

simpler models because

probably a little out of my depth here because I haven't worked on this stuff directly in a couple months. But, like, you know, I I remember we were looking into 1 library at some point and had all these big claims about

how much you could speed up the model for inference, but it only supported, like, certain types of operations

in the neural network. So, like, if you have, like, a vanilla convolutional neural network, like, yeah, you could get great performance improvements. But if you were doing something more, you know I don't wanna use the word advanced, but just, you know, more advanced, You would get, like, no performance gain because maybe they weren't supporting, like, the kernels that you were using or they weren't supporting, like, the operations that you were doing. There's kinda always

always, I think,

ways to get these things to work faster. But with us, it's always the trade off of, like, okay.

Do we wanna spend time,

you know, really trying to prune this network and get it to be a lot smaller inference time

to improve inference performance, or do we just wanna

solve it from an engineering problem? Like, let's get, you know, really good scaling behind it and just kinda maybe, like, pay more than we need to for right now so we can use that time continuing to improve the actual accuracy that's coming out of the model. And so that's always, like, a trade off. And so

for the low hanging fruit that we can do,

we always try to do that, and a bunch of our models, you know, run on CPU. Like, we have a model for real time transcription,

and that's a pretty lightweight model that runs on a CPU. And you can even run it on, like, your MacBook if you need to compared to the model that we use for our asynchronous transcription that's, like, much bigger and that, like, has to be deployed on a GPU. And we've tried so many ways

to speed up inference time, and some of them, you know, it's just, like, negligible because that's a really

complicated

model architecture that we're using for that because we're not just using, like, you know, Kaldi or some open source

architecture. You know, we do our own research for all the models that we build, and these are architectures that we've designed and come up with, basically.

And in terms of the overall domain of speech recognition, what are some of the open areas of research and some of the areas that you're specifically investing in for being able to optimize some of the capabilities and performance characteristics of what you're building at Assembly?

I mean, to be completely honest, speech recognition accuracy has gotten a lot better over the last couple years, but there's still a long way to go before it's human level. I think there's a lot of, you know, marketing out there that is like, oh, you know, Microsoft has reached, like, human level accuracy

with automatic transcription and, certainly, in, like, a laboratory environment where you have, like, you know, a very specific dataset and you train a model on that dataset

and, you know, you evaluate that model on that dataset, you can get really good performance. But in the real world, I mean, you have so many accents, so many different

types of encoding for files. You have so many different types of artifacts that can be in an audio or video file.

So many characteristics

in production

that make real world transcription

still far from human level. Certainly, you know, if you take, like, a TED talk and you run that through our API, you're gonna get, like, 95, 96% accuracy. This podcast probably,

you know, like, the low 90%

of accuracy.

But if you take a Zoom meeting where you have a bunch of participants and they're all over each other or you take a phone call where it's, like, low quality,

you know, people aren't speaking clearly or they're, like, on speaker phone while they're, like, running through the park or something, the accuracy is still gonna suffer. And in a lot of those cases, yeah, it'll still be difficult for a human, but the machine still definitely lags behind the human. What we really are focusing on is getting to human level accuracy

as quick as possible because we think that that is the key to unlocking

a tremendous amount of new use cases that just require higher levels of accuracy. I think the last we checked, there's still, like, 1, 000, 000, 000 of dollars spent on human transcription because there's a lot of use cases in the medical domain or in, you know, court reporting or legal domains where you need, you know,

99% accuracy. 92 won't cut it. And so that's why we're really focused on just getting the accuracy up and doing the research that needs to be done to get there, and it's a very,

you know, unsolved problem. You know, there's not, like, clear cut where it's okay. You just triple the number of parameters or you add more layers or something. There still needs to be research

done to figure out how do we make these models work better. There's a lot of really interesting research that is being done. I think, certainly, in the last couple of years, it's gotten better, like I said. But, really, for us, the big thing is, like, how do we get these things just to work better? You know? We don't wanna get caught up in the race of, like, okay. Let's accept the status quo, and the only way to make it work better is to just, like, you know, really overfit it to your use case. Right? Because that's

just a Band Aid. The real

solution is how do we just get these things to work at human level accuracy? You know? So it's kind of like the effort that's being done for for self driving cars. Like, you know, if you have a self driving car working pretty well, you can, you know, go exit to exit on a highway or you can, you know, in, like, university campus, have, like, a shuttle go around. Right? But once you get that self driving car so good where it can drive through the city and it can do everything, then there's an explosion in the amount of applications that that self driving car can be used for. And so it's it's similar, I think, with, like, any machine learning system, but, you know, for speech recognition, that's how we look at it too. Like, we're right now really only at the early stages of this because

the technology is just where it's at. But as it gets better,

there'll be an enormous amount of new use cases that it can be applied to, and that's why we're really just focusing on doing the research to get there. 1 of

the perennial challenges of working with machine learning and doing data analysis is the availability

and quality of the data that you're working from. And I'm wondering

what the sort of synergies are

of being in the situation of performing the research on speech recognition, trying to help push the bar in terms of the capabilities and the quality

and being in a company where people are submitting their, you know, speech audio data to you for transcription and being

able to use some of that input to help with improving the qualities of your model and do research on, you know, speech transcription in a noisy environment where there are jackhammers in the background or speech transcription

across a variety of different accents where maybe you have somebody who's speaking British English with an American, with somebody who is using English as a second language and just some of the overall complexities that come with that. Well, uniquely, we actually don't store

any of the data that our

customers sent to our API. So, you know, it's 1 of the the privacy

protections we have in place that, you know, we don't store it, and then we never use it for training. We certainly have some customers that are like, no. I I want you to use this data, so, you know, just take it and use it if, you know, if it's just like YouTube videos or something where there's nothing sensitive. But by default,

nothing's stored, so we don't use it for training.

But yeah. I mean,

there's certainly a way to get better performance, which is, like, just increase data, but at a certain point,

you do hit a plateau because your model is only so powerful. Right? So even if you had a 1000000 hours of, like, a perfect blend of, like, every single possible accent distributed

perfectly.

You know, if you've got a model with, like, you know, a 1000000 parameters only, it's not gonna be able to learn from all that data. You know, you would need, like, a much, much bigger model. And then if you have, like, a much, much bigger model that can learn from that much data,

you would need, like, thousands of GPUs to train that and it might take, like, a year to train because it's, you know, sequential

audio data. It's really high dimensional.

There's certainly, like okay. You know, let's say, like, you're training a model on, like, 2 1, 000 hours of of labeled audio data.

Taking that from, like, 2, 000 to, like, 20, 000, you're gonna see big gains. But at a certain point, you stop seeing gains just by adding data, and it really is about,

you know, the modeling power that you have.

In terms of the sort of different

applications of machine learning of what you're building at assembly, the obvious point is you have a model for doing speech recognition and the audio processing. But what are some of the other applications of machine learning and some of the ways that you're layering them together to build the overall experience of somebody

submitting an audio file to you and getting a high quality transcript back? So at the very basic, you know, we have, like, automatic punctuation restoration,

automatic casing. There's something called inverse text normalization. So if you say, like, $30,

that's gonna come out, like, t h I r t y, right, spelled and taking that to, like, putting a money sign in front of it and then, you know, 3 and a 0. We have a lot of models that try to make the transcript more human readable.

But what's really exciting, I think, that we're doing is we're adding on more layers to enrich the transcription. So we have, topic detection feature now. So depending on what's spoken in the transcription, we can determine

the topics that were spoken. So, like, in this podcast, for example, you know, we'd be able to detect, like, this is about machine learning and Python and, like, all the topics that were detected. And then we have another feature, you know, that is our content safety detection feature, so we can tell you if there's, like, hate speech or profanity

or, like, violence or gambling or any of these things that are potentially sensitive, even, for example, like, sensitive social issues. So people are talking about, like, you know, like, riots or vaccines or something, politics, we can flag that. And then we can tell you exactly where. So we can say, like, hey. At the 2 minute mark, there was, like, hate speech detected in your audio file, or at the, you know, 6 minute mark, they were talking about football.

So we're adding a lot of features like that. We have key phrase extraction feature that's pretty awesome. So it can be used to pretty much, like, summarize

this audio or video data that you have. And so we're adding a lot of features

to make it easier for developers to do more with their transcripts,

and that's what we're really focused on right now because, you you know, it's 1 thing to just transcribe something and then, you know, it's in text format and you can read it or you can index it. But then you realize,

okay. You know, now that we have this data in this more pliable format, we can do all these things with it. And so we're trying to make it easier for customers to do more with their transcripts.

In terms of

the sort of enabling the developer, another interesting thing worth exploring is the design of the API

and how you're able to make it both simple to interact with, but powerful in terms of the capabilities and being able to

provide the full level of control to the developer without having to sort of get them deep in the guts of figuring out how to pick apart all the information?

It's a great question. It's something that we're, like, constantly thinking about, you know. Right now, we're on, like,

v 2 of our API,

you know, spec and v 3 is gonna come out in some time. And so we are constantly thinking about how do we make the API

easier to interact with and less of a thing to grok as a developer

and collecting feedback to think about

how we can improve that.

Right now,

we really have 2, like, major API endpoints, and then, you know, you can just use it and say, hey. Transcribe this thing, or you can add these additional

parameters to, like, turn on certain features and then get more data in the JSON response, basically. So

our company is primarily engineers. All of us have used APIs before, and we've used bad ones, and we've used good ones. So we try to just really talk about

and test out the endpoints ourselves to figure out,

alright, is this too complicated?

Is this easy? You know, when we launch a new feature, we'll build it, and then we'll have other people from the team that are also developers, but that weren't involved with building it. They'll test it out and they'll give feedback if it was confusing or the response should be different or parameter name should be changed.

So we try to just really

spend a lot of time thinking about that versus, I think, some companies just like, you know, the developer

in isolation will build it, and they won't really think about, like, the experience that customers, developers will have with it. They'll just, like, you know, publish the docs. You know, that's our product. Right? So, like, the whole team has a lot of emphasis on, like, alright. Is this parameter name confusing, or does this make sense? Or, you know, is this response

format

in, like, the easiest to consume way, or should this be different? So we just spend a lot of time thinking about it, basically.

And 1 of the interesting aspects of the API design is that

the developer is interacting with this sophisticated

deep learning system, and there's a lot of complex terminology that goes into working in the machine learning space

and being able to translate that into accessible terms for somebody who isn't steeped in that overall ecosystem

and being able to, you know, clarify that speaker diarization. Well, this means that this tells you who's speaking when, right, kind of thing. What are some of the other

sort of conceptual

aspects that you have to

explicitly map from the sort of deep learning,

machine learning domain to the, you know, everyday developer who just wants to get their work done? That's a challenge. You know? Even from okay. The deep learning team might be calling their model x, but then no 1 is gonna understand what x is, so we have to then come up with a different name, you know, like the topic detection feature.

The actual model is called something totally different. We just, you know, call it the topic detection

model because that's what's easier to understand for people.

So it is a challenge that we have to

face, especially with some of these newer features. Great example is our content safety detection feature. No 1 knows what that is because that doesn't exist yet and that's you know, it can flag, you know, like I said, the hate speech or violence or anything that's, like, sensitive in nature in in your transcripts.

So we have to figure out how we

communicate

these technical topics to everyday developers,

these machine learning related topics to everyday developers. And 1 of the ways that we're going about that is by just building a lot of sample projects,

you know. So if you go to our, you know, blog, you'll see it's like, you know, how to build this and how to build this and how you could do this. And we think that that is a good way to expose

the capabilities

to developers and because most people are just signing up and they just wanna transcribe something, and then they wanna get more advanced. And so I think it's showing them what you can do and more like the output versus what it is. Right? So, like, I could tell you, oh, speaker diarization is this, and this is the model and what it does. But if I just instead show you, like, hey.

This output, you know, gives you, like, who said what in the transcript. You know? Then it's like you get what it is because you can play with it and see it, and it doesn't really matter what it's called at that point. It could be called whatever.

So that's 1 of the things that we try to do. Like, with our docs, there's a lot of, like, code samples so you can see, like, the example response for all the different features. And we highlight out, like, okay. This is what you'll get, and, like, this is what this means. So we try to do things like that, but it's definitely still work in progress.

And then another interesting thing to call out is the challenge

of the CICD process for deep learning models where CICD for,

you know, a web application is a fairly well understood and well scoped problem. But what are some of the complexities that creep in when you're trying to build a business based on machine learning and you need to be able to

validate the performance and the feature set and address any sort of regression problems in the model and just the overall workflow for managing the machine learning life cycle and through that CICD process to get it into production?

Yeah. I think this is 1 of the least

standardized

things, from my opinion. I'm curious what you think in the machine learning space. Have you found that to be true too? Yeah. I mean, there's the current trend that's happening a lot in sort of the data engineering ecosystem of MLOps

operationalizing

the machine learning workflow, and there's definitely been a lot of progress made there, but it's still wild west. There are, you know, components that are well engineered and work great for this use case or, you know, this environment, but then stitching it all together and fitting the

specific needs of a given workflow or a given company or a given model is still a work in progress.

Yeah. Yeah. It's wild west is a great way to explain it.

I think, traditionally, you know, CICD to me as a developer is like, okay. You have automated tests. You have, like, end to end tests and, you know, you see, like, your test coverage and make sure that, like, this function doesn't blow up or there's any regressions with your tests.

But with models, it's really just about, like, the accuracy. Right? Like, what's coming out of it. So

1 of the things that we've done is as part of our CICD flows, you know, we run, like, benchmarks. So, like, we know, okay, if you if you take this dataset and you run it through this model,

this is CICD

so

that

as part of the the test that get run, we actually our CICD so that as part of the the test that get run, we actually, like, assert that the accuracy is what it should be within some small margin of error. So that if there was something that got changed in the pipeline

or, you know, if during the handoff from the research team to the engineering

team,

some piece was implemented incorrectly,

we'll see that the accuracy

is not what it should be or, you know, the performance of the actual model is not what it should be before that gets into production.

And that

sounds obvious. We actually didn't even, like, have that in the early days. We just did a lot of manual checks and, like, QA before we would deploy something. Because it's actually, like, a pain in the butt to get that set up because you have to, like,

build this, like, benchmark

into your CICD flow and, like, write, you know, this test that's gonna, like, load the model and run this dataset to the model. So that's 1 of the things that's helped, but, I mean, I'll be honest. It's it's definitely still a work in progress.

I would say all the, like, the ops and, like, production type work around the machine learning, there's still a lot of work that has to be done there. Like you said, there's just this whole, like, ML ops thing or data ops thing starting to be focused on more. And

in terms of the

experience that you've had running the business and some of the ways that you've seen the AssemblyAI

product used, what are some of the most interesting or innovative or unexpected ways that you've seen it employed?

That like I was saying earlier, that's what gets me most excited when I see

these things being built with the API that I just never even thought of before or none of us ever really thought of before. 1 of the things that I've seen

start to happen is there's a lot of applications of speech recognition starting to happen throughout the hiring process.

So, you know, traditionally, you'll do, like, a HR screen or, you know, you'll have, like, a screening call during a hiring process where, you know, someone from the company will do, like, a screening call and make sure that the candidate

meets certain qualifications. And some companies are automating that where you're called into a number

and

some synthetic voice will ask you questions

and the answers get recorded and transcribed, and

this conversation that a candidate is having with this bot

get displayed to a hiring manager with, you know, like, automatic scores applied based on the answers. So

that's, honestly, something that, like, really surprised me that that was happening, and it seems to be working pretty well. But But that's 1 of the things that has been

surprising for me to see that I was pretty excited about when I saw it because I didn't even think of that as an option.

And then in terms of your experience

of building this business and running it and growing it and trying to

make this service that is targeted to developers and accessible to developers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I mean, the first 2 years, honestly,

was all

research and development. Like, getting our training infrastructure set up,

getting all the data that we needed, training tons of models because we had to get

the models to be, you know, state of the art,

And as a developer, you know, you just wanna, like, hack something together. It's like, okay. I need this feature and this feature, and

in a traditional

start up or product, the majority that where there's no, like, data science or machine learning involved,

the majority of the features are, like, CRUD features where you could just write out on a board. Okay. I need this, and it might take you, like, a month, but you can really

say, okay. I need this, this, this, and this, and this. Whereas with machine learning, it's a lot of trial and error and experimentation,

and models take, like, 4 weeks to train, and then you may not know that it didn't work or you had an issue until 4 weeks later. So then you all of a sudden, it's 3 months later and you've only iterated 3 times.

So,

you know, the 1st 2 years, we spent just doing a ton of research and development, and I think

that was challenging. That was a challenging time because you you have to get the tech to a certain part. Now

we've passed that and things are, you know, are fortunately going a lot better now, but, yeah, I think that was 1 of the more challenging parts, being a really small team in the beginning

and having to just really have the patience where, you know, a traditional start up, you can quickly ship something, you can quickly get users, you can quickly iterate, you can quickly keep going.

As a machine learning start up, like, you have to reach some certain level of performance before you can go to market, and it takes a long time. It takes a long time if you're doing something new or if you're trying to build something that's really, really high quality. So I think for any machine learning startup, that's probably challenging in the beginning, but that was something that was difficult in their early days.

Given the

requirement to go through all these iteration cycles before you can even start to launch, I'm wondering what you see as the

sort of downstream impact on the

growth and viability

of businesses

that are

driven by machine learning capabilities and the requirement

for investment to even be able to think about building a company around it?

It's hard to make, I think, generalized claims because I think it really depends on the type of model that you're building. If you're primarily working with text data,

I think the models are, in general, easier to to build and there's a lot more research

to build on top of.

Speech data is really difficult because it's high dimensional data. It's really challenging. The models are really big, so you need a lot of compute.

But, yeah, I mean, there is no getting around it. You need

more resources

to start as a machine learning company than as a

normal company because you need to have compute, you need to have data, you need to have people that can work on the models.

I would say that it probably takes longer for machine learning companies to get to market than a non machine learning company because,

you know, you yeah. You have to get your models working well enough.

So it is a challenge, and I think that it's 1 of the reasons why,

unfortunately, I think a lot of the developments

in deep learning are happening in these big tech companies

and not in start ups because there's a lot of barriers to entry.

And it's 1 of the reasons why we're excited about what we're doing because we really, you know, see this as an opportunity

to give

people

an option

for getting advanced

deep learning tech without having to go to a big tech company to get it. And that seems to be, you know, resonating with people, but it's why 1 of the reasons we're excited about what we're doing. But I I think, in general, there needs to be,

you know I don't know what the answer is, but I would love to see more, you know, research coming out of start ups and smaller companies and not just these big tech companies.

Because I think that when you have, like

you know, this is maybe getting more meta, but when you have, like, innovation

consolidated into just a few places, like in these, you know, big tech companies,

it it limits the creativity of of what's possible. It limits

the creativity. Whereas if you had, you know, 20, 30, a 100 different startups all working in different directions and having different research and different applications, it's it's more exciting. There's more ideas.

There's more creativity,

and

I don't think that's something that's unique to machine learning. I think, you know, the movie industry, right, back when it was, like, there were monopolies in the movie industry, That really restricted the type of content that was produced because you have, like, you know, the studios and the theaters. It was all, like, 1 entity, but when that

got broken up, you had more ideas being shared and more types of movies being created. It's maybe the worst analogy ever to to machine learning companies, but I think the point is it's difficult, and I think that scares a lot of people off. To be honest, I think it scares off investors

too a lot of the times because it's like, oh, well, there's, you know, big tech companies you're competing with. But I would love to see more of it because I think it's important that this research happens outside of big tech companies too. For people who are excited about the opportunity for speech recognition and they want to, you know, experiment with it or incorporate it into some product or tool that they're building, what are some of the cases where AssemblyAI

might be the wrong choice and they're better suited with

building their own models or going to something like Google or an Amazon?

Sure.

Never. No. No. I'm just kidding.

Right now,

medical transcription is not something that we support that strongly.

It's something that we're planning on, but I would

truly really say I mean, for most use cases, we're pretty good option. In a lot of cases, the best option. Medical transcription is 1 where there's, you know, there's a lot of unique vocabulary

and terms that need to be supported so that's creates its own challenges. But other than that, I mean,

pretty good option.

Alright. And in terms of the near to medium term future of what you're building at AssemblyAI,

what are some of the new features or new capabilities

or new internal engineering projects that you're focusing on?

Yeah. So we've got a big

accuracy update that we're launching in August, actually, which is using brand new neural network architecture

that we've been working on for the last couple of months. And we're really excited about that because that's gonna

get our transcription accuracy,

you know, 1 step closer

to

human level. So that's pretty exciting, and we're working on getting that live and updated.

We just had a big update to our real time transcription

API, so that's where

you can actually stream audio in real time and get transcriptions back in real time. It's been a big update

to that API

that a lot of developers and customers are using.

And then another thing that we're really excited about are these, you know, these add ons that I spoke about, like our key phrase extraction and sentiment analysis and, you know, we have really robust PII detection and redaction. So if you send something in

that has a credit card number spoken or birthday spoken or a name spoken, we can automatically detect that and

remove it from the transcript before we return it to you so that you can make sure that your data is staying compliant with whatever

regulations or whatever you need to. So

that's really exciting. We've got a lot of good things coming out on that front in terms of the additional things on top of the transcription

that are coming out pretty soon too. So for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the author HP Lovecraft.

So he's

an author from late 1800, early 1900,

did a lot of sort of, you know, sci fi horror kind of short stories and novels. Definitely just a very interesting author in a very interesting time for literature. So definitely recommend picking up anything by him if you're looking for something to read. And so with that, I'll pass it to you, Dylan. Do you have any picks this week? Yeah. We'll also talk about a book. I just started reading this book called Project Hail Mary. Have you read that? I have not. It's by the same guy who wrote The Martian, you know, the movie they made with Matt Damon. It's really good. It's about

this, like, space algae that is consuming the sun. I've only gotten, like, halfway through, but really good sci fi book, Really exciting. I would definitely recommend people check that out if they haven't. It's called Hail Mary by Andy,

w e I r, Weir? Weir? I don't know how you pronounce his last name. Alright. I'll definitely have to take a look at that. So thank you very much for taking the time today to join me me and share the work that you're doing at AssemblyAI.

It's definitely a very great product, 1 that I've been having fun experimenting with and look forward to building some tooling around for doing transcription for the podcast.

So thank you for all of the time and energy you're putting into that, and I hope you enjoy the rest of your day. Yeah. Thanks for having me on. It was great to be here.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcastdot

com for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__