Building Big Data Pipelines For Audio With Klio

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

And do you want to get better at Python?

Now is an excellent time to take an online course.

Whether you're just learning Python or you're looking for deep dives on topics like APIs, memory management, async and await, and more, our friends at Talk Python Training have a top notch course for you.

If you're just getting started, be sure to check out the Python for absolute beginners course. It's like the 1st year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving.

Go to python podcast.com/talkpython

today and get 10% off the course that will help you find your next level.

That's pythonpodcast.com/talkpython.

And don't forget to thank them for supporting the show. Your host as usual is Tobias Macy. And today, I'm interviewing Lynn Root about Clio, an open source pipeline for processing audio and binary data. So, Lynn, can you start by introducing yourself?

Yeah, sure. Thanks for having me. My name is Lynn Root.

I am a staff engineer at Spotify.

I've been at Spotify for about a little over 7 years now.

Past 2 years, I've been working on audio processing infrastructure.

And do you remember how you first got introduced to Python?

Like it was yesterday. Yes.

So I myself have like a bachelor's degree in finance and economics,

but I wanted

to, you know, pursue a master's degree in financial engineering.

So in order to apply, I had to like learn how to code.

And so I got exposed

to Python and among other languages. And I basically said, screw finance programming is so much more fun.

So then

I started, you know, teaching myself Python.

Doing so started the PyLadies chapter in San Francisco.

This is just basically me finding

friends to learn Python with me.

And soon after

I joined Spotify

and then became known as our internal Python person,

maintaining

a lot of infrastructure tooling for developers to use for the language.

Somehow got lucky enough to get stuck with a Python 2 to 3 migration road management.

It got quickly introduced to Python, and I'm kind of stuck with it since.

And so you mentioned that you've been responsible for a lot of the internal tooling. And I know that Spotify uses Python in a number of areas, most recently with this Clio project. But I'm wondering if you've also been involved with things like the Luigi project or some of the other interesting things that you've been tasked with aside from just the Python 2 to 3 migration.

For a hot second, I was a maintainer for Luigi.

But I was on a team that didn't own Luigi. So it didn't really make sense for me to maintain beyond, like, a few months.

In terms of other, like, open source

Python related

things from Spotify,

I, maybe in 2015, released

a tool to help parse what's called RAML,

REST API modeling language.

Basically like YAML, but specific to

defining a REST API.

And then

more recently,

not under the Spotify name, but as a part of my job, I wrote a tool and released it called interrogate, which helps you identify

what code is missing its documentation,

kinda like test coverage to make sure you have documentation coverage.

So in terms of the Clio project, can you give a bit of an overview about what it is and some of the origin story behind it?

So so basically, Clio is Python based framework for for data processing.

It's actually built on top of and wraps Apache Beam.

And Apache Beam is an open source project. It actually originated in Google

and provides a way to, I guess, define

batch and streaming pipelines

while being quote unquote embarrassingly parallel. That's their quote.

So Clio is built on top of Beam, and it could be used for general data processing,

but Clio is a bit more opinionated

and it's actually

particularly

tuned to process media or I guess any sort of binary file. And so

the generic use case is processing large files,

but it's also

meant to

handle

particularly

resource intensive processing.

So for instance,

digital signal processing, where you look at the actual waveform of a media file. So you might be using packages like Librosa or Py Audio or PyDeb

or, like, image stuff like Pill or

pillow, or maybe some system level tools like FFmpeg or ImageMagick,

or maybe you need to run a bunch of media files through an ML model that might be trained to do some sort of signal processing for you.

And so

the reason that Clio got started was

researchers and ML engineers at Spotify were getting deeper and deeper into this audio processing

workspace and wanted slash needed infrastructure to support

that processing at scale.

You can imagine

we have a lot of analytics

and our models that are based on the metadata of audio, like artists, genre, that kind of thing. But we also have the actual audio file itself to look at. And so I kind of want to step back a bit just to give context for those who are not familiar, because I wasn't until 2 years ago.

Audio can be represented as like a wave. I guess if you were to go back into algebra, you could draw a little graph where the x axis is time, the y axis is amplitude.

The waveform can be just drawn going up and down with y equals 0 at its center. And so the higher the amplitude

or the wave, the louder it sounds.

And then these waves can be sinusoidal,

like smooth

up and down or triangular or square or sawtooth.

Then there's also frequency,

which is how often that wave repeats in a given time. And so with audio frequency

translates to pitch

and where a waveform

at a certain frequency

translates to like a specific note, like a 4 or something like that. And so

looking at this information can reveal some really cool things,

something as simple as estimating

a song's tempo, since you can see

all the beats of a song in its wave.

You can track its pitch from a waveform,

providing the ability to estimate what key the song is in. You can start looking into patterns as well, like an audio that's just spoken word, like a podcast, will look different than a pop song.

We can detect, does an audio have speech in it? And then maybe from there, detect its language.

And this isn't necessarily anything new that Spotify is doing. So it's called, like, music informational information retrieval. I mean, it's an actual proper, like, academic subject matter for for at least a couple of decades now. But the idea of machine hearing is kind of new still, maybe a decade old, where we try to teach machines to listen to audio as humans do.

So far,

a couple of years ago, our researchers and engineers started to do some real novel things with this signal processing and machine learning or machine hearing. For instance, it's always been difficult

to take a single song and separate out what's called stems, like vocals and instrumentals.

But about 3 years ago, researchers

essentially found a way to do it, and there's actually even a white paper about it.

And from there, they wanted to essentially do that for all of our Spotify catalog, which as you can imagine is huge with like tens of millions of song or music files. And that's not even including podcasts.

So these folks,

you know, they started to slap together their own duct tape system, but quickly

realized that they're not infrastructure

engineers.

They had a bit difficulty scaling and

the fact that they had a lot of required maintenance kind of took away a lot of time from their actual research and development.

And so, that's essentially how Cleo was born. It's, you know, built

by infrastructure engineers

who should know a thing or 2 about scaling, hopefully

about observability and reliability,

and then just empowers researchers and other engineers or anyone interested in, I guess, in audio processing to only care about the processing logic itself.

And you mentioned a lot of interesting problems that you're working to solve by being able to process these audio files and particularly at the scale of Spotify, where you have so many different

assets that you need to be able to process because of music and recently podcasts.

I'm interested in understanding

some of the specifics

of the problems that are faced by engineers and researchers who are trying to work with this audio and binary data as compared to some of the other types of large scale

stream processing or data processing that is focused primarily on text or structured data?

It's quite different, the problem space for binary versus like textual based data. It is, you know, inherently a lot quicker to crunch some analytics on audio metadata, like, you know, trying to figure out what songs to recommend to a user that the songs that are labeled in a similar genre as some other songs, and that computation, I guess, should be relatively quick. Dealing with like binary data brings a lot of resource constraints that you wouldn't, I guess, otherwise be concerned with. For instance,

like the average song, like 3 to 4 minutes in duration might be about 8 megabytes in size, and that's the compressed size.

Downloading that takes up network bandwidth and potentially disk space if you're sending it to disk. And compressing that audio to actually load it into memory and to be able to analyze it will then take up 40 megabytes of Ram

processing. 40 megabytes of data itself can take maybe 1 to 2 minutes of CPU time, depending on what you're doing. Maybe even longer,

Maybe you

actually need GPU's to do all this processing.

And then if you're using

frameworks like TensorFlow,

yet another

heavy memory footprint,

some libraries

might not be well engineered for concurrency

or not thread safe in terms of processing. So that might actually force you to write a pipeline that can only be 1 thread per process.

A lot of these audio frameworks in general are just actually wrappers around tools implemented in other languages or have C bindings,

maybe even need to like

shell out to call like F of MPEG or something like that. So you need to make sure that you have your runtime

environment set up properly with with all those system level dependencies. So

I guess overall, it does get a bit hairy,

but it's like a super interesting space to be in because I don't remember the last time I really had to deeply care about lower level constraints like this, especially with the advent of the cloud computing kind of thing.

Digging more into

the Clio architecture itself, I know that you mentioned that it's built on top of Apache Beam, which is a framework for being able to handle both batch and stream processing over large volumes of data and on arbitrary compute substrates.

And

particularly because of these performance

sensitive characteristics, I'm curious what the

decision process looked like as far as choosing the underlying framework to work with, but also the decision to use Python

versus something that might be more close to the metal as far as the performance optimizations?

So that's a good question. We definitely looked elsewhere first,

like just kind of tried to understand the lay of the land.

We didn't want to necessarily write something that already existed.

There's certainly like a lot of audio processing

tools out there, but that's not necessarily

the problem. The problem here is scaling with resources, right, as you said. And so we actually kind of looked at

data processing frameworks

in hopes that they could

help with, with our requirements. So we looked at, you know, beyond Beam, we looked at stuff like Argo or Airflow, even Luigi,

some others. And, I mean, there's so many, I guess there's a new data processing framework popping up every week. But some frameworks, you know, don't support streaming or event based processing,

only batched processing.

Some didn't support

being able to configure the types of hardware that you needed, like, which is very job specific since 1 might need more memory ability

to provide these like a dockerized environment or

specify the exact runtime requirements needed. And so like some of the problems that we're needing to solve is we did need that streaming slash event based so folks can

process audio as soon as it's available to our catalog.

We had users with very different like hardware requirements.

So it had to be flexible in that regard.

We have, you know, so many

different frameworks that folks can use

needed to ensure

that like the runtime environment can, can be kind of essentially hard pin, like versions and stuff like that, because it is research.

It needs to be reproducible.

So again, dockerization of environment was ideal.

We do have Shio,

which is a Scala implementation of Beam,

but we could have looked more deeper into that. But

the actual

lingua franca, I guess, of audio processing and ML is Python based.

So researchers and ML engineers

are far more common, I guess, are comfortable in Python than Scala or another JVM based language.

And then we have actual tools

developed in

Python. There are some JVM based tooling

for folks for like ML processing or audio processing, but they definitely do not have the reach or the development as a lot of these Python libraries do. And so that kind of sort of enforced a decision for us is that we want to provide tools that are actually,

you know, helpful to engineers rather than forcing them to learn another language and adapt,

you know, kind of lesser

libraries to their needs. And I guess translate from like Python into

a JVM by these things. So I went with Python.

We chose Beam specifically

because Beam is or it was already widely accepted in our infrastructure.

Spotify's

motto to kind of keep things,

I guess

not too many competing

products to do the same thing. We want folks to

have like less cognitive overload when, when writing their pipeline. So if they are familiar with Beam for a simple data pipeline, then they should be familiar with Beam and writing a media processing.

So, yeah, that's how we eventually

arrive at building on top of the Beam framework with with Python.

You mentioned briefly some of the constraints that you were working with and the design goals of the project. And I'm wondering if you can dig a bit more into some of the alternative options that you looked at for being able to work with this volume of binary data

and be able to fulfill the use cases that you were targeting

and highlight some of the capabilities that were lacking in those options or some of the edge cases that they weren't able to cover that

made it worthwhile

to invest the time and effort into building this new system from scratch?

It's kind of difficult to answer this question because there wasn't really competing full fledged libraries that handles that handle this, like, binary data. About a year into our development,

I think Microsoft released a framework that actually does, like video processing, and I think it's event based.

It supports like ML, but it's particularly targeted for like analytics

generation,

not necessarily, like, processing, like, heavy actual media processing itself. Also, it's closely tied to Azure, which we are a Google Cloud Platform based company.

But as well as like, we were already a year into development with folks starting to use our product. A lot of

the existing processing frameworks had bits and pieces of what we need, but you can't exactly cobble together a processing framework

from other processing frameworks. Right? So

it was better for us to essentially

build upon

something

that gives us the biggest head start, I guess, which, ended up being Beam because they do support

the ability to configure

your resource requirements, like, you know, more memory or CPUs, etcetera.

They do support

running on data flow, which is in Google Cloud. So we don't have to actually

manage the running of the pipelines ourselves.

It already

has

Python

SDK, so you can actually

write, you know, a vanilla Beam job that kinda looks similar to Clio, but, you know, you'd

have to do a lot of the boiler plate that Cleo does.

So it's, it's actually quite easy to go between the 2

if, if need be.

So, yeah, they had a lot of

the basic needs.

Yeah. They also

ability to customize our runtime environment with, you know, whatever dependencies we need. And so the framework, there's a lot of architecture around it and tries to be a bit more opinionated.

In terms of the actual

design of the framework, you mentioned that Clio provides a decent amount of boilerplate for people who are looking to build these jobs on top of Beam. I'm wondering if you can talk a bit more about the design of Clio itself and some of the capabilities that you've baked into it to simplify this work of processing these large binary files and being able to

integrate with the broader ecosystem

stupid

easy

to write an audio processing pipeline. Clio, I guess, tries to do a lot of things and essentially

make it

stupid easy to write an audio processing pipeline.

So when you're writing like a Clio job, you're essentially, you know, writing a VM pipeline, which should look familiar.

But all you need to do is, is just write

what the processing logic is.

A user's perspective, you have like a CLI tool

that maybe creates

like some template files for you for a new job.

And you just add some job specific boiler plate code that defines your processing logic in the pipeline.

That pipeline

itself might be a few steps. Like first download this file from this like cloud storage bucket,

pipe that into logic that actually handles processing, like stem separation, like separate vocals from instrumentals,

and then take that output. And now you have like 2 outputs and, and upload

to their respective cloud storage buckets.

So like the CLI helps

the management, including

building

a Docker image for the job, deploying a job, stopping it. What What you actually don't see is when, when you run the job locally, everything is actually running in the Docker container, not within, not on like locally the host, and this is to help with

consistency

environments.

But if you step back a bit and take a look at the job as more generically, it has maybe typically 2 types of input, like an event input, which is like a trigger,

and then another input, the actual binary file to process.

Similarly, it has, you know, 2 outputs,

event output and the data

output. And so a job receives an event where the event contains a unique reference that points to a binary file.

This is basically a trigger saying some work needs to be done,

and then that gets mapped to

the actual binary data lookup, some like cloud storage bucket. The job then downloads the file, does the processing, uploads the result, as I said before, to another bucket and event to its event output signaling that the

processing has been done for that particular file. And this is kind of akin to, like, a microservice,

back end, but instead of, like, the request response paradigm,

it's actually events and queues.

And so that's what, like, a single Clio job looks like from a user's perspective.

And if you step back a little bit further, you can also

connect jobs together based on these event inputs and outputs.

Say for instance, I may have a colleague

that wants to look at the vocals of a song, maybe to detect its language.

So they can write a job that subscribes to like my job that separates vocals from instruments.

And then you essentially have like a graph forming.

This kind of provides a way to build upon each other's work, you know, a bit easier rather than kind of being siloed or sharing code and

doing duplicate work essentially,

save on compute time. And so

say that you want to

process

songs as soon as a label delivers them, like a music label delivers them.

We actually have this really cool feature

built upon backed by Clio that allows users in Japan to, like, sing along to songs

so that you can, you know, essentially turn off the vocals of a song and just hear the instrumentals and then see its lyrics

timed along with the song. So, you know, have fun fun singing along.

But it would be a bit frustrating

to not be able to sing along to a song that was just released, just added to our catalog.

So perhaps you want to, you know, process on demand.

So Clio

allows you to hook into a queue, like, pops up to listen to an event for that new song. But that event producing system doesn't necessarily

have to be a Clio job itself,

just so long as it has data that you need to process, like a reference to an audio file.

And so this leads me to

kind of like a bigger part of Clio's

architecture is what's in that event data

and refer to it as like a Clio message. And it's just basically a protobuf message that contains

Clio related metadata

as well. The pipeline that transforms actually see that the process, like a reference to a file.

In reality, you can set up your Clio job to process any sort of data. Like, it can be just like JSON data or something like that.

But Clio essentially uses this pointer, whatever that data is, probably an audio file, plus the job's configuration

to find that audio file. And then it makes it easy for the user to say, okay, just go ahead and download this file. You already know where to find it. But then under the hood,

Clio first actually checks to see if that output

for that particular file exists.

So say that your

output is, you know, 2 audio files of vocals and instrumentals.

They live in separate buckets in some cloud storage.

Google in our case, Clio takes the file name, looks up the file name in both those bucket buckets. And if it discovers that both already exist, it doesn't actually process it unless you, you know, explicitly want it to, but it doesn't spend unnecessary resources.

But then so maybe the output does not yet exist, so, you know, we should try and process it. But then we first need to check to see if our input data exists, if that original audio file is available.

If it doesn't, the job can be set up to trigger its parent job to generate that input for it, its dependency input. And so these little checks

allows us to do some interesting things with how the overall graph of jobs gets executed.

We have, essentially 2 modes of execution

support in what we call top down and bottom up. So with top down, every job in a graph gets triggered.

So this is useful when you want to process audio, as soon as it's available to catalog or to dataset

event

triggers the Apex job or like to process a new file.

And then all the child jobs then get triggered once that Apex job finishes

and then their children get triggered and so on and so on. But it's not necessarily

efficient or necessary

for every job in a graph to process a file. Maybe

you only want to trigger

your job, but not any downstream jobs.

So you trigger your own job for it to do work for a file. Maybe the input data dependencies are missing,

meaning that its parent job hasn't produced

the output for that particular file.

So the parent job then gets triggered.

It recurses all the way up as far as it needs to in the graph.

And so basically all the data dependencies get generated,

but not all child jobs are subsequently

executed.

Only the jobs that are in the direct path of the original job. So like no selling jobs and no downstream jobs are doing unnecessary

processing. So this is essentially describing

bottom up execution.

It's to help optimize

for,

you know, costs of resource processing time because it can be expensive

to unnecessarily

process, you know, audio files.

1 of the interesting pieces that I am interested in digging a bit more into

is the ability to compose these different pipelines together. Because I know that with a number of different frameworks for being able to build out some of these

DAGs of tasks, it can be difficult to actually wire them together where, you know, 1 pipeline is largely self contained. But it sounds like from what you're saying that if somebody builds a particular

workflow

for processing some of this audio data and extracting a certain set of information from it, Somebody else can then discover that pipeline and then be able to hook in their additional processing based on what was already completed upstream or maybe even at some node in the midst of that task graph. Is that accurate?

And then in terms of the actual

integration with Beam, I know that particularly in things like Spark and PySpark, people complain about some of the impedance mismatch between the ways that the Spark framework thinks about objects

and memory allocation and the way that Python is trying to do it. I'm curious if you've seen a similar impedance mismatch between the Python SDK

and the actual Beam execution layer.

That's an interesting question because we have seen, I guess, a few, but Python, in my opinion, is quite flexible enough to help work around that. There's still some, like, headaches,

fairly difficult to control concurrency

from

the Python SDK part down to, like, the Java part. Sometimes you do need to like limit processing to only 1 thread or 2 threads per like process

or like maybe 1 thread per core,

because you know, you have a really memory intensive

TensorFlow model or something that's not thread safe. So the ability to do that doesn't necessarily

exist

within

at least the Apache Beam part of it,

and

the underlying Java framework for that, doesn't

expose any kind of knobs that you can turn to help limit it on that end. So a lot of times we need to kind of force something like this feature of concurrency management into into Clio itself. What Clio does is try to like

not make that a concern for its users. So you don't get, you know, that kind of complaint from users that kind of abstracts that away,

all those kind of difficulties.

But I do have to say that there hasn't been much of that issue between like the Java and Python SDK or the Java under implementation within the Python SDK that talks to it. It's been very kind of honestly easygoing, but no other sort of real big, insurmountable challenges

come up due to

differences of the 2 languages.

And then digging more into some of the optimizations that you've built into Clio, particularly for things like the bottom up processing that you mentioned of being able to determine whether an upstream job needs to be re executed or not based on your downstream requirements. So I'm wondering if there are any other

useful

or interesting capabilities that you've built into Clio for being able to handle some of these particular edge cases or some of the optimizations that are necessary for working with these large files and the volume of IO that's necessary as a result?

Yeah. There there's a handful, and we have so many ideas in development or on our roadmap. But right now, there are tools that allow users to download files, but download them to memory so you don't, you know, write to disk and take up disk space.

And then in doing that, you then have, like, a choice on how it can be pickled. For instance,

NumPy, a lot of like, the waveforms are represented in NumPy arrays.

NumPy's pickling optimizes for memory where as a standard lib pickle optimizes for speed. So you do have that choice.

It's quite nice.

Sometimes you need to like debug

a dag of jobs,

so you don't necessarily want to make it do work.

You're able to kind of debug a a day of jobs by

sending, like, a message in ping mode. So you can kind of

track what happened if this audio file were to trigger, you know, a certain job and you can see it go through the graph and what actually would happen if it would, you know, find output or input or if it, you know, raises an error at a certain place that you can easily find. It's a rudimentary approach to tracing, like, a back end request kind of thing.

And then there are some kind of cool, I guess, usability features that Clio has in in my opinion. I mean, it's stuff like, you know, running unit tests, which is just handy to have and to verify

that your configuration is is set properly and that your declared resources exist, like whatever storage bucket you have declared, but then also allows you to

profile locally,

your jobs like memory and CPU

footprint.

This is particularly helpful to

understand the kind of resources

your job will require once deployed.

And then there's even like a command, an underappreciated

command.

We have a lot of ideas for it. We just need to develop them. It's this idea of auditing your code so you can audit your job. Right now you can audit for

things like not

cleaning up temporary files, and using up disk space or not being thread safe when you use certain libraries.

It's quite handy.

Those are all, like, the features and cool tools that we're trying to provide that to sort of help that resource intensive

processing.

Python has become the default language for working with data, whether as a data scientist,

data engineer, data analyst, or machine learning engineer.

Springboard has launched their school of data to help you get a career in the field through a comprehensive set of programs that are 100% online and tailored to fit your busy schedule.

With a network of expert mentors who are available to coach you during weekly 1 on 1 video calls, a tuition back guarantee that means you don't pay until you get a job, resume preparation, and interview assistance, there's no reason to wait.

Springboard is offering up to 20 scholarships of $500 towards the tuition cost exclusively to listeners of this show.

Go to python podcast.com/springboard

today to learn more and give your career a boost to the next level.

In terms of the actual project itself, I'm curious what the

motivation

was for releasing it as open source and if there were there's any particular process involved in cleaning up the code for it to be publicly consumable or if it was the intent from the outset to be open source and if that helps to direct the overall design and architecture of the project to avoid tight coupling with the internal Spotify systems?

From the beginning,

Clio was certainly developed

with the intention to open source. I mean, I've been heavily involved in Spotify's open source strategy for, like, 5 years now. So it wasn't really necessarily a question, it was kind of like the default approach.

And we did get some hints early on that open sourcing

framework like this would be well received,

The research community or even some of our competitors,

yeah, we first wanted to make a viable product and to have our internal folks kind of kick the tires

and make sure it's actually it has some traction.

And then with that, we did go through a couple of API redesigns

and iterations and then add some more optimizations

adding like highly requested basic features like, like batch and stuff like that. Because we were first streaming

and still in a lot of development

and

we're now comfortable releasing it. You are right to point out that there was a lot of time spent kind of cleaning up code first. I am particularly a perfectionist,

so

I tried to have things documented and tested. And so like, I'm not embarrassed to have my code out there. And so I did do, like, maybe a couple of months kind of focused effort on actually cleaning the code base

and as well as adding

really good basic features like, yep, like batch or handling different input and outputs like, you know, storage or, like, local files,

etcetera.

So, yeah, some time was spent just focusing on that, but we

conveniently

timed the open source open sourcing of Clio with

conference that we presented,

Clio at. It's called, ISMIR,

International Society of Music Information Retrieval.

So we basically

hosted a tutorial to show off Clio to see, like, what their reaction was would be. And so that

was kind of the fire under our butts to, to get it out there. But also,

I think we finally decided that, yes, we were comfortable.

You know, we might be a little bit lacking in full documentation coverage or full test coverage, but

we are also just as eager to, you know, learn from a broader range of users, not just, like, internal Spotify users and maybe use cases we haven't thought of and hopefully

invite, you know, contributions

and essentially invite others to help improve and and build upon Clio.

For people who are actually using Clio for doing their own work, I'm wondering if you can talk through some of the workflow that's involved with getting a Clio project set up and being able to handle third party dependencies that you might want and managing the deployment and life cycle of the overall project.

Yeah.

So building a pipeline, I guess, starts off with a very, like, simple command, like Clio job create. It, like creates, like, I guess, a cookie cutter template of a job. You just declare your dependencies in

1 or 2 places,

The standard like requirements dot TXT for your 3rd party Python dependencies.

And then maybe you need

like, system level dependencies, like, actual, like, C libraries that maybe your Python dependencies

plug into, or

you might actually want to shell out to something like FFmpeg.

There, you would declare it in like the Docker file.

And so you, you add your dependencies

and you kind of go about your regular Python development.

You kind of declare

the actual processing

that you want handled.

Behind the scenes, Clio actually takes care of

reading those event trigger inputs and writing, and it can handle,

like, the data reading input and writing output for you. But essentially, all you have to worry about is, okay, you've received a trigger that has this reference ID. What should the job do now? And so maybe

you take that file and

run it through your model and then just return the result.

Or maybe you tell Clio to upload the result to a bucket. And so you really only have to focus

on the actual business logic of the pipeline.

And then

once that's all defined,

you probably want to test it out locally.

The CLI tool allows you to easily kind of get it running locally,

see if it works. You're able to kind of publish a message to its queue

from the CLI as well.

And then there from there, you should be able to see some really pretty logs

and maybe good enough to write some unit tests As I've kind of hinted before, Clio kind of will run those unit tests in the Docker image for you. So it should be the same environment as it would be when it's deployed.

Then maybe you're totally comfort comfortable and confident and wanna, you know, deploy it. Clio job deploy

command will just, you know, put it out there, put it on your configured runner, which right now is either direct runner, the local runner or data flow.

So on the Google Cloud dataflow,

it should, like, then take up all of the heavy lifting of orchestration.

You won't have to worry about

spinning up all the resources needed. If you want auto scaling, you can configure that so that maybe

all of a sudden, you have a huge load of audio files processed. It'll take care of scaling for you.

With that, kind of comes in, like, built in metrics and observability,

which Clio can help you kind of discover, like help you find

those dashboards

a show up onto

the dashboards.

Maybe you have like a new

model, like a retrained model, and you want to

redeploy

your job with this new, like ML model,

you then kind of like go through that process again of just like updating your code,

kind of rebuilding the image and then Cleo job deploy will

take care of

taking down that old job

without

losing the in process work for you. Like things

work in progress and the actual pipeline will kind of get drained, will continue, but any work still in the queue will just kind of be paused

until the new Dataflow job is run.

And now you have like this new

model deployed and you kind of, you need to reprocess

the old data that had been processed by the old model. You can then re trigger,

like all those audio files that need reprocessing by just like a simple

force through trigger. So Clio will say, oh, I do find this output already existing, but you told me to force it. So we're gonna go ahead and process it for you. So, yeah, that's sort of like an overview

work through or walk through of sort of the general workflow

of the Clio job.

Yeah. Or integrate with it more directly. What are some of the extension or integration points that you have exposed for being able to expand its utility beyond just the core feature set?

There are actually quite a few points of, like, extension.

For instance, like, you would just have to write, like, a small little class

to, like, read from or write to, like, an AWS

service or Azure service for, like, SNS or s 3 or q storage or whatever.

Similarly, we have, like,

support for metrics,

only for,

like, log based metrics and and Google Stackdriver,

but it's extremely

easy to

extend the metrics client beyond that, maybe StatsD or Grafana.

And it would take a bit more code, but it should be able to extend Clio

to actually work on other runners supported by Beam. So Beam supports

Dataflow, but also

Flink or Spark,

And it's just a little bit more legwork, but

defining support for that kind of runner is definitely

possible and relatively

simple.

And right now, Clio

has an optional library for, like, specific audio handling based off of, like, the Librosa package,

but folks could easily make tools for easier pipelining of like image and video handling.

So there's, there's many, many areas, many touch points

that should be relatively easy to kind of extend.

And for people who are building on top of Clio, what are some of the challenges

or misunderstandings or edge cases that they should be aware of or that you've seen people run into?

There's in general, like, a common misunderstanding

when you're first learning about Clio, particularly if you're not in, like, the research space of, like, audio processing. Is that, like, Clio

does the audio or AML processing for you, or it's a framework for that when really

Clio is meant to help scale that processing distributed way. Sometimes that can be a misunderstanding

when going into the project, you know, fresh.

But then in general, the the space itself is very complex

and has a natural,

like, steep learning curve.

So you might have to do a bit of self education in order to get beyond, like, a simple hello world, like example.

And it's difficult for the user because it's hard to tell where

Clio ends and where Apache begins or where Dataflow begins.

You're interacting with Clio mostly, but Beam is doing, like, all of the pipelining work and Dataflow

is doing all the orchestration work or another runner, the orchestration work. But hopefully, that doesn't necessarily discourage folks. What Clio has done is essentially

make the nearly impossible

possible.

It is still complex.

And what we're working on next is to make that complex a bit more simple.

So

in our roadmap, we have

ideas for tools like better audio chunking or just file chunking

right now, processing extremely large

files like hours, hours long podcasts

is, like, less reliable, more a little bit more shaky than, you know, the

3 or 10 minute, like, song. We're hoping

to build

more useful utilities

in the near future to adjust that sort of use case of extremely large files.

Or for the odd music track that actually happens to run for a solid hour.

Yeah.

And so in terms of your own experience

of working on the Clio project and being involved in this space of audio data processing, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Of course, I had to learn a lot about audio processing. I still couldn't write like a PhD based off of it or a dissertation based off it, but sort of, yeah, I had to do a crash course

in, I guess, digital signal processing

kind of thing, reading a lot of textbooks,

but actually the challenges that were more in

our faces in my face was

understanding pickling.

I haven't actually, I mean, I, I know pickling, but I hadn't actually been exposed to it before

this and, and specifically with how Beam pickles.

And so when launching a pipeline, particularly for like a remote runner like Dataflow,

user code gets pickled locally and then unpickled on worker machines.

So

anytime pickling is used, it kind of can limit what you can do. And so this is actually 1 of the reasons why we dockerize everything to to make sure that the environment in which you launch your pipeline in, the environment that gets pickled is the same environment that the pipeline

itself gets unpickled on the remote worker. And so that might be obvious to some, but I had yet been exposed to that problem space.

But with that, like deeper understanding of pickling can come

ideas of how to abuse it, I guess.

So during

unpickling,

there's like this gender set state method

defined in the class that gets called when unpickling an instance.

And this provides us with, I guess, ways to manipulate

worker environment as it starts up.

Now, Cleo doesn't do this yet at all, but we could,

but we've seen it used

within Clio to maybe start demons on workers, like profilers,

like, the Stackdriver profiler,

which can be really useful.

In generic kind of problem that I have regularly is kind of designing a framework for other developers.

For instance, how best to design an API that makes sense and feels ergonomic to use.

And then like how much magic should we provide the user? How much of that magic will actually

limit the user maybe too much?

How can we provide flexibility

for the user while still being opinionated?

And then the maintenance and development cost for us when it comes to like making those decisions.

I'm like regularly

paralyzed

by

like, a framework design, making sure that it's good and well thought out. And so I find personally

that to be a big challenge.

And as far as the ways that Clio is being used, what are some of the most interesting or unexpected

or innovative

projects that you've seen built with it?

Yeah. I think I mentioned this before, but I still think it's super cool is is the sing along feature available in Japan and I think other markets as well. That's basically based off of Clio generated data of separating vocals from instruments and just taking the instrumentals

and and the lyrics.

I mean, we've also seen internally Clio used for like audio based ad generation,

where you can like dynamically

piece together someone's voice,

like reading an ad with various background music.

You can then create like multiple versions of the same ad. So this would help with

like, maybe you're listening to a Chopin album and then it wouldn't be a great experience if that adds background music was heavy metal.

So the fact that you're able to generate, you know, different types of ads with different backgrounds

and kind of target based off that it provides a better experience.

Some things that might be boring, to those like in this research space or in general, but I find super cool is

we fingerprint audio, which means generate a unique identification or a signature for each audio or even segments of audio. And then that, can be used for like deduplication,

cover song identification,

many things.

And there are some other research based things that we're doing, including like song and speech transcriptions,

instrument separation, like separating the actual guitar from bass from drums.

And then we can build on top of that. Personally

ideating here, but like maybe we could build

automatic generated, like sheet music with that. So that'd be really cool.

Some external users, I haven't seen any actual, like, product of it, but I know external users are, like, really interested in in trying out Clio for an auto DJing feature.

Basically, being able to create a playlist

where, like, the tracks all fade in and out perfectly.

A while ago,

New York Times

Sunday review

feature,

It's actually like a little web app.

It's titled like why songs in summer sound the same.

And it's a cool website that like highlights how summer songs have certain attributes or qualities like danceability

or energy and loudness.

Then

the review itself actually goes to show, like, over the past 40 or 50 years, songs released in the summer or for the summer have become less diverse and have been converging on the same levels of danceability and loudness and other attributes.

I don't wanna spoil it any further, and I'll, add the link to this cool app in the show notes. But this research kind of research is built upon our public audio features API

And the the data for that public API is backed by clear jobs where the audio processing that takes place estimates how danceable a song is or the acoustickness

or the tempo or what's called speechiness.

So, like, it's making research a lot easier and allowing us to also build some really cool features.

And you mentioned a few things that you've got planned for the future of the project. Are there any other things that you'd like to call out for plans that you have or

contributions that would be helpful to help drive it forward?

Yeah. So there's definitely a lot of development

ahead. I say that while

Clio is production ready, it can do a lot more to help users

focus on what they care about, make things more simpler.

And so, yeah, this includes,

you know, more optimization

tools like file chunking

and improved concurrency management.

And this is coming with, you know, more and more interest of processing podcasts and just longer audio in general. But we're also hoping to

improve

the current ability to observe more observability features,

allow you to

be able to

diagnose any sort of, like,

resource constraints that your job might be having, kind of uncover any sort of really ugly, like, deep down into the system issues,

basically help diagnose, essentially.

We also have plans for

better testing. Like, right now, you can certainly easily unit test, but what about integration tests, particularly if you're a job that's a part of a bigger

graph of jobs?

In general, like, I'm curious, I guess, to see what the open source community will bring.

I know that folks do wish to use this

Amazon's,

cloud architecture

and as well as, like, kind of homegrown architecture, like maybe a self maintained, Kubernetes

cluster.

And so that's what I hope to anticipate in terms of contributions.

But in general, we're like a very small team and so we can only focus on what's in front of us.

So I would be wholeheartedly

excited for any contributions that like, you know, different use cases or prove

usability,

anything of that sort.

Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the PSF fundraiser that's going on right now. They're looking to close out q 4 with some, hopefully, additional funds to help

support the overall Python community. So if you have some time and are able to, definitely recommend contributing back to the PSF for all the work that they do for the people who use Python. And so with that, I'll pass it to you, Lynn. Do you have any picks this week?

Yeah. Just 1 pick. I hope this doesn't make me sound like a fan girl, but about a month ago, I started using, this note taking tool called Roam.

It's super easy to take notes and create a graph of your notes, catalog everything that you might think. It's just better aligned with how I approach my own note taking. So I highly recommend taking a look at it.

Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on Clio. It's definitely a very interesting project and interesting problem space. So I'm excited to see where it goes and the new capabilities that you're able to

add to Spotify as a result of that and some of the ways that the overall developer and research community can take advantage of it. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Thank you very much. I appreciate you having

me.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show

notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__