Summary
Technologies for building data pipelines have been around for decades, with many mature options for a variety of workloads. However, most of those tools are focused on processing of text based data, both structured and unstructured. For projects that need to manage large numbers of binary and audio files the list of options is much shorter. In this episode Lynn Root shares the work that she and her team at Spotify have done on the Klio project to make that list a bit longer. She discusses the problems that are specific to working with binary data, how the Klio project is architected to allow for scalable and efficient processing of massive numbers of audio files, why it was released as open source, and how you can start using it today for your own projects. If you are struggling with ad-hoc infrastructure and a medley of tools that have been cobbled together for analyzing large or numerous binary assets then this is definitely a tool worth testing out.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to pythonpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s pythonpodcast.com/talkpython, and don’t forget to thank them for supporting the show.
- Python has become the default language for working with data, whether as a data scientist, data engineer, data analyst, or machine learning engineer. Springboard has launched their School of Data to help you get a career in the field through a comprehensive set of programs that are 100% online and tailored to fit your busy schedule. With a network of expert mentors who are available to coach you during weekly 1:1 video calls, a tuition-back guarantee that means you don’t pay until you get a job, resume preparation, and interview assistance there’s no reason to wait. Springboard is offering up to 20 scholarships of $500 towards the tuition cost, exclusively to listeners of this show. Go to pythonpodcast.com/springboard today to learn more and give your career a boost to the next level.
- Your host as usual is Tobias Macey and today I’m interviewing Lynn Root about Klio, an open source pipeline for processing audio and binary data
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Klio is and how it got started?
- What are some of the challenges that are unique to processing audio data as compared to text?
- What use cases does Klio enable?
- What are some of the alternative options available for working with binary data?
- What capabilities were lacking in other solutions that made it worthwhile to build a new system from scratch?
- Can you describe the design and architecture of Klio?
- What was the motivation for implementing Klio as a Python framework, rather than building on top of the Scio project?
- How much of a challenge has it been to interface to the Beam framework from Python? (Java <-> Python impedance mismatch)
- One of the interesting optimizations in Klio is the option for bottom up execution of a job to avoid processing a given file unless absolutely necessary. What are some of the other useful or interesting capabilities that are built into Klio?
- What was the motivation and process for releasing Klio as open source?
- For someone who is building a pipeline with Klio, can you talk through the workflow?
- What are the extension and integration points that are exposed?
- How does Klio handle third party dependencies for a given job?
- What are some of the challenges, misunderstandings, or edge cases that users of Klio should be aware of?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Klio project?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Klio used?
- What do you have planned for the future of the project?
Keep In Touch
Picks
- Tobias
- Lynn
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Klio
- Spotify
- PyLadies SF
- Luigi
- RAML
- ramlfications
- Interrogate
- Apache Beam
- Librosa
- PyAudio
- Pillow
- FFMPeg
- ImageMagick
- Music Information Retrieval
- Machine Hearing
- Scio
- Microsoft Azure
- Google Cloud Platform
- Google Cloud Dataflow
- Protocol Buffers
- Apache Spark
- PySpark
- DAG == Directed Acyclic Graph
- ISMIR Conference
- Digital Signal Processing (DSP)
- Python Pickle
- Research paper on separating vocals from instrumentals of a song
- New York Times: Why songs of the summer sound the same
- Microsoft’s Rocket Platform for video analytics
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. And do you want to get better at Python? Now is an excellent time to take an online course. Whether you're just learning Python or you're looking for deep dives on topics like APIs, memory management, async and await, and more, our friends at Talk Python Training have a top notch course for you. If you're just getting started, be sure to check out the Python for absolute beginners course. It's like the 1st year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to python podcast.com/talkpython today and get 10% off the course that will help you find your next level.
That's pythonpodcast.com/talkpython. And don't forget to thank them for supporting the show. Your host as usual is Tobias Macy. And today, I'm interviewing Lynn Root about Clio, an open source pipeline for processing audio and binary data. So, Lynn, can you start by introducing yourself?
[00:01:47] Unknown:
Yeah, sure. Thanks for having me. My name is Lynn Root. I am a staff engineer at Spotify. I've been at Spotify for about a little over 7 years now. Past 2 years, I've been working on audio processing infrastructure.
[00:02:02] Unknown:
And do you remember how you first got introduced to Python?
[00:02:06] Unknown:
Like it was yesterday. Yes. So I myself have like a bachelor's degree in finance and economics, but I wanted to, you know, pursue a master's degree in financial engineering. So in order to apply, I had to like learn how to code. And so I got exposed to Python and among other languages. And I basically said, screw finance programming is so much more fun. So then I started, you know, teaching myself Python. Doing so started the PyLadies chapter in San Francisco. This is just basically me finding friends to learn Python with me. And soon after I joined Spotify and then became known as our internal Python person, maintaining a lot of infrastructure tooling for developers to use for the language.
Somehow got lucky enough to get stuck with a Python 2 to 3 migration road management. It got quickly introduced to Python, and I'm kind of stuck with it since.
[00:03:09] Unknown:
And so you mentioned that you've been responsible for a lot of the internal tooling. And I know that Spotify uses Python in a number of areas, most recently with this Clio project. But I'm wondering if you've also been involved with things like the Luigi project or some of the other interesting things that you've been tasked with aside from just the Python 2 to 3 migration.
[00:03:28] Unknown:
For a hot second, I was a maintainer for Luigi. But I was on a team that didn't own Luigi. So it didn't really make sense for me to maintain beyond, like, a few months. In terms of other, like, open source Python related things from Spotify, I, maybe in 2015, released a tool to help parse what's called RAML, REST API modeling language. Basically like YAML, but specific to defining a REST API. And then more recently, not under the Spotify name, but as a part of my job, I wrote a tool and released it called interrogate, which helps you identify what code is missing its documentation, kinda like test coverage to make sure you have documentation coverage.
[00:04:12] Unknown:
So in terms of the Clio project, can you give a bit of an overview about what it is and some of the origin story behind it?
[00:04:19] Unknown:
So so basically, Clio is Python based framework for for data processing. It's actually built on top of and wraps Apache Beam. And Apache Beam is an open source project. It actually originated in Google and provides a way to, I guess, define batch and streaming pipelines while being quote unquote embarrassingly parallel. That's their quote. So Clio is built on top of Beam, and it could be used for general data processing, but Clio is a bit more opinionated and it's actually particularly tuned to process media or I guess any sort of binary file. And so the generic use case is processing large files, but it's also meant to handle particularly resource intensive processing.
So for instance, digital signal processing, where you look at the actual waveform of a media file. So you might be using packages like Librosa or Py Audio or PyDeb or, like, image stuff like Pill or pillow, or maybe some system level tools like FFmpeg or ImageMagick, or maybe you need to run a bunch of media files through an ML model that might be trained to do some sort of signal processing for you. And so the reason that Clio got started was researchers and ML engineers at Spotify were getting deeper and deeper into this audio processing workspace and wanted slash needed infrastructure to support that processing at scale.
You can imagine we have a lot of analytics and our models that are based on the metadata of audio, like artists, genre, that kind of thing. But we also have the actual audio file itself to look at. And so I kind of want to step back a bit just to give context for those who are not familiar, because I wasn't until 2 years ago. Audio can be represented as like a wave. I guess if you were to go back into algebra, you could draw a little graph where the x axis is time, the y axis is amplitude. The waveform can be just drawn going up and down with y equals 0 at its center. And so the higher the amplitude or the wave, the louder it sounds.
And then these waves can be sinusoidal, like smooth up and down or triangular or square or sawtooth. Then there's also frequency, which is how often that wave repeats in a given time. And so with audio frequency translates to pitch and where a waveform at a certain frequency translates to like a specific note, like a 4 or something like that. And so looking at this information can reveal some really cool things, something as simple as estimating a song's tempo, since you can see all the beats of a song in its wave. You can track its pitch from a waveform, providing the ability to estimate what key the song is in. You can start looking into patterns as well, like an audio that's just spoken word, like a podcast, will look different than a pop song.
We can detect, does an audio have speech in it? And then maybe from there, detect its language. And this isn't necessarily anything new that Spotify is doing. So it's called, like, music informational information retrieval. I mean, it's an actual proper, like, academic subject matter for for at least a couple of decades now. But the idea of machine hearing is kind of new still, maybe a decade old, where we try to teach machines to listen to audio as humans do. So far, a couple of years ago, our researchers and engineers started to do some real novel things with this signal processing and machine learning or machine hearing. For instance, it's always been difficult to take a single song and separate out what's called stems, like vocals and instrumentals.
But about 3 years ago, researchers essentially found a way to do it, and there's actually even a white paper about it. And from there, they wanted to essentially do that for all of our Spotify catalog, which as you can imagine is huge with like tens of millions of song or music files. And that's not even including podcasts. So these folks, you know, they started to slap together their own duct tape system, but quickly realized that they're not infrastructure engineers. They had a bit difficulty scaling and the fact that they had a lot of required maintenance kind of took away a lot of time from their actual research and development.
And so, that's essentially how Cleo was born. It's, you know, built by infrastructure engineers who should know a thing or 2 about scaling, hopefully about observability and reliability, and then just empowers researchers and other engineers or anyone interested in, I guess, in audio processing to only care about the processing logic itself.
[00:09:24] Unknown:
And you mentioned a lot of interesting problems that you're working to solve by being able to process these audio files and particularly at the scale of Spotify, where you have so many different assets that you need to be able to process because of music and recently podcasts. I'm interested in understanding some of the specifics of the problems that are faced by engineers and researchers who are trying to work with this audio and binary data as compared to some of the other types of large scale stream processing or data processing that is focused primarily on text or structured data?
[00:09:59] Unknown:
It's quite different, the problem space for binary versus like textual based data. It is, you know, inherently a lot quicker to crunch some analytics on audio metadata, like, you know, trying to figure out what songs to recommend to a user that the songs that are labeled in a similar genre as some other songs, and that computation, I guess, should be relatively quick. Dealing with like binary data brings a lot of resource constraints that you wouldn't, I guess, otherwise be concerned with. For instance, like the average song, like 3 to 4 minutes in duration might be about 8 megabytes in size, and that's the compressed size.
Downloading that takes up network bandwidth and potentially disk space if you're sending it to disk. And compressing that audio to actually load it into memory and to be able to analyze it will then take up 40 megabytes of Ram processing. 40 megabytes of data itself can take maybe 1 to 2 minutes of CPU time, depending on what you're doing. Maybe even longer, Maybe you actually need GPU's to do all this processing. And then if you're using frameworks like TensorFlow, yet another heavy memory footprint, some libraries might not be well engineered for concurrency or not thread safe in terms of processing. So that might actually force you to write a pipeline that can only be 1 thread per process.
A lot of these audio frameworks in general are just actually wrappers around tools implemented in other languages or have C bindings, maybe even need to like shell out to call like F of MPEG or something like that. So you need to make sure that you have your runtime environment set up properly with with all those system level dependencies. So I guess overall, it does get a bit hairy, but it's like a super interesting space to be in because I don't remember the last time I really had to deeply care about lower level constraints like this, especially with the advent of the cloud computing kind of thing.
[00:12:00] Unknown:
Digging more into the Clio architecture itself, I know that you mentioned that it's built on top of Apache Beam, which is a framework for being able to handle both batch and stream processing over large volumes of data and on arbitrary compute substrates. And particularly because of these performance sensitive characteristics, I'm curious what the decision process looked like as far as choosing the underlying framework to work with, but also the decision to use Python versus something that might be more close to the metal as far as the performance optimizations?
[00:12:35] Unknown:
So that's a good question. We definitely looked elsewhere first, like just kind of tried to understand the lay of the land. We didn't want to necessarily write something that already existed. There's certainly like a lot of audio processing tools out there, but that's not necessarily the problem. The problem here is scaling with resources, right, as you said. And so we actually kind of looked at data processing frameworks in hopes that they could help with, with our requirements. So we looked at, you know, beyond Beam, we looked at stuff like Argo or Airflow, even Luigi, some others. And, I mean, there's so many, I guess there's a new data processing framework popping up every week. But some frameworks, you know, don't support streaming or event based processing, only batched processing.
Some didn't support being able to configure the types of hardware that you needed, like, which is very job specific since 1 might need more memory ability to provide these like a dockerized environment or specify the exact runtime requirements needed. And so like some of the problems that we're needing to solve is we did need that streaming slash event based so folks can process audio as soon as it's available to our catalog. We had users with very different like hardware requirements. So it had to be flexible in that regard. We have, you know, so many different frameworks that folks can use needed to ensure that like the runtime environment can, can be kind of essentially hard pin, like versions and stuff like that, because it is research.
It needs to be reproducible. So again, dockerization of environment was ideal. We do have Shio, which is a Scala implementation of Beam, but we could have looked more deeper into that. But the actual lingua franca, I guess, of audio processing and ML is Python based. So researchers and ML engineers are far more common, I guess, are comfortable in Python than Scala or another JVM based language. And then we have actual tools developed in Python. There are some JVM based tooling for folks for like ML processing or audio processing, but they definitely do not have the reach or the development as a lot of these Python libraries do. And so that kind of sort of enforced a decision for us is that we want to provide tools that are actually, you know, helpful to engineers rather than forcing them to learn another language and adapt, you know, kind of lesser libraries to their needs. And I guess translate from like Python into a JVM by these things. So I went with Python.
We chose Beam specifically because Beam is or it was already widely accepted in our infrastructure. Spotify's motto to kind of keep things, I guess not too many competing products to do the same thing. We want folks to have like less cognitive overload when, when writing their pipeline. So if they are familiar with Beam for a simple data pipeline, then they should be familiar with Beam and writing a media processing. So, yeah, that's how we eventually arrive at building on top of the Beam framework with with Python.
[00:16:00] Unknown:
You mentioned briefly some of the constraints that you were working with and the design goals of the project. And I'm wondering if you can dig a bit more into some of the alternative options that you looked at for being able to work with this volume of binary data and be able to fulfill the use cases that you were targeting and highlight some of the capabilities that were lacking in those options or some of the edge cases that they weren't able to cover that made it worthwhile to invest the time and effort into building this new system from scratch?
[00:16:31] Unknown:
It's kind of difficult to answer this question because there wasn't really competing full fledged libraries that handles that handle this, like, binary data. About a year into our development, I think Microsoft released a framework that actually does, like video processing, and I think it's event based. It supports like ML, but it's particularly targeted for like analytics generation, not necessarily, like, processing, like, heavy actual media processing itself. Also, it's closely tied to Azure, which we are a Google Cloud Platform based company. But as well as like, we were already a year into development with folks starting to use our product. A lot of the existing processing frameworks had bits and pieces of what we need, but you can't exactly cobble together a processing framework from other processing frameworks. Right? So it was better for us to essentially build upon something that gives us the biggest head start, I guess, which, ended up being Beam because they do support the ability to configure your resource requirements, like, you know, more memory or CPUs, etcetera.
They do support running on data flow, which is in Google Cloud. So we don't have to actually manage the running of the pipelines ourselves. It already has Python SDK, so you can actually write, you know, a vanilla Beam job that kinda looks similar to Clio, but, you know, you'd have to do a lot of the boiler plate that Cleo does. So it's, it's actually quite easy to go between the 2 if, if need be. So, yeah, they had a lot of the basic needs. Yeah. They also ability to customize our runtime environment with, you know, whatever dependencies we need. And so the framework, there's a lot of architecture around it and tries to be a bit more opinionated.
[00:18:32] Unknown:
In terms of the actual design of the framework, you mentioned that Clio provides a decent amount of boilerplate for people who are looking to build these jobs on top of Beam. I'm wondering if you can talk a bit more about the design of Clio itself and some of the capabilities that you've baked into it to simplify this work of processing these large binary files and being able to integrate with the broader ecosystem
[00:19:00] Unknown:
stupid easy to write an audio processing pipeline. Clio, I guess, tries to do a lot of things and essentially make it stupid easy to write an audio processing pipeline. So when you're writing like a Clio job, you're essentially, you know, writing a VM pipeline, which should look familiar. But all you need to do is, is just write what the processing logic is. A user's perspective, you have like a CLI tool that maybe creates like some template files for you for a new job. And you just add some job specific boiler plate code that defines your processing logic in the pipeline. That pipeline itself might be a few steps. Like first download this file from this like cloud storage bucket, pipe that into logic that actually handles processing, like stem separation, like separate vocals from instrumentals, and then take that output. And now you have like 2 outputs and, and upload to their respective cloud storage buckets.
So like the CLI helps the management, including building a Docker image for the job, deploying a job, stopping it. What What you actually don't see is when, when you run the job locally, everything is actually running in the Docker container, not within, not on like locally the host, and this is to help with consistency environments. But if you step back a bit and take a look at the job as more generically, it has maybe typically 2 types of input, like an event input, which is like a trigger, and then another input, the actual binary file to process. Similarly, it has, you know, 2 outputs, event output and the data output. And so a job receives an event where the event contains a unique reference that points to a binary file.
This is basically a trigger saying some work needs to be done, and then that gets mapped to the actual binary data lookup, some like cloud storage bucket. The job then downloads the file, does the processing, uploads the result, as I said before, to another bucket and event to its event output signaling that the processing has been done for that particular file. And this is kind of akin to, like, a microservice, back end, but instead of, like, the request response paradigm, it's actually events and queues. And so that's what, like, a single Clio job looks like from a user's perspective.
And if you step back a little bit further, you can also connect jobs together based on these event inputs and outputs. Say for instance, I may have a colleague that wants to look at the vocals of a song, maybe to detect its language. So they can write a job that subscribes to like my job that separates vocals from instruments. And then you essentially have like a graph forming. This kind of provides a way to build upon each other's work, you know, a bit easier rather than kind of being siloed or sharing code and doing duplicate work essentially, save on compute time. And so say that you want to process songs as soon as a label delivers them, like a music label delivers them.
We actually have this really cool feature built upon backed by Clio that allows users in Japan to, like, sing along to songs so that you can, you know, essentially turn off the vocals of a song and just hear the instrumentals and then see its lyrics timed along with the song. So, you know, have fun fun singing along. But it would be a bit frustrating to not be able to sing along to a song that was just released, just added to our catalog. So perhaps you want to, you know, process on demand. So Clio allows you to hook into a queue, like, pops up to listen to an event for that new song. But that event producing system doesn't necessarily have to be a Clio job itself, just so long as it has data that you need to process, like a reference to an audio file.
And so this leads me to kind of like a bigger part of Clio's architecture is what's in that event data and refer to it as like a Clio message. And it's just basically a protobuf message that contains Clio related metadata as well. The pipeline that transforms actually see that the process, like a reference to a file. In reality, you can set up your Clio job to process any sort of data. Like, it can be just like JSON data or something like that. But Clio essentially uses this pointer, whatever that data is, probably an audio file, plus the job's configuration to find that audio file. And then it makes it easy for the user to say, okay, just go ahead and download this file. You already know where to find it. But then under the hood, Clio first actually checks to see if that output for that particular file exists.
So say that your output is, you know, 2 audio files of vocals and instrumentals. They live in separate buckets in some cloud storage. Google in our case, Clio takes the file name, looks up the file name in both those bucket buckets. And if it discovers that both already exist, it doesn't actually process it unless you, you know, explicitly want it to, but it doesn't spend unnecessary resources. But then so maybe the output does not yet exist, so, you know, we should try and process it. But then we first need to check to see if our input data exists, if that original audio file is available. If it doesn't, the job can be set up to trigger its parent job to generate that input for it, its dependency input. And so these little checks allows us to do some interesting things with how the overall graph of jobs gets executed.
We have, essentially 2 modes of execution support in what we call top down and bottom up. So with top down, every job in a graph gets triggered. So this is useful when you want to process audio, as soon as it's available to catalog or to dataset event triggers the Apex job or like to process a new file. And then all the child jobs then get triggered once that Apex job finishes and then their children get triggered and so on and so on. But it's not necessarily efficient or necessary for every job in a graph to process a file. Maybe you only want to trigger your job, but not any downstream jobs.
So you trigger your own job for it to do work for a file. Maybe the input data dependencies are missing, meaning that its parent job hasn't produced the output for that particular file. So the parent job then gets triggered. It recurses all the way up as far as it needs to in the graph. And so basically all the data dependencies get generated, but not all child jobs are subsequently executed. Only the jobs that are in the direct path of the original job. So like no selling jobs and no downstream jobs are doing unnecessary processing. So this is essentially describing bottom up execution.
It's to help optimize for, you know, costs of resource processing time because it can be expensive to unnecessarily process, you know, audio files.
[00:26:31] Unknown:
1 of the interesting pieces that I am interested in digging a bit more into is the ability to compose these different pipelines together. Because I know that with a number of different frameworks for being able to build out some of these DAGs of tasks, it can be difficult to actually wire them together where, you know, 1 pipeline is largely self contained. But it sounds like from what you're saying that if somebody builds a particular workflow for processing some of this audio data and extracting a certain set of information from it, Somebody else can then discover that pipeline and then be able to hook in their additional processing based on what was already completed upstream or maybe even at some node in the midst of that task graph. Is that accurate? And then in terms of the actual integration with Beam, I know that particularly in things like Spark and PySpark, people complain about some of the impedance mismatch between the ways that the Spark framework thinks about objects and memory allocation and the way that Python is trying to do it. I'm curious if you've seen a similar impedance mismatch between the Python SDK and the actual Beam execution layer.
[00:27:42] Unknown:
That's an interesting question because we have seen, I guess, a few, but Python, in my opinion, is quite flexible enough to help work around that. There's still some, like, headaches, fairly difficult to control concurrency from the Python SDK part down to, like, the Java part. Sometimes you do need to like limit processing to only 1 thread or 2 threads per like process or like maybe 1 thread per core, because you know, you have a really memory intensive TensorFlow model or something that's not thread safe. So the ability to do that doesn't necessarily exist within at least the Apache Beam part of it, and the underlying Java framework for that, doesn't expose any kind of knobs that you can turn to help limit it on that end. So a lot of times we need to kind of force something like this feature of concurrency management into into Clio itself. What Clio does is try to like not make that a concern for its users. So you don't get, you know, that kind of complaint from users that kind of abstracts that away, all those kind of difficulties.
But I do have to say that there hasn't been much of that issue between like the Java and Python SDK or the Java under implementation within the Python SDK that talks to it. It's been very kind of honestly easygoing, but no other sort of real big, insurmountable challenges come up due to differences of the 2 languages.
[00:29:17] Unknown:
And then digging more into some of the optimizations that you've built into Clio, particularly for things like the bottom up processing that you mentioned of being able to determine whether an upstream job needs to be re executed or not based on your downstream requirements. So I'm wondering if there are any other useful or interesting capabilities that you've built into Clio for being able to handle some of these particular edge cases or some of the optimizations that are necessary for working with these large files and the volume of IO that's necessary as a result?
[00:29:49] Unknown:
Yeah. There there's a handful, and we have so many ideas in development or on our roadmap. But right now, there are tools that allow users to download files, but download them to memory so you don't, you know, write to disk and take up disk space. And then in doing that, you then have, like, a choice on how it can be pickled. For instance, NumPy, a lot of like, the waveforms are represented in NumPy arrays. NumPy's pickling optimizes for memory where as a standard lib pickle optimizes for speed. So you do have that choice. It's quite nice.
Sometimes you need to like debug a dag of jobs, so you don't necessarily want to make it do work. You're able to kind of debug a a day of jobs by sending, like, a message in ping mode. So you can kind of track what happened if this audio file were to trigger, you know, a certain job and you can see it go through the graph and what actually would happen if it would, you know, find output or input or if it, you know, raises an error at a certain place that you can easily find. It's a rudimentary approach to tracing, like, a back end request kind of thing. And then there are some kind of cool, I guess, usability features that Clio has in in my opinion. I mean, it's stuff like, you know, running unit tests, which is just handy to have and to verify that your configuration is is set properly and that your declared resources exist, like whatever storage bucket you have declared, but then also allows you to profile locally, your jobs like memory and CPU footprint.
This is particularly helpful to understand the kind of resources your job will require once deployed. And then there's even like a command, an underappreciated command. We have a lot of ideas for it. We just need to develop them. It's this idea of auditing your code so you can audit your job. Right now you can audit for things like not cleaning up temporary files, and using up disk space or not being thread safe when you use certain libraries. It's quite handy. Those are all, like, the features and cool tools that we're trying to provide that to sort of help that resource intensive processing.
[00:32:07] Unknown:
Python has become the default language for working with data, whether as a data scientist, data engineer, data analyst, or machine learning engineer. Springboard has launched their school of data to help you get a career in the field through a comprehensive set of programs that are 100% online and tailored to fit your busy schedule. With a network of expert mentors who are available to coach you during weekly 1 on 1 video calls, a tuition back guarantee that means you don't pay until you get a job, resume preparation, and interview assistance, there's no reason to wait. Springboard is offering up to 20 scholarships of $500 towards the tuition cost exclusively to listeners of this show.
Go to python podcast.com/springboard today to learn more and give your career a boost to the next level. In terms of the actual project itself, I'm curious what the motivation was for releasing it as open source and if there were there's any particular process involved in cleaning up the code for it to be publicly consumable or if it was the intent from the outset to be open source and if that helps to direct the overall design and architecture of the project to avoid tight coupling with the internal Spotify systems?
[00:33:23] Unknown:
From the beginning, Clio was certainly developed with the intention to open source. I mean, I've been heavily involved in Spotify's open source strategy for, like, 5 years now. So it wasn't really necessarily a question, it was kind of like the default approach. And we did get some hints early on that open sourcing framework like this would be well received, The research community or even some of our competitors, yeah, we first wanted to make a viable product and to have our internal folks kind of kick the tires and make sure it's actually it has some traction.
And then with that, we did go through a couple of API redesigns and iterations and then add some more optimizations adding like highly requested basic features like, like batch and stuff like that. Because we were first streaming and still in a lot of development and we're now comfortable releasing it. You are right to point out that there was a lot of time spent kind of cleaning up code first. I am particularly a perfectionist, so I tried to have things documented and tested. And so like, I'm not embarrassed to have my code out there. And so I did do, like, maybe a couple of months kind of focused effort on actually cleaning the code base and as well as adding really good basic features like, yep, like batch or handling different input and outputs like, you know, storage or, like, local files, etcetera.
So, yeah, some time was spent just focusing on that, but we conveniently timed the open source open sourcing of Clio with conference that we presented, Clio at. It's called, ISMIR, International Society of Music Information Retrieval. So we basically hosted a tutorial to show off Clio to see, like, what their reaction was would be. And so that was kind of the fire under our butts to, to get it out there. But also, I think we finally decided that, yes, we were comfortable. You know, we might be a little bit lacking in full documentation coverage or full test coverage, but we are also just as eager to, you know, learn from a broader range of users, not just, like, internal Spotify users and maybe use cases we haven't thought of and hopefully invite, you know, contributions and essentially invite others to help improve and and build upon Clio.
[00:35:45] Unknown:
For people who are actually using Clio for doing their own work, I'm wondering if you can talk through some of the workflow that's involved with getting a Clio project set up and being able to handle third party dependencies that you might want and managing the deployment and life cycle of the overall project.
[00:36:03] Unknown:
Yeah. So building a pipeline, I guess, starts off with a very, like, simple command, like Clio job create. It, like creates, like, I guess, a cookie cutter template of a job. You just declare your dependencies in 1 or 2 places, The standard like requirements dot TXT for your 3rd party Python dependencies. And then maybe you need like, system level dependencies, like, actual, like, C libraries that maybe your Python dependencies plug into, or you might actually want to shell out to something like FFmpeg. There, you would declare it in like the Docker file.
And so you, you add your dependencies and you kind of go about your regular Python development. You kind of declare the actual processing that you want handled. Behind the scenes, Clio actually takes care of reading those event trigger inputs and writing, and it can handle, like, the data reading input and writing output for you. But essentially, all you have to worry about is, okay, you've received a trigger that has this reference ID. What should the job do now? And so maybe you take that file and run it through your model and then just return the result. Or maybe you tell Clio to upload the result to a bucket. And so you really only have to focus on the actual business logic of the pipeline.
And then once that's all defined, you probably want to test it out locally. The CLI tool allows you to easily kind of get it running locally, see if it works. You're able to kind of publish a message to its queue from the CLI as well. And then there from there, you should be able to see some really pretty logs and maybe good enough to write some unit tests As I've kind of hinted before, Clio kind of will run those unit tests in the Docker image for you. So it should be the same environment as it would be when it's deployed. Then maybe you're totally comfort comfortable and confident and wanna, you know, deploy it. Clio job deploy command will just, you know, put it out there, put it on your configured runner, which right now is either direct runner, the local runner or data flow.
So on the Google Cloud dataflow, it should, like, then take up all of the heavy lifting of orchestration. You won't have to worry about spinning up all the resources needed. If you want auto scaling, you can configure that so that maybe all of a sudden, you have a huge load of audio files processed. It'll take care of scaling for you. With that, kind of comes in, like, built in metrics and observability, which Clio can help you kind of discover, like help you find those dashboards a show up onto the dashboards. Maybe you have like a new model, like a retrained model, and you want to redeploy your job with this new, like ML model, you then kind of like go through that process again of just like updating your code, kind of rebuilding the image and then Cleo job deploy will take care of taking down that old job without losing the in process work for you. Like things work in progress and the actual pipeline will kind of get drained, will continue, but any work still in the queue will just kind of be paused until the new Dataflow job is run.
And now you have like this new model deployed and you kind of, you need to reprocess the old data that had been processed by the old model. You can then re trigger, like all those audio files that need reprocessing by just like a simple force through trigger. So Clio will say, oh, I do find this output already existing, but you told me to force it. So we're gonna go ahead and process it for you. So, yeah, that's sort of like an overview work through or walk through of sort of the general workflow of the Clio job.
[00:40:10] Unknown:
Yeah. Or integrate with it more directly. What are some of the extension or integration points that you have exposed for being able to expand its utility beyond just the core feature set?
[00:40:21] Unknown:
There are actually quite a few points of, like, extension. For instance, like, you would just have to write, like, a small little class to, like, read from or write to, like, an AWS service or Azure service for, like, SNS or s 3 or q storage or whatever. Similarly, we have, like, support for metrics, only for, like, log based metrics and and Google Stackdriver, but it's extremely easy to extend the metrics client beyond that, maybe StatsD or Grafana. And it would take a bit more code, but it should be able to extend Clio to actually work on other runners supported by Beam. So Beam supports Dataflow, but also Flink or Spark, And it's just a little bit more legwork, but defining support for that kind of runner is definitely possible and relatively simple.
And right now, Clio has an optional library for, like, specific audio handling based off of, like, the Librosa package, but folks could easily make tools for easier pipelining of like image and video handling. So there's, there's many, many areas, many touch points that should be relatively easy to kind of extend.
[00:41:41] Unknown:
And for people who are building on top of Clio, what are some of the challenges or misunderstandings or edge cases that they should be aware of or that you've seen people run into?
[00:41:51] Unknown:
There's in general, like, a common misunderstanding when you're first learning about Clio, particularly if you're not in, like, the research space of, like, audio processing. Is that, like, Clio does the audio or AML processing for you, or it's a framework for that when really Clio is meant to help scale that processing distributed way. Sometimes that can be a misunderstanding when going into the project, you know, fresh. But then in general, the the space itself is very complex and has a natural, like, steep learning curve. So you might have to do a bit of self education in order to get beyond, like, a simple hello world, like example.
And it's difficult for the user because it's hard to tell where Clio ends and where Apache begins or where Dataflow begins. You're interacting with Clio mostly, but Beam is doing, like, all of the pipelining work and Dataflow is doing all the orchestration work or another runner, the orchestration work. But hopefully, that doesn't necessarily discourage folks. What Clio has done is essentially make the nearly impossible possible. It is still complex. And what we're working on next is to make that complex a bit more simple. So in our roadmap, we have ideas for tools like better audio chunking or just file chunking right now, processing extremely large files like hours, hours long podcasts is, like, less reliable, more a little bit more shaky than, you know, the 3 or 10 minute, like, song. We're hoping to build more useful utilities in the near future to adjust that sort of use case of extremely large files.
[00:43:37] Unknown:
Or for the odd music track that actually happens to run for a solid hour.
[00:43:41] Unknown:
Yeah.
[00:43:42] Unknown:
And so in terms of your own experience of working on the Clio project and being involved in this space of audio data processing, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:57] Unknown:
Of course, I had to learn a lot about audio processing. I still couldn't write like a PhD based off of it or a dissertation based off it, but sort of, yeah, I had to do a crash course in, I guess, digital signal processing kind of thing, reading a lot of textbooks, but actually the challenges that were more in our faces in my face was understanding pickling. I haven't actually, I mean, I, I know pickling, but I hadn't actually been exposed to it before this and, and specifically with how Beam pickles. And so when launching a pipeline, particularly for like a remote runner like Dataflow, user code gets pickled locally and then unpickled on worker machines.
So anytime pickling is used, it kind of can limit what you can do. And so this is actually 1 of the reasons why we dockerize everything to to make sure that the environment in which you launch your pipeline in, the environment that gets pickled is the same environment that the pipeline itself gets unpickled on the remote worker. And so that might be obvious to some, but I had yet been exposed to that problem space. But with that, like deeper understanding of pickling can come ideas of how to abuse it, I guess. So during unpickling, there's like this gender set state method defined in the class that gets called when unpickling an instance.
And this provides us with, I guess, ways to manipulate worker environment as it starts up. Now, Cleo doesn't do this yet at all, but we could, but we've seen it used within Clio to maybe start demons on workers, like profilers, like, the Stackdriver profiler, which can be really useful. In generic kind of problem that I have regularly is kind of designing a framework for other developers. For instance, how best to design an API that makes sense and feels ergonomic to use. And then like how much magic should we provide the user? How much of that magic will actually limit the user maybe too much?
How can we provide flexibility for the user while still being opinionated? And then the maintenance and development cost for us when it comes to like making those decisions. I'm like regularly paralyzed by like, a framework design, making sure that it's good and well thought out. And so I find personally that to be a big challenge.
[00:46:25] Unknown:
And as far as the ways that Clio is being used, what are some of the most interesting or unexpected or innovative projects that you've seen built with it?
[00:46:34] Unknown:
Yeah. I think I mentioned this before, but I still think it's super cool is is the sing along feature available in Japan and I think other markets as well. That's basically based off of Clio generated data of separating vocals from instruments and just taking the instrumentals and and the lyrics. I mean, we've also seen internally Clio used for like audio based ad generation, where you can like dynamically piece together someone's voice, like reading an ad with various background music. You can then create like multiple versions of the same ad. So this would help with like, maybe you're listening to a Chopin album and then it wouldn't be a great experience if that adds background music was heavy metal.
So the fact that you're able to generate, you know, different types of ads with different backgrounds and kind of target based off that it provides a better experience. Some things that might be boring, to those like in this research space or in general, but I find super cool is we fingerprint audio, which means generate a unique identification or a signature for each audio or even segments of audio. And then that, can be used for like deduplication, cover song identification, many things. And there are some other research based things that we're doing, including like song and speech transcriptions, instrument separation, like separating the actual guitar from bass from drums.
And then we can build on top of that. Personally ideating here, but like maybe we could build automatic generated, like sheet music with that. So that'd be really cool. Some external users, I haven't seen any actual, like, product of it, but I know external users are, like, really interested in in trying out Clio for an auto DJing feature. Basically, being able to create a playlist where, like, the tracks all fade in and out perfectly. A while ago, New York Times Sunday review feature, It's actually like a little web app. It's titled like why songs in summer sound the same.
And it's a cool website that like highlights how summer songs have certain attributes or qualities like danceability or energy and loudness. Then the review itself actually goes to show, like, over the past 40 or 50 years, songs released in the summer or for the summer have become less diverse and have been converging on the same levels of danceability and loudness and other attributes. I don't wanna spoil it any further, and I'll, add the link to this cool app in the show notes. But this research kind of research is built upon our public audio features API And the the data for that public API is backed by clear jobs where the audio processing that takes place estimates how danceable a song is or the acoustickness or the tempo or what's called speechiness.
So, like, it's making research a lot easier and allowing us to also build some really cool features.
[00:49:38] Unknown:
And you mentioned a few things that you've got planned for the future of the project. Are there any other things that you'd like to call out for plans that you have or contributions that would be helpful to help drive it forward?
[00:49:51] Unknown:
Yeah. So there's definitely a lot of development ahead. I say that while Clio is production ready, it can do a lot more to help users focus on what they care about, make things more simpler. And so, yeah, this includes, you know, more optimization tools like file chunking and improved concurrency management. And this is coming with, you know, more and more interest of processing podcasts and just longer audio in general. But we're also hoping to improve the current ability to observe more observability features, allow you to be able to diagnose any sort of, like, resource constraints that your job might be having, kind of uncover any sort of really ugly, like, deep down into the system issues, basically help diagnose, essentially.
We also have plans for better testing. Like, right now, you can certainly easily unit test, but what about integration tests, particularly if you're a job that's a part of a bigger graph of jobs? In general, like, I'm curious, I guess, to see what the open source community will bring. I know that folks do wish to use this Amazon's, cloud architecture and as well as, like, kind of homegrown architecture, like maybe a self maintained, Kubernetes cluster. And so that's what I hope to anticipate in terms of contributions. But in general, we're like a very small team and so we can only focus on what's in front of us.
So I would be wholeheartedly excited for any contributions that like, you know, different use cases or prove usability, anything of that sort.
[00:51:34] Unknown:
Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the PSF fundraiser that's going on right now. They're looking to close out q 4 with some, hopefully, additional funds to help support the overall Python community. So if you have some time and are able to, definitely recommend contributing back to the PSF for all the work that they do for the people who use Python. And so with that, I'll pass it to you, Lynn. Do you have any picks this week?
[00:52:06] Unknown:
Yeah. Just 1 pick. I hope this doesn't make me sound like a fan girl, but about a month ago, I started using, this note taking tool called Roam. It's super easy to take notes and create a graph of your notes, catalog everything that you might think. It's just better aligned with how I approach my own note taking. So I highly recommend taking a look at it.
[00:52:26] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on Clio. It's definitely a very interesting project and interesting problem space. So I'm excited to see where it goes and the new capabilities that you're able to add to Spotify as a result of that and some of the ways that the overall developer and research community can take advantage of it. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Thank you very much. I appreciate you having
[00:52:53] Unknown:
me.
[00:52:55] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to the Episode
Interview with Lynn Root
Overview of Clio Project
Challenges in Audio Processing
Choosing Apache Beam and Python
Clio Architecture and Design
Optimizations and Features in Clio
Open Sourcing Clio
Workflow and Deployment
Lessons Learned and Challenges
Innovative Uses of Clio
Future Plans and Contributions
Closing Remarks