Summary
With the rising availability of computation in everyday devices, there has been a corresponding increase in the appetite for voice as the primary interface. To accomodate this desire it is necessary for us to have high quality libraries for being able to process and generate audio data that can make sense of human speech. To facilitate research and industry applications for speech data Mirco Ravanelli and Peter Plantinga are building SpeechBrain. In this episode they explain how it works under the hood, the projects that they are using it for, and how you can get started with it today.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Mirco Ravanelli and Peter Plantinga about SpeechBrain, an open-source and all-in-one speech toolkit powered by PyTorch
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what SpeechBrain is and the story behind it?
- What are the goals and target use cases of the SpeechBrain project?
- What are some of the ways that processing audio with a focus on speech differs from more general audio processing?
- What are some of the other libraries/frameworks/services that are available to work with speech data and what are the unique capabilities that SpeechBrain offers?
- How is SpeechBrain implemented?
- What was your decision process for determining which framework to build on top of?
- What are some of the original ideas and assumptions that you had for SpeechBrain which have been changed or invalidated as you worked through implementing it?
- Can you talk through the workflow of using SpeechBrain?
- What would be involved in developing a system to automate transcription with speaker recognition and diarization?
- In the documentation it mentions that SpeechBrain is built to be used for research purposes. What are some of the kinds of research that it is being used for?
- What are some of the features or capabilities of SpeechBrain which might be non-obvious that you would like to highlight?
- What are the most interesting, innovative, or unexpected ways that you have seen SpeechBrain used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SpeechBrain?
- When is SpeechBrain the wrong choice?
- What do you have planned for the future of SpeechBrain?
Keep In Touch
- Mirco
- mravanelli on GitHub
- @mirco_ravanelli on Twitter
- Peter
- pplantinga on GitHub
- @ComPeterScience on Twitter
- Website
Picks
- Tobias
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- SpeechBrain
- Mila
- Speech Processing
- Speech Enhancement
- NumPy
- SciPy
- Theano
- PyTorch
- Speech Recognition
- NeMo
- ESPNet
- Sequence to Sequence (Seq2Seq)
- HyperParameters
- TorchAudio
- PyTorch Lightning
- Keras
- HuggingFace
- Generative Adversarial Network
- Snorkel
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, Go to python podcast.com/linode. That's l I Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Your host as usual is Tobias Macy. And today, I'm interviewing Mirko Ravenelli and Peter Plantinga about SpeechBrain, an open source and all in 1 speech toolkit powered by PyTorch. So, Mirko, can you start by introducing yourself?
[00:01:09] Unknown:
Sure. First of all, thank you very much for inviting us to this podcast. And, yeah, actually, I'm currently a postdoc researcher at Mila. Mila is the biggest academic lab doing research, machine learning, and deep learning. I have been working for a while now in the domain of audio speech processing, especially with deep learning. And I recently started this very exciting project called SpeechBrain, which is, you know, the topic of our discussion today.
[00:01:39] Unknown:
And Peter, how about yourself?
[00:01:40] Unknown:
My name is Peter. I recently graduated from the Ohio State University, and I got my degree in machine learning, studying speech enhancement, which I can get into more later. And during my time there, I did an internship with Speech Brain and contributed a lot to the core architecture of the toolkit.
[00:02:05] Unknown:
And going back to you, Mirko, do you remember how you first got introduced to Python?
[00:02:09] Unknown:
Sure. Sure. I mean, you know, that was a while ago. Right? Because Python is, you know, the the best choice if you would like to do research, especially in the domain of machine learning, deep learning, data scientists. So I actually started working a while ago with, you know, libraries like NumPy, SciPy, etcetera. And then with, you know, the explosion of deep learning, I started working with Tiano, which was, I think, the most successful general purpose deep learning library a few years ago. And more recently, I fell in love with PyTorch, which is, I think, currently the best deep learning library. And that's why also we utilize we build Speechmatics on top of PyTorch.
[00:02:50] Unknown:
And, Peter, do you remember how you got introduced to Python? I was introduced to Python, ironically
[00:02:55] Unknown:
enough, as part of a programming languages course. So it wasn't even the focus of what I was learning, but I immediately fell in love with it. I think it's a very elegant and beautiful language.
[00:03:07] Unknown:
Moving on to SpeechBrain now, You mentioned that it's a toolkit for being able to do speech processing, speech enhancement were a couple of the things you mentioned that it's used for. But can you just give a bit of an overview about what the project is and some of the story behind how it came about and why you created it? You know, as I mentioned before, I'm in the field of audio speech processing, deep learning for a while now.
[00:03:29] Unknown:
And when I started conceiving this project, I wasn't really was a kind kind of 2 years ago or something like that. And it was at that time, I wasn't really happy with the speech processing libraries that we have we had at that that time. So most of them were, you know, difficult to use, difficult to not very flexible, not very modular, and very difficult to modify. And for me, as a researcher, it was so bad. Right? Because in the research domain, we really want to, you know, explore a lot of ideas, implement a lot of things as fast as possible. I saw a clear space for a toolkit that was kind of more flexible, easier to use, more modular, etcetera. So I started them writing kind of project proposal.
I submitted it to some potential sponsors. The sponsor were very excited, like me, on this project, and I was able to, you know, create a team that also included Peter as 1 of the core members. And, you know, after a lot of work, really, really a lot of work, I never worked that hard in my entire life like the last year with SpeechPrint. But after this huge work, we were able to, you know, release something to the community. Right? We were able to transform an idea into an open source toolkit, hopefully valuable for the entire community.
[00:04:55] Unknown:
And as far as the overall goals and some of the main use cases that you're focusing on with the speech brain project, I'm wondering if you can give an overview about the kind of use cases that it enables and, in particular, the ideas that you had for how and why somebody might use it as it compares to when you started and where you're at today?
[00:05:16] Unknown:
So the most important use case that we have in mind in mind for speech brand is research. Right? So basically, we designed it around that because we wanted something to to allow people, right, to easy prototype speech and audio technologies. Right? That's absolutely the most important use case. But, you know, this kind of need for, you know, easy prototyping technology holds in academia for research in academia, but also also in companies. Right? Because most of the time, the companies will like to, you know, try a lot of models, a lot of technologies, evaluate them on standard datasets, and then pick up the best 1 to put on production. So Speechfront is also suitable for this kind of use. The current version of the toolkit supports a lot of speech and audio related technologies.
I will say it supports, of course, speech recognition. You know, everyone knows about converting speech recordings into text. And as mentioned by Peter, it supports speech enhancement, which is about, you know, cleaning up speech recording corrupted by noise and reverberation. It supports, like, speaker recognition, which is about detecting who is speaking inside the recording, support multi microphone signal processing, spoken language understanding, and I'm forgetting, of course, something that, you know, it's already a kind of framework that can be used for multiple tasks. So these are more or less the, most important use case that have in mind for
[00:06:51] Unknown:
SpeechStream. And in terms of the actual processing of the audio, I'm wondering what are some of the particular challenges related to dealing with audio that has speech embedded in it and being able to pick out those particular wave patterns and extract that from any sort of background noise that you might be dealing with versus just doing straight processing of audio signals without having to differentiate between speech and non speech elements?
[00:07:17] Unknown:
Well, I have to say that, you know, speech has some kind of peculiarities. Right? So if you think a a little bit about that, it has a kind of hidden hierarchical structure behind speech. And for instance, you know, you can combine null pixels to create to detect some phoneme units. You can detect some phoneme units to form phonemes. From phonemes, you can create words. From words, you can create sentences. And finally, on the top of this, you can have the meaning of what you are you want to say. Right? So this kind of hierarchical architecture is pretty challenging, right, to capture by a machine learning system. And that what makes probably speech very challenging, but also very interesting. And also another thing is that everything is even more complex because we have a signal is often corrupted by noise, reverberation, which is very difficult to model because it's very unpredictable.
We have the signal changes a lot depending on factors like speaker identity, accent, emotion, and many other things. So it's very interesting, right, working with the speech. I say that actually, speech training is a little bit agnostic, right, with respect to the specific technology. So I have to the specific type of signal. Like, we call it speech brain because speech is the main of the toolkit. But definitely, you can use it for music processing, sound classification, even for text processing to train language model. And I also have kind of project going on to use SpeechBrain for decoding EEG break signals. So it's actually a toolkit which is suitable for time series. And what make this possible is that all these tasks shares the same technology, which is deep learning.
So today is much easier. Right? I'll have this kind of multitask to look to be suitable for processing than the sake time series. At least it's easier than, in the
[00:09:24] Unknown:
past. In terms of the availability of libraries and frameworks and services, there are a number of them that are out there, particularly that focus on being able to do things like speech transcription. You know, AWS has services. Google has services. There are 3rd party systems that provide services and open source libraries. And I'm wondering what the overall state of the eco system was when you first began working on this and what you found to be missing or lacking that necessitated the work of actually building out this entire toolkit to be able to fulfill the use cases that you're that you needed to be able to achieve?
[00:10:00] Unknown:
Every company, right, has, its own speech recognizer. But the little problem is that the speech recognizer is not open source most of the times. We even don't have the data, right, that they can use for training. So there is a clear need here to democratize a little bit the field and not in the hands of a few companies only. So that's very important. But still in the domain of open source, there is a lot, right, already. For instance, for speech recognition, the dominant toolkit is Kaldi, which is written in c plus plus. This is very suitable, especially for industrial application where you take care about speed, efficiency, etcetera.
But also we have seen, right, growing interest towards Python based speech processing toolkits. Here, I can mention, for instance, FairSec from Facebook, NEMO from NVIDIA, and ESPNet from the Johns Hopkins University. So why speech brain? Speech brain has this peculiarity, which is designed from scratch to deal with multiple tasks. We call it speech brain. Right? Because like in our brain, we want a kind of toolkit, a kind of device that is able to, at the same time, to, you know, recognize speech, understand speech, recognize a speaker, recognize emotion, etcetera. So we designed it for that purpose in mind. So we designed it with this multitask idea in mind, and we just make everything suitable for that. So that's a peculiarity of SpeechBrain, I think. And it's very important these days because there is a clear trend towards complex speech pipeline. Think about virtual assistants. Virtual assistants are actually complex things. Right? They put together many different modules.
And the good thing is that with speech frame, you can do everything within the same toolkit. So that's, I think, 1 strength of the toolkit, which make it suitable probably, especially because it's it is a I think it is carefully designed to be modular, flexible,
[00:12:03] Unknown:
and easy to use. So 1 of the things that you mentioned is that you can potentially run into difficulties with accents or being able to support different languages. And I'm wondering how much of the sort of multilingual support is related to the ways that you look for the different audio signals and structures within the waveforms versus how much of it is just the baseline processing the audio data and then having a specific language model for being able to support the, you know, speech to text algorithm?
[00:12:34] Unknown:
Well, in speech mode, we actually have, already some some language support. And then it is very easy, right, to create a system for new language. For instance, we created this system for an African language spoken in Rwanda called Kenya Rwanda, and no 1 in the team was able to speak this language. So now with the this technology called end to end, you know, end to end speech recognition, you can really create a speech recognition system without knowing the language, so without necessarily knowing the language. So you just approach the problem as a sequence to sequence problem where you have some kind of audio and input and you would like, you know, output some kind, sequence of words.
So that's very natural to do in speech brain. As I mentioned, we already have systems for language that we don't know, but we also have languages like English, French, Italian, Mandarin, etcetera. Yeah. Nowadays, you know, creating a new language is much, much easier than in the past. So because everything is framed into a sequence to sequence learning problem.
[00:13:41] Unknown:
And in terms of the actual technology that's powering speech brain, can you talk through how the project is implemented and some of the architectural and design decisions that you landed on to be able to support your research oriented focus?
[00:13:56] Unknown:
So we decided to build our toolkit on top of the PyTorch toolkit. There's a library for machine learning of really all different types. So our toolkit is really sort of an interface where any task where you have audio as an input and it is able to use this sort of streamlined functionality for machine learning. So because we know that we're going to get audio as an input, we can handle things like variable length inputs in a nice way. Yeah. We decided to go with PyTorch because it's really become their go to toolkit for machine learning, and it's very flexible. It was easy to use for building our toolkit.
I think the moment that I fell in love with PyTorch, 1 sort of very common problem in machine learning is coming across not a number or infinity, and it's sort of like the curse of programming and machine learning. And there's this 1 function in PyTorch where if you put it in, it will automatically catch the function that produced it. And when I found that, I would my jaw dropped, and I was like, the heavens opened. And I was I was very pleased to discover that. So
[00:15:19] Unknown:
I think building on PyTorch was a good decision, and I think Speed 2 Grain offers value on top of it. And in terms of the design and structural elements of the project, I'm curious what were some of the kinds of decision points that you had to run into, particularly given your focus on making it modular and extensible.
[00:15:39] Unknown:
Yeah. I can answer that. So 1 thing that we focused a lot on was in machine learning, there's a lot of what are called hyperparameters. Either think that you can tune that make a big difference in how the model performs. And there can be 100 of these. So having them side, like, inside of your code really clutters things up a lot and makes it hard to read. We focused a lot on separating hyper parameters from code so that the code was easy to read, it's very straightforward and it doesn't contain any magic numbers, nothing like that. And the hyper parameters are all stored in a YAML file, which in order to do that, we had to create sort of extensions to YAML so that you could make a sort of hierarchical combining different hyperparameters in different ways.
And then we also provided sort of a backbone, a structure called the brain glass, which allows it sort of take care of the nitty gritty details of training machine learning models. So those 2 sort of parts, those are present in all of the different tasks that we have. They sort of make it easy to create a new recipe because these structures are there and work no matter what the task is.
[00:17:11] Unknown:
And in terms of being able to use the PyTorch toolkit for processing this audio data and dealing with the different types of speech models or algorithms that you're trying to experiment with, what were some of the more challenging aspects of figuring out how to put that into a, you know, well designed API and sort of a uniform interface?
[00:17:34] Unknown:
The 1 challenge that we came across using Itos that we didn't necessarily expect is there were times when we had to worry a lot about performance, and by performance I mean just how fast the program for executing. I think at least my expectation going into it was that iForge would take care of most of that, and we could be left with the more research y ideas of, like, now how do I make the speech recognition model work well? So we did have to dive into those details of speed and performance sometimes just to make sure that the toolkit could work for a really wide variety of people, even though that's not necessarily what we expected going into it.
[00:18:23] Unknown:
And as far as the original ideas and focus of the project and some of the assumptions that you had going into it, I'm wondering what were some of the elements that were either changed as you began to work through the problem or some of the assumptions that were invalidated.
[00:18:41] Unknown:
Some of the assumptions that we had were that we could focus really on research questions, mostly, and didn't have to deal too much with some of the nitty gritty details of how does PyTorch work, things like that. But we really did have to dig into the details of the PyTorch in order to use it the most efficiently that we could. So, I mean, it is what it is, and I think we made the best of what we had. And I think speech training is better for that.
[00:19:16] Unknown:
Because of the fact that PyTorch is such a widely used toolkit and has gained a lot in popularity, I'm wondering if there are any elements of the ecosystem around it that helped to simplify the work that you were doing with SpeechBrain that you might have otherwise had to build up from scratch if you maybe went with a different baseline framework?
[00:19:35] Unknown:
First of all, important component of the PyTorch ecosystem is Torchaudio. That, you know, it's still an ongoing project, but, they already provided some kind of basic functionalities, you know, to read the signals and things like that where they transform, resampling, etcetera. So that's 1 end of the PyTorch ecosystem that we are using. But then we are also have other dependencies that maybe Peter can comment a little bit more. So there's been several
[00:20:01] Unknown:
popular toolkits that have been built on top of PyTorch. So 1 is called PyTorch Lightning, and then I don't think this is actually on PyTorch, but in the deep learning ecosystem, there's the Carriage toolkit. We took inspiration from those toolkits for some of the parts of the speech brain toolkit just in terms of how can we make something that's very easy to use, that makes really good intuitive sense, and works for a wide variety of problems. So I think having that ecosystem was important for giving us good ideas about some of the things that we wanted to include.
[00:20:43] Unknown:
As far as the actual workflow of using SpeechBrain, I'm wondering if you can just talk through some of the process of collecting the source data, labeling the audio, doing the training process, and just building out a solution from start to finish?
[00:20:59] Unknown:
Of course, data collection and labeling is a very challenging problem and 1 that academics don't spend enough time thinking about. They tend to use data sets that already exist. So I think the focus of Beach Brain is not on data collection, but more on the modeling side. If you want to build a model with speech brain, place to start is a set of templates that we've created that are basically, they do a specific task like speech recognition or speech enhancement, and they're meant to be as simple as possible and as well documented as possible. So you can basically copy the recipe and then make the modifications that you need to work for your specific data or for your particular model that you're trying to build, something like that. So the templates are a great place to start. They have everything you need and should require just modifications in the specific area that you're trying to work in.
[00:22:02] Unknown:
And to bring it to a sort of concrete example, if I were to want to take my entire back catalog of the podcast and train a model so that I can automatically generate transcriptions and do speaker diarization and speaker recognition and maybe some background audio noise elimination. What are some of the steps that would be involved in actually building that kind of project and some of the additional pieces that I might need to bring in?
[00:22:28] Unknown:
Sure. So along with the sort of more research oriented, you know, you can take a template and edit it a little bit and use it to generate a new experiment. We also have a set of pre trained models which have been uploaded to the hugging face, and they already are capable of doing a lot of the tasks that you just mentioned. So it just takes 3 or 4 lines of code. You can find all of these speech brain models on Hugging Face, and it's a description of how to use them there. And then those functions which we have built in to use the pre trained models would be capable of doing in just 3 or 4 lines of code, the transcription or the enhancement, whatever you want. It's a separate model for each of those tasks.
[00:23:21] Unknown:
And then as far as the sort of use cases, you mentioned that it's primarily focused toward research. And I'm wondering what are some of the ways that that manifests in terms of the interfaces that you provide and the way that the overall project is structured and the way that you think about the sort of community building around it? This is a kind of academically driven pro project. Right? So
[00:23:46] Unknown:
we already have a lot of students that show interest in contributing. There is a lot of excitement around this project, not just the students, but also around the world. The community is where is growing very fast. You know, we released the product only 3 months ago, and we already have make it bigger and bigger with more collaborators. So I think the fact that the project is kind of academically driven helps a bit. Right? Because we can have more students, more volunteers that are very happy to help us because this is just an effort for the community. There isn't kind of business behind that.
So I think we have serve very important reaction, a very positive reaction of the community, and we have the feeling that these things will will grow more and more in the next months. So
[00:24:36] Unknown:
As far as being able to help grow the ecosystem around it, as I was going through and preparing for the show, I noticed that you have a bunch of tutorials that are available on the Google Colab for being able to actually step through and experiment with the system, and it definitely eases the on ramp. But I'm wondering how you thought about structuring the stages of the tutorials to kind of gradually expose the different capabilities of the system.
[00:25:01] Unknown:
Yeah. For the whole in general, we highly consider educational purposes as well. So as I mentioned, we are academically driven, So we have a lot of students and we would like to also consider educational purposes as well. And that's the reason why you have a lot of tutorials, a lot of documentation, etcetera. And we also have plan to, you know, expand them in the future. Right now, you have some kind of tutorials that goes in the direction that you mentioned. For instance, there are tutorials like speech recognition from scratch, where we walk a user step by step towards the creation of a speech recognizer.
So even in the current version, you can find tutorials with different levels of complexity from basic functionality of speech frame to more complex
[00:25:50] Unknown:
1. As far as the capabilities of the project, we've touched on a few of the different high level objectives of, you know, speech recognition, speech enhancement. But what are some of the features or capabilities that are not obvious at first or that only become really evident or powerful as you combine different elements of the system that you think are worth calling out?
[00:26:13] Unknown:
Multitask learning wasn't trivial at all. Right? So we definitely had to think a lot about, how to implement that because every task has its own features, its own way to process data, etcetera. So I think the most surprising thing is, you know, at the end, when we try to combine things together, that was pretty smooth. Right? And for instance, Peter work a lot on combination of speech enhancement and speech recognition. Then we realized that, you know, combining these things is really, really natural. And maybe Peter can comment a little bit more on that, but I think the most positive thing. So in the end, I think we have the feeling that we reach our initial goal of making speech to accounts just a little bit more flexible, faster, and modular.
[00:27:04] Unknown:
Yeah. I would agree with that. I recently finished my PhD, and my dissertation focused a lot on exactly what Mika was saying, how we can train machine learning models in a better way if we use a combination of different models together, you know, 1 model teaching another, things like that. And that sort of work is really made possible by Speechmatin because before when each task was sort of siloed in its own separate way of doing things, It was really hard to get these models to talk to each other. So having something like the speech brain where the models are all constructed in the same way makes it much much easier to combine models and use different models to train each other and really sort of innovative ways of
[00:28:03] Unknown:
using models in combination with each other. That brings up the question of being able to also build some generative adversarial networks or GANs and be able to maybe sort of attack the model to be able to improve its robustness. And I'm wondering if that's something that you've explored at all with speech brain and the fact that it has this multi model support.
[00:28:24] Unknown:
Yeah. That's exactly right. That's 1 of the things that I was talking about specifically, like, for speech enhancement in particular, we have 1 retrofit that uses Gens in a sort of a new way, and it produces much better end results than using straightforward GANs.
[00:28:44] Unknown:
So I think speech brain is a good place for that. You mentioned, Peter, that you just finished your PhD in Mirko. You're still working with SpeechBrain in your own academic work. I'm wondering what are some of the specific research areas that you're actually employing SpeechBrain now that you've actually got it to a point where it's production ready in an academic setting?
[00:29:04] Unknown:
Well, on my side, I'm very interested in the domain of self supervised learning. Self supervised learning, you know, happens we don't have, you know, manual annotation of data, but you just have data themselves. Right? For instance, for speech data, we have tons of speech data, but only a relatively small amount of them are actually annotated and, suitable for supervised learning. So I'm I'm working a lot in this domain, which is kind of exploded in the last couple of years. Thanks to the work done by Facebook on Move to Max, etcetera. But I also work on some innovative techniques, and I think this is a kind of breakthrough for the deep learning community and also for the community working on speech and audio. So I'm working a lot on, self supervised learning, and still I'm very interested in these things about combining technology together because I think, like, maybe it's right. We the technology of recognizing speech and the technology of producing speech not independent. Right? They are correlated with each other. So I plan to do some work in in the in the direction of combining this technology and put them kind of loop.
And this is probably kind of research where speech train excel. Right? So we we we did it for that. So combination of technology, multimodal things are are another topic that is very interesting for my research.
[00:30:30] Unknown:
Have you explored any of the work in being done with things like Snorkel for being able to automatically generate annotations based on functions that are defined by domain experts. And so being able to take a piece of audio data, have some function that will generate labels, and then being able to feed that into your self supervised systems?
[00:30:51] Unknown:
Yeah. That's, actually very close to what is called self training. In self training, normally, we have 1 model, right, which produces some output with could be an output with error or whatever, but we use this model to create kind of annotation to the data. That's something already, explore a lot in the community. And, yeah, you know, the combination of self supervised learning and self training is very powerful these days. And, actually, the state of the art systems in speech recognition, for instance, actually both modality. They actually use self supervised learning, self training, and finally, supervised learning. So this is a big trend, right, towards combining a different learning modalities, which is super interesting in my opinion.
[00:31:38] Unknown:
And in terms of some of the other projects that are being built with Speech Brain and people in the community that are starting to adopt it, what are some of the most interesting or innovative or unexpected ways that you're seeing it used?
[00:31:50] Unknown:
First of all, we have to say that we release a speech brain, only 3 months ago. Right? So we don't really know what people are doing with that. So we will discover this very soon. But, yeah, I think on our side, the most interesting thing is this combination of technology, which is pretty pretty interesting. I don't know if Peter is aware about
[00:32:12] Unknown:
something more, but yeah. I've seen some things that I didn't expect come out of Speed Brain like, or the music generation that was trained to produce, like, piano music that sounded like Bach. And it didn't really sound that much like Bach, but it was really fun to listen to.
[00:32:33] Unknown:
It's very cool. And in terms of the lessons that you've learned in the process of building and using the project, what are some of the most interesting
[00:32:49] Unknown:
develop a single toolkit for multiple tasks. That was really, really, really difficult. And in fact, during the projects, right, we iterate over and over until we find the the current solution. That's the most challenging things because, you know, it's relatively easy these days to create, you know, a single system for a single task, like speech recognition. But if you will like would like to be to address the multitask aspect, you have to carefully think about what you do. Right? So and, you know but I think with the amazing team that I work with, we communicate a lot during the project. We share feedback along the way, and, we revised things multiple times. And finally, we were able to reach, I think, a solution that I mean, maybe not the final 1, but at least a good starting point.
[00:33:39] Unknown:
In terms of people who are interested in using SpeechBrain and they might want to do some of the speech enhancement or speech recognition or some of the background noise cleanup, what are the cases where it's the wrong choice and they might be better served using some managed API or a different toolkit or, you know, another open source project entirely? It's never
[00:33:59] Unknown:
it's never a bad choice. I'm joking. Actually well, at least in the current version, we are not addressing some tasks which might be important in some applications. I'm merely thinking about, you know, small footprint speech processing where we may wanna do speech processing not on a GPU, but on a smart phone, etcetera. And connected to that, we are not addressing real time processing. So these are kind of topics that we are not currently supporting, but we have plans for that.
[00:34:31] Unknown:
On that vein, what are some of the things that you do have planned for the near to medium term future of Speech Brain and your uses of it? Sure. We have, pretty big plans. So since we have pretty positive feedback from the community, we are very motivated to do good job for the future.
[00:34:47] Unknown:
There are things that we are going to do in the in the short term, like, support for text to speech, which is very important. But also in the long term, as I mentioned before, we want to do to go to dive into, you know, the domain of real time processing and and small footprint. So these are big topics that we plan to address in the future.
[00:35:08] Unknown:
Well, for anybody who does want to get in touch and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a tool that I started using recently for actually managing the scheduling of the podcast interviews that I do. It's called x.ai, and it's just a calendar scheduling tool that makes it very easy to set up different templates of time ranges that you're available and how many of a particular type of meeting you wanna allow within a given day or a week, and then it just handles a lot of the back and forth automatically. So that's been pretty nice to be able to offload some of the scheduling challenges of dealing with the podcast. So definitely recommend that for anybody who's looking for a way to simplify the work of managing their calendar. And so with that, I'll pass it to you, Miroko. Do you have any picks this week?
[00:35:55] Unknown:
Well, I recommend SpeechBrain for sure. For vacation, well, now it's getting better, so I will I will do my vacation as usual in Italy. So it's a 2 years I'm not I'm not able to go there. So that's something I recommend. So visit my country is wonderful.
[00:36:13] Unknown:
Well, thank you both very much for taking the time today to join me and share the work that you've been doing with SpeechBrain. It's definitely a very interesting project and 1 that I plan to start experimenting with and see if I can build a transcription process for the podcasts because that's something that's I've worked on in fits and starts, but haven't really got a smooth workflow yet. So, hopefully, Speech Brain will help in simplifying that process. So thank you both again for all the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you. Thank you. Thank you for listening. Don't forget to check out our other show, the data engineering podcast@dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com with your story. To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
Introduction to Guests and SpeechBrain
Overview and Goals of SpeechBrain
Challenges in Audio Processing
State of the Ecosystem and Need for SpeechBrain
Technical Implementation and Design Decisions
Workflow and Use Cases
Research Focus and Community Building
Innovative Uses and Future Plans