Summary
The Musicbrainz project was an early entry in the movement to build an open data ecosystem. In recent years, the Metabrainz Foundation has fostered a growing ecosystem of projects to support the contribution of, and access to, metadata, listening habits, and review of music. The majority of those projects are written in Python, and in this episode Param Singh explains how they are built, how they fit together, and how they support the goals of the Metabrains Foundation. This was an interesting exporation of the work involved in building an ecosystem of open data, the challenges of making it sustainable, and the benefits of building for the long term rather than trying to achieve a quick win.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Before you put your code into production you need to make sure that it passes all of the tests, that it has been packaged with all of the dependencies, and that you haven’t introduced any security issues. Instead of running all of that on your laptop, let Codefresh handle it automatically with their continuous integration and continuous delivery platform. Built for the modern era of cloud-native computing, they make publishing to Kubernetes, serverless platforms, and virtual machines fast and seamless. With a growing library of pre-made steps, a flexible pipeline definition, and unlimited scale Codefresh lets you ship faster and safer than ever. Go to pythonpodcast.com/codefresh today to get unlimited builds on your free account.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Param Singh about the ways that Python is being used across the various Metabrainz projects
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by giving an overview of what the Metabrainz organization is and the various projects that it encompasses?
- What are the motivations for creating those projects and some of the origin story for Metabrainz?
- The Musicbrainz server is the longest running project and is written in Perl. What was the reason for switching to Python for all of the other *brainz projects?
- How does the MetaBrainz Foundation sustain itself? Where do the funds come from?
- How do you determine where and how to allocate the funding that you receive?
- Which of the *brainz projects is the most complex or challenging to build, whether due to technical or sociological reasons?
- How do you source and manage the information that powers all of the Metabrainz projects?
- How is development of the various projects organized?
- How does that influence the amount of code sharing that is possible between them?
- Of the projects that you have been involved in, how are they architected?
- What are the main ways that the projects differ in how they are implemented?
- What are some of the ways that you are using Python in support of the various projects that you work on?
- What are some of the most interesting, innovative, or unexpected ways that you have seen the projects or data built by Metabrainz being used?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while working as a contributor and maintainer of the Metabrainz projects?
- What is in store for the future of the existing Metabrainz projects?
- What are the next domains that are being considered for building a Metabrainz platform for?
Keep In Touch
- paramsingh on GitHub
- Website
Picks
- Tobias
- Beets music library organizer
- Param
Links
- Metabrainz
- Stripe
- The Himalayas
- Dublin Ireland
- XKCD Import Antigravity
- Last.fm
- Google Summer of Code
- CDDB
- Perl
- Flask
- SQLAlchemy
- 3rd anniversary cake
- Redis
- PostgreSQL
- RabbitMQ
- Spark
- Music Technology Group
- Splunk
- Artist Origins Map on ListenBrainz
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode today. That's l I n o d e, and get a $60 credit try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Before you put your code into production, you need to make sure that it passes all of the tests, that it has been packaged with all of the dependencies, and that you haven't introduced any security issues. Instead of running all of that on your laptop, let Codefresh handle it automatically with their continuous integration and continuous delivery platform. Built for the modern era of cloud native computing, they make publishing to Kubernetes, serverless platforms, and virtual machines fast and seamless. With a growing library of premade steps, a flexible pipeline definition, and unlimited scale, Codefresh lets you ship faster and safer than ever. Go to python podcast.com/codefresh today to get unlimited builds on your free account.
Your host as usual is Tobias Maci. And today, I'm interviewing Param Singh about the ways that Python is being used across the various Metabrains projects. So, Param, can you start by introducing yourself?
[00:01:44] Unknown:
Sure. Hey. So my name is Param Singh. I'm from India. I studied computer science. Like, I graduated, like, a few, just last year. I studied computer science in a university very near the Himalayas. Right now, I'm in Dublin. I work for Stripe, and I have been working for, like, MetaGreen in college since the beginning of 2017, which is almost
[00:02:06] Unknown:
3 and a half years now. So yeah. And do you remember how you first got introduced to Python?
[00:02:11] Unknown:
Yeah. So I think, like, the first few lines of Python code that I wrote was in high school, I think, like, almost 9 or 10 years ago. I was a high school junior, just had installed, like, Linux on my on the family computer for the first time. So I decided to play around with Python because it just, like, seemed cool. You you know, the x t c d comment about the import anti gravity tech, come join us programming as fun again, that sort of thing. I only knew, like, data from actual high school events before that. So going from that to Python just, like, opened up a lot of the scenes, cool in a sense. And then I basically, like, built small projects using Python's scraper for, like, school teacher profiles, etcetera, etcetera, and then it's just been going on since then.
[00:02:56] Unknown:
And you mentioned that you've been working with the MetaBrains projects for about 3 and a half years now. I'm wondering if you can just give a bit of background into how you first got involved with that and some of the ways that you are engaging with the different projects?
[00:03:11] Unknown:
Right. So I was actually looking for, basically I used to be, like, a very avid Last. F m user. Like, last time is basically a site which tracks your music history and then, like, you get statistics out of it, recommendations, all that all that kind of jazz. But, like, there were some, like, complications with last time where it's caught by another company, and they got rid of a few features, etcetera, etcetera, which led to me so basically trying to find an alternative. And from there, I came across listening, which was basically low at first sight because, like, it was in Python. It seemed cool, and it was all open source, which really died with my personal values in the sense that, like, it's my data. So it shouldn't be just closed down. Right? If if anyone wants to use it and if I made public, people should be able to use it. So from there on, I just, like, started contributing. I was initially a Google Summer of Code student for ListenBain where I basically worked on ListenBain, and Google paid me to work on ListenBain for over a couple for and 1 of the 1 of the team mentored me, and that was that was when I was, like, a college junior. So, yeah, that's how I got introduced to the Metabrains Foundation. And from then on, I just, like, keep kept contributing after summer of code part time while I was a student.
And, yeah, that's mostly it. So the MetaBrains foundation
[00:04:29] Unknown:
is apparent to a number of different projects. I know that MusicBrainz was the foundational 1, but can you give a bit more of an overview about what the organization is and some of the various projects that it encompasses?
[00:04:40] Unknown:
Mhmm. Sure. So Metavreens is basically a nonprofit organization that is in California, but we have, like, core contributors spread across the world, Europe, US. Like, basically, almost all time zones, we have someone who contributes. We believe in, like, open access to data, and we specialize in datasets re revolving around music technology. So, basically, any information about music that someone wants, we probably want to, like, have it somewhere. So the oldest query that we have is MusicBrain. I like to explain MusicBrain as kinda like Wikipedia, but for music metadata. Like, suppose you want to build a music app like Spotify, you'd want to know, like, which artist has released which song or, like, what kind of artist Twitter is, etcetera etcetera. So we basically counsel that information in MusicBrain.
It's MusicBrain has, like, a total of 2, 000, 000 editors in total since since it started. It's basically the source of proof for music data on Internet right now. Google, Bing, Amazon, BBC, they all use it for their music information. So if you ask any details about our song, please Google Home or, like, maybe Alexa even. I'm not sure about Alexa, but maybe. So they'll probably query the music billing data and get that information from there, and anyone can just edit the information. So, yes, that's MusicBrain. The product that I personally spend I'll spend a lot of time on is ListenBrain. So ListenBrainz started around 5 years ago. It's basically just last December open source. It keeps track of your music history. It basically started as, like, hack project between the Metaverse people and a few people that found Glass Steffen because they wanted to take a fresh dive at it. So, yeah, that's listening.
Acoustic gains is another project that we work on. Acoustic gains, I like to define as, like, us trying to find out what music sounds like. We basically try to outsource extremely detailed acoustic information like the pitch of a song, the BPM, etcetera etcetera. And then we run, like, machine learning models on it to calculate more abstract stuff like danceability, whether the song has vocals, are there male vocals, or are there female vocals, stuff like that. Other than that, we we do have, like, a few other projects. Like, we have, BookBings. BookBings was started as a community project by people who were basically inspired by music, and they wanted to start something like music, but for books. CritiqueBrainz is another project where we try to collect and make make open music reviews, like, opinions about music. MusicBrainz is more fact based, but CritiqueBrainz is, like, where a place where you can put your specific opinions about the music. So, yeah, I think that's about it. We also maintain and develop a small app, which we call Pickard, which people use to, like, actually use music based data and use it to organize your music collection.
So, yeah, I think I think that's a decent overview of all the projects that we have. It's a lot of place for, like, a pretty small team, but we try to make do.
[00:07:41] Unknown:
And as you mentioned, a big component of these projects is the data that's actually collected using the overall goals are for making all that information publicly available and some of the foundational inspiration for the MetaBrains Foundation, if you have any of that context.
[00:08:06] Unknown:
Yeah. Sure. So when music was signed, I was probably only a baby. But the story is really interesting, and, like, I can tell what the story is. So this is back in 1996 when people when, like, the Internet was just getting popular and things will things like that were happening. There was this service called cbdb that people used to basically store information about their compact disc. Right? And they they submit that information to cbb.com, and then everyone could use it and yeah. But it wasn't open source. And what happened was that it basically got by a company, and it became private. So, basically, like, huge number of contributors who contributed a large amount of data to a survey, basically, they with the assumption that it would remain public, that that became private. So that resulted in a large, like, public out cry, and people were basically angry.
And it led to the start of a bunch of open source projects that wanted to be, like, the competitor to cdde.com. So MusicBrain was 1 of those 1 of those competitors. Eventually, we just, like, took it and ran with it. And, right now, I don't think, like, we have any real competitors in terms of music metadata right now in the sense that, we're almost the source of truth, as I said earlier. So yeah. Eventually, the Metabane Foundation was set up to actually, like, maintain the data because having an organization versus a person is just, like, good practice so that stuff like this doesn't happen again. The rest of the projects are basically just, like, logical extensions to music things. Right? So we have all this data. Why not build an app like Piccard to help people organize organize their collections? We have all the facts about music. Why don't we actually start collecting opinions too? So that led to fatigue.
Why not start crowdsourcing more detailed information, like acoustic brain does? So it they're basically just, like, extensions of what Neelbrain does. And they're all, like, relatively long games. So Neil Breen really became useful to people about, like, maybe a decade or 2 after it got started. Right? Acoustic bands and listen bands and all the other, they're relatively younger. So they're not very useful to anyone right now. But we're hopeful that, like, they'll get a critical mass, like, maybe years down the line, and then they'll be useful to people. So, yeah, that's the, I guess, the origin story for MetaBrains.
The motivations our motivations are basically, like, very simple. We just want to, like, make sure that all this data, it it remains it we're just trying to democratize all the data. Right? Right now, if you look at it, most of the data, the big companies, the big 4 or whatever, they have they have the data. And and, like, if you're trying to build a competitor to those services, you probably won't be able to because, like, it's that's not realistic. But if we succeed, then that leads to most a lot of the stuff going into the open, which leads to basically our democratization where people can actually build competitors to huge services while being small and agile, I guess.
[00:11:08] Unknown:
And digging more into the technology, as you said, MusicBrainz was the first iteration, the first project of the foundation. And because of the time at which it was created, it was implemented in Pearl, which was 1 of the more popular languages. But all of the more recent projects have been implemented in Python. So I'm curious what the motivation was for switching to Python for all these other more recent MusicBrain or for all these other MetaBrains projects and continuing to maintain MusicBrainz as a Perl app.
[00:11:40] Unknown:
Mhmm. So just 1 small correction there, I guess. BookBrainz started off as a community project, and it's actually written in JavaScript. You know, Node Express, React, the user in Shabang. We still maintain it, but it is in JavaScript. But other than that, the choices the specific decision about going to Python from Perl was a bit before my time at Metabank, but I can still take a stab at XB maintenance behind it. I think a major reason was that Parell is just isn't very popular among developers these days, and this showed up for us in the number of new contributors that we get to our project. Bookings, which is written in JavaScript, it has, like, a huge number of new people, like, new developers wanting to contribute.
The Python branch also have, like, a reasonable amount of new contributors who are able to, like if it's a small thing, just go ahead and fix it. And if if they want to, like, make more core changes, they're still they're still, like, volunteering. Music means is, right now, harder to contribute to these days because most of them, there's just not, like, many people who are who are willing to learn Pearl to actually start contributing or many people who want to program in Pearl in the file in the free time. Right? So I think that was 1 of the reasons why we went to Python. Another reason was that our Python environment just Flask or SQL Alchemy or stuff like that, it's just, like, really nice to use. So it really helps us work at a fast pace and, like, it's just fun to develop in. Another reason would probably be Pearl 6. So it is Pearl 6 is obviously gonna be very different from, like, Pearl's Day, and it had been roaming around for a long time. And Pearl's Day basically basically meant that we were gonna have to change languages at some point of time no matter what. So if we are gonna change languages, then we probably want to go with, like, something that's better for us, like, use cases and Flask and SQL Alchemy, and that stuff just really, like, seems perfect. So yeah. And I think that those are, like, major things for us starting with Python now. In terms of the sustainability
[00:13:43] Unknown:
of the foundation and the work that you're doing, how does the MetaBrains project approach that, and where does the funding come from?
[00:13:51] Unknown:
Oh, yeah. So we basically so a few years ago, we basically relied just on donations in the sense that people who use our site would donate to us, and that would go to the server cost or developer or developers or, like, payroll and stuff like that. But that wasn't really working out because, like, our major real users, they were businesses. And our individual users are the editors of music, they don't really have, like, much incentive to donate to us. Like, are they already contributing their data? Right? Asking them to contribute money as well was, like, not ideal. So in the end, what happened was that the executive director came up with something that he's I like to describe as the drug dealer mod. So the data that we have is free. Right? And and it's easy to use. So what engineers do is that if they're building stuff, they just download it and, like, build their cool stuff. Mostly, this stuff gets released without us even knowing about it. Eventually, those products, they get at least medium, and we come to know we we find out that this product is using our data. So we knock on their doors, and we ask them to support us. And these are not exactly, like, core music industry company. These are mostly music technology company. Right?
So mostly what happens is that, like, executives will try to find other sources of this data. Right? Maybe try to find an alternative. Most of the alternatives are, like, orders of magnitude more expensive and not nearly as accurate. So, eventually, what happens is that and we basically go to them and tell us that you need to support it because we need funding to actually maintain the service that you're using, and, eventually, they just pay us. Right? So a really fun anecdote about this, about our, like, funding model is the Amazon case story. I really like telling this story to people because it's always, like, very fun. So what happened was that Amazon was basically, like, 3 years behind in paying us. We had sent them, like, an invoice, and it had been 3 years. And they'd accepted that they paid, and they hadn't paid it. Right? So our senior director had basically, like, been following them for months asking for the payment, etcetera etcetera. So, eventually, what he did was that he said he sent them an email saying, if you don't pay in 2 weeks, I'm going to send you a cake.
So a cake. And, yeah, a cake, which basically say congratulations on 3 3rd anniversary of invoice number 144. So they did panic that a bit, but it didn't happen. The 3rd year anniversary came came, and, like, I already did the direct descent. I was on the page. Right? It got some traction in, like, some blogs, etcetera etcetera. Amazon did get, like, a bit panicked. So I executed again and directed eventually got call from, like, head of legal, head of music, head of anything, etcetera. And, eventually, what happened was that the cake was sent on a Tuesday, and, finally, the check came in on a Friday. Right? So that's the point that we realized that it's really just a matter of, like, getting these huge companies who actually use our data to pay us. And, like, that's when we set up entirely different side call metadata.org, where we ask people to basically use our data for free if it's for noncommercial purpose. But if it is for commercial purpose, then please do support us. We have different tiers of of support where we if it's a nonprofit or if it's a university or something, then that basically is free. But if they're a small startup, we we ask them for a small amount of money, etcetera. We say it's a huge startup. We have a tier called, like, Unicon, which is for really, really, really big companies like Google or Microsoft or Amazon, etcetera. So that's basically our funding model right now in the sense that people just start using our data, and, eventually, we find out about it. We ask them to support us. And at that point, they're just so hooked into the data that it's probably more expensive for them to, like, actually switch versus just not just supporting us. And in terms of the sustainability
[00:17:45] Unknown:
of the project, it's a lot of open source work. The sustainability question is just around the development time that goes into building it, and then the actual use and integration a substantial amount of data that's been accrued. So you also have these hosting costs and server and ongoing maintenance and, data management questions. So I'm wondering how you allocate the funds that you do receive from these companies who are supporting you and some of the challenges that you face in determining what feature sets to develop and determining the road map of these projects and managing the overall data and its longevity? Mhmm. So
[00:18:33] Unknown:
in terms of finances, I think all of our finances are public and anyone can see what we're spending the money on. In terms of, like, actual prioritization, I think what really happens is that the beauty brands is basically our start product, and we we try to make sure that it has all the resources that it needs to actually keep running and be sustainable and, like, just run all the way through. So Philippines right now has, like, I think, 2 developers who are full time on it and, like, 1 developer who's, like, part time on it. The rest of the employees because they're not, like, bringing in much money right now and because they're longings, they don't have as many resources allocated to them as we'd like to because we are still, like, resource we don't have, like, huge piles of money. So, ideally, what happens is that we try to make do with as much as we have as much as possible and try to just keep them running as well as possible. But in the end, I don't think, like, there's any specific very huge plans around, like, okay. This percent of the money needs to go there. It's we mostly play it by ear and, like, just pre level precautions in the sense that right now, I'm pretty sure that we always try to keep, like, 1 year's worth of money in the bank account in liquid form so that, like, even if all our supporters went away tomorrow, we'd be able to run for at least a year. And it's mostly just, like, stuff like that, thinking about worst cases and making sure that those don't happen. And it's not very ideal, but I think that's mostly what we do.
[00:20:08] Unknown:
And with these multiple projects that the Metabrains Foundation is responsible for, how is the overall development of them organized? You mentioned that there are 2 full time engineers dedicated to music brains. But in terms of organizing the work being done on these different projects, is it just the typical open source approach of here are a bunch of issues, everybody works on what they feel like, and, eventually, it ends up being something that's useful and usable by the broader community? Or is there some more directed effort that goes into determining what gets worked on by whom and when?
[00:20:44] Unknown:
So, realistically, I'd say that so I think as in most things, it's a compromise between them. So we have, like, I think, 4 or 5 developers who actually who we have contracts with, and the rest of the developers are basically volunteers. So in the end, what happens realistically is that there's always stuff that's not as, like, the thing that people want to do in their free time. That's stuff that needs to be done, but getting volunteers for it would be, like, very hard. So in the end, what happens is that the contractors, the people who actually work full time on stuff and get paid by the Metal Green Foundation. They mostly spend their time on stuff like that and actually just keeping the product running and making sure that, like, if something comes in from a huge customer or something, then that issue or anything is, like, etcetera etcetera. But other than that, we do all of our work in the open, just like all open source projects. Road maps, our planning documents, our design documents. They're all public, and anyone can see them. And the and so other than that, our our development checks are all happen on IRC and are publicly logged.
So after that, it just becomes, like, the case of how a brilliant wants to how a brilliant wants to go ahead. Right? So I personally have a thread open in our Discord forums, basically asking our users, like, our community, what they want from listening. We use all of that stuff to, like, prioritize what needs to get built and then just try to go ahead and build some part of that and, like, then go back again, see the feedback, build something else, etcetera etcetera. The Google Summer of Code program has been, like, really impactful for MetaGreen specifically. Every summer, Google sponsored a few students to, like, come and work with us on an internship of sorts. And students just basically propose their own project, and we mentor them while we go away. I was introduced to Google Summer of Code and so were, like, an I I'd say, like, a huge number of other high impact contributors.
And because we're, like, a very small team, it's easy to, like, stay in sync on what someone is working on so that we don't, like, step on anyone's toes. But overall, in terms of prioritization, it's mostly just, like, us it's mostly just a mixture of us building things that we want from the project versus us, like, taking user feedback in and trying to build those as well just to make sure that our users actually like the product and, like, a bunch of hanging out and just hacking on stuff.
[00:23:07] Unknown:
So for the projects themselves, you mentioned that the Python projects are generally built around Flask and SQLAlchemy. I'm wondering how much of the overall architectural design is shared across the projects, and how much divergence there is among them, and just some of the complexities involved in moving between the different projects as a contributor? Mhmm.
[00:23:29] Unknown:
So, basically, are all our Python projects, they have the same, like, basic extra. Python class, code base, basically, the red is etcetera etcetera. They're also, like, structured pretty similarly. So if you're, like, familiar with 1 of them, you can probably easily start on the other 2. That's that's mostly just because it's mostly just, like, 3 same cofounders who found who started all the 3 projects. But the thing that happens is that all the projects, like, diverge on their specific use cases. For example, ListenBain diverges from the rest of them, best of the project in terms of its, like, data pipeline. Listen means can have, like, a lot of data coming in at points of at at specific point of time because, like, there's a lot of people listening to music at a specific point point of time. That's not the case with, like, music panes because it's mostly just editors. Right? So for listen paints, we basically, like the short explanation would be we receive data from the API, put it into rabbit and q rabbit and q consumer, then writes it into our time series database. From there, we have periodic dumps, and from those go to Spark. And, like, from there, we calculate stuff like statistics, etcetera etcetera.
And so that's listening. Compared to that, acoustic veins is different because, like, a lot of incredibly detailed low level data that people send. So if we don't have, like, those pipeline problems that ListenBain has, but we we have, like, a large amount of data that we need to run machine learning models on. So the pipeline for that is completely different from the pipeline for listen means because because they have different use cases, and the the ways to scale them is just different. So that's basically where the all the divergences between the different projects really come from. Different use cases then lead to different technologies being used. Those should be in queues like a machine learning model that's built by MTG and the music technology group at a university in Barcelona because, like, Acoustic Brain was a partnership with them. So yeah. Now while listening, we just started doing machine learning with listening, but we have to do that in spa because it's basically like, we built out our entire Spark list for listening. So, yeah, the I guess I guess, like, they really diverge when the use cases and when the use cases diverge.
But, overall, I think it'd be easy for 1 person to move from, like, listening based development to acoustic based development pretty easily because they basically all have the same structure and almost the same coding style of the same testing framework, etcetera etcetera. It's always the same, but except for the basis where it's different, I guess.
[00:26:05] Unknown:
Yeah. And that's interesting too in terms of the supporting tooling that you're using to help maintain some of that consistency and some of the ways that you're using Python there. You mentioned testing and some of the coding styles. I'm wondering if you can just discuss a bit more about the conventions that you've built up to help facilitate that transition between projects and maintain the consistency and just the overall maintainability of multiple projects with a small team?
[00:26:31] Unknown:
Right. Yeah. So I guess the first thing here is that we're basically Python everywhere. And even if we're, like, building small tools that aren't supposed to actually go into the core repository or the core code basis, they're they're almost always in Python because we're just, like, good at Python now. We have been running Python for, like, years, and it's something that we know how to run well and how to just work with. So yeah. So what we do right now is we have different git repositories for each of the projects. Like, listen, we have a different git repository. Acoustic main has a different git repository. Critique main has a different git repository. And if we need to, like, share share code between them like, we have a different, utils kind of Python ID that we also maintain called, which basically contains stuff like that. We have a Flask extension in it, which is configured for our environment. It has, like, sensory integration, login, date limits, etcetera etcetera.
So if I was building all this, like, again, I would probably, like, tend towards putting all of this into a mono repo because, like, having just a single re having just a single repository for all of these projects because that would definitely make sharing code across, different projects much easier. But right now, what we basically do is that because we're reporting and everyone, basically, everyone has, like, at least some idea on what someone else is working on, and it's basically just like, if someone is working on something that might be of interest to some other project, and then we'll probably try to think of a way to, like, put it in brain tutors just to make sure that other projects don't have to, like, do the same work again. So, yeah, our Python environment right now is, like, very mature, and it's it's really hard to displace it for any kind of tooling that we really need to build, and I don't really see that changing anytime soon. And the interesting outlier too among all of these projects is the Picard application that you mentioned because it's a desktop
[00:28:25] Unknown:
program and built using Python and Qt. So I'm wondering if you can discuss a bit about some of the ways that that stretches the development tooling and just some of the work that goes on to keep that relatively consistent with the other programs.
[00:28:41] Unknown:
So we did have, like, this interesting use case where Piccard basically needs a website that people can download it from. So what happened was that when you were, like, writing a new version of the of that website, we basically just, like, took code from the existing project and then just remodeled it. So that was really interesting. But overall, Piccard is, like, a completely different application. So there there isn't, like, much code sharing between Picard and 1 of our, like, web services data projects. But, yeah, I think the web server and we have, like, a extension API for Picard, which is based on Flask. And I think those things definitely, like, really took the learnings from our our, like, API development in acoustic brain or listen brain and then just done with it. And that was a nice example of us taking old learnings and just taking them and running them again and not having to basically duplicate a lot of work.
So yeah.
[00:29:39] Unknown:
Another element of these projects beyond just the server side aspects is the data modeling and the fact that you have existing data. So any schema evolution needs to take into account the potential for breakages or mutations. I'm wondering how that gets factored into the planning and the overall review process and making sure that you don't accidentally change things too drastically or that you have a common set of base attributes that you use across the different projects for managing the metadata?
[00:30:12] Unknown:
So I think we haven't had many problems with stuff like schema changes. That's mostly because we've been running music playing for, music playing for, like, over 20 years now. And music is, like, very and music has definitely given us, like, opportunities to learn how to actually do this stuff very well. We have to, like, work a thin line between people actually being able to understand what the schema represents and, like, people being able to input data into the schema without act without having to read, like, books about it and still having a detailed a detailed enough schema that that that's actually useful for people and so that it's actually easy to write queries on by actual developers. I think I think MusicBrain has been doing that, like, really well in the sense that we haven't had, like, people complaining a lot in at least in recent times. And given that we've taken a lesson earlier, we we're we're pretty good at, like, taking learnings from 1 project and then applying them on other. So we also haven't really had any specific problems in terms of, like, schema changes that would cause users to be angry at us. But overall, what happens is that if we're doing, like, anything around anything bad critical, it goes through, like, a huge amount of testing and, like, extensive code review just to make sure that it's all actually working and it'll all happen well. And then we have specific release date where we basically have, like, an entire document written on the the things that we need to do before the actual release happens. And at the end of it, it's mostly just, like, just going through the document and taking the steps 1 by 1 and just running through them. We haven't really had, like, much many problems with them, knock on wood. But, yeah, I think it's mostly just, like, 1 of those maybe paint has just been 1 of those, like, surprisingly great things about it where it just like, it walks the line between having a detailed schema and having users be able to understand what they want to insert into the schema really well.
So yeah. But I don't know. Just in terms of, like, actual schema changes, it's mostly just us being, like, very thorough about what data we're changing because that can lead to huge and we're we're just very proactive of the data of the data that we have and of the standards of our data. So anything that basically touches that just goes through a lot of review before it actually reaches production.
[00:32:39] Unknown:
And closely related to the underlying representation of the data is the APIs and the attributes that are available in the responses and the interfaces that are available to end users of the project. So I'm curious how you approach versioning and evolution of those interfaces for people who are consumers of the different projects that are managed by the Metabrains Foundation.
[00:33:02] Unknown:
So for listening, our basic philosophy right now is that we don't really try we try to just not break anything unless we really, really need to. And we haven't made any specific API API changes that might have made clients incompatible or anything. We haven't made any breaking changes in the time that I've been here. But ListenBrainz is, like, relatively young. It's only it's only been, like, 5 years to it. So yeah. But, overall, I think what we really do for for meeting specifically, like, what we've realistically done is that we've had a version version 1 of our API app and a version 2 of an API as running simultaneously.
And we'd say that we want to disable version 1 in the next few next x days or something and eventually just keep running for years. And at some point, it'll just happen is that what will just happen is that we just don't have enough resources to maintain it. And at that point, we just, like, switch it off. But when we do switch it off, we do have, like, people coming in and complaining because we switch it off. But at that point, it's mostly just us. Like, we just point them to the platform that we had made years ago saying, deprecate this part of the API on this day, and then we deprecate it, like, years later. So it we haven't had to be, like, a huge number of people, complaining about that. So but our general philosophy is we try to make as few waiting tables as possible and only make waiting tables when they are absolutely necessary because just waiting clients can be a pain.
And I personally maintain a Python client for the API that has a personal project, and so that basically gives me the point of view of both a developer and, like, and a maintainer of the API. So that that really helps keep all of this in context.
[00:34:48] Unknown:
And as you said, the primary focus of the different projects in the MetaBrains Foundation are oriented around data having to do with music and its associated information. And there's the recent introduction of the Book Brains project. I'm wondering if there are any other domains that are being considered for building a new platform for and what the process is for introducing that idea or kick starting a project for it.
[00:35:15] Unknown:
So just for context, BookBanks basically started as a community project where someone from our community took like they wanted something like that for book something like music playing for book. And they just started hacking on it, and it eventually became it eventually started being used by a lot of people in our community. And at that point, it just made logical sense for us to actually just take care of maintaining it as well. But but right now, I'd say that we already have, like, too many projects and a bit too little capacity, and we specifically need to focus because of all the uncertainty due to the pandemic and COVID 19 and all that stuff. So I don't really see any new products that MetaBrains will start anytime soon. We already have, like, a bunch of fledgling products that need a lot of love, but people have definitely asked us for stuff like movie brains or food brains or, like, recipe brains.
I personally love to see those, but I don't think that it's very realistic for us to build any of those right now. And the thing that we really want to actually move into is, like, music recommendations. So if you think about it, right now, we have, like, just an update to actually get started on building initial version of our open music recommendation engine. We have, like, people's music listening history. We have data about what music actually sounds like. We have music metadata. So what we're actually thinking of is basically building our developer friendly developer friendly tool or something that basically helps people get started on building their recommendation engines without having to actually go through the entire process of the data collection that most companies need to do before they can actually do anything anything that's actually valuable.
So the music recommendation thing is actually, like, completely dominated by non open players a bit right now. We personally feel that we can democratize that. We can, like, change that. But as always, it's, like, a long game. It's it's not gonna happen tomorrow. The first 1 that we'll probably build will be, like it'll probably be really bad, but at least it'll be open. So but, in in general, like, our experience is that in and in these things, like, when we build something and it's the first 1, so it's obviously very bad, But someone will come to us and tell us that, okay. This is just, like, bad, and most probably they'll help us fix it. So it becomes a bit better, and then someone else might come and help us. And, eventually, it just, like it takes a long time. It definitely doesn't happen in the short term. But over in the end, it becomes, like, a good product or a good project. So that's what we're, like, hoping for in the future of Metabrains.
But in terms of, like, new platforms or new datasets, I don't think that anything else is, like, really realistic in the future.
[00:38:06] Unknown:
And as far as the ways that the information and the projects that are curating that information So a
[00:38:23] Unknown:
So a very interesting thing that I saw, like, a few weeks ago on Twitter was a user basically took his listening data and put it into Splunk. Splunk is a logging utility. Right? We use it for MetaMask doesn't use it, but it's generally used for, like, logs, and you can make graph out of the logs. And so that was that was really interesting. The user, they they actually, like, created real time draft of their listening history in the sense that this is the top artist that I listened to this week, stuff like that out of Splunk. And and because they were using our API, they could just, like, hit our API every 5 minutes or so and get, like, very real time data. Obviously, that doesn't, like, scale to the number of users we have, and I was, like, personally working on solving those problems for all the users that we have. But, yeah, that that was a very interesting use that I saw. Another interesting use that I was, like, pretty impressed with was, like, 1 of our contributors basically built a artist recommendation engine based on just music based data.
So what they basically did was that in terms of, like, if you have an artist and you want to calculate, like, and you want a bunch of artists that are similar to them. So to do that, what they basically did is that they used the number of compilations that the artist was in and just, like, put weight on that and basically just recommended the artist that were in the same compilation as the artist that you had for a number of times. And that is, like, a really you know, it is way to actually calculate similar artists in my opinion, and, yeah, that is fun.
[00:40:00] Unknown:
And your point of the user hitting the API every few minutes also brings up the question of scalability of the services that you're running. Because of the fact that these are globally accessible, how do you handle things like potential DDoS attacks or rate limiting of users to ensure that the service is accessible to everybody who wants to use it?
[00:40:22] Unknown:
So if we're talking about so music is basically used by a lot of people. So we have specific rate limits on music, and we ask people for support. And, basically, what that means is that if they support us and, like, sign a contract with the sort of thing, then we might actually, relax those rate limits a bit. In terms of listening and acoustic, we haven't actually had many problems with actual scaling of it yet. And we basic we still don't have a rate limit of some kind there in the sense that, like, I think oh, yeah. Like, we do have, like, a rate limit of the number of requests that you can send in a particular time period, but we are really mature it in the sense that if someone starts, supporting us or things that we really like the rate limit a bit. But, yeah, I think, overall, the infrastructure is there, but it's not very mature yet.
[00:41:17] Unknown:
And from your experiences of being a contributor and maintainer of some of these different projects in the MetaBrains Foundation, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:41:30] Unknown:
So this is a relatively embarrassing story that I have. So this is almost 2 years ago. And when I just started with, like, MetaBranch contributions and stuff, I was basically working on, like, calculating using, Google's BigQuery service. Right? And this is basically the first time that I'd like, I was writing actual production code that was actually gonna be used by people. And so what basically happened was that Google changed their API just a bit, and that led to, like, a bunch of breakages, which eventually led to a bill a monthly bill of almost, like, $33. Now and that was, like, a huge learning experience for me in in general engineering terms because, like, I after that, I really, like, started thinking about things in terms of, like, how will this fail? Just, like, adversarial mindset on it. Like, how could I make this big? And that is, like, definitely 1 of the biggest learning experiences that I've had with, Meta Range. Other than that, like, more recently, I think it's been interesting to realize, like, how you can actually amplify your contributions by helping others create more stuff. Like, initially, when I just started with Metabase, I used to just, like, write code. And that basically meant that my output was equal to just my output. Right? But right now, I don't really like, I I haven't been getting much time to actually write Metabase code in the last, like, month, but I have been, like, mentoring students who mentoring Google Summer Course students who work full time on it. And it basically means that some part of my effort gets amplified, and it basically creates, like, multiple times the value that I would have created just myself. So that was, like, a really nice learning that I had in recent times, and I'm really interested in, like, exploring that further and seeing how people actually do that on scale.
[00:43:21] Unknown:
So in terms of the future direction of the different MetaBrains projects, what do you have in store for the future? And are there any new capabilities or new directions that you're excited about or that you are looking for any particular contributions for?
[00:43:37] Unknown:
Yes. So in terms of, like, very near future, we have been working very hard on listening based statistics. We've been implementing graph left and right almost every week. And we have this cool graph called artist origins now. It basically is a world map of the artist that you have listened to. And stuff like that is really nice because it's likely to thinking about job difficulty and diversity and stuff like that. And those graphs are something that I've been really excited for a while now, and it's nice to see them in actual production and users actually using them. In the medium term, I'm really excited about adding more social features to listening, stuff like following people, stuff like, stuff like having friends, stuff like user compatibility.
I'm also very excited about what we do with the recommendations. And just to caveat that we're not, like, experts on recommendations. We're just starting out. And if anyone, like, actually wants to help us, we'd be really, really happy to get those contributions in. Like, that would be very valuable to us.
[00:44:36] Unknown:
Are there any other aspects of any of the MetaBrains projects or the foundation itself or the overall space of music related data that we didn't discuss that you'd like to cover before we close out the show?
[00:44:49] Unknown:
Yeah. The 1 thing that I just like to point out is that because that I personally that I personally really really like about MetaBank is in terms of, like, working is that we're all, like, really passionate about it. So it makes it very easy to, like, actually just keep working on stuff and get burnt out. And we really, like, try to make sure that that doesn't happen to our contributors who, like, are very passionate about stuff. I really like the fact that even though the work that we do, it might be slow, but we're, like, still consistent. And, like, we still have, like, consistent and good improvement released almost all the time through the year. And, yeah, all our colleagues are longing, so that's the consistency is really the thing that we're looking for. And that's what we really like, to see that people understand that.
Yeah.
[00:45:38] Unknown:
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And with that, I'm gonna move us into the picks. And this week, I'm going to choose the Beats project. I had them on the podcast a while ago, but it's a Python application for being able to easily organize your music and actually takes advantage of the MusicBrain's information to be able to update your tagging of your, files and ensure that things are organized properly into the albums and artists. And, it's just a really great way to easily get a whole mess of m p threes in various random directories into a cleanly organized and well maintained set of music and being able to access that easily. So definitely recommend that for anybody who's struggling with a hard drive full of music files. And so with that, I'll pass it to you, Param. Do you have any picks this week?
[00:46:30] Unknown:
Okay. I was, like, completely prepared for this, but let me think. My pick, personally, would be an artist. Like, given the music theme, I'd I'd like to take this time to maybe pick an artist that I really like and that most people might not have heard of. The artist's name is Pratik Kur. He's,
[00:46:47] Unknown:
Indian he's an Indian indie pop, like, singer who has this song called cold slash mess, which I really like. And I've been listening to that for a long time since now. So that's my pick for today, I guess. Well, thank you very much for taking the time today to join me and share your experience as a contributor and maintainer of projects at the MetaBrains Foundation. It's definitely a very interesting set of projects and something that I've benefited from a number of times. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Yeah. You too. Thank you for having me. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of python podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your
[00:47:53] Unknown:
friends and coworkers.
Introduction and Guest Introduction
Param Singh's Background and Introduction to Python
Involvement with MetaBrains Projects
Overview of MetaBrains Foundation and Projects
Origins and Motivations of MetaBrains
Technology Choices: From Perl to Python
Sustainability and Funding Model
Development and Project Management
Architectural Design and Tooling
Data Modeling and API Versioning
Future Directions and New Projects
Scalability and Rate Limiting
Lessons Learned and Mentorship
Exciting Future Capabilities
Closing Remarks and Picks