Podcasts are one of the few mediums in the internet era that are still distributed through an open ecosystem. This has a number of benefits, but it also brings the challenge of making it difficult to find the content that you are looking for. Frustrated by the inability to pick and choose single episodes across various shows for his listening Wenbin Fang started the Listen Notes project to fulfill his own needs. He ended up turning that project into his full time business which has grown into the most full featured podcast search engine on the market. In this episode he explains how he build the Listen Notes application using Python and Django, his work to turn it into a sustainable business, and the various ways that you can build other applications and experiences on top of his API.
- Your host as usual is Tobias Macey and today I’m interviewing Wenbin Fang about the technology powering the Listen Notes podcast discovery platform
- Superhuman email client
Your host as usual is Tobias Macy. And today, I'm interviewing Wenbin Fang about the technology powering the Listen Notes podcast discovery platform. So, Wenbin, can you start by introducing yourself?
Sure. First, thanks for having me on. Hello, everyone. I'm Wundhian Feng, founder of Listen Knowles, protocol search engine and database. I grew up in China, came to the United States in 2010 for graduate study in computer science. Now, I live in San Francisco, running in ListenKnowles. Listeners is not a conventional company. It's only 1 full time employee, just me. I have a few freelancers helping me on on and off for
content designs and other stuff. Yeah. And do you remember how you first got introduced to Python?
It was in the summer of 2012 when I was in graduate school. I was working on a research project. I needed to pass a bunch of log files and do data visualization using graph v's. I wanted to use a scripting language, but I don't remember why I pick Python. Maybe I had the impression that Python is good processing strings, textual data, something like that. Then in 2013, I joined a startup called Nextdoor, a local social network. That's when I joined Django and seriously used Python to write production code.
And so can you describe a bit more about what the Listen Notes project is and some of the story behind how it got started and why you decided that this what you wanted to spend your time doing, and how you turned it into a full time business?
So Listenos is a protocol search engine and database. Specifically, it's 1 database, 3 UIs. 1 Polycom database, 3 user interfaces. What do I mean? So 1 Polycom database is easy to understand, but we need UIs to access to the database. Right? The first UI is It's a website. You type in keywords, and then you find Polycast episodes and, Polycast. The second UI is the Polycast API. API stands for application programming interface. It's a UI. So people, developers can use the API to build apps, to build services, accessing to database and search engine.
Now so the UI, there's no good name for you, so I call it bring your own UI. So people can come to our website, export Polycoms metadata into CSV files, and then open in Excel, Google Sheets, or whatever UI. You can write a Python script to pass the CSV, bring your own UI. For stories, like many software engineers, I've been working on some side projects on and off. Nissanos was 1 of my side projects. It was started in January 2017. So 4 years ago, close to 5 5 years ago. Initially, I worked on this project part time for 1 week, and then launched it. It was basically a single page, Then I started to work on it full time in September 2017, so it's 9 months after I started it as a side project.
Why did I start a side project in the first place? So, personally, I really wanted this project to exist. I listened to a lot of podcasts in my spare time, and when I was doing some boring engineering work, like writing unit text, Back in 2016 and 2017, almost all the Polycast apps require you to subscribe to Polycast first, and then listen to individual episodes. This is pretty bad. There are more and more Polycast. I cannot subscribe to too many Polycast. Actually, I'm an inbox 0 person. What I really wanted was to be able to find individual episodes to listen without subscribing to podcast.
Right now, we have listen notes, and today, I use listen notes a lot. I use listen notes to find individual episodes, add episodes to my listen data playlist on listen notes dot com, which provides a RSS feed. I subscribe to this RSSP overcast. I don't subscribe to any podcast other than this RSSP. Since the end of 2017, I've listened to more than 5,000 podcast episodes in this single
[00:05:31] Unknown:
[00:06:48] Unknown:
playlist on listen notes. I can tell you, some use cases, how people use listen notes and the playlist feature. So 1 use case is in schools. So teachers who queue the playlist by topic and give the playlist to students, Use, our listen data feature. So we we know that in schools, teachers will cure reading list for books, for articles, and with the playlist feature, they can aggregate episodes by topic. Also, some email newsletter writers will use listen notes to find episodes and curate podcast episodes. Because nowadays, many email newsletters are curate content curation type newsletters. They curate articles, curate podcast episodes.
Also, some marketing people will need to do content research across different types of media. They may want to write a blog post, so they need to do research. We provide a tool for them to search Polycast API source by topic, and they can cure rates. And this is very good for content research. And it could also be, useful for podcasters, like you. So before interviewing a person, you might want to find a past podcast interviews for that person, and then binge you this and a little bit to avoid asking the same questions.
[00:08:14] Unknown:
[00:09:13] Unknown:
When I started listen notes, most podcast apps provide some kind of search ability, but it mostly only allow you to search podcast, no episodes. And what we really wanted, was to search individual episodes. Today, I see many Polycast apps already provide the episode search. Their question is whether the search quality is good You can also index a bunch of web pages to provide a search engine, but the key is to provide good search ranking algorithm to service the most relevant contents. And this is what we are working on right now to keep improving the search relevance on our search engine. There are some other discovery features on our website that I don't see on other podcast apps. For example, there's a feature on website, it's called listen real time, which is like Google Analytics, real time. You can see Polycast episodes are being listened right now on our website.
So this is 1 way for people to discover some niche podcast episodes. They don't even know it is. Also, there are a bunch of apps provide some kind of curated list by human editors.
[00:10:37] Unknown:
[00:11:14] Unknown:
Primarily, we index metadata from RSVP, like title, description, things like that. If an episode has transcript, we will also index transcript. Yeah. Banking algorithm, we actually aggregate multiple signals to determine what aerosols or protocols to service on top of the results. So some signals we use, well, I can only disclose a little bit. I don't want to fully disclose everything. So, for example, a podcast is mentioned on New York Times, top 10 true crime podcasts in a New York Times article, then probably this is a good podcast. Right? And if an episode is mentioned in the good email news data, then probably this is a good episode.
So something like that. And we also look at first party data. So by first party data, I mean, some activities, from our users, from our website. Like, if an episode is added to a bunch of playlist, then probably it's it's very popular. It's good. As far as
[00:12:21] Unknown:
[00:12:40] Unknown:
Yes. So at the very beginning in January 2017, as I said, it was a single page web app. So a search bar, you type in keyword, and you see some search results on a single page. And then after I started working on it full time, I added the playlist feature. So you can add the individual episodes into Playtest. And then in the end of 2017, some people reach out to me asking if we provide API, because they wanted to add a search feature to their own Polycast app, so we provide API. And the initial version of API was also simple. There were only 3 endpoints, search, fetch Polycast metadata, and fetch episode metadata, 3 endpoints.
And then later, we keep providing more Polycast discovery features, like the 1 I mentioned earlier, the real time feature. And also, yeah, there are a bunch of discovery features. I don't think I want to enumerate on my 1, even predicting in the journals. Pretty much in the past 3 or 4 years is to iterate on existing features, mainly incremental improvements. Right now, if you look at our product as API documentation, you can see a bunch of endpoints. Pretty much these endpoints mirror to the existing product features on the website.
[00:14:10] Unknown:
[00:14:30] Unknown:
So I can describe what the architecture looked like right now and then compare it with the initial architecture. So as of today, I primarily describe the decentralized website. Although we have 2 experimental iOS app, but it's irrelevant. It's irrelevant to the core product. So this and those website, the front end is the JS plus Tailwind CSS. Those JS and the CSS files are bundled through webpack. We upload bundles to s 3 and serve through CloudFront. And for back end, it's primarily Django and Python for web servers and API servers. And we use Postgres as the main database. We are running 3 database instances, dv1, dv2, dv3. Dv1 is the master DB, and the other 2 are slave DVs.
We use Elasticsearch for search engine. There are 3 Elasticsearch instances, and we use Radius, obviously, for some caching statistics stuff. We use rapid mq for mister q and the salary for async task workup. We run everything, everything, the infrastructure on AWS. As we are speaking today, December 7, 2021, AWS is a huge outage right now. At this moment, US is 1 region is down. So okay. So this is the raw architecture for this analysis today. And when I started in January 2017, the software I used was roughly the same, with HLS, Django, Python, Postgres, Elasticsearch, These are all the same. Except that at the beginning, I were using Footstrap CSS, and I ran the whole infrastructure on 3 small instances on DigitalOcean.
Each instance was like $10 per month, something like that. And later on, we migrated from digitals into database. In terms of scaling up, I think this architecture is very common, and it's very easy to scale up. Right now, we have several API servers, several web servers, and we can horizontally scale them up easily.
[00:16:48] Unknown:
[00:17:15] Unknown:
and is powerful enough to support scale we have right now. So there's no worry about it.
[00:17:23] Unknown:
[00:18:01] Unknown:
Basically, I use all the technologies I already knew at the beginning. I didn't want to spend time to learn new things. Those technologies were used in my former employer, Nextdoor. I know it's beta testing. I know this tech stack can support a multi $1,000,000,000 company. This works fine for this and those. Yeah. The whole tech stack is pretty similar to Nextdoor's tech stack. You want to say anything new from Nextdoor? I think I use Tailwind CSS, but it is not part of the business front end. I don't know if Nextdoor is using Tailwind because it's quite new. Yeah. Banking infrastructure is not really the same. So there's no innovation here, to be frank. Yeah. Well, there's definitely a lot to be said for choosing boring, battle tested technologies
[00:18:50] Unknown:
[00:19:53] Unknown:
I haven't seen much changes in the irisense format. There's some specification changes, but it's really hard to get all the major platforms to adopt the same standard, especially AirPod. So how do you talk to Air Force Podcast? How do you find a human being from Air Force Podcast to talk to them, ask them to support a a specific new specification? Right? If Apple Podcast doesn't adopt the new specification, then, I think it's it's very challenging.
[00:20:27] Unknown:
[00:20:58] Unknown:
to extract persons' names, locations, events, things like that. If you go to our web page for individual Polycast episodes, you'll find this kind of information. When you click a person's name, and you bring up a list of search results for that specific person, specific location,
[00:21:16] Unknown:
[00:21:41] Unknown:
It's actually a very natural evolution. So, as I said, API was initially launched in the end 2017. You had the 3 API endpoints initially, and then it evolves naturally. So first, I made major website features to the API endpoints. And I also talk to API users frequently, and their feedbacks are very important to decide, how the API would look like, what kind of response data, what kind of data fields they need to build an app. 1 principle I stick to is, is always be well compatible. Don't break the existing API endpoints.
[00:22:20] Unknown:
[00:22:39] Unknown:
But to be honest, all these technical stuff, all solved issues, they are not that difficult. Some of them are quite time consuming. I'll give you example. Upgrading Postgres from 9.17 to the version 11 without significant downtime. It is easy to say, but it's not easy to do. So with the upgrade, always keep the infrastructure software to the latest version. If not the latest version, at least close to the latest version because of bug fixes, security fixes, new features. Right? It's always a drama, tech companies, for upgrading across major versions, especially for database.
Last time, I upgraded the host grades from 9.17 to 11. I managed to achieve between 30 seconds 60 seconds downtime for write access and 0 downtime for read access. So, they involve a lot of prep work and a bunch of rehearsals on staging environment. Just to emulate all kinds of failure scenarios, and try to recover the service, how to go back, things like that. This is challenging. We need to do similar thing for Elasticsearch, other pieces of infrastructure to our software. Right now, we are providing API. We need to make sure API doesn't go down, doesn't have 0 time, doesn't have downtime.
So, yeah, it's challenging to manage this kind of infrastructure.
[00:24:13] Unknown:
[00:24:43] Unknown:
So for Spotify's entrance into Polycast, I don't think it affects listeners too much because it's a it's a work out that it doesn't participate in the open ecosystem. Arguably, people can say, over Spotify is provided, it's not Polycast, because it's not based on ISP, but based on the narrow definition of ISP. Right? But, a good thing is, Spotify expand the audience audio consumption, the spoken audios. Right? The podcast listener, Spotify, might not be the traditional or existing podcast listeners. So existing podcast listeners pretty much use the same open RSS space, the Polycast players.
And Spotify may only target to their existing users, music listeners, and they hand them to discover, oh, there's a spoken audio there.
[00:25:41] Unknown:
[00:26:30] Unknown:
to product house producers, but in the future, we certainly do more. Right now, we certainly have the search history for how people discover certain polycast. We have the listen history, listen stays for individual episodes on our own platform. Basically, each podcast players all have their own listen stays by its fragmented. We developed a new metric called distance score, which is to estimate the popularity of podcast and then provide a global ranking. This is to give people a lot of sense
[00:27:07] Unknown:
[00:27:29] Unknown:
Yeah. So ASR a decent score is used in our search ranking algorithm. So just now, I I mentioned that, a few signals we use to determine the search ranking. Azure, we use the same signals to calculate such a recent score. Basically, we use the 1st party data and the 3rd party data. 1st party data are the user activities on our own ecosystem, And so the party datas are those on the open web. I mentioned New York Times say, oh, this is 1 of the top 10 true crime podcast, something like that. And also some social media activities, and we combine all these signals together, and there's a formula.
We assign different weights on each signal, come up with this score. We still need to continue tweaking the the score. And the global ranking is based on these scores.
[00:28:20] Unknown:
[00:29:13] Unknown:
So we use ClearHouse. I forgot to mention ClearHouse in the infrastructure part. We use ClearHouse to store some aggregate data logs. It's a column based data warehouse. Yes. We use Ceri as async task to do the heavy lifting data crunching. And also, radius, is also used in in the pipeline.
[00:29:34] Unknown:
[00:30:15] Unknown:
Yeah. I I talked about the use cases for API first. So, basically, if you want to access to a Polycast database or you want to search a Polycast, you have 2 choices. You build your own search engine, you build your own database, or you use some kind of API. There are some examples. So, for example, people want to build their own podcast app. I didn't know that there are so many podcast apps out there. If you search Polycast app in App Store, you can see a bunch of niche Polycast apps. And, also, there are many Polycast clipping apps. So people can create clips, short clips from Polycast ADSource and share to social media. And then we use our API to search to find specific episode.
Basically, it uses a better way for onboarding users. Also, a bunch of audio apps that want to get into Polycast, they will use our API, like music apps, audiobook apps. Spotify is not the only music app in the world. There are tens of thousands of small music apps. And also, Audible is not the only audio app. There are tons of audio app. And so, Polycom is a very good adjacent market for them to expand to. Also, there are some social apps that allow people to share and discuss things, like movies, games, restaurants, and podcasts.
And we provide a search feature so their users can easily find podcasts and discuss podcasts. And also some content curation sites, our content discovery sites, so we use our API to help people discover podcasts. 1 specific example is some website will provide information for specific stock. They will list all the podcast interviews from the CEO or CFO from a public company. So people can listen to the Polycast interviews and make decisions, whether or not to buy the stock. Well, this is not a good investment advice, somehow you get a signal. These are the rough use case of our API.
And in terms of the CSV buyers, some financial institutions would want to find as much data as possible to do, for some stock picking, stock alternative data, satellite data, or protocols data. So they will search a person's name or topic, something like that. I don't really know, how they use our data. They just export the data, and they will use it. Also, some PR companies want to pitch to Polycast for guest opportunities. So they will need a bunch of Polycast information to do research on what Polycast to pitch to. Yeah. I definitely get lots of those emails.
[00:33:11] Unknown:
[00:33:31] Unknown:
important use case, our API. Our API is used in many coding boot camps. I don't think you can use best new programmers to use REST API right away. They might need some specific SDK, language specific SDKs to make function calls. So we provide a bunch of SDKs in different languages. We have Python SDK, Nubi, Knowledges, Lost, Suite, Java, Go, Basically, all the major languages. Yeah.
[00:34:04] Unknown:
[00:34:18] Unknown:
And I guess as far as people who are using listen nodes, either as end users or people who are building applications on top of it, what are some of the most interesting or innovative or unexpected ways that you've seen the API used?
[00:34:34] Unknown:
[00:34:57] Unknown:
As the podcast ecosystem continues to evolve and mature and more companies and individuals become involved in it, what are some of the missing pieces that you think can and should be filled in to make it more viable as a broad ecosystem so that more people can interact and collaborate and grow the entire community?
[00:35:17] Unknown:
The GUID is not permanent. It changes a lot, so there's no unique ID.
[00:36:23] Unknown:
Yeah. I've had to regenerate the GWT on a couple of episodes because of publishing errors. So I'm definitely guilty of being somebody that changes that.
[00:36:33] Unknown:
[00:36:38] Unknown:
And then as far as being able to have that kind of universal linking, I know that there are a few different platforms and services that are offering that as a feature where, you know, you publish your RSS feed through us, and then we'll give you a single endpoint that you can share with people so that it will automatically open the appropriate podcast player on their platform of choice. That
[00:36:59] Unknown:
[00:37:08] Unknown:
Fair point. And as you have spent a lot of time building the Listen Notes project and digging into the podcast ecosystem, what are some of the most interesting or unexpected things that you've learned about podcasts, the ecosystem around it, and the ways that people are consuming and using podcasts in their daily lives?
[00:37:26] Unknown:
[00:37:55] Unknown:
As a consumer of podcasts yourself, what are some of the best practices that you've seen from some publishers that you think can and should be more broadly adopted by everybody who's producing podcasts, whether it's in terms of the information that they're producing in the RSS feed or things that they're including in the show notes or the availability of transcripts or just anything having to do with making it more accessible for end users to be able to understand more easily what it is that they're going to get out of it as a as a listener.
[00:38:29] Unknown:
[00:38:49] Unknown:
Alright. I will take that as feedback, the way I'm producing my podcast. Thank you. Can't promise anything because it is a bit of a labor intensive process, but I will take note. And in terms of your own experience of building the Listen Notes platform and growing it as a business, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:12] Unknown:
Some player has SEO from India and Pakistan tried to submit fake podcasts, containing synthetic audio and a bunch of things. They tried to link building. So on our end, we need to build a bunch of tools to fight such, fake podcast.
[00:40:09] Unknown:
For people who are looking for a way to discover new podcast or specific episodes, what are some of the cases where Listen notes might be the wrong choice and they're better suited just going with the Spotify app that they're already using or, general Google search?
[00:40:24] Unknown:
[00:40:47] Unknown:
as you continue to manage the platform and build the project and add new features and capabilities, what are some of the things you have planned for the near to medium term?
[00:40:57] Unknown:
[00:41:33] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to recommend the Wheel of Time series that Amazon recently launched. I will admit that I have not read the books. It's definitely seems to be quite the undertaking to have done that, but I've started watching the show, and they seem to have done a really good job, at least for somebody who isn't familiar with the books. I thoroughly enjoy it, so definitely recommend that to people who are looking for something to watch. And with that, I'll pass it to you, Wenden. What do you have for a pick this week? I will recommend Superhuman,
[00:42:09] Unknown:
[00:42:26] Unknown:
