Summary
News media is an important source of information for understanding the context of the world. To make it easier to access and process the contents of news sites Lucas Ou-Yang built the Newspaper library that aids in automatic retrieval of articles and prepare it for analysis. In this episode he shares how the project got started, how it is implemented, and how you can get started with it today. He also discusses how recent improvements in the utility and ease of use of deep learning libraries open new possibilities for future iterations of the project.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This portion of Python Podcast is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place (sometimes fast, sometimes slow)? Do you know why? With Datadog, you will. You can troubleshoot your app’s performance with Datadog’s end-to-end tracing and in one click correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at pythonpodcast.com/datadog. If you sign up for a trial and install the agent, Datadog will send you a free t-shirt.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Lucas Ou-Yang about Newspaper, a framework for easily extracting and processing online articles.
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what the Newspaper project is and your motivations for creating it?
- What are the main use cases that Newspaper is built for?
- What are some libraries or tools that Newspaper might replace?
- What are the common structures in news sites that allow you to abstract across them for content extraction?
- What are some ways of determining whether a site will be a good candidate for using with Newspaper?
- Can you talk through the developer workflow of someone using Newspaper?
- What are some of the other libraries or tools that are commonly used alongside Newspaper?
- How is Newspaper implemented?
- How has the design of he project evolved since you first began working on it?
- What are some of the most complex or challenging aspects of building an automated article extraction tool?
- What are some of the most interesting, unexpected, or innovative projects that you have seen built with Newspaper?
- What keeps you interested in the ongoing support and maintenance of the project?
- What do you have planned for the future of Newspaper?
Keep In Touch
- @LucasOuYang on Twitter
- Website
- codelucas on GitHub
Picks
- Tobias
- Million Bazillion Podcast
- Lucas
- Hackers and Painters: Big Ideas from the Computer Age by Paul Graham
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Newspaper
- Los Angeles
- Django
- NLP == Natural Language Processing
- Web Scraping
- Requests
- Wintria
- Python Goose
- Diffbot
- Heuristics
- Stop Words
- RSS
- SpaCy
- Gensim
- PyTorch
- NLTK
- LXML
- Beautiful Soup
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it. The podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast dotcom/linode today. That's l I n o d e, and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And today, I'm interviewing Lucas Oyeying about newspaper, a framework for easily extracting and processing online articles. So Lucas, can you start by introducing yourself?
[00:01:27] Unknown:
My name is Lucas. I live in Los Angeles over in California. Love Python. Love open source. Newspaper 3 k was a project I built in college, and it kinda blew up from there. And, currently, I I work on VR software and, in the past, worked on a lot of engagement surfaces like, feeds and stories.
[00:01:48] Unknown:
And do you remember how you first got introduced to Python?
[00:01:51] Unknown:
Yeah. Definitely. It was it was a pretty crazy story. I was just, like, building random projects in college, and I had something pretty cool going. And this, like, random random, like, old older wealthy person, like, approached me. Like, this is probably back in 2, 021, 013, 2014, and there's a lot of excitement around, like, starting tech companies, especially I think I think back then, like, Google and Facebook were a lot really talked about, and people wanted to have the next big thing. So there's usually, like, on on college campuses, a lot of, like like these aren't aren't really VCs. They're just more more, like, people with business backgrounds that want people with technical expertise to help them build stuff. So, basically, some some, like, old rich dude wanted me to help them build build something, and he, like, wanted to fund our news extraction. Like, I I was, like, building playing around with news technology at the time. I really liked Reddit. I wanted to build something with a similar concept, and this this person basically wanted to fund us. Given the nature of the problem, you know, it's news, like, it really involves scraping data and having, like, web services.
Python was naturally the way to go. So I, like, picked up Django, a bunch of Python NLP Python NLP packages, Eventually, dove into the world of web scraping. Yeah. It was, like, a lot of fun. And I guess I I won't go too far into it, but the the actual the actual product we built was a real time news website that shows you news articles, like, throughout the world as they were published. So it's like you see the the articles flashing inside of your screen. Just like I had initial success, but it kinda tanked, due to lack of product market fit.
[00:03:36] Unknown:
You mentioned that 1 of the projects that you were working with when you were in college and trying to experiment with this real time news aggregator was the newspaper data extraction or journalistic article extraction. So can you describe a bit more about the project itself and the newspaper library that you maintain and some of the origin of how that came about?
[00:03:59] Unknown:
Yeah. Totally. This is a really, like, unbelievable story, but it was a lot of fun going through it. Basically so the name of the new startup is called Wintria, and the whole, like, value proposition was consumer website that delivers news in real time. And this may seem unrelated to newspaper, but I'll, like, go into why it's super related later. And we we have we built this really, like, well refined product that that had a couple of 1, 000 people using it every day. It was all powered by, like, this proprietary news aggregation technology totally built in Python.
The service itself, the website, was also at Python. It was in Django. And after the startup, like, failed commercially, I realized a lot of the work we did on the news aggregation technology was more advanced than what existed in open source. Actually, like, I had this, like, crazy idea. Like, this was, like, after the startup failed. I wasn't really talking with the VC partners or the cofounders anymore. It was just me by myself. And, you know, for me, I'm just, like, a pure I'm just like a pure engineer mindset. I just wanted to, like, you know, make the code as easy to use as possible. I wasn't really thinking about the business anymore. So I just generalized the code, built this really clean API. I wanted to make it accessible for anyone to use because I think this is, like, during 2014 and 2013 at that time, a lot of the APIs were super hard to use, and you needed to have, like, a strong under understanding of, like, computer science and coding and even DevOps to even use it to even use them.
And I was, like, really inspired at the time by other really clean human centric APIs such as, like, the famous a lot of famous Python repositories like requests and a couple of more. So, basically, I took the underlying technology of Wintria, which was the new startup, and packaged it into newspaper 3 k, which is the repository we're we're talking about today. And, basically, overnight after after open sourcing it, it became this huge hit on GitHub. Ironically, even Kenneth Ritz, the the guy who made requests, like, gave me the shout out, and I was, like, super happy about it. Over the next few months, a lot of other open source people or, like, people on Twitter who work in tech reached out to me about it. It was a really fun time, really invigorating. And I think it was, like, after that, I knew I wanted to, like, work more on this library, And that was basically the creation of newspaper.
[00:06:34] Unknown:
And I know that the as you mentioned, the current version of the library is called newspaper 3 k because of its support for Python 3, and that there was a Python 2 version that the readme says is still available, but was fairly buggy. So I'm wondering if you can maybe talk a bit about some of the work that went into making that migration from Python 2 to Python 3, and if there was any work that you did to reinvent the API or rearchitect the library to make it more maintainable.
[00:07:03] Unknown:
Yeah. It's a great question. I remember the migration being not involving any, like, crazy foundational changes to the architecture or the, like, actual algorithms. A lot of it was more of, oh, okay. Now all the requirements are finally in Python 3. So it's it's just really much more of just, like, waiting for all the dependencies to become Python 3 supported and then just, like, changing everything at Python 3. I don't remember anything, like, crazy about it.
[00:07:33] Unknown:
Digging more into the newspaper library itself, what are some of the main use cases that it was designed for and some of the ways that people are using it now that it's been released as open source?
[00:07:44] Unknown:
So 1 of the, like, key value props of newspaper is that it simplifies information retrieval for journalists. And the the last bit is super important because, obviously, if you have this huge budget, you can afford to hire, like, big engineers or data scientists or dev ops. You could just build an in house solution that can do whatever you want. But this isn't the case for most people, especially not for, like, independent journalists or researchers in colleges. Really, the reason why this project became a big hit was people were using newspaper for like, especially people without the computer science or without DevOps or software experience. They're able to achieve their goals of extracting structured information from the web.
[00:08:31] Unknown:
For people who are using newspaper, what are some of the tools or other libraries or frameworks that it might replace or augment and just some of the ways that it simplifies the overall work of actually retrieving information from these journalistic sources.
[00:08:47] Unknown:
All of these papers have several components. It it handles sending requests,
[00:08:52] Unknown:
parallelizing requests. There's a lot like infrastructure optimizations, like throttling requests, sending a bunch of requests in parallel. Otherwise, it takes forever, especially because you're extracting from potentially tens of thousands of news websites. Some of them are super slow. Like, you might be surprised. Some of them literally take 5 seconds to return. So the library has to do a lot of, like, intelligent throttling and paralyzing requests. So so there's that, like, scraping bit, but there's also, like, pieces around understanding and parsing the actual return information, like document understanding, identifying patterns in the HTML to be able to, like, extract the data. I'll go more into the details of this later. And another, like, really novel innovation for newspaper was understanding news websites as a whole, not just a piece of HTML, but, like, an actual news website, like CNN or BuzzFeed.
Like, what are patterns for entire news websites, and how can this library parse everything in in the news organization? And there's also pieces around natural language processing, information and retrieval, and, also, like, smartly caching the data. So with just a few computers, you don't have you can crawl tens of thousands of websites. You don't need a huge data center. I think to answer your other question, it's meant to replace, like, lower level software components like requests, the famous Python requests library, a bunch of NLP libraries, low level extraction libraries like Goose. Yeah. So mostly of those. And and even replacing paid services like this bot.
[00:10:24] Unknown:
Because of the fact that newspaper is focused primarily on articles for news sources, what are the common structures in news sites that allow you to abstract across them for being able to extract the content and some of the ways that somebody might be able to determine whether a given website is going to be a good candidate for using newspaper on?
[00:10:47] Unknown:
Yeah. This is a great question. Newspaper is by no means the best strategy out there. I think among the open source tools, newspapers probably among 1 of the better ones, and it's definitely has a really, like, simple to use API. So it's very accessible. But when talking about actually how these tools extract data, it's all about identifying patterns in new sites and patterns in pieces of HTML. Like, there's a distinction between, like, scraping from an entire news website like CNN.com versus scraping from an individual article like some Reddit page. I'll go into both.
But at a high level, newspaper is right now heuristic based, which for those who don't know that term, it's kind of using rules, like a bunch of if statements. It's not machine learning based. So when I say heuristic, it's, like, in contrast to machine learning. I do things differently now if this was being built from scratch because machine learning is it's just, like, a lot more scalable and stronger identifying patterns. But for articles and used websites, we found a couple of, like, key patterns that this library is basically able to use to extract structured information. Like, for example, there's, like, several different problems that this tool and other tools like Newspaper solve, such as identifying what's the full article text, the body text of a piece of HTML?
What's the title? What are the authors? What are the publishing dates? What are the keywords? And when we talk about extracting structured information from an entire news website, it becomes, like, which are the actual article links instead of random advertisement pages or about pages. I think, like, the 1 of the hardest problems by far is extracting the actual full text, the full article body from a piece of HTML. So I'll go into that. We, like, identify really high hyperlink densities in a cluster of text. That's usually indicative that this is not an article.
Like, if you see a bunch of hyperlinks very close together, it's usually, like, an advertisement or, like, a header. A lot of you probably can think of examples yourself because you browse websites, but it's probably not an actual, you know, piece of the actual body text you want that you want. We look at things like stop word density. If there's a lot of stop words and and also density of advertisements, things like these. Like, these signals can reveal that part of the HTML. It's probably not the article body. There's also other patterns like distance
[00:13:32] Unknown:
and clustering of HTML tags, such as, like the actual HTML tags themselves
[00:13:38] Unknown:
are really revealing. Like, if it's a paragraph tag or a div tag or if it's, like, a bunch of or or how much text is in each individual tag. You're able to identify things like if there is a big list of comments versus, like, heavy bodies of text. Yeah. It's like a bunch of, like, stuff like graphics, and you can kinda see a pattern here. It's just like a bunch of different rules that apply across many different news websites and articles that this library uses to extract the structured information. Other things are like, titles are much easier because there's a lot more obvious of patterns, like h 1 tags. And, usually, the news websites give you the titles themselves. Another bit that newspaper gives a really interesting advantage over other other tools is being able to identify all the news articles inside a news website like cnn.com.
So where the and and the developer just needs to type cnn.com, and this library can figure out all the URLs that are actually articles, not just random pages, and it does so with identifying patterns inside the URL. This is, like, a very, like, novel technique that was invented inside newspaper. It didn't happen anywhere else. It's like if a URL like, if it contains a date and something like a title, it has very, very high chance it it's an article. And, yeah, these are, like, a lot of the common strategies we use, but I guess a lot of people are wondering if it uses machine learning, and the answer is no. It doesn't use machine learning, but I would use machine learning if it was being rebuilt.
[00:15:12] Unknown:
For some of these news websites, they might also have an RSS feed where you can pull the most recent articles and possibly go back at least a certain point of time in the history of their publications. Are there cases where newspaper is actually a better choice for retrieving information from the site than just going straight to the RSS feed given that it's already fairly well structured for being able to extract information from it?
[00:15:37] Unknown:
Yeah. This is actually a really good question. And this is a lot more of a philosophical foundational question of, should these tools even exist, or should we just have better web standards? This entire problem wouldn't even be a problem if, like, we had improved standards within web and HTML that let developers and news organizations specify what they want people to query. Like, oh, here's my title. Here's my article body. RSS, it's a lot more explicit. They're actually giving you a lot more structured information. I feel that that's, like, 1 avenue, but we're certainly not going down that path from my knowledge. Maybe things have changed recently. I haven't been keeping up, but I definitely feel the other avenue, which which is, like, from the side of people who want the information, we have to do the scraping ourselves.
Like, in your RSS example, many news websites don't give you RSS, especially the smaller ones. And even the big news websites that do do have an RSS feed, it's usually quite limited. I don't have statistics for this. It it it probably would be better if I have the numbers, but from my understanding, it's limited by which articles they wanna give you. There's a bunch of limitations. Something like newspaper is definitely if you want to have the flexibility of having the entire view, you want everything on the website itself. Like, this library is not just able to query, like, all 5, 000 articles on CNN today. It's even able to do things like throttle requests so the news organization doesn't suspect that you're, like, trying to DDoS them or send way too many requests. It's doing it in a, like, respectful way. There's ways of doing this which honor the robots dot text, which means you're, you know, playing by the rules. And I feel like a high level, like, RSS, And I think Google also implemented something to, like, specifically give tags that indicate where which information someone can have. I feel like that approach is good, but not everyone's gonna adopt it. And I think even right now in the current ecosystem, many news websites don't adopt it, especially, like, smaller and medium news websites.
So something like newspaper is gonna be super important if you want all the information instead of just a small subset and also if you care about small news vendors, not just CNN.
[00:17:57] Unknown:
Yeah. Particularly for the people who are maintaining the sites, having an RSS feed is an extra maintenance burden unless it's something that's built into the publishing platform that they're using. And particularly for newspapers or sites that are trying to build a subscription based model, RSS is another challenge because you would have to have an authenticated feed for people who have a subscription, and then it's hard to ensure that they're actually using it just for themselves, which is where they'll likely push people more to using the website. So I can see that there's definitely a lot of challenges and conflicting priorities in terms of making the information easily accessible and easily used in contexts for programmatic usage versus the original intent of just publishing it for the sake of an individual person to be able to read it.
[00:18:48] Unknown:
Yeah. Totally. I especially agree with your point on, like, if it were a small company, having an RSS feed would be definitely pretty a pretty big obstacle.
[00:18:57] Unknown:
For somebody who is actually using newspaper for being able to retrieve some information and do some processing on a set of articles, can you talk through the overall workflow of somebody getting started and working through being able to extract the information and then process it and use it for some other purpose other than just for individual consumption?
[00:19:18] Unknown:
Yeah. Definitely. It's designed to have a simple workflow. It was designed so there's also several, like, advanced functionalities that people with more experience can utilize. There's 2 approaches. If you want to extract everything and and use organization like cnn.com, You could use the the higher level newspaper API as opposed to newspaper's article API, which the high level newspaper API just lets you type in any organization name, like cnn.com. And this library will send out requests, throttle them, and parallelize them to CNN, identify all the articles, and then objectify them inside Python. So you have this, like, huge list of actual articles. You have a list of these are all Python lists, a list of categories and genres, all done by newspaper.
And the developer, this person can just it's in the code now, so you can, like, iterate the articles. The the articles themselves are now wrapped in newspaper's article API, which lets you download the HTML. It lets you extract the HTML, figure out the title, body text, authors, etcetera. And it's designed to maximize flexibility. So if you're the developer, you can you know, now you have a list of articles from CNN, literally article objects that you can play with. There's a lot of really cool use cases from this. Like, I've heard of researchers and journalists both using this library to do things like studying tax sentiment from political websites, predicting like, doing machine learning training, basically using the library to supply training data from machine learning to predict publication dates or whatever or doing analysis on emojis, training text AIs.
Yeah. A bunch of cool stuff like that.
[00:21:06] Unknown:
This episode of podcast dotnet is sponsored by Datadog, the premier monitoring solution for modern environments. Datadog's latest features help teams visualize granular application data for more effective troubleshooting and optimization. Datadog Continuous Profiler analyzes your production level code and collects different profile types, such as CPU, memory allocation, IO, and more, enabling you to search, analyze, and debug code level performance in real time. Correlate and pivot between profiles and distributed traces to find slow or resource intensive requests. In addition, Datadog's application performance monitoring live search lets you search across a real time stream of all ingested traces from your services.
For even more detail, filter individual traces by infrastructure, application, and custom tags. Datadog has a special offer for podcast dot in it listeners. Sign up for a free 14 day trial at pythonpodcast.com/datadog. Install the Datadog agent and receive 1 of Datadog's famously cozy t shirts for free. And for people who are using newspaper, particularly for the natural language elements of it, are there particular libraries that you initially intended to integrate with and that you structure the content in a way that it's easy to feed into for being able to build training models? Or is it just a sort of text object that you can process however you want for being able to use with NLTK or spaCy or Gensim or PyTorch or whatever it is that you're working with?
[00:22:42] Unknown:
This was designed to be as flexible as possible. When building this, they didn't make any assumptions on how people would use it. It's just it seems that after open sourcing newspaper, the natural customer seems to be mostly journalists or researchers. So I guess, like, mostly people with, like, some coding background, but not necessarily, like, very heavy because this this library, like, it gives you flexibility. It's like it gives you a lot of data in structured format that you can play with. And you can also use with, you know, NLTK and these other tools that you mentioned, but it does it in a higher level way. So if someone wants, like, a lot more control, then newspaper probably might not be the right library for you. You probably want something lower level.
[00:23:26] Unknown:
Digging more into newspaper itself, can you talk a bit more about how it's implemented in some of the other libraries that it relies on for being able to function and some of the ways that the overall design has evolved since you first began working on it?
[00:23:40] Unknown:
Yeah. Definitely. I know I've said it before, but it's important to stress again that it's heuristic based based on rules and patterns, not machine learning. So it's not, like, pretrained on some set of news articles. It was designed by
[00:23:55] Unknown:
me and inspired by other libraries also. Us, like, identifying patterns ourselves and then designing the rules. I think now machine learning is really popular. Everyone's talking about machine learning. I have to add, like, machine learning is not always the only solution. It's not even always the best 1. It it's just a tool. I I think if I were to do things differently, newspaper, I would make it machine learning based because of how the things have changed, But newspaper being heuristic based, it still gets great utility. It's quite efficient. It's implemented purely in Python.
Some of the high level dependencies would be I love, you know, the the Python requests the Python requests API, it's probably my favorite Python library. It's like the epitome of a well designed API, in my opinion. And Newspaper itself was inspired by the request API. It relies on requests for all the IO work, the sending requests to news websites. It relies on lxml, I believe, is what it's called. It's a very efficient HTML parsing library. You've probably used Beautiful Soup, but lxml is much more efficient from my understanding, from some benchmarks.
Beautiful Soup and LXML are basically tools to parse HTML. Like, there's some NLP functionality, which is totally not my work. I just, like, bundled up some various NLP libraries. I think it might be NLTK to handle some of the, like, keyword extraction. But newspaper, it's, like, fundamental value prop would be and the core of the algorithms and how it's implemented would be the the technology that extracts the article body text, which is heuristic based, and also the technology that identifies where the news articles are in a news website like cnn.com.
[00:25:43] Unknown:
And in terms of your experience working on it and building and maintaining this project, what have you found to be some of the most complex or challenging aspects of building an article extraction tool and trying to automate it and build it in a way that is easy to approach and easy to understand for people who are, as you said, not necessarily developers for their day job? They're just trying to use the tool to get something done.
[00:26:11] Unknown:
The hardest, most challenging aspect of this was evaluating progress. And this is something, like, kind of a failure on my part that I should have done better from the beginning. It's so this library isn't, like, an infrastructure library. It it's more of, like, quality. It's, like, like, able to extract structured information in in high quality. How do you measure quality? Like, the ideal situation is, like, every pull request that gets sent to newspaper, there is some evaluation metric that reports, oh, because of this diff, newspaper's article extraction quality improves 2% or something like like that. And this is important because when I was building this at the very start, there's always this question of how do you know it actually got better? Because, you know, there's millions of news articles out there. Like, this change, like, this improvement to the algorithm maybe improves things for 1 website, but regresses things for another website. So this is, like, a huge problem. A really principled solution would be to have a evaluation framework, a world class 1, built into the library itself. So every pull request someone makes, I would be able to say, hey. Your pull request has improved quality on average by 5%. Therefore, I'm gonna merge it. So because of the lack of this library, it was so challenging. Like, there's tons of pull requests of people making random improvements.
From the improvements themselves, I wouldn't be able to tell. It actually improves quality on average. I'm not able to actually, like, have a holistic view of the improvements people are making. So, like, for a couple of years, I was just, like, accepting most pull requests, and I I kinda, like, view that as a mistake now. Like, the library could use more focus instead of feature creep, and, like, the challenging part is evaluating quality. That could have been solved with just having a really strong, like, evaluation framework, like, world class 1.
Yeah. That would probably be the most challenging aspect. Oh, sorry. And there's 1 1 last thing I forgot to mention. This is also a pretty tough 1. The whole world is moving towards kind of a mobile environment. Desktop websites are still popular, but most people are on their mobile phones now, and most websites are loaded by desktop mobile browsers. And it doesn't help that many websites are now very dynamic. They use JavaScript to change their page dynamically, which kinda undermines a lot of my assumptions when designing newspaper.
So navigating this new world is gonna be really challenging.
[00:28:35] Unknown:
Yeah. I was gonna ask you about what you have seen as far as the staying power of your heuristics for being able to pull information from these sites as the different websites change their layouts and probably change some of their stylistic and semantic aspects of the structure of their sites and maybe decide to change the URL patterns that you were using for determining what the articles were and just being able to maintain the library in the face of the constant change of the Internet and the sites that you're trying to gather this information from?
[00:29:09] Unknown:
Yeah. That's a great question. I think a lot about this 1. It's not easy. I think, like, as a I used to have more people helping maintain with the PRs, but it's not as simple as just approving PRs as time goes on. I definitely think the library is regressing. And without, like, an evaluation framework, it's hard to tell how bad it's regressing. I noticed on GitHub that the library still has pretty high traffic people. It's still, like, 1 of the most popular popular libraries, and I think there's, like, 10, 000 people who've started. And there's some, like, big big companies that still use this library.
So I think, like, it it still has something going for it, and the patterns have still held up mostly. But I'm, like, kinda nervous about the future if websites undergo, like, foundational changes and there's a big shift in how HTML is structured or if everyone starts using JavaScript to dynamically load a page. This library will be less and less effective. So I actually had thoughts on how to address this, but it's gonna require, like, really systematic changes. Like, I kinda I kinda view, like, something that has to be done is we we need a evaluation framework for newspaper that can output, like, as time goes on, how how well this library does for the top 10, 000 user websites. That would be super important.
Move the library from heuristics to machine learning so we can, like, dynamically learn, you know, the patterns for news websites around the world. That'd be pretty huge also. And lastly, I think it would be nice to, like, build an open source team to maintain the library itself. Like, it's hard with 1 person, but with maybe 5 or 6 people that can help, a lot of this will be better.
[00:30:46] Unknown:
Another complex capability that it includes is being able to support multiple different languages and text formats and right to left versus left to right for being able to pull out this information. And I know that in some cases, there's potentially a disconnect between being able to gather the information natural language processing frameworks. So I'm wondering what the complexities are or some of the incidental challenges are for being able to support that multilingual aspect of the newspapers and the journalistic environment?
[00:31:30] Unknown:
Yeah. Our approach with non English languages are we keep all the assumptions we make in English to non English. I know it's not, like, great, but it's the only scalable way to do things. And our support for non English languages are based on things like a list of stop words and foundational tokenization rules. So, like, using these 2 things, keeping everything else the same, including all the, like, algorithms and patterns I mentioned above, We just change the stop word. So so, basically, like like, to to extract structured information from an English piece of HTML, We, like, look at patterns.
A lot of these patterns rely on English stock words. I'm just gonna assume what the audience knows what that means. It's like a foundational piece of information for NLP. We we rely on English stock words. We also look at, like, foundationally, how do you tokenize English, which is mostly using spaces and some, like, grammatical rules maybe. And for non English languages, we keep everything else the same but change the stop words to support the other language. And for some unique languages that are tokenized differently, we have different, like, tokenization algorithms for those, such as, like, left to right versus right to left and etcetera.
[00:32:47] Unknown:
And in terms of the ways that you see newspaper being used, what are some of the most interesting or unexpected or innovative projects that you've seen built with it?
[00:32:56] Unknown:
What's pretty fun is in the first few months of open sourcing it, I noticed that newspaper got cloned and starred by a bunch of employees of couple of news organizations. A lot of them are, you know, big famous news companies that you've definitely heard of in America
[00:33:11] Unknown:
and even overseas companies. It was really exciting. I don't really know what they used newspaper for. I just know that I've heard through the grapevine that they've used newspaper, like, in their organizations. I have a lot of, like, PhD students and researchers reach out to me about this library. So I know there's a lot of, like, academic work being done with newspaper, and I also get hit up by journalists a lot about this. So So I think there's in terms of, like, what interesting things have gotten built, I feel like probably a lot of, like, a lot of news companies and news websites may be powered by newspaper. I think a lot of people use this to gather and structure data for research.
But I've heard journalists use this library to, like, get a grip of what's happening around the world to aid their work, which is pretty exciting. I've seen, like, some demos of, like, people training models on data that newspapers generated for AI, things like text reply AI and comment generation. And I've also seen people build services that, like, analyze, like, polarization from websites, like, if there's, like, political bias. I actually see on on Hacker News, like, occasionally, people build stuff and it's a lot of it's, like, supported by newspaper, which is pretty cool. Like, people have, like, built a lot of different consumer websites like news aggregators or etcetera.
And, yeah, it's great to see that they've used newspaper.
[00:34:40] Unknown:
What is it that keeps you interested in the ongoing support and maintenance of the project and keeps you involved with continuing to work on it?
[00:34:48] Unknown:
It's got amazing open source user base. There are always people willing to send in bug fixes and PRs and give suggestions.
[00:34:55] Unknown:
I really love when I see that, especially if, like, it's like a new GitHub user because I I know the feeling. Like, when when I sent my first pull request, it was also, like, really exciting. Sometimes people even send donations, which is great. When I decline PRs, you know, it's really tough. I feel like when people wanna contribute, it's it's like a really great group experience because I feel like I'm getting someone into open source. Yeah. I think mostly the community that keeps me wanting to work on the project, but there's also a lot of, like, bigger, like, philosophical reasons why I think something like newspaper needs to exist. And I I think you're probably gonna, like we can probably go into this more in, like, your next question.
[00:35:36] Unknown:
As you continue to work on newspaper and maintain it and try to keep it up to date for being able to maintain relevancy as the different sites that you're working with change their structures and evolve in kind. What are some of the new capabilities or ongoing maintenance or features that you're planning on adding for newspaper and just your overall goals as you continue to work with it? Yeah. Totally. This is my favorite question,
[00:36:03] Unknown:
and I had a lot of ideas here. A lot of these are pretty tough ideas. I I view so this is part of this is also, like, kinda answering the previous question of why I'm gonna still invest time on this despite like, I have a lot of other things to do, but I I view this as so important. I think I think news is becoming more and more important, and news integrity has become 1 of the most important problems, like, base of society today. Like, knowing if something is fake news, understanding if a piece of information is, like, real or fake. I think now we have, like, such a flood of information on the Internet that the integrity piece hasn't gotten up. So I think, like, 1 of the most exciting things planned for Newspaper is like, this tool is able to extract information for articles and news organizations.
I view the next logical step. This is, like, just 1 of the goals. This is kind of also kinda like 1 of the more long term lofty ones, maybe a little unrealistic. But I view the logical next step would be to go 1 step further and having, like, an entire ecosystem view of all news organizations and being able to piece out, like, if something is fake news or the origins of a piece of information. I think that would be, like, super important. I actually talked with a CNN executive once in New York in a swimming pool about this, and he was also saying, like, from his view in CNN, fake news is such a big problem, obviously, you know, with everything happening in the political landscape and what's happened in America, I think it's just such an important, like, reason to work on work in the news domain, you know, in the in this whole, like, Python data science news sphere. It's it's only gonna become more important, and I really want to tackle fake news with newspaper.
Before that, I want to build a world class testing and evaluation system where, like, every change into this repository, we know for sure if there's an improvement or a progression and eventually move from heuristic based approaches I mentioned above to machine learning based approaches and also develop, like, a stronger, like, focus on pull requests. So I wanna start rejecting more pull requests despite it being mean, but just being more focused on what gets implemented.
[00:38:29] Unknown:
Yeah. Accepting every offer of help that comes your way and every change set to the library can definitely end up leading you down the path of complete unmaintainability and a slew of conflicting features and capabilities that make it impossible for you to actually make forward progress because 1 of my favorite aphorisms is that open source is free as in puppy.
[00:38:55] Unknown:
Yeah. Basically.
[00:38:56] Unknown:
Are there any other aspects of the newspaper project or the ways that you were using it or the ways you've seen it being used or the overall challenges of working with news sources that we didn't discuss that you'd like to cover before we close out the show?
[00:39:10] Unknown:
Yeah. I actually view kind of, like, this mistake I made was I try trying to be nice too, like, too often. I think maybe it's, like, because I was still kind of new to open source. I view, like, rejecting a PR as being mean. And after 5 or 6 years, it like, everything you said just now about, like, you know, if if you just took everyone's help, you're just gonna create this huge spaghetti mess of conflicting features. This I was I think I was accepting way too many PRs for 2 or 3 years, and it's already, like, gotten to a point where it's I I've seen the the problematic result of it, and I think going forward, I'm definitely gonna start projecting more PRs. And maybe I I think it might help other open source people also, but have, like, a charter and, like, a long term vision for your repository, especially if you're spending a lot of time on it. So, like, if yeah. So some PRs are very controversial. You could just you have to make a hard decision sometimes.
[00:40:07] Unknown:
But as you said, just having a good idea of where you want to take something and what your ultimate purpose is for the project can definitely help to act as a guiding light for determining what contributions to accept and have a way to easily point to your specifications of this is what this library is meant to do for the cases where you do need to reject a pull request to say, I appreciate the work. I appreciate the effort, but this is not what I'm trying to do. And then maybe offer some ways to have extension points or ways to hook into the library or build an ecosystem where they can have a place to put those capabilities, but it doesn't have to live within your library specifically.
[00:40:49] Unknown:
Yeah. Exactly. I've actually seen some of the more experienced open source people basically reject PRs in, like, the most polite way. It's it's almost comedic. Like, like, all the, like, smiley faces and all, you know, this is great, but probably probably not not for this PR.
[00:41:06] Unknown:
Well, for anybody who does want to get in touch and follow along with the work that you're doing or contribute to your work on newspaper, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a podcast I started listening to recently from the folks at Marketplace from NPR called 1, 000, 000 Bazillion. I started listening to that with my kids, and it's just a really entertaining podcast about money and the economy and just ways to teach kids about how to think about money and deal with money. So it's just a lot of fun and very well structured for younger kids and adults as well. So definitely recommend taking a look at that. And with that, I'll pass it to you, Lucas. Do you have any picks this week?
[00:41:48] Unknown:
Alright. So my pick would be was a really good book by Paul Graham from Y Combinator, the Hacker News, called Hackers and Painters. And that book was 1 of the things that got me into computer programming and open source. And I think it's especially for some of the newer listeners who aren't very deep in the weeds in open source and software engineering yet, I think this book gives a lot of clarity into, you know, like, the joy of being a hacker and a programmer.
[00:42:18] Unknown:
Yeah. I'll definitely take a look at that 1. So thank you very much for taking the time today to join me and discuss the work that you've been doing with newspaper. It's a project that I've been hearing a lot about. So I appreciate all of the time and effort you've put into that and helping to simplify the work of people who are trying to help keep the journalistic ecosystem on track and healthy. So I appreciate all of your time and effort on that, and I hope you enjoy the rest of your day. Thanks so much, Tobias. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Lucas Oyeying: Introduction and Background
The Newspaper Project: Origin and Development
Migration from Python 2 to Python 3
Main Use Cases and Features of Newspaper
Technical Details and Challenges in News Extraction
Workflow and Practical Applications of Newspaper
Integration with NLP Libraries and Flexibility
Implementation and Dependencies of Newspaper
Challenges in Building and Maintaining Newspaper
Adapting to Changes in Web Technologies
Multilingual Support and Complexities
Interesting Uses and Community Engagement
Future Plans and Goals for Newspaper
Maintaining Focus and Quality in Open Source
Closing Remarks and Picks