Simplified Data Extraction And Analysis For Current Events With Newspaper

Hello, and welcome to podcast dot in it. The podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers,

40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast dotcom/linode

today. That's l I n o d e, and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And today, I'm interviewing Lucas Oyeying about newspaper, a framework for easily extracting and processing online articles. So Lucas, can you start by introducing yourself?

My name is Lucas. I live in Los Angeles over in California.

Love Python. Love open source.

Newspaper 3 k was a project I built in college, and it kinda blew up from there. And, currently, I I work on VR software

and, in the past, worked on a lot of engagement surfaces

like, feeds and stories.

And do you remember how you first got introduced to Python?

Yeah. Definitely. It was it was a pretty crazy story.

I was just, like, building random projects in college, and I had something pretty cool going. And this, like, random

random, like, old older

wealthy person,

like, approached me. Like, this is probably back in 2, 021, 013,

2014,

and

there's a lot of excitement around, like, starting tech companies,

especially I think I think back then, like, Google and Facebook were a lot really talked about, and people wanted to have the next big thing. So there's usually, like, on on college campuses, a lot of, like like these aren't aren't really VCs. They're just more more, like, people with business backgrounds that want people with technical expertise to help them build stuff. So, basically, some some, like, old rich dude wanted me to help them build build something, and

he, like, wanted to fund our news extraction. Like, I I was, like, building playing around with

news technology at the time. I really liked Reddit. I wanted to build something with a similar concept,

and

this this person basically wanted to fund us. Given the nature of the problem, you know, it's news, like, it really involves

scraping data

and having, like, web services.

Python was

naturally the way to go. So I, like, picked up Django,

a bunch of Python NLP

Python NLP packages,

Eventually, dove into the world of web scraping.

Yeah. It was, like, a lot of fun. And

I guess I I won't go too far into it, but the the actual

the actual product we built was a real time news website that shows you news articles, like, throughout the world as they were published. So it's like you see the the articles flashing inside of your screen. Just like I had initial success, but it kinda tanked,

due to lack of product market fit.

You mentioned that 1 of the projects that you were working with when you were in college and trying to experiment with this real time news aggregator was

the newspaper

data extraction or journalistic article extraction. So can you describe a bit more about the project itself and the newspaper library that you maintain

and some of the origin of how that came about?

Yeah. Totally. This is a really,

like, unbelievable story, but it was a lot of fun going through it.

Basically

so the name of the new startup is called Wintria,

and the whole, like, value proposition was

consumer website that

delivers news in real time.

And

this may seem unrelated to newspaper, but I'll, like, go into why it's super related later. And we we have we built this really, like, well refined product that that had a couple of 1, 000 people using it every day. It was all powered by, like, this proprietary

news aggregation technology totally built in Python.

The service itself, the website, was also at Python. It was in Django.

And

after the startup,

like, failed commercially,

I realized a lot of the work we did on the news aggregation technology

was more advanced

than what existed in open source.

Actually, like, I had this, like, crazy idea. Like, this was, like, after the startup failed.

I wasn't really talking with the VC partners or the cofounders anymore. It was just me by myself.

And, you know, for me, I'm just, like, a pure

I'm just like a pure engineer mindset. I just wanted to, like, you know, make the code

as easy to use as possible. I wasn't really thinking about the business anymore. So I just generalized the code,

built this really clean API. I wanted to make it accessible for anyone to use

because I think this is, like, during 2014 and 2013

at that time,

a lot of the APIs

were super hard to use, and you needed to have, like, a strong under understanding of, like, computer science and coding and even DevOps to even use it to even use them.

And I was, like, really inspired at the time by other really clean human centric APIs

such as, like, the famous a lot of famous Python repositories like requests

and a couple of more.

So, basically,

I took the underlying technology of Wintria, which was the new startup, and

packaged it into newspaper

3 k, which is the repository we're we're talking about today.

And, basically, overnight after

after open sourcing it, it became this

huge hit on GitHub.

Ironically, even Kenneth Ritz, the the guy who made requests, like, gave me the shout out, and I was, like, super happy about

it. Over the next few months, a lot of other open source people or, like, people on Twitter who work in tech reached out to me about it. It was a really fun time, really invigorating. And I think it was, like, after that, I knew I wanted to, like, work more on this library,

And that was basically the creation of newspaper.

And I know that the as you mentioned, the current version of the library is called newspaper 3 k because of its support for Python 3, and that there was a Python 2 version that the readme says is still available, but was fairly buggy. So I'm wondering if you can maybe talk a bit about some of the work that went into making that migration from Python 2 to Python 3, and if there was any work that you did to reinvent the API or rearchitect the library to make it more maintainable.

Yeah. It's a great question.

I remember the migration being not

involving any, like, crazy foundational changes to the architecture or the, like, actual algorithms.

A lot of it was more of,

oh, okay. Now all the requirements are finally in Python 3. So it's it's just really much more of just, like, waiting for all the dependencies to become Python 3

supported and then just, like, changing everything at Python 3. I don't remember anything, like, crazy

about it.

Digging more into the newspaper library itself, what are some of the main use cases that it was designed for and some of the ways that people are using it now that it's been released as open source?

So 1 of the, like, key value props of newspaper is that it simplifies information retrieval for journalists.

And the the last bit is super important because,

obviously, if you have this huge

budget, you can afford to hire, like, big engineers or data scientists or dev ops. You could just build an in house solution that can do whatever you want. But this isn't the case for most people, especially not for, like, independent journalists or researchers in colleges.

Really, the reason why this project became a big hit was

people were using newspaper for

like, especially people without the computer science or without DevOps

or

software experience. They're able to achieve their goals

of extracting structured information from the web.

For people who are using newspaper, what are some of the tools or other libraries or frameworks that it might replace or augment and just some of the ways that it simplifies the overall work of actually

retrieving information from these journalistic sources.

All of these papers have several components.

It it handles

sending requests,

parallelizing requests. There's a lot like infrastructure optimizations, like throttling requests, sending a bunch of requests in parallel. Otherwise, it takes forever,

especially because you're extracting from potentially tens of thousands of news websites. Some of them are super slow. Like, you might be surprised. Some of them literally take 5 seconds to return.

So the library has to do a lot of, like, intelligent throttling and paralyzing requests. So so there's that, like, scraping bit, but there's also, like, pieces around understanding and parsing the actual return information,

like document understanding,

identifying patterns in the HTML

to be able to, like, extract the data. I'll go more into the details of this later. And another, like, really novel innovation for newspaper was understanding news websites as a whole, not just a piece of HTML, but, like, an actual news website, like CNN or BuzzFeed.

Like, what are patterns for entire news websites, and how can this library

parse everything in in the news organization?

And there's also

pieces around natural language processing,

information and retrieval, and, also,

like, smartly caching the data. So with just a few computers, you don't have you can crawl tens of thousands of websites. You don't need a huge data center. I think to answer your other question, it's meant to replace,

like, lower level software components like requests,

the famous Python requests library, a bunch of NLP libraries,

low level extraction libraries like Goose. Yeah. So mostly of those. And and even replacing paid services like this bot.

Because of the fact that newspaper is focused primarily on articles for

news sources,

what are the common structures in news sites that allow you to abstract across them for being able to

extract the content and

some of the ways that somebody might be able to determine whether a given website is going to be a good candidate for using newspaper on?

Yeah. This is a great question.

Newspaper

is by no means the best strategy out there. I think among the open source

tools, newspapers probably among 1 of the better ones,

and it's definitely has a really,

like, simple to use API. So it's very accessible. But

when talking about actually how these tools extract

data, it's all about identifying patterns in new sites and patterns in pieces of HTML.

Like, there's a distinction between, like, scraping from an entire news website like CNN.com

versus scraping from an individual article like some Reddit page.

I'll go into both.

But at a high level, newspaper is right now heuristic based,

which for those who don't know that term, it's kind of using rules, like a bunch of if statements.

It's not machine learning based. So when I say heuristic, it's, like, in contrast to machine learning. I do things differently now if this was being built from scratch

because machine learning is

it's just, like, a lot more scalable

and stronger identifying patterns.

But for articles and used websites,

we found a couple of, like, key patterns that this library is basically able to use to extract structured information. Like, for example, there's, like, several

different problems that this tool and other tools like Newspaper solve,

such as identifying

what's the full article text, the body text of a piece of HTML?

What's the title?

What are the authors? What are the publishing dates? What are the keywords?

And when we talk about extracting

structured information from an entire news website,

it becomes, like, which are the actual article links instead of random

advertisement pages or about pages. I think, like, the 1 of the hardest problems by far is extracting the actual full text, the full article body from a piece of HTML.

So I'll go into that. We, like, identify

really high hyperlink densities in a cluster of text. That's

usually indicative that this is not

an article.

Like, if you see a bunch of hyperlinks

very close together,

it's usually, like, an advertisement

or, like, a header.

A lot of you probably can think of examples yourself because you browse websites, but it's probably not an actual, you know, piece of the actual body text you want that you want. We look at things like stop word density.

If there's a lot of stop words

and and also density of advertisements,

things like these.

Like, these signals can reveal

that part of the HTML. It's probably not the article body. There's also other patterns like

distance

and clustering of HTML tags,

such as, like the actual HTML tags themselves

are really revealing. Like, if it's a paragraph tag or a div tag

or if it's, like, a bunch of

or or how much text is in each individual tag.

You're able to identify things like if there is a big list of comments

versus, like,

heavy

bodies of text.

Yeah. It's like a bunch of, like, stuff like graphics,

and you can kinda see a pattern here. It's just like a bunch of different rules

that apply across many different news websites and articles

that this library uses to extract the structured information.

Other things are like, titles are much easier because there's a lot more obvious of patterns, like h 1 tags. And, usually, the news websites give you the titles themselves.

Another bit that newspaper gives a really interesting

advantage over other other tools is being able to identify all the news articles

inside

a news website like cnn.com.

So where the and and the developer just needs to type cnn.com,

and this library can figure out all the URLs

that are actually articles, not just random pages,

and it does so with identifying patterns inside the URL.

This is, like, a very, like, novel technique that was invented inside newspaper. It didn't happen anywhere else. It's like if a URL like, if it contains a date and something like a title, it has very, very high chance it it's an article.

And, yeah, these are, like, a lot of the common strategies we use, but I guess

a lot of people are wondering if it uses machine learning, and the answer is no. It doesn't use machine learning, but I would use machine learning if it was being rebuilt.

For some of these news websites, they might also have an RSS feed where you can pull the most recent articles and possibly go back at least a certain point of time in the history of their publications.

Are there cases where newspaper is actually a better choice for retrieving information from the site than just going straight to the RSS feed given that it's already fairly well structured for being able to extract information from it?

Yeah. This is actually a really good question. And this is a lot more of a philosophical

foundational question

of, should these tools even exist,

or should we just have better web standards?

This entire problem wouldn't even be a problem if,

like, we had

improved standards within web and HTML

that let developers and news organizations specify what they want people to query. Like, oh, here's my title. Here's my article body. RSS, it's a lot more explicit. They're actually giving you a lot more structured information.

I feel that that's, like, 1 avenue, but we're certainly not going down that path from my knowledge. Maybe things have changed recently. I haven't been keeping up, but I definitely feel

the other avenue, which which is, like,

from the side of people who want the information, we have to do the scraping ourselves.

Like, in your RSS example, many news websites don't give you RSS, especially the smaller ones. And even the big news websites that do do have an RSS feed, it's usually quite limited. I don't have statistics for this. It it it probably would be better if I have the numbers, but from my understanding,

it's limited by which articles they wanna give you. There's a bunch of limitations.

Something like newspaper

is definitely if you want to have the flexibility of having the entire view, you want everything

on the website itself. Like, this library is not just able to query,

like, all 5, 000 articles on CNN today. It's even able to do things like throttle requests

so the news organization doesn't suspect that you're, like, trying to DDoS them or send way too many requests. It's doing it in a, like, respectful way. There's ways of doing this which honor the robots dot text,

which means you're, you know, playing by the rules. And

I feel like a high level,

like, RSS,

And I think Google also implemented something to, like, specifically give tags that indicate where which information

someone can have. I feel like that approach is good, but not everyone's gonna adopt it. And I think even right now in the current ecosystem,

many news websites don't adopt it, especially, like, smaller and medium news websites.

So something like newspaper is gonna be super important if you want all the information instead of just a small subset

and also if you care about

small news vendors,

not just CNN.

Yeah. Particularly for the people who are maintaining the sites, having an RSS feed is an extra maintenance burden unless it's something that's built into the publishing platform that they're using.

And particularly

for

newspapers

or sites that are trying to build a subscription based model,

RSS is another challenge because you would have to have an authenticated feed for people who have a subscription, and then it's hard to ensure that they're actually using it just for themselves, which is where they'll likely push people more to using the website. So I can see that there's definitely a lot of challenges and conflicting priorities in terms of making the information easily accessible and easily used in

contexts for programmatic usage

versus the original intent of just publishing it for the sake of an individual person to be able to read it.

Yeah. Totally. I especially agree with your point on, like, if it were a small company,

having an RSS feed would be definitely pretty a pretty big obstacle.

For somebody who is actually using newspaper

for being able to retrieve some information and do some processing on a set of articles, can you talk through the overall workflow of somebody getting started and working through being able to extract the information and then process it and use it for some other purpose other than just for individual consumption?

Yeah. Definitely. It's designed to have a simple workflow.

It was designed so

there's also several, like, advanced functionalities that people with more experience can utilize.

There's 2 approaches. If you want to

extract everything and and use organization like cnn.com,

You could use the the higher level newspaper API

as opposed to newspaper's article API,

which the high level newspaper API just lets you type in any organization name, like cnn.com.

And this library will send out requests,

throttle them, and parallelize

them to CNN, identify

all the articles,

and then objectify them inside Python. So you have this, like, huge list of actual articles. You have a list of these are all Python lists, a list of categories and genres,

all done by newspaper.

And the developer,

this person can just it's in the code now, so you can, like, iterate the articles.

The the articles themselves are now wrapped in newspaper's article API, which lets you download the HTML. It lets you extract the HTML, figure out the title, body text, authors,

etcetera.

And it's designed to maximize flexibility.

So if you're the developer, you can you know, now you have a list of articles from CNN, literally article objects that you can play with. There's a lot of really cool

use cases from this. Like, I've heard of researchers

and journalists both using this library to do things like studying tax sentiment from political websites,

predicting

like, doing machine learning training, basically using the library to supply training data from machine learning to predict publication dates or whatever or

doing analysis on emojis,

training text AIs.

Yeah. A bunch of cool stuff like that.

This episode of podcast dotnet is sponsored by Datadog, the premier monitoring solution for modern environments.

Datadog's latest features help teams visualize granular application data for more effective troubleshooting and optimization.

Datadog Continuous Profiler analyzes your production level code and collects different profile types, such as CPU, memory allocation,

IO, and more,

enabling you to search, analyze, and debug code level performance in real time.

Correlate and pivot between profiles and distributed traces to find slow or resource intensive requests.

In addition, Datadog's application performance monitoring live search lets you search across a real time stream of all ingested traces from your services.

For even more detail, filter individual traces by infrastructure, application, and custom tags.

Datadog has a special offer for podcast dot in it listeners.

Sign up for a free 14 day trial at pythonpodcast.com/datadog.

Install the Datadog agent and receive 1 of Datadog's famously cozy t shirts for free.

And for people who are using newspaper, particularly for the natural language elements of it, are there particular libraries that you initially intended to integrate with and that you structure the content in a way that it's easy to feed into for being able to build training models? Or

is it just a sort of text object that you can process however you want for being able to use with NLTK or spaCy or Gensim

or PyTorch or whatever it is that you're working with?

This was designed

to be as flexible as possible.

When building this, they didn't make any assumptions on how people would use it. It's just it seems that after open sourcing newspaper,

the natural customer seems to be mostly journalists

or researchers.

So I guess, like, mostly people with, like, some

coding background, but not necessarily, like, very heavy because this this library, like, it gives you flexibility. It's like it gives you a lot of data in structured format that you can play with. And you can also use with, you know, NLTK and these other tools that you mentioned,

but it does it in a higher level way. So if someone wants, like, a lot more control,

then newspaper probably might not be the right library for you. You probably want something lower level.

Digging more into newspaper itself, can you talk a bit more about how it's implemented in some of the other libraries that it relies on for being able to function and some of the ways that the overall design has evolved since you first began working on it?

Yeah. Definitely. I know I've said it before, but it's important to stress again that it's heuristic based based on rules and patterns, not machine learning. So it's not, like, pretrained on some set of news articles. It was designed by

me

and inspired by other libraries also. Us, like, identifying patterns ourselves and then designing the rules. I think now machine learning is really popular. Everyone's talking about machine learning. I have to add, like, machine learning is not always the only solution. It's not even always the best 1. It it's just a tool. I I think if I were to do things differently,

newspaper, I would make it machine learning based because of how the things have changed, But newspaper being heuristic based, it still gets great utility. It's quite efficient.

It's implemented

purely in Python.

Some of the high level dependencies

would be I love, you know, the the Python requests

the Python requests API, it's probably my favorite Python library. It's like the epitome of a well designed API, in my opinion. And Newspaper itself was inspired by the request API. It relies on requests for all the

IO work, the sending requests to news websites.

It relies on lxml, I believe, is what it's called. It's a very efficient

HTML parsing library.

You've probably used Beautiful Soup, but lxml

is much more efficient from my understanding,

from some benchmarks.

Beautiful Soup and LXML are basically tools to parse HTML.

Like, there's some NLP functionality, which is totally not my work. I just, like, bundled up some

various NLP libraries. I think it might be NLTK

to handle some of the, like, keyword extraction.

But newspaper, it's, like, fundamental value prop would be and the core of the algorithms and how it's implemented would be the the technology that

extracts the article body text,

which is heuristic based, and also the technology that identifies

where the news articles are in a news website like cnn.com.

And in terms of

your experience working on it and building and maintaining this project, what have you found to be some of the most complex or challenging aspects

of building an article extraction tool and trying to automate it and build it in a way that is easy to approach and easy to understand for people who are, as you said, not necessarily

developers for their day job? They're just trying to use the tool to get something done.

The hardest, most challenging aspect of this was evaluating progress. And this is something, like, kind of a failure on my part that I should have done better from the beginning.

It's so this library isn't, like, an infrastructure library. It it's more of, like, quality. It's, like, like, able to extract structured information in in high quality. How do you measure quality?

Like, the ideal situation is, like, every pull request that gets sent to newspaper,

there is some evaluation metric that reports, oh, because of this diff,

newspaper's

article extraction quality improves 2% or something like like that. And this is important because when I was building this at the very start, there's always this question of how do you know it actually got better? Because, you know, there's millions of news articles out there. Like, this change, like, this improvement to the algorithm

maybe improves things for 1 website, but regresses things for another website. So this is, like, a huge problem. A really principled solution would be to have a evaluation framework,

a world class 1, built into the library itself. So every pull request

someone makes,

I would be able to say, hey. Your pull request has improved

quality on average by 5%. Therefore, I'm gonna merge it. So because of the lack of this library, it was so challenging. Like, there's tons of pull requests of people making random improvements.

From the improvements themselves, I wouldn't be able to tell. It actually improves quality on average. I'm not able to actually, like, have

a holistic view of the improvements people are making. So, like, for a couple of years, I was just, like, accepting most pull requests,

and I I kinda, like, view that as a mistake now. Like, the library could use more focus instead of feature creep,

and, like, the challenging part is evaluating quality.

That could have been solved with just having a really strong, like, evaluation framework,

like, world class 1.

Yeah. That would probably be the most challenging aspect. Oh, sorry. And there's 1 1 last thing I forgot to mention.

This is also a pretty tough 1. The whole world is moving towards kind of a mobile environment.

Desktop websites are still

popular, but

most people are on their mobile phones now, and most websites are loaded by desktop mobile browsers.

And it doesn't help

that many websites are now

very dynamic. They use JavaScript to change their page dynamically, which kinda undermines a lot of my assumptions

when designing newspaper.

So navigating this new world is gonna be really challenging.

Yeah. I was gonna ask you about what you have seen as far as the staying power of your heuristics for being able to pull information from these sites as the different websites

change their layouts

and probably change some of their stylistic and semantic aspects of the structure of their sites and maybe decide to change the URL patterns that you were using for determining what the articles were and just being able to maintain the library in the face of the constant change of the Internet and the sites that you're trying to gather this information from?

Yeah. That's a great question. I think a lot about this 1. It's not easy. I think, like, as a I used to have more people helping maintain with the PRs, but it's not as simple as just approving PRs as time goes on. I definitely

think the library is regressing. And without, like, an evaluation framework, it's hard to tell how bad it's regressing. I noticed on GitHub that the library still has pretty high traffic people. It's still, like, 1 of the most popular popular libraries, and I think there's, like, 10, 000 people who've started.

And there's some, like, big big companies that still use this library.

So I think, like, it it still has something going for it, and the patterns

have still held up mostly. But I'm, like,

kinda nervous about the future if websites undergo, like, foundational changes

and there's a big shift in how HTML is structured

or if everyone starts using JavaScript to dynamically load a page. This library will be less and less effective.

So I actually had thoughts on how to address this, but it's gonna require,

like, really systematic changes. Like, I kinda I kinda view, like, something that has to be done is we we need a evaluation framework

for newspaper that

can output,

like, as time goes on, how how well this library does for the top 10, 000 user websites. That would be super important.

Move the library from heuristics to machine learning so we can, like, dynamically learn, you know, the patterns for news websites around the world. That'd be pretty huge also.

And lastly, I think it would be nice to, like, build an open source team to maintain the library itself.

Like, it's hard with 1 person, but with maybe 5 or 6 people that can help, a lot of this will be better.

Another complex

capability that it includes is being able to support multiple different languages

and text formats

and right to left versus left to right for being able to pull out this information. And I know that in some cases, there's potentially a disconnect between

being able to gather the information

natural language processing

frameworks.

So I'm wondering what the complexities are or some of the incidental challenges are for being able to support that multilingual aspect

of the newspapers and the journalistic environment?

Yeah. Our approach with non English languages

are

we keep all the assumptions we make in English to non English. I know it's not, like, great, but it's the only scalable way to do things. And our support for non English languages are based on

things like a list of stop words

and foundational tokenization rules. So, like, using these 2 things, keeping everything else the same, including all the, like, algorithms and patterns I mentioned above,

We just change the stop word. So so, basically, like like, to to extract structured information from an English piece of HTML,

We, like, look at patterns.

A lot of these patterns rely on English stock words. I'm just gonna assume what the audience knows what that means. It's like a foundational piece of information

for NLP.

We we rely on English stock words.

We

also look at, like, foundationally, how do you tokenize English, which is mostly using spaces

and some, like, grammatical rules maybe. And for non English languages,

we keep everything else the same but change the stop words

to support the other language. And for some unique languages

that are tokenized differently,

we have different, like, tokenization

algorithms

for those, such as, like, left to right versus right to left and etcetera.

And in terms of the ways that you see newspaper being used, what are some of the most interesting or unexpected or innovative projects that you've seen built with it?

What's pretty fun is in the first few months of open sourcing it, I noticed that newspaper got cloned and starred by a bunch of employees of couple of news organizations. A lot of them are, you know, big famous news companies that you've definitely heard of in America

and even overseas companies.

It was really exciting. I don't really know what they used newspaper for. I just know that I've heard through the grapevine that they've used newspaper,

like, in their organizations.

I have a lot of, like, PhD students and

researchers reach out to me about this library. So I know there's a lot of, like, academic work being done with newspaper,

and

I also get hit up by journalists a lot about this. So So I think there's in terms of, like, what interesting things have gotten built, I feel like probably a lot of,

like, a lot of news companies and news websites

may be powered by newspaper. I think a lot of people use this

to gather and structure data for research.

But I've heard journalists use this library to,

like, get a grip of what's happening around the world to aid their work, which is pretty exciting.

I've seen, like, some demos of,

like, people training models on data that newspapers generated for AI,

things like text reply AI and comment generation. And

I've also seen people build services that, like, analyze,

like, polarization

from websites,

like, if there's, like, political bias.

I actually see on on Hacker News, like, occasionally, people build stuff and it's a lot of it's, like, supported by newspaper, which is pretty cool. Like, people have, like, built a lot of different consumer

websites like news aggregators or

etcetera.

And, yeah, it's great to see that they've used newspaper.

What is it that keeps you interested in the ongoing support and maintenance of the project and keeps you involved with continuing to work on it?

It's got amazing

open source user base. There are always people willing to send in bug fixes and PRs and give suggestions.

I really love when I see that, especially if, like, it's like a new GitHub user because I I know the feeling. Like, when when I sent my first pull request, it was also, like, really exciting. Sometimes people even send donations, which is great. When I decline PRs,

you know, it's really tough.

I feel like when people wanna contribute, it's it's like a really great group experience because I feel like I'm getting someone into open source. Yeah. I think mostly the community that keeps me wanting to work on the project, but there's also a lot of, like, bigger,

like, philosophical reasons why I think something like newspaper needs to exist. And I I think you're probably gonna, like we can probably go into this more in, like, your next question.

As you continue to work on newspaper and maintain it and try to keep it up to date for being able to maintain relevancy

as the different sites that you're working with change their structures and evolve in kind. What are some of the new capabilities

or

ongoing maintenance or features that you're planning on adding for newspaper and just your overall goals as you continue to work with it? Yeah. Totally. This is my favorite question,

and I had a lot of ideas here. A lot of these are pretty tough ideas. I I view so this is part of this is also, like, kinda answering the previous question of why I'm gonna still invest time on this despite like, I have a lot of other things to do, but I I view this as so important. I think I think news is becoming

more and more important, and news integrity

has become 1 of the most important problems, like, base of society today. Like, knowing

if something is fake news,

understanding if a piece of information is, like, real or fake.

I think now

we have, like, such a

flood of information on the Internet

that the integrity piece hasn't gotten up. So I think, like, 1 of the most

exciting

things planned for Newspaper is

like, this tool is able

to extract information

for articles and

news

organizations.

I view the next logical step. This is, like, just 1 of the goals.

This is kind of also kinda like

1 of the more long term lofty ones,

maybe a little unrealistic. But I view the logical next step would be to go 1 step further and having,

like, an entire ecosystem view of all news organizations and being able to piece out, like,

if something is fake news or the origins of a piece of information.

I think that would be, like, super important. I actually talked with a CNN

executive once in New York in a swimming pool about this, and he was also saying,

like, from his view in CNN,

fake news is such a big problem, obviously,

you know, with everything happening

in the political landscape and what's happened in America,

I think

it's just such an important,

like, reason to work on work in the news domain, you know, in the in this whole, like, Python data science

news sphere. It's it's only gonna become more important,

and I really want to

tackle fake news with newspaper.

Before that, I want to build a world class testing and evaluation system where,

like, every change into this repository, we know for sure if there's an improvement or a progression

and eventually move from heuristic

based approaches I mentioned above to machine learning based approaches and also develop, like, a stronger, like, focus on pull requests. So I wanna start rejecting more pull requests despite it being

mean, but just being more focused on what gets implemented.

Yeah. Accepting every offer of help that comes your way and every change set to the library can definitely end up leading you down the path of complete unmaintainability

and

a slew of conflicting features and capabilities

that make it impossible for you to actually make forward progress because 1 of my favorite aphorisms is that open source is free as in puppy.

Yeah. Basically.

Are there any other aspects of the newspaper project or the ways that you were using it or the ways you've seen it being used or the overall challenges of working with news sources that we didn't discuss that you'd like to cover before we close out the show?

Yeah. I actually view

kind of, like, this mistake I made was I try trying to be nice too, like, too often. I think maybe it's, like, because I was still kind of new to open source. I view, like, rejecting a PR as being mean.

And

after 5 or 6 years, it like, everything you said just now about, like, you know, if if you just took everyone's help, you're just gonna create this huge spaghetti mess of conflicting features. This I was I think I was accepting way too many PRs for 2 or 3 years, and it's already, like, gotten to a point where it's I I've seen

the the problematic

result of it, and I think going forward,

I'm definitely gonna start projecting more PRs. And maybe I I think it might help other open source people also, but have, like, a charter and, like, a long term

vision for your repository, especially if you're spending a lot of time on it. So, like, if yeah. So some PRs are very

controversial. You could just you have to make a hard decision sometimes.

But as you said, just having a good idea of where you want to take something and what your ultimate purpose is for the project can definitely help

to act as a guiding light for determining

what contributions to accept

and have a way to easily point to your

specifications of this is what this library is meant to do for the cases where you do need to reject a pull request to say, I appreciate the work. I appreciate the effort, but this is not what I'm trying to do. And then maybe offer some ways to have extension points or ways to hook into the library or build an ecosystem where they can have a place to put those capabilities, but it doesn't have to live within your library specifically.

Yeah. Exactly. I've actually seen some of the more experienced open source people

basically reject PRs in, like, the most polite way. It's it's almost comedic.

Like,

like, all the, like, smiley faces and all, you know, this is great, but

probably probably not not for this PR.

Well, for anybody who does want to get in touch and follow along with the work that you're doing or contribute to your work on newspaper, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a podcast I started listening to recently from the folks at Marketplace from NPR called 1, 000, 000 Bazillion.

I started listening to that with my kids, and it's just a really entertaining

podcast about money

and the economy and just ways to teach kids about how to think about money and deal with money. So it's just a lot of fun and very well structured for younger kids and adults as well. So definitely recommend taking a look at that. And with that, I'll pass it to you, Lucas. Do you have any picks this week?

Alright. So my pick would be was a really good book by

Paul Graham

from Y Combinator, the Hacker News, called Hackers and Painters.

And that book was 1 of the things that got me into

computer programming and open source. And I think it's especially for some of the newer listeners who aren't very deep in the weeds in open source and software engineering yet, I think this book gives a lot of clarity into, you know, like, the joy of being a hacker and a programmer.

Yeah. I'll definitely take a look at that 1. So thank you very much for taking the time today to join me and discuss the work that you've been doing with newspaper. It's a project that I've been hearing a lot about. So I appreciate all of the time and effort you've put into that and helping to simplify the work of people who are trying to help keep the

journalistic ecosystem on track and healthy. So I appreciate all of your time and effort on that, and I hope you enjoy the rest of your day. Thanks so much, Tobias.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__