When, Why, and How To Use Web Scraping In A Nutshell

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode

today. That's l I n o d e, and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. Today, I'm interviewing Attila Toth about doing data extraction with web scraping. So, Attila, can you start by introducing yourself? Hey, Tobias. Yeah. Thanks for having me. First of all, I work as a technology evangelist at ScrapingHub,

which is a company

that does web data extraction.

My job is really to just educate people about web scraping, both technical people and business people, how to do web scraping, what is web scraping, what are the challenges associated with web scraping.

You know, it can be useful for people, useful for businesses.

That's what I mainly do. And in terms of speaking to those different audiences,

what have you found to be some of the most

common points of confusion

and some of the changes in the way that you need to discuss the subject matter to help them out and make sure that it's the right solution for them? In my experience with web scraping, the 1 thing that always comes up is, which we will probably talk about later, is the legal issues. Like, is web scraping legal?

That always comes up. And another problem which always comes up is,

when, like, doing web scraping on a small scale is really easy to do, sort of. You just have a website,

you want to grab some data,

you write some code in Python or in some other languages using web scraping libraries, and it's really easy to do and it can be quick to do. But when you are trying to do things at scale

and get really get like huge amounts of data from the web consistently,

that can be really hard to do. So there are a whole another set of challenges when you're trying to scale. In my experience, that's

what a lot of people don't know when they start web scraping,

especially when, you know, many times people have just, like a hobby project

that involves web scraping.

Sometimes it's, you know, to help them with, like, parts betting

or

just monitoring a website,

any kind of hobby project. And then they try to

scale that for not just as a hobby, but for business purposes.

And then they realize that, oh, it's not that easy actually

to scale web scraping. They will experience a lot of difficulties doing that. So, yeah, I guess that those are the 2 things. The 1 is the legal stuff

with web scraping and the other is how hard it is really to scale up. Do you remember how you first got introduced to Python? Yeah. Of course. Before Python, actually,

I was programming in Java.

I was developing

Android applications in Java. Also, I had a lot of hobby projects,

you know, like minor things that helped me with, like, automations.

For example,

I had 1 of these projects that involved app scraping. And at the time, it was like 2015.

I didn't know much about web scraping. So I did a research

and I found this library called Scrappy,

which which isn't written in Java. It's a Python library or or Python framework.

So, basically,

Scrappy was the entry point for me because

that's the reason I I started to learn Python, to learn Scrappy.

And,

then I found that Python is much more better suited to do,

like, analysis, data analysis kind of work, which I was really interested in.

So I just switched to Python from Java.

And in terms of web scraping itself,

can you give a bit more of a description about what it is and some of the reasons that you might want to use web scraping as opposed to using an API or pulling data from some sort of data dump or database?

So there are many use cases for web scraping. The reason I I really love web scraping because it can be really useful for individuals,

for programmers,

or really anyone who can code

and wants to, like, automate

a job, automate

anything on the web, or get some kind of data from the web if the website doesn't have an API. Because, really, if the website doesn't have an API,

you know, normally, you you cannot really do anything

except, you know, copy and paste the data, but that's not feasible in many cases, so you have web scraping. But also on the business side, on large scale projects,

it can be useful as well to, for example,

web scraping is huge in the in the ecommerce world where e commerce companies

monitor prices

of the competitors

and they do it daily.

They extract the prices. They compare the prices. They run price analysis on the data. And because of web scraping, they are able

to better price their product.

Or another use case, which is a similar use case, is in the real estate market.

For example, I have a I have a friend right now

who is looking for a house. He's looking to buy a house, but currently the, you know, the prices

on the market at least where he lives is really high and so he's having a hard time finding a good house to buy. So he set up a scraper

which

scrapes the, you know, some of the major real estate sites daily.

It sends an email to him if there is a house which he should look at. So that's another interesting use case. On the business side of things, it's also useful for lead generation,

like when there's a website with a bunch of company contact information,

You can do a 1 time scraping

to get contact information

and it can be useful to to prospect

people.

So these are the general

use cases. There are many others like scraping news, for example, like articles,

news articles.

Generally speaking, I would say that 1 way to use web scraping is to just grab the data and then you you get value immediately,

like, you can maybe export it as a spreadsheet or a CSV file and look and you can look at the data

and that's useful

at that stage.

But on another level,

if you run some kind of analysis

or some companies do different

NLP stuff on the on the text data, and then they are building on top of web script data. And so that's another way you can get value from web scraping.

And in your experience,

what was it that first got you involved in web scraping and that keeps you interested in the space to the point where you're acting as a technical evangelist for a company that is built entirely around that subject area?

So it all started for me with a hobby project. In the past, I was really into

sports betting, but I I really I I really didn't do it for the, like, for the money because I I just wasn't that good. But I enjoyed the, you know, I enjoyed the research.

I enjoyed looking into stats,

and I enjoyed coming up with different strategies

to bet on football. I wanted to create the create a software

that

does the research

for me. Software that would have, like, first of all, data, football data, historical

stats, and then it would analyze

data and give me suggestions,

you know, what game or what match I should bet on. Here my fur very first problem was that I didn't have any data. I didn't have sparse data, and at the time, there weren't

many open source or open

datasets.

I I was looking for a dataset for, I think, for multiple months, but I couldn't really find anything except really

high priced

datasets that are for companies not individuals.

But I realized that there are websites

where you can actually you can actually see the data, like it's it's there in the HTML, you can see it in tables, so you can physically see it, but there is no API to the website.

So I figured that there must be a way to

get this

data from the HTML

into a SQL database

or something. And basically, this is how I came across web scraping first

and I developed my first scraper in Java using the JSUP library, which is really popular for people who who do web scraping in Java. I also use later on for more difficult websites, I use, I think, HTML unit, which is used for testing. I think it's similar to to selenium in Java, but I used it to execute JavaScript because some websites

need JavaScript to work properly. And if you don't render JavaScript,

they they won't show you the data. But anyway,

I then

switched from Java to Python to learn Scrappy and to do web scraping with Scrappy.

I built this

sports betting application

with the help of Scrappy. Or at least Scrappy helped me to get the data from websites. And, yeah, and that's where that's when I really I really started to like web scraping

because I realized that, you know, it was just a silly hobby project, although I really liked it. But without web scraping, I wouldn't have been able to get the data. That was the big thing for me that that without web scraping, I couldn't have started this project at all because I wouldn't have been able to get the data. And you mentioned Scrapy being the tool that you have settled on as the most useful for your use cases. And I know that it's also 1 of the more popular frameworks for web scraping in general, despite the fact that, as you said, there are options in other languages as well as other Python tools for it. So I'm wondering

what it is about the Scrapy project itself and its community that has made it stand out and retain such widespread popularity.

I think Scrappy has a really good community because, like, there are many other libraries in Python or in other languages.

But most of the web scraping libraries

are really only

useful for parsing HTML.

So because when you when you scrape the web, what you're doing essentially is downloading the HTML

file,

finding the data, like fighting finding the data fields or data points in the HTML,

which you can do using CSS selectors

or XPath or maybe regular expressions.

Once you found the data, you you extract it using the library, which can be Beautiful Soup or AlexML

or Scrappy or injssoup or there are many others. But the real volume in Scrappy comes if you need to do

something more with the data, you need to process it. And in real world projects,

usually,

you need to post process the data. Meaning that you need to clean it, you need to standardize it, normalize it, you need to store in some kind of database.

So it's really not just grabbing the data from the HTML.

And with Scrappy,

there are a lot of features

that make it easy for you to post process the data and really do end to end web scraping inside Scrappy. Because if you if you wouldn't use Scrappy, you would need to sort of reinvent the wheel. You would need to come up with a way to sort of facilitate this whole to facilitate your whole web data extraction project. And in Scrappy, you have built in modules

that, for example,

if you want to get the data in JSON,

in Scrappy,

there is a module called the feed export,

which exports

your scraped data in in a format you like, like JSON or maybe CSV.

And it's just 1 line of code. Also, you have, you have the pipelines

that make it easy to

store the data into a database.

So, yeah, I think the real volume scrappy is when you need to post process the data. And the situation is that you always

you almost always need to post process the data, at least you need to clean it. Because when you grab data from the HTML,

it's probably messy,

it's probably not standardized,

especially if you are extracting data from multiple websites. You need to standardize it. And so Scrappy makes it, easy to do these tasks.

And with Python in particular, there's the added benefit that you have easy access to all of the different data engineering tool chains and machine learning tool chains that people have built up around the numerical aspects of Python over the past couple of decades.

Oh, yeah. Absolutely. Especially,

like I mentioned, many web scraping use cases are built around

running some kind of analysis

on the web script data.

And as you said, once you get the data from Scrappy and it's in Python, you can just pass the data onto some other

Python library like pandas, for example, to do other data engineering

stuff with it.

And in terms of web scraping itself, you already mentioned this earlier,

is the idea of copyright and content ownership and the legality of web scraping. So I'm wondering what are the main things that anybody who's interested in doing web scraping should be aware of in terms of

the legality and issues around copyright ownership or attribution of the data?

So it's interesting because at scraping hub, we often do webinars

for people. And this question

always comes up in some shape or form.

Is web scraping legal? How you can do it so it's legal?

Like, there are some general guidelines about web scraping, which I'm gonna share now, but scraping hub, we have a legal team that makes sure that everything what we are doing as far as web scraping is legal and obviously, I'm not a lawyer. So what I'm gonna say is not any kind of legal advice, but there are some general guidelines.

So for example, it's important that web scraping itself

is legal, but it really matters

how you get the data,

how polite you are while getting the data. And also, you mentioned copyright. It's it's important that what you do with the data after you scraped it. So

with the first part, how polite you are when scraping a website. It's very it's very easy with web scraping to to ignore

the rules

that the website

has

and that you should follow. So for example,

let's say you want to scrape a website, what you should do

to make sure that you are being polite because it's really important

from a legal standpoint, but also from an ethical standpoint. So first of all, there is a a file that is all almost always available on websites, which is the robots text file.

And when you're scraping

the website, you should always follow the rules

that are defined in the robots. Txt file. So for example,

if you wanna be polite and the robots. Txt file says that between, I don't know, between 2 requests, you should have a delay of 3 seconds, for example,

then you should put that delay

between requests in your scraper.

So the very first thing you could do to respect the website is to follow the rules defined in the robots. Txt file. But, generally,

you just don't want to hit the website too hard. Because if you hit the website too hard, there is a chance that it will

go down, it will shut down, which is not good. You don't want to do that. You want to be polite and you want to respect the website.

So it it can operate properly

for everybody.

And the other part is what you do with the data once you scraped it.

And here it's really it it's really

up to the specific use case,

you know, because for example, if you are scraping personal information,

there is a high chance it has some kind of legal

issues or it might have because of GDPR.

Or if you are scraping content

like images or articles or some kind of content,

you are republishing it. That can be a problem because of, you know, different legal legal issues.

But it's not an issue

because of the web scraping itself. Web scraping itself is legal. But once you got the data,

obviously,

it does matter what you do with the data and how you use the data. And so that can be that can be a problem.

And to that point, as you said, after you

access the data, you need to do some processing on it, and you're likely to perform some analysis. Or if you're trying to do this at scale, you might be loading it into a database or data warehouse and performing

additional analysis or building some sort of product on top of it. So what are some of the

considerations

for storing and processing that data that we should be thinking of both from the technical and the legal aspects?

So from the legal side of things, I cannot talk much about it because I'm not a lawyer. But from the technical standpoint,

there is a difference when you're doing things small scale or or large scale. So when you're scraping the web at a large scale, 2 of the problems

that will take probably

most of your time is the data quality, like keeping data quality high, and the other is maintaining the spiders.

So

there are many other things that we might mention later on like unstructured data. Yeah. Actually, let's talk about unstructured data because with web scraping,

you don't really have, like, a universal

schema

that every every web scraper follows. Like, there is not 1 schema that you know you need to follow if you are scraping a specific type of website.

So it can be a challenge when you are figuring out how to store data, like what kind of data fields you need, how you name those data fields. So first you need to figure out the schema, like what data points you really need and how it makes sense to store data long term. But once you have that, it's really just a challenge to keep the data quality high. Because if the data quality

data quality is too low, like you mentioned or we mentioned

products

or use cases where people build on top of Web Script data. Even though it's really

useful to

grab the data, run analysis on it, and based on the analysis,

you know, companies usually make better decisions

and that sort of thing. But if the data quality is low, the kind of decisions that you can or analysis that you can run on low quality data is just, you know, it's it's not a good situation when you have low quality data and you are trying to run analysis on the low quality data. And with web scraping, it's really easy to get bad quality data

because

let's say you have a spider,

you have selectors that select the data from on the website,

but websites changes its layout.

And so

from 1 day to another, you will not be able to get the data because the website has a different layout and your selectors don't work anymore and data quality drops.

And so you need to maintain the spiders.

And if you're building on top of TypeScript data, for example, in the ecommerce world, we have a different

use cases like, for example, price intelligence,

which basically means that you extract prices

and you run analysis

on the prices

and, you know, you monitor competitors

or you monitor your own product prices

or you monitor the whole market. And if the price data that comes out of your web scraping project is low quality,

you will not be able to

drive decisions based on the data. So that's 1 consideration

to sort of work on is to how to keep the data quality high. So that's 1 reason why data quality could drop because of site maintenance or because the website layout changes and you need some time to maintain the spider. But also because

scraping websites

at scale,

you need to use proxies. You need to use a lot of proxies

in many large scale web scraping project.

That's the major of majority of the work to manage proxies,

getting a healthy proxy pool, and making sure that you can make successful

requests. Meaning that you you can get the HTML

file in the first place and then extract the data with your spiders. If you cannot make successful

requests,

you don't get the HTML files. And so this can cause the data quality to drop. And in terms of the data quality aspects that you mentioned, there is the issue of sites changing their layouts or actively trying to prevent web scraping that can lead to breakages once you've got something that's functioning and then needing to maintain it. And then also, as you mentioned, the issues around trying to scale a scraper to be able to maintain the data on a periodic basis or collect data from multiple sites. So I'm wondering

what are some of the ways that people who are writing a scraper can guard against some of those changes and updates or

ways to allow the scraper to adapt to those changes?

I would say that you'd have a way to

learn quickly if there's a layout change. With Scrappy, for example,

you have a lot of open source modules that you can use alongside with Scrappy that help you keep the data quality high, that alerts you that, hey, there is something wrong. Maybe you should have a look at this website

because maybe it changed layout. But there is not really like a shortcut. Like, if the website layout changes, you need to

change the selectors

and when it, you know, for 1 website it's really easy to do but for 100 or 1000 of websites

you will have a lot of maintenance work. But the good news is that there are, you know, just like in many other industries,

in web scraping as well, there are machine learning

advancements, like there are different AI based tools that extract the data for you. So for example, at scraping hub, we have a product called auto extract, which is an ML based tool,

which extracts extracts the data fields from the page and you only need to

submit

the type of the page. So for example, we have this API called auto extract news API

and what it basically does is that you submit

it's an API and you submit the page URL and you say that, hey, it's, there's a news on this page or there's an article on this page. And so the the software will know that, you know, what fields, what data points to look for and it automatically

extracts all the data fields that you care about from the HTML.

And this is all m l based so you don't need to manually

find or locate the data points in the HTML.

And this is huge

because this way you don't need to maintain

the the spider if the website layout changes. Because if it changes,

doesn't matter, the the algorithm

will will locate the data fields anyway.

So that's really the the ultimate solution long term. You know, we are doing this for many different data types. So for example, we have news API, which is for news and articles.

We have a product API, which is for e commerce websites,

meaning product pages. So you get all the data fields

related to products like product price,

stock information,

product description, product name and so on. So in the

future, you will not need to deal with maintaining spiders.

You will just need to specify that, hey, I want to scrape data from this page, this is a product page, or a real estate listing,

or an article page,

and give you all the data points. And the API will give you all the data points and you won't need to struggle with, you know, the selectors or XPAS or whatever.

1 of the other ways of scaling the collection

of web data is by using proxies

to help with

things like access restriction, where there might be some sort of

geo IP blocking setup or

issues with

the lack of a matching user agent that a site might set up. So I'm wondering if you can talk through some of the considerations

of using proxies, when you might want to use them, and some of the features you should look at if you're trying to either select or implement a proxy yourself?

Yeah. So there are some reasons that you would want to or need to use proxies. 1 of them, just like you mentioned, if there is some kind of geo blocking that the website implemented or if you want to get geo specific data from the website, meaning that the website shows you a different page or different data based depending on where you are. So if, for example, in e commerce,

you might see a different

price or different delivery or shipping information based on your location. And so in order to get this kind of data, you need to use geo geo targeted proxies. The second reason why we why you would use proxies is when you're trying to

scrape

a lot of data regularly, consistently

from a website,

you will just not be able to get it from, you know, from your computer or data center because the website will realize that you are a bot and it will block you even though what you're doing is, you know, ethical and you are making sure that you are polite. The website can still block you. With proxies, the interesting thing is that when people experience issues

with the web scraping project, meaning that they start to experience blocks,

most of the time, they know that need they need to use proxies. But many times it's it's not just the proxies itself. Like, proxy itself is is a commodity. Like, you can buy proxies

from many many countries.

You can buy different kinds of proxies, data center proxies, residential proxies.

There are all kinds of proxies out there that you can buy, but with web scraping, if you want to be efficient,

it's really about how you manage those proxies that you have. So let's

say you are trying to scale up your project, you experience issues with data quality,

you don't get, successful requests,

and you want to start using proxies.

1 solution that you could do is to just buy a bunch of proxies and just rotate them randomly. And that might work for a while, but that's definitely not an efficient way to do things. What I would say the best way to do it is to get proxies, but put a lot of consideration

into how you manage those proxies.

So for example, at scrapinghub,

we have a platform

which is helping you with web scraping on all levels.

You can run your spiders in the cloud, we we provide these

machine learning based APIs.

We have solutions, etcetera, but we also provide proxies. But the way we think about this is that what you really want is not really proxies.

It's just a mean to an end. What you really want is successful requests.

So what we give you

is not just proxies, but a full proxy management solution.

Yes, we have proxies, but the main thing is that how we manage the proxies for you. Like, there are many tactics

that you can use to manage the proxies in in a smart way. So for example,

a website might, you know, block a proxy.

You know, 1 time it might block a proxy, but it doesn't mean that it will be blocked forever. So you can you can retry the proxy

a day later, a week later, a month later, and it might work again.

With smart proxy management,

it's really about making use of your current proxies.

Like usually, the customers that come to us, they just don't want to deal with this whole proxy management thing. They just want a solution that gives them successful requests.

Or we have customers who come to us because they started to or try to do it themselves.

They tried to manage the proxies themselves,

but it just really became a headache. The thing with proxies is that, you know, as the time goes on, it it's not gonna be easier. Like, it's just gonna be harder. You're just gonna need to spend more time on the way you manage manage the proxies.

And so for many companies, it just makes sense to outsource

this thing because you can get really bogged down

managing proxies

and for many companies and for many people, it's just not it's not what they try to do. It's not the core of their business to do web scraping. It's just a way to get data. And so so many people don't like to spend way too much time on proxies.

For people who are building web scrapers,

particularly with Scrapy, but also just in the general case, what are some of the common pitfalls or sharp edges that they're likely to run into for building the scraper, collecting the data, and also for trying to scale that data collection? I have a colleague that said last year at at our summit, he said the web is a jungle. So when you scrape the web, you really shouldn't expect

to find,

you know, standardized websites.

You shouldn't expect that websites

follow the same practices.

So 1 thing is that front end developers or website developers have different ways to accomplish things and when you're writing your spider, it can be really, really frustrating or it can be hard

to create really a custom custom web scraper for for each website.

Because this website use this way to display data, the other website use some kind of other way to to display the data. So it can be really hard. The other thing is, in my experience, I don't have data about this, but in my experience, people usually

underestimate the effort they will need to get data

from websites.

And, actually, I I did this mistake.

Before I joined scraping cup. I was building my own company or I was a co founder in a web scraping based startup. And I did that mistake that it was easy to get data from, like, let's say, 10 websites.

At the time, we were talking about, like, maybe thousands of records per day or maybe tens of thousands of records

extracted per day and that was, fairly easy to do. I didn't need to use proxies or I didn't need need to use many proxies.

There wasn't many maintenance work needed for like, you know, 10 websites.

But when we started to scale, it become really hard because

at the same time, we needed to maintain the spiders

because the layouts were changing.

There was always some kind of work to do with the spiders.

And on the proxy side,

we we needed to find some kind of solution to to not get bogged down with the proxies.

So those 2 things are really I think people

if people start a new project, it's easy to underestimate those. Like, it really comes down to a question of data quality when you're extracting data from the web, and there needs to be needs to be a process

in place to make sure that the data quality is high, especially

when you are building something on top of the Web Script data. Like for example,

I mentioned price intelligence.

In price intelligence softwares,

the way it works is that there is a module that, extracts data from websites and, you know, it can be thousands of websites or tons or tens of thousands of websites.

It stores the data in a database

and then there is an application

that reads the data from the database.

But if the data

that comes out of the of the scraping module or the extraction code is low quality or it's it's some it's not standardized or there is some issue with it, you will not be able to

trust

the results, the analysis

that the price intelligence software gives you. And so, on long term, I think data quality is the issue to focus on and to make sure that that you have a process in place.

Also,

like I mentioned that the web is really not standardized. Like, when it comes to web data, there is not 1

straightforward

schema that everyone is using. That's why I think solutions that sort of try to standardize this make sense.

Like when we are talking about the product page, we have a clear and agreed upon schema that everyone will use

to to store data or to display data, and the same for for other data types. Another consideration that we actually didn't talk about is JavaScript.

Like, when you're scraping the web, you sometimes need to render JavaScript,

and it can be it can be a challenge because as far as resources,

because if you need to render JavaScript, that's gonna take more of your resources,

and it's gonna be probably more expensive, and you probably need to use some kind of tool that renders JavaScript.

Because, for example,

the scraping libraries we mentioned

don't render JavaScript.

Also, Scrappy doesn't render JavaScript

by default. So for this, you need to use, you know, a tool like, a headless a headless browser like Selenium

or in JavaScript,

many people use Puppeteer

or,

what I'd really like to use for JavaScript rendering is Splash,

which is an open source

JavaScript rendering

solution,

and it has a really great integration with Scrappy. So you can just, you know, plug in splash into your Scrappy

and render JavaScript

with Scrappy. And yeah. Another pitfall maybe

related to JavaScript rendering

is that some websites

has a sort of hidden APIs,

like JSONs that are getting passed around in the background

that has the same kind of data that you want to scrape. And many times, it makes, sense to grab that JSON

from the background. Like, maybe

the website uses AJAX to request these data files. And instead of relying on the HTML and relying on your spider to grab the data from the HTML, you can just grab the JSON

and get structured data that way. This way, you don't even need to render JavaScript.

And in your experience

of working on web scrapers yourself and working with people who are building them at scraping hub, what are some of the most interesting or innovative or unexpected ways that you've seen Scrapy and scraping hub used and some of the particularly

interesting or challenging lessons that you've learned in the process?

There are 2 categories of use cases.

At least that's how I I look at it. Like, first, we have the

individual use cases, meaning that use cases that are implemented by individuals

for hobby projects or, you know, simple automations.

Like, for example, I recently

I recently heard about this person who

who created a scraper

to find grocery delivery slots

on in an online store. Because, you know, during covid,

everybody started to order online groceries and and everything. That was the situation here that you couldn't find open delivery slots. So you couldn't order online. But this person built a web scraper that monitored this website

and whenever there was an open delivery slot, it sent an email. So that's I think pretty useful.

Or I mentioned my my friend who is looking to buy a house and he's monitoring the the real estate prices. So that's 1 category of use cases. And the other category of use cases which are also interesting in my opinion

is when you're building

on top of script data. Like for example, I talked to this company not so long ago, they are like an NLP company

and they provide basically

a search engine

for news.

So you can search news

that are relevant for you. They monitor

specific topics. So for example, you you might want to learn about, I don't know, COVID vaccine

development

and you monitor all the news related to that. And in the background,

the software has a feature

that lets you submit,

you know, news that, hey, this article

is about the topic that I want to monitor. Here's another article that's also about the topic I want to monitor. And you basically

train the system

to monitor

news that are relevant for you.

So whenever you use the the application,

it will only show you news that are really relevant for you. So I found it really interesting.

Another use case which is which is I guess it's not spectacular,

but it's just really useful and it just really makes sense to do it is to monitor

prices on the web. Right? Like, really any prices. You know, we have e commerce,

we have real estate as far as prices.

There are huge web scraping projects in the automotive industry.

When it comes to brand monitoring,

it's actually called the minimum advertised price monitoring

that brands use.

Which means that, you know, brand has specific pricing policy

and distributors

or the resellers

need to need to align their prices

with brands policy. And with web scraping, these brands

can monitor the resellers website and get alerted

if the price is, you know, higher or lower

than it's supposed to be. So yeah, but for me, the reason I love web scraping is that it's easy to do web scraping on small scale. Like I remember when I first learned about web scraping, I think it took me about, I don't know, maybe 2 days or 3 days, probably 3 days to write my first

scraper and to actually see data coming out of my web scraper. And there are many use cases for for individuals to just automate small tasks, But with,

you know, for big companies, for large enterprises,

it also makes sense to do web scraping at a large scale. So that's why I I love it. And are there any other aspects of web scraping and your experience at scraping hub that we didn't discuss that you'd like to cover before we close out the show? I think we covered covered the main things. Yeah. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the band Government Mule. I may have picked them in the past, but it's just a really great rock band that I've enjoyed listening to for a number of years now. So if you're looking for some new music to listen to or some old music that you may be familiar with, definitely worth taking a look at that. And with that, I'll pass to you, Attila. Do you have any picks this week? Yeah. Actually, yeah. So, I mean, it's not a new thing, but there's this open source it's not like a library, but it's on GitHub.

Awesome awesome web scraping and awesome scrappy, which has a lot of tools about web scraping and about scrappy

that, you know, just makes it easy to to get data from the web. Alright. I'll definitely have to take a look at those. So thank you very much for taking the time today to join me and discuss your experience building web scrapers and working with scraping hub to help other people and their efforts on that front. So I appreciate all the time and effort you've put in, and I hope you enjoy the rest of your day. Yeah. Thank you, Tobias. Thank thanks for having me. It was fun. Have a good day.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__