Summary
The internet is a rich source of information, but a majority of it isn’t accessible programmatically through APIs or databases. To address that shortcoming there are a variety of web scraping frameworks that aid in extracting structured data from web pages. In this episode Attila Tóth shares the challenges of web data extraction, the ways that you can use it, and how Scrapy and ScrapingHub can help you with your projects.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This portion of Python Podcast is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place (sometimes fast, sometimes slow)? Do you know why? With Datadog, you will. You can troubleshoot your app’s performance with Datadog’s end-to-end tracing and in one click correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at datadog.com/pythonpodcast. If you sign up for a trial and install the agent, Datadog will send you a free t-shirt.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Attila Tóth about doing data extraction with web scraping.
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what web scraping is and when you might want to use it?
- How did you first get started with web scraping?
- There are a number of options for web scraping tools in Python, as well as other languages. What are the characteristics of the Scrapy project and community that have made it stand out and retain such widespread popularity?
- One of the perpetual questions with web scraping is that of copyright and content ownership. What should we all be aware of when scraping a given website?
- What are some of the most challenging aspects of crawling and scraping the web?
- What are some of the features of Scrapy that aid in those challenges?
- Once you have retrieved the content from a site, what are some of the considerations for storing and processing the data that we should be thinking about?
- How can we guard against a scraper breaking due to changes in the layout of a site, or simple updates that weren’t accounted for in the initial implementation?
- What are some of the most complicated aspects of scaling web scrapers?
- For someone who is interested in using Scrapy, what are some of the common pitfalls that they should be aware of?
- What are some of the most interesting, innovative, or unexpected projects that are built with Scrapy and ScrapingHub?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with web scrapers and ScrapingHub?
- What resources would you recommend to anyone who is looking to learn more about web scraping?
Keep In Touch
Picks
- Tobias
- Attila
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Web Scraping
- ScrapingHub
- Java
- Android
- Scrapy
- JSoup
- HTMLUnit
- Selenium
- Pandas
- robots.txt
- Puppeteer
- Splash
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode today. That's l I n o d e, and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.
For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. Today, I'm interviewing Attila Toth about doing data extraction with web scraping. So, Attila, can you start by introducing yourself? Hey, Tobias. Yeah. Thanks for having me. First of all, I work as a technology evangelist at ScrapingHub,
[00:01:30] Unknown:
which is a company that does web data extraction. My job is really to just educate people about web scraping, both technical people and business people, how to do web scraping, what is web scraping, what are the challenges associated with web scraping. You know, it can be useful for people, useful for businesses.
[00:01:49] Unknown:
That's what I mainly do. And in terms of speaking to those different audiences, what have you found to be some of the most common points of confusion and some of the changes in the way that you need to discuss the subject matter to help them out and make sure that it's the right solution for them? In my experience with web scraping, the 1 thing that always comes up is, which we will probably talk about later, is the legal issues. Like, is web scraping legal?
[00:02:19] Unknown:
That always comes up. And another problem which always comes up is, when, like, doing web scraping on a small scale is really easy to do, sort of. You just have a website, you want to grab some data, you write some code in Python or in some other languages using web scraping libraries, and it's really easy to do and it can be quick to do. But when you are trying to do things at scale and get really get like huge amounts of data from the web consistently, that can be really hard to do. So there are a whole another set of challenges when you're trying to scale. In my experience, that's what a lot of people don't know when they start web scraping, especially when, you know, many times people have just, like a hobby project that involves web scraping.
Sometimes it's, you know, to help them with, like, parts betting or just monitoring a website, any kind of hobby project. And then they try to scale that for not just as a hobby, but for business purposes. And then they realize that, oh, it's not that easy actually to scale web scraping. They will experience a lot of difficulties doing that. So, yeah, I guess that those are the 2 things. The 1 is the legal stuff with web scraping and the other is how hard it is really to scale up. Do you remember how you first got introduced to Python? Yeah. Of course. Before Python, actually, I was programming in Java.
I was developing Android applications in Java. Also, I had a lot of hobby projects, you know, like minor things that helped me with, like, automations. For example, I had 1 of these projects that involved app scraping. And at the time, it was like 2015. I didn't know much about web scraping. So I did a research and I found this library called Scrappy, which which isn't written in Java. It's a Python library or or Python framework. So, basically, Scrappy was the entry point for me because that's the reason I I started to learn Python, to learn Scrappy. And, then I found that Python is much more better suited to do, like, analysis, data analysis kind of work, which I was really interested in.
So I just switched to Python from Java.
[00:04:41] Unknown:
And in terms of web scraping itself, can you give a bit more of a description about what it is and some of the reasons that you might want to use web scraping as opposed to using an API or pulling data from some sort of data dump or database?
[00:04:57] Unknown:
So there are many use cases for web scraping. The reason I I really love web scraping because it can be really useful for individuals, for programmers, or really anyone who can code and wants to, like, automate a job, automate anything on the web, or get some kind of data from the web if the website doesn't have an API. Because, really, if the website doesn't have an API, you know, normally, you you cannot really do anything except, you know, copy and paste the data, but that's not feasible in many cases, so you have web scraping. But also on the business side, on large scale projects, it can be useful as well to, for example, web scraping is huge in the in the ecommerce world where e commerce companies monitor prices of the competitors and they do it daily.
They extract the prices. They compare the prices. They run price analysis on the data. And because of web scraping, they are able to better price their product. Or another use case, which is a similar use case, is in the real estate market. For example, I have a I have a friend right now who is looking for a house. He's looking to buy a house, but currently the, you know, the prices on the market at least where he lives is really high and so he's having a hard time finding a good house to buy. So he set up a scraper which scrapes the, you know, some of the major real estate sites daily.
It sends an email to him if there is a house which he should look at. So that's another interesting use case. On the business side of things, it's also useful for lead generation, like when there's a website with a bunch of company contact information, You can do a 1 time scraping to get contact information and it can be useful to to prospect people. So these are the general use cases. There are many others like scraping news, for example, like articles, news articles. Generally speaking, I would say that 1 way to use web scraping is to just grab the data and then you you get value immediately, like, you can maybe export it as a spreadsheet or a CSV file and look and you can look at the data and that's useful at that stage.
But on another level, if you run some kind of analysis or some companies do different NLP stuff on the on the text data, and then they are building on top of web script data. And so that's another way you can get value from web scraping.
[00:07:42] Unknown:
And in your experience, what was it that first got you involved in web scraping and that keeps you interested in the space to the point where you're acting as a technical evangelist for a company that is built entirely around that subject area?
[00:07:57] Unknown:
So it all started for me with a hobby project. In the past, I was really into sports betting, but I I really I I really didn't do it for the, like, for the money because I I just wasn't that good. But I enjoyed the, you know, I enjoyed the research. I enjoyed looking into stats, and I enjoyed coming up with different strategies to bet on football. I wanted to create the create a software that does the research for me. Software that would have, like, first of all, data, football data, historical stats, and then it would analyze data and give me suggestions, you know, what game or what match I should bet on. Here my fur very first problem was that I didn't have any data. I didn't have sparse data, and at the time, there weren't many open source or open datasets.
I I was looking for a dataset for, I think, for multiple months, but I couldn't really find anything except really high priced datasets that are for companies not individuals. But I realized that there are websites where you can actually you can actually see the data, like it's it's there in the HTML, you can see it in tables, so you can physically see it, but there is no API to the website. So I figured that there must be a way to get this data from the HTML into a SQL database or something. And basically, this is how I came across web scraping first and I developed my first scraper in Java using the JSUP library, which is really popular for people who who do web scraping in Java. I also use later on for more difficult websites, I use, I think, HTML unit, which is used for testing. I think it's similar to to selenium in Java, but I used it to execute JavaScript because some websites need JavaScript to work properly. And if you don't render JavaScript, they they won't show you the data. But anyway, I then switched from Java to Python to learn Scrappy and to do web scraping with Scrappy.
I built this sports betting application with the help of Scrappy. Or at least Scrappy helped me to get the data from websites. And, yeah, and that's where that's when I really I really started to like web scraping because I realized that, you know, it was just a silly hobby project, although I really liked it. But without web scraping, I wouldn't have been able to get the data. That was the big thing for me that that without web scraping, I couldn't have started this project at all because I wouldn't have been able to get the data. And you mentioned Scrapy being the tool that you have settled on as the most useful for your use cases. And I know that it's also 1 of the more popular frameworks for web scraping in general, despite the fact that, as you said, there are options in other languages as well as other Python tools for it. So I'm wondering
[00:11:02] Unknown:
what it is about the Scrapy project itself and its community that has made it stand out and retain such widespread popularity.
[00:11:09] Unknown:
I think Scrappy has a really good community because, like, there are many other libraries in Python or in other languages. But most of the web scraping libraries are really only useful for parsing HTML. So because when you when you scrape the web, what you're doing essentially is downloading the HTML file, finding the data, like fighting finding the data fields or data points in the HTML, which you can do using CSS selectors or XPath or maybe regular expressions. Once you found the data, you you extract it using the library, which can be Beautiful Soup or AlexML or Scrappy or injssoup or there are many others. But the real volume in Scrappy comes if you need to do something more with the data, you need to process it. And in real world projects, usually, you need to post process the data. Meaning that you need to clean it, you need to standardize it, normalize it, you need to store in some kind of database.
So it's really not just grabbing the data from the HTML. And with Scrappy, there are a lot of features that make it easy for you to post process the data and really do end to end web scraping inside Scrappy. Because if you if you wouldn't use Scrappy, you would need to sort of reinvent the wheel. You would need to come up with a way to sort of facilitate this whole to facilitate your whole web data extraction project. And in Scrappy, you have built in modules that, for example, if you want to get the data in JSON, in Scrappy, there is a module called the feed export, which exports your scraped data in in a format you like, like JSON or maybe CSV.
And it's just 1 line of code. Also, you have, you have the pipelines that make it easy to store the data into a database. So, yeah, I think the real volume scrappy is when you need to post process the data. And the situation is that you always you almost always need to post process the data, at least you need to clean it. Because when you grab data from the HTML, it's probably messy, it's probably not standardized, especially if you are extracting data from multiple websites. You need to standardize it. And so Scrappy makes it, easy to do these tasks.
[00:13:39] Unknown:
And with Python in particular, there's the added benefit that you have easy access to all of the different data engineering tool chains and machine learning tool chains that people have built up around the numerical aspects of Python over the past couple of decades.
[00:13:53] Unknown:
Oh, yeah. Absolutely. Especially, like I mentioned, many web scraping use cases are built around running some kind of analysis on the web script data. And as you said, once you get the data from Scrappy and it's in Python, you can just pass the data onto some other Python library like pandas, for example, to do other data engineering stuff with it.
[00:14:17] Unknown:
And in terms of web scraping itself, you already mentioned this earlier, is the idea of copyright and content ownership and the legality of web scraping. So I'm wondering what are the main things that anybody who's interested in doing web scraping should be aware of in terms of the legality and issues around copyright ownership or attribution of the data?
[00:14:40] Unknown:
So it's interesting because at scraping hub, we often do webinars for people. And this question always comes up in some shape or form. Is web scraping legal? How you can do it so it's legal? Like, there are some general guidelines about web scraping, which I'm gonna share now, but scraping hub, we have a legal team that makes sure that everything what we are doing as far as web scraping is legal and obviously, I'm not a lawyer. So what I'm gonna say is not any kind of legal advice, but there are some general guidelines. So for example, it's important that web scraping itself is legal, but it really matters how you get the data, how polite you are while getting the data. And also, you mentioned copyright. It's it's important that what you do with the data after you scraped it. So with the first part, how polite you are when scraping a website. It's very it's very easy with web scraping to to ignore the rules that the website has and that you should follow. So for example, let's say you want to scrape a website, what you should do to make sure that you are being polite because it's really important from a legal standpoint, but also from an ethical standpoint. So first of all, there is a a file that is all almost always available on websites, which is the robots text file.
And when you're scraping the website, you should always follow the rules that are defined in the robots. Txt file. So for example, if you wanna be polite and the robots. Txt file says that between, I don't know, between 2 requests, you should have a delay of 3 seconds, for example, then you should put that delay between requests in your scraper. So the very first thing you could do to respect the website is to follow the rules defined in the robots. Txt file. But, generally, you just don't want to hit the website too hard. Because if you hit the website too hard, there is a chance that it will go down, it will shut down, which is not good. You don't want to do that. You want to be polite and you want to respect the website.
So it it can operate properly for everybody. And the other part is what you do with the data once you scraped it. And here it's really it it's really up to the specific use case, you know, because for example, if you are scraping personal information, there is a high chance it has some kind of legal issues or it might have because of GDPR. Or if you are scraping content like images or articles or some kind of content, you are republishing it. That can be a problem because of, you know, different legal legal issues. But it's not an issue because of the web scraping itself. Web scraping itself is legal. But once you got the data, obviously, it does matter what you do with the data and how you use the data. And so that can be that can be a problem.
[00:17:51] Unknown:
And to that point, as you said, after you access the data, you need to do some processing on it, and you're likely to perform some analysis. Or if you're trying to do this at scale, you might be loading it into a database or data warehouse and performing additional analysis or building some sort of product on top of it. So what are some of the considerations for storing and processing that data that we should be thinking of both from the technical and the legal aspects?
[00:18:17] Unknown:
So from the legal side of things, I cannot talk much about it because I'm not a lawyer. But from the technical standpoint, there is a difference when you're doing things small scale or or large scale. So when you're scraping the web at a large scale, 2 of the problems that will take probably most of your time is the data quality, like keeping data quality high, and the other is maintaining the spiders. So there are many other things that we might mention later on like unstructured data. Yeah. Actually, let's talk about unstructured data because with web scraping, you don't really have, like, a universal schema that every every web scraper follows. Like, there is not 1 schema that you know you need to follow if you are scraping a specific type of website.
So it can be a challenge when you are figuring out how to store data, like what kind of data fields you need, how you name those data fields. So first you need to figure out the schema, like what data points you really need and how it makes sense to store data long term. But once you have that, it's really just a challenge to keep the data quality high. Because if the data quality data quality is too low, like you mentioned or we mentioned products or use cases where people build on top of Web Script data. Even though it's really useful to grab the data, run analysis on it, and based on the analysis, you know, companies usually make better decisions and that sort of thing. But if the data quality is low, the kind of decisions that you can or analysis that you can run on low quality data is just, you know, it's it's not a good situation when you have low quality data and you are trying to run analysis on the low quality data. And with web scraping, it's really easy to get bad quality data because let's say you have a spider, you have selectors that select the data from on the website, but websites changes its layout.
And so from 1 day to another, you will not be able to get the data because the website has a different layout and your selectors don't work anymore and data quality drops. And so you need to maintain the spiders. And if you're building on top of TypeScript data, for example, in the ecommerce world, we have a different use cases like, for example, price intelligence, which basically means that you extract prices and you run analysis on the prices and, you know, you monitor competitors or you monitor your own product prices or you monitor the whole market. And if the price data that comes out of your web scraping project is low quality, you will not be able to drive decisions based on the data. So that's 1 consideration to sort of work on is to how to keep the data quality high. So that's 1 reason why data quality could drop because of site maintenance or because the website layout changes and you need some time to maintain the spider. But also because scraping websites at scale, you need to use proxies. You need to use a lot of proxies in many large scale web scraping project.
That's the major of majority of the work to manage proxies, getting a healthy proxy pool, and making sure that you can make successful requests. Meaning that you you can get the HTML file in the first place and then extract the data with your spiders. If you cannot make successful requests,
[00:21:55] Unknown:
you don't get the HTML files. And so this can cause the data quality to drop. And in terms of the data quality aspects that you mentioned, there is the issue of sites changing their layouts or actively trying to prevent web scraping that can lead to breakages once you've got something that's functioning and then needing to maintain it. And then also, as you mentioned, the issues around trying to scale a scraper to be able to maintain the data on a periodic basis or collect data from multiple sites. So I'm wondering what are some of the ways that people who are writing a scraper can guard against some of those changes and updates or ways to allow the scraper to adapt to those changes?
[00:22:40] Unknown:
I would say that you'd have a way to learn quickly if there's a layout change. With Scrappy, for example, you have a lot of open source modules that you can use alongside with Scrappy that help you keep the data quality high, that alerts you that, hey, there is something wrong. Maybe you should have a look at this website because maybe it changed layout. But there is not really like a shortcut. Like, if the website layout changes, you need to change the selectors and when it, you know, for 1 website it's really easy to do but for 100 or 1000 of websites you will have a lot of maintenance work. But the good news is that there are, you know, just like in many other industries, in web scraping as well, there are machine learning advancements, like there are different AI based tools that extract the data for you. So for example, at scraping hub, we have a product called auto extract, which is an ML based tool, which extracts extracts the data fields from the page and you only need to submit the type of the page. So for example, we have this API called auto extract news API and what it basically does is that you submit it's an API and you submit the page URL and you say that, hey, it's, there's a news on this page or there's an article on this page. And so the the software will know that, you know, what fields, what data points to look for and it automatically extracts all the data fields that you care about from the HTML.
And this is all m l based so you don't need to manually find or locate the data points in the HTML. And this is huge because this way you don't need to maintain the the spider if the website layout changes. Because if it changes, doesn't matter, the the algorithm will will locate the data fields anyway. So that's really the the ultimate solution long term. You know, we are doing this for many different data types. So for example, we have news API, which is for news and articles. We have a product API, which is for e commerce websites, meaning product pages. So you get all the data fields related to products like product price, stock information, product description, product name and so on. So in the future, you will not need to deal with maintaining spiders.
You will just need to specify that, hey, I want to scrape data from this page, this is a product page, or a real estate listing, or an article page, and give you all the data points. And the API will give you all the data points and you won't need to struggle with, you know, the selectors or XPAS or whatever.
[00:25:32] Unknown:
1 of the other ways of scaling the collection of web data is by using proxies to help with things like access restriction, where there might be some sort of geo IP blocking setup or issues with the lack of a matching user agent that a site might set up. So I'm wondering if you can talk through some of the considerations of using proxies, when you might want to use them, and some of the features you should look at if you're trying to either select or implement a proxy yourself?
[00:26:04] Unknown:
Yeah. So there are some reasons that you would want to or need to use proxies. 1 of them, just like you mentioned, if there is some kind of geo blocking that the website implemented or if you want to get geo specific data from the website, meaning that the website shows you a different page or different data based depending on where you are. So if, for example, in e commerce, you might see a different price or different delivery or shipping information based on your location. And so in order to get this kind of data, you need to use geo geo targeted proxies. The second reason why we why you would use proxies is when you're trying to scrape a lot of data regularly, consistently from a website, you will just not be able to get it from, you know, from your computer or data center because the website will realize that you are a bot and it will block you even though what you're doing is, you know, ethical and you are making sure that you are polite. The website can still block you. With proxies, the interesting thing is that when people experience issues with the web scraping project, meaning that they start to experience blocks, most of the time, they know that need they need to use proxies. But many times it's it's not just the proxies itself. Like, proxy itself is is a commodity. Like, you can buy proxies from many many countries.
You can buy different kinds of proxies, data center proxies, residential proxies. There are all kinds of proxies out there that you can buy, but with web scraping, if you want to be efficient, it's really about how you manage those proxies that you have. So let's say you are trying to scale up your project, you experience issues with data quality, you don't get, successful requests, and you want to start using proxies. 1 solution that you could do is to just buy a bunch of proxies and just rotate them randomly. And that might work for a while, but that's definitely not an efficient way to do things. What I would say the best way to do it is to get proxies, but put a lot of consideration into how you manage those proxies.
So for example, at scrapinghub, we have a platform which is helping you with web scraping on all levels. You can run your spiders in the cloud, we we provide these machine learning based APIs. We have solutions, etcetera, but we also provide proxies. But the way we think about this is that what you really want is not really proxies. It's just a mean to an end. What you really want is successful requests. So what we give you is not just proxies, but a full proxy management solution. Yes, we have proxies, but the main thing is that how we manage the proxies for you. Like, there are many tactics that you can use to manage the proxies in in a smart way. So for example, a website might, you know, block a proxy.
You know, 1 time it might block a proxy, but it doesn't mean that it will be blocked forever. So you can you can retry the proxy a day later, a week later, a month later, and it might work again. With smart proxy management, it's really about making use of your current proxies. Like usually, the customers that come to us, they just don't want to deal with this whole proxy management thing. They just want a solution that gives them successful requests. Or we have customers who come to us because they started to or try to do it themselves. They tried to manage the proxies themselves, but it just really became a headache. The thing with proxies is that, you know, as the time goes on, it it's not gonna be easier. Like, it's just gonna be harder. You're just gonna need to spend more time on the way you manage manage the proxies.
And so for many companies, it just makes sense to outsource this thing because you can get really bogged down managing proxies and for many companies and for many people, it's just not it's not what they try to do. It's not the core of their business to do web scraping. It's just a way to get data. And so so many people don't like to spend way too much time on proxies.
[00:30:19] Unknown:
For people who are building web scrapers, particularly with Scrapy, but also just in the general case, what are some of the common pitfalls or sharp edges that they're likely to run into for building the scraper, collecting the data, and also for trying to scale that data collection? I have a colleague that said last year at at our summit, he said the web is a jungle. So when you scrape the web, you really shouldn't expect
[00:30:47] Unknown:
to find, you know, standardized websites. You shouldn't expect that websites follow the same practices. So 1 thing is that front end developers or website developers have different ways to accomplish things and when you're writing your spider, it can be really, really frustrating or it can be hard to create really a custom custom web scraper for for each website. Because this website use this way to display data, the other website use some kind of other way to to display the data. So it can be really hard. The other thing is, in my experience, I don't have data about this, but in my experience, people usually underestimate the effort they will need to get data from websites.
And, actually, I I did this mistake. Before I joined scraping cup. I was building my own company or I was a co founder in a web scraping based startup. And I did that mistake that it was easy to get data from, like, let's say, 10 websites. At the time, we were talking about, like, maybe thousands of records per day or maybe tens of thousands of records extracted per day and that was, fairly easy to do. I didn't need to use proxies or I didn't need need to use many proxies. There wasn't many maintenance work needed for like, you know, 10 websites. But when we started to scale, it become really hard because at the same time, we needed to maintain the spiders because the layouts were changing.
There was always some kind of work to do with the spiders. And on the proxy side, we we needed to find some kind of solution to to not get bogged down with the proxies. So those 2 things are really I think people if people start a new project, it's easy to underestimate those. Like, it really comes down to a question of data quality when you're extracting data from the web, and there needs to be needs to be a process in place to make sure that the data quality is high, especially when you are building something on top of the Web Script data. Like for example, I mentioned price intelligence.
In price intelligence softwares, the way it works is that there is a module that, extracts data from websites and, you know, it can be thousands of websites or tons or tens of thousands of websites. It stores the data in a database and then there is an application that reads the data from the database. But if the data that comes out of the of the scraping module or the extraction code is low quality or it's it's some it's not standardized or there is some issue with it, you will not be able to trust the results, the analysis that the price intelligence software gives you. And so, on long term, I think data quality is the issue to focus on and to make sure that that you have a process in place.
Also, like I mentioned that the web is really not standardized. Like, when it comes to web data, there is not 1 straightforward schema that everyone is using. That's why I think solutions that sort of try to standardize this make sense. Like when we are talking about the product page, we have a clear and agreed upon schema that everyone will use to to store data or to display data, and the same for for other data types. Another consideration that we actually didn't talk about is JavaScript. Like, when you're scraping the web, you sometimes need to render JavaScript, and it can be it can be a challenge because as far as resources, because if you need to render JavaScript, that's gonna take more of your resources, and it's gonna be probably more expensive, and you probably need to use some kind of tool that renders JavaScript.
Because, for example, the scraping libraries we mentioned don't render JavaScript. Also, Scrappy doesn't render JavaScript by default. So for this, you need to use, you know, a tool like, a headless a headless browser like Selenium or in JavaScript, many people use Puppeteer or, what I'd really like to use for JavaScript rendering is Splash, which is an open source JavaScript rendering solution, and it has a really great integration with Scrappy. So you can just, you know, plug in splash into your Scrappy and render JavaScript with Scrappy. And yeah. Another pitfall maybe related to JavaScript rendering is that some websites has a sort of hidden APIs, like JSONs that are getting passed around in the background that has the same kind of data that you want to scrape. And many times, it makes, sense to grab that JSON from the background. Like, maybe the website uses AJAX to request these data files. And instead of relying on the HTML and relying on your spider to grab the data from the HTML, you can just grab the JSON and get structured data that way. This way, you don't even need to render JavaScript.
[00:35:58] Unknown:
And in your experience of working on web scrapers yourself and working with people who are building them at scraping hub, what are some of the most interesting or innovative or unexpected ways that you've seen Scrapy and scraping hub used and some of the particularly interesting or challenging lessons that you've learned in the process?
[00:36:16] Unknown:
There are 2 categories of use cases. At least that's how I I look at it. Like, first, we have the individual use cases, meaning that use cases that are implemented by individuals for hobby projects or, you know, simple automations. Like, for example, I recently I recently heard about this person who who created a scraper to find grocery delivery slots on in an online store. Because, you know, during covid, everybody started to order online groceries and and everything. That was the situation here that you couldn't find open delivery slots. So you couldn't order online. But this person built a web scraper that monitored this website and whenever there was an open delivery slot, it sent an email. So that's I think pretty useful.
Or I mentioned my my friend who is looking to buy a house and he's monitoring the the real estate prices. So that's 1 category of use cases. And the other category of use cases which are also interesting in my opinion is when you're building on top of script data. Like for example, I talked to this company not so long ago, they are like an NLP company and they provide basically a search engine for news. So you can search news that are relevant for you. They monitor specific topics. So for example, you you might want to learn about, I don't know, COVID vaccine development and you monitor all the news related to that. And in the background, the software has a feature that lets you submit, you know, news that, hey, this article is about the topic that I want to monitor. Here's another article that's also about the topic I want to monitor. And you basically train the system to monitor news that are relevant for you.
So whenever you use the the application, it will only show you news that are really relevant for you. So I found it really interesting. Another use case which is which is I guess it's not spectacular, but it's just really useful and it just really makes sense to do it is to monitor prices on the web. Right? Like, really any prices. You know, we have e commerce, we have real estate as far as prices. There are huge web scraping projects in the automotive industry. When it comes to brand monitoring, it's actually called the minimum advertised price monitoring that brands use.
Which means that, you know, brand has specific pricing policy and distributors or the resellers need to need to align their prices with brands policy. And with web scraping, these brands can monitor the resellers website and get alerted if the price is, you know, higher or lower than it's supposed to be. So yeah, but for me, the reason I love web scraping is that it's easy to do web scraping on small scale. Like I remember when I first learned about web scraping, I think it took me about, I don't know, maybe 2 days or 3 days, probably 3 days to write my first scraper and to actually see data coming out of my web scraper. And there are many use cases for for individuals to just automate small tasks, But with, you know, for big companies, for large enterprises,
[00:39:42] Unknown:
it also makes sense to do web scraping at a large scale. So that's why I I love it. And are there any other aspects of web scraping and your experience at scraping hub that we didn't discuss that you'd like to cover before we close out the show? I think we covered covered the main things. Yeah. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the band Government Mule. I may have picked them in the past, but it's just a really great rock band that I've enjoyed listening to for a number of years now. So if you're looking for some new music to listen to or some old music that you may be familiar with, definitely worth taking a look at that. And with that, I'll pass to you, Attila. Do you have any picks this week? Yeah. Actually, yeah. So, I mean, it's not a new thing, but there's this open source it's not like a library, but it's on GitHub.
[00:40:37] Unknown:
Awesome awesome web scraping and awesome scrappy, which has a lot of tools about web scraping and about scrappy
[00:40:44] Unknown:
that, you know, just makes it easy to to get data from the web. Alright. I'll definitely have to take a look at those. So thank you very much for taking the time today to join me and discuss your experience building web scrapers and working with scraping hub to help other people and their efforts on that front. So I appreciate all the time and effort you've put in, and I hope you enjoy the rest of your day. Yeah. Thank you, Tobias. Thank thanks for having me. It was fun. Have a good day. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introduction: Attila Toth
Common Challenges in Web Scraping
Attila's Journey into Python and Web Scraping
Web Scraping vs. APIs and Data Dumps
Use Cases for Web Scraping
Attila's First Web Scraping Project
Scrapy: Features and Community
Legal and Ethical Considerations in Web Scraping
Data Quality and Maintenance Challenges
Using Proxies in Web Scraping
Common Pitfalls in Web Scraping
Innovative Use Cases and Lessons Learned
Closing Remarks and Picks