What they don’t tell us about Web Scraping

Pierre-Louis Danieau
5 min readFeb 5, 2023

No, web scraping is not easy.

Background photo by Jeremy Thomas on Unsplash

There are many gurus on the web who make you believe that it is possible to scrape anything and everything with Python.

I can’t count the number of times I’ve seen in articles the same piece of code making me believe that it’s possible to scrape anything and everything in only a few seconds.

Sorry to disappoint you, but it’s not that effortless.

This piece of code that I see everywhere and that almost never works

So no, mastering this piece of code and running it on your local computer to save scraped data in csv format won’t make you the next web master.

I myself fell into this trap a few years ago.

I was quickly disappointed when I realized that I won’t be able to access as easily to this huge database that is the web.

And here I am today writing this article hoping to help some people not to live the same disillusion that I experienced when I discovered web scraping.

This is what I learned after trying to scrape dozens and dozens of sites.

The 60 / 30 / 10 rule

Before starting the scraping of a website, it is necessary to have in mind these different orders of magnitude.

I would like to point out that this rule has never been proven and that it only reflects my personal experience.

1) 60% of websites can be scraped easily

By “easy”, I mean that it is possible to scrape them with the kind of code I presented above. That is, using the request library to get the HTML code and the beautifulsoup library to parse the code and find the interesting elements.

These sites are easy to scrape and are very numerous.

The reason?

Their owners have put very few (if any) firewalls and Captchas to prevent bot from scraping their content.

The problem?

These sites often have very little interest in being scraped because no interesting data to exploit is available. There is therefore nobody who wants to scrape this kind of sites.

I’m thinking in particular of personal blogs, discussion forums, business showcase sites, etc…

However, it is interesting to try to scrape these sites when you are just discovering this domain and you want to learn.

2) 30% of websites can be difficult to scrape but some optimizations can make it possible..

Among these 30%, I think of websites that already have more traffic and that are coded in JavaScript (making the use of the request library inefficient for scrapping).

These sites have set up some firewalls and use some techniques to avoid too many bots coming to scrape their site.

However, some optimizations, easy to set up, can avoid the ban of its IP address or to be caught by a Captcha.

Here is a non-exhaustive list of things you can do: Use the selenium library with optimized web drivers like the undetected-chromedriver, automatic and regular rotation of user-agents for HTTP requests, include time-outs and random mouse movements to simulate human behavior or use a rotating proxy service to regularly change the IP address of the scrapper.

The combination of each of these parameters will depend a lot on the website you are scrapping but for those 30% of the websites I am talking about, you can end up finding THE right combination that will open the doors to the website.

3) 10% are almost impossible and only very few people succeed.

And among these 10% we find of course the most interesting sites to scrape like social networks and e-commerce sites (Amazon, Airbnb, Ebay, Zillow, Instagram …).

Indeed, these web giants are aware that their data is an invaluable source of value. So they set up a whole bunch of very powerful bot detection systems thanks to Machine Learning.

And although it is possible that you manage to slip through the cracks of the bot detection algorithms, when you want to industrialize your scraper in order to retrieve data on a regular basis, they will eventually detect you…

But let’s assume that you have found the right combination to scrape one of these websites…

… But what to do next?

Because yes, it’s nice to manage with some optimizations (and a little bit of luck) to scrape a web site on your computer.

But generally, the public data of a website are dynamic and you probably want to structure and store these data somewhere in order to reuse them.

Otherwise what’s the point of web scraping!

It is then essential to set up a scalable architecture that allows you to :

  • Program periodically the execution of the algorithm in order to obtain updated data.
  • Build a database (preferably stored in the cloud) to store the scrapped data.
  • Set up monitoring routines to make sure that periodic scrapping goes as planned. Because, yes, what happens if the site changes its domain name? If the div with the hard-coded _2k43C identifier in your python program changes number? If the image you want to scrape is not on the same page anymore? Etc…

So many difficult challenges to solve…

But once you have reached the end of each of these steps, we can say that a new field of possibilities is open to you!

All you need to do is to reuse this data in an intelligent way in order to bring value to it. But this part is not the most difficult because as we often say…

…Data is gold!

Although this article may seem pessimistic, I’m not saying that it is impossible to scrape this or that site, I’m just saying that it’s harder than you might think.

Good luck !

Pierre-Louis D.

--

--