How to avoid getting blocked while web scraping

You have built your own web crawler and it works perfectly, well the few times you tested it but when the bot starts scraping on a large scale, it is likely to get blocked unless you have accounted fo

By Admin @September, 22 2020

how to avoid getting blocked while web scraping

Introduction

You have built your own web crawler and it works perfectly, well the few times you tested it but when the bot starts scraping on a large scale, it is likely to get blocked unless you have accounted for it.

 

Behind the Camera

Websites are hosted on computers, referred to as servers because they are serving you their resources. And for every device that opens a website, a certain amount of server resources like RAM, CPU, I / O speed, bandwidth, and others are being used. CPUs are the brain of the computer responsible for executing programs by performing necessary calculations, a CPU with many colors is able to handle many executions and calculations at a time while the clock speed indicates how fast it can be done with these calculations, RAM also known as the memory specifies how many information can be read for processing by the CPU at a particular time, I / O speed determines how fast data can get written and read from disks, while Bandwidth is how much data can get transferred to or from the server at once.

Anytime a user opens a website on the browser, the above is what is happening on the server before the web page is opened. When many users are opening a website at once, these processes are happening from the users at the same time, and depending on the server's resources, it might take a while for the users to be able to open the web pages.

When a web bot is accessing a website to get and download data, these processes are also happening in the background, and if a crawler is accessing a website every second, the server might stop responding depending on the server's resources. Servers do not find this funny and they decided to start blocking web crawlers by detecting them and denying them access by showing a very static page telling them they are not allowed anymore, or showing captchas to prove they are humans before they can gain entry.

The good news is only crawlers will get blocked, human-users should never get blocked as the Web as we know it today was made for humans and human readable content.

| The basic rule is Always browse like a human-user will do and you will never be blocked.

 

MAKE YOUR SPIDER IDENTIFY AS A HUMAN-USER

The loophole is that real users will never get banned, and that means if you could make your crawler successfully disguise as a real human, it would be given full authority to access and do all it wants. After studying human behavior on the web, below are the ways they visit websites and carry out their web activities unlike web scrapers:

  1. They do not break the rules

  2. They do not surf too fast

  3. They do not have only one pattern of crawling

  4. They do not interact with the invisible parts of the web pages

  5. They use browsers

  6. They are able to solve captchas

  7. They save and use cookies

 

Luckily, this guide teaches you how to

        - crawl responsibly by completely emulating a real human being, and also

        - how to crawl irresponsibly and still get away with it.

 

CRAWLING RESPONSIBLY

 

1. Try to play by the rules

This is respecting the websites and their rules. You can do this by viewing the target website’s robots.txt file and then teaching your web scraper to understand and crawl only the allowed contents. The robots.txt file is the webmasters’ instruction to web bots, they specify the content they want you to request and the ones you should not request according to the Robots exclusion protocol. It is like checking for buildings or rooms you are allowed to enter before entering them or before even entering the compound. The thing about not playing by their rules you risk being blocked really fast. So, the first step is knowing about robots.txt, the full robots.txt syntax and then teaching your website to interpret the syntax well.

 

2. Do not surf too fast

Have you ever seen a real human spending about 3 hours on a website and visiting many pages of the website every second throughout the 3 hours? NO! Only web crawlers could crawl for over a very long period of time, visiting about 10 pages in a second. This is a fast way to get blocked as their resource usage will go insanely up in no time and the requests are being performed by only one person, they block the person and the server is back to normal.

It is good for web bots builders to make their crawlers sleep or rest for a bit - maybe 5 seconds or so, humans could still surf every 5 seconds for a few hours. You could as well have a few hours, maybe 2 to 4 hours everyday that the crawler is totally inactive and not scraping the websites at all, these are called "sleeping time". This can be achieved by scheduling a script to terminate the crawler at a random minute in specified hours and also have to script to start the crawler after it’s sleeping time. This looks so humane! And by that you stand a lesser chance of ending up in the blacklist. Apart from being banned, the website might stop being available to the web bots and other users for a very long time. This is referred to as Distributed Denial of Service (DDoS) attack. And many websites are interested in blocking only this kind of attacks which means by just slowing, you are able to crawl a lot of websites

|Scraping a website too fast is the easiest way to identify yourself as a web bot and get blocked

 

3. Have more than one pattern of crawling

On most websites, there are scripts to track user’s behaviour and once a user has only one way of scraping every time they visit, these users are more likely to be blocked. Imagine you search for something on a website, visit and record link on the search result, and then start visiting the recorded links, you have already started spooking the server that you are a web bot, you will be given a few more trials and once you scrape like that subsequently, you appear on their radar, and finally get blocked after failing a few more tests to prove you are not a robot.

 

4. Avoid Interacting with the invisible parts of the web pages

Servers are good at setting traps that web scrapers easily fall into without even knowing they have fallen into them. These traps are referred to as honeypot traps. An example is following links that are hidden with CSS, typing into hidden input fields and submitting forms that are not visible to human-users. To avoid this, you would have to train your web bot to look out for   display: none  and visibility: hidden  CSS properties and ignore elements with these properties. 

 

5. Use Headless Browsers

This is one of the best ways to avoid being blocked as you look a lot like a real user (real users always use browsers). Just by making your crawler drive a headless browser, you have a lot of privilege, and slim chances of being banned.

With a headless browser, you have JavaScript enabled without which many modern websites of nowadays cannot be well loaded. Most websites load the least HTML contents possible and then rely on AJAX to update other parts of the web page. The content you intend to scrape might not be available unless JavaScript is present.

Most popular APIs to drive a headless browser includes Puppeteer , Selenium , and PhantomJS .

It's easy to surf slowly with a headless browser because a lot of milliseconds are spent by the time a page fully loads with all its assets and scripts. Also, most headless browsers driving APIs have nice ways to avoid honeypot traps by specifying that you are looking for visible contents, errors are thrown if the elements are not visible even if they are present.

If using a browser is too much, try sending Requests Headers that modern send browsers and use browsers' User Agent to disguise as a human-user

webscraping captcha avoid getting blocked

6. Teach your Scraper to Solve Captchas

Even with all the ways to avoid getting blocked, you might eventually get blocked because you will never know the algorithms of some anti-scraping or anti-bots mechanisms. Some servers build a really great algorithm, some employ Distil Networks to require web bots and even some real users to solve captchas at least the first time before being trusted. And in most cases, solving captchas are ways to bypass almost all anti-scraping techniques.

Fortunately, there are many third parties that offer the ability to solve captchas just by APIs at a specified cost. All you will need is create an account with them, fund your account and follow their documentation to be able to solve captchas.

A trusted captcha solving service is 2captcha with the ability to solve many captchas. You will only have to create an account, fund it and use your API key to make requests as shown in their documentation. The starting price is $0.50 per 1000 solved captchas. It’s really easy to integrate.

 

7. Keep Cookies and use them

Keeping cookies and using them is a good way to skip a lot of anti-scraping screening. Many captcha providers store cookies after you have successfully solved a captcha, and once you make requests with the cookies, they skip checking whether you are a real user or not. Most websites also save cookies after you have passed their tests to show that you are a real user, and they will not retest you until the cookie expires.


 

| Webscrape without getting blocked with WebAutomation.io 

 


CRAWLING IRRESPONSIBLY

As a web scraping bot developer, there are times you will want to scrape contents disallowed in robots.txt files, scrape really fast, and do other things that identifies you as a web bot. In such cases, identifying as a human being is not even an option. You can still scrape it by following the below methods:

 

8. Use Rotating IP Addresses

You should sign up for proxies and make your crawler tunnel through these proxies to avoid getting blocked most especially when not identifying yourself as a human-user. This way, if the server blocks the crawler, they would be blocking the IP of the proxy you are using, not the crawler’s public IP.

 

To avoid your proxy to be blocked, you could have many proxies and make the scraper select and random after shuffling with a nice algorithm like the Fisher-Yates shuffle. And if you are using InstantProxies, you could have for refresh proxies in case some get blocked.

A smarter way is signing for backconnect proxies. With a backconnect proxy, you will be assigned a proxy to tunnel through, after a few requests or minutes depending on the backconnect type, proxies change automatically.

This way you can scrape really fast, and also scrape contents disallowed in robots.txt file while the server do not which IP Address to block.

With some, proxies your public IP can still be detected. When signing up for a Proxy service, make sure you sign up for a very anonymous one if it involves asking questions from the customer service of the Proxy Providers

 

9. Use Trusted Rotating User Agents

Another way to identify web bots is by their User Agents. Most web scraping bot developers neglect setting trusted agents and when they do this, very basic and blockable user agents are used. For example: curl7.71, python-request, node.

You could also switch to a more trusted user agent that servers would not want to deny access to like the user agents of Google Chrome, Firefox, Safari, Opera and the other modern browsers.

A better way to avoid being blocked is to have an array of many trusted user agents and then randomly select one after shuffling the user agents well.

 

GIVE WEBAUTOMATION A TRY

Yes with the above tips, you can get away with scraping over 99.99 of websites but why don't we take the hassle away from you and you can focus on extracting quality data without the infrastructure headache. Our platform abstract the backend operations to allow you scrape anonymously and safely

 

Save Costs, Time and Get to market faster

Build your first online custom web data extractor.

Leave a comment:

You should login to leave comments.