How to avoid getting blocked while web scraping

You have built your own web crawler and it works perfectly, well the few times you tested it but when the bot starts scraping on a large scale, it is likely to get blocked unless you have accounted fo

By Admin @June, 10 2021

how to avoid getting blocked while web scraping

Introduction

You built your own web crawler and it works perfectly.

At least, it does when you test it.

But then you run it on a large scale, and your bot gets blocked.

What’s going on? And how can you stop this from happening?

 

Behind the Curtain

Websites are hosted on servers, and every time a user opens a website, a certain amount of server resources like RAM, CPU, I/O speed, and bandwidth get used:

  • A CPU is the “brain” of the computer responsible for executing programs and performing necessary calculations
  • RAM specifies how much information can be read for processing by the CPU at a particular time
  • I/O speed determines how fast data can get written and read from disks
  • Bandwidth is how much data can get transferred to or from the server at a time.

When many users open a website at one time, it can start to use up so many server resources that the website starts to slow down.

When a bot accesses a website to collect data, it also uses these server resources. If the bot accesses the website repeatedly in a short period of time, it may overload the server’s resources and the server might stop responding.

Obviously, those responsible for server up-time don’t want this to happen, so they have created ways to identify and block web crawlers, while letting real human users through to their sites. One of the ways they do this is called Captcha. You have definitely seen a website with this protection when have to click the little box that says you are a human or pick bicycles or boats in a bunch of blurry photos.

So, the key is to convince the server your bot is a human so it will let you through to the website.

 

The basic rule is: Always browse like a human user would and you should never be blocked.

 

How to Make Your Scraper Look Human

Don’t worry, you don’t need to use wigs and makeup on your bot.

Instead, you need to mimic human browsing behaviour.

Here are some ways humans typically use the web:

  1. They do not break the rules
  2. They do not surf too fast
  3. They do not have only one pattern of browsing
  4. They do not interact with the invisible parts of a web page
  5. They use browsers
  6. They are able to solve captchas
  7. They save and use cookies

 

In the rest of this guide, we show you how to crawl responsibly by having your scraper emulate human behaviour (which we recommend).

We also give some tips about how you can crawl irresponsibly and still get away with it (but don’t tell anyone we told you this or we’ll deny it).

 

Crawl Responsibly

 

1. Try to play by the rules

Respect websites and their rules. When you don’t play by a site’s rules, you risk being blocked really fast.

You can do this by viewing the target website’s robots.txt file and teaching your scraper to understand and crawl only the allowed content.

The robots.txt file is the webmasters’ instruction to web bots, they specify the content they want you to request and the ones you should not request according to the Robots exclusion protocol. The first step is knowing about robots.txt and the full robots.txt syntax and then teaching your scraper to interpret the syntax correctly.

It is like checking for buildings or rooms you are allowed to enter before entering them or even before even entering the compound.

 

2. Do not surf too fast

Have you ever seen a real human spend 3 hours on a website and visit dozens or hundreds of pages every second?

NO!

Only web crawlers can do that.

Again, this is a quick way to go straight to Internet Jail and get your scraper blocked.

What you should do instead is program your scraper to “rest” between pages it visits. Perhaps for 5–10 seconds between each page. Another good practice is to have your scraper go inactive for a few hours mid-scrape. This is called “sleeping time” and can be achieved by scheduling a script to terminate the crawler at a random time after a certain number of minutes/hours of scraping, and then restart after a specified time.

This will mimic human behaviour and lessen the chances of your scraper getting blocked, or even worse, causing a full blackout of the website by making it think your crawler is a Distributed Denial of Service (DDoS) attack.

 

Scraping a website too fast is the easiest way to identify yourself as a bot and get blocked

 

3. Have more than one pattern of crawling

On most websites, there are scripts to track user’s behaviour.

When a scraper has only one way of crawling a site each time it visits, it is easy to identify as a bot.

Instead, change up your scraper’s behaviour so it is not so clearly interacting with the website each time in the same way.

 

4. Avoid Interacting with the invisible parts of web pages

Servers are good at setting traps that web scrapers fall into without even knowing they have fallen into them.

These are referred to as honeypot traps.

An example is following links that are hidden with CSS, typing into hidden input fields, and submitting forms that are not visible to human users.

To avoid this, you have to train your scraper t to look out for “display: none” and “visibility: hidden” CSS properties and ignore elements with these properties.

 

5. Use Headless Browsers

This is one of the best ways to avoid being blocked as you look a lot like a real user (real users always use browsers).

With a headless browser, you can enable JavaScript, without which many modern websites cannot load properly. Most websites load the fewest HTML contents possible and then rely on AJAX to update other parts of the web page. This means that the content you intend to scrape might not be available unless JavaScript is present.

The most popular APIs to drive a headless browser includes Puppeteer , Selenium , and PhantomJS .

It’s easy to surf slowly with a headless browser because a lot of milliseconds are spent by the time a page fully loads with all its assets and scripts.

Most headless browsers using APIs also have nice ways to avoid honeypot traps by specifying that you are looking for visible contents.

 

6. Teach your scraper to solve Captchas

Even with these ways to avoid getting blocked, you might eventually get blocked because you will never know the algorithms of some anti-scraping or anti-bot defenses.

Many sites employ Distil Networks to require crawlers and even real users to solve captchas at least once before being trusted. In most cases, solving captchas is the best way to bypass almost all anti-scraping techniques.

Fortunately, there are third parties that offer the ability to solve captchas by API at a specified cost. All you need is to create an account with them, fund your account, and follow their documentation to solve captchas.

One trusted captcha solving service is 2captcha. The starting price is $0.50 per 1000 solved captchas and it is easy to integrate.

 

7. Store cookies

Storing cookies and using them is a good way to bypass a lot of anti-scraping screening. Many captcha providers store cookies after you have successfully solved a captcha, and once you make requests with the cookies, they skip checking whether you are a real user or not.

Most websites also save cookies after you have passed their tests to show that you are a real user, and they will not retest you until the cookie expires.

 

Crawl Irresponsibly

 

As a web scraping developer, there are times you may want to scrape content disallowed in robots.txt files, scrape quickly, and perform other actions that identify you as a bot.

In such cases, identifying as a human is not even an option.

 

But you can still scrape by using the methods below:

1. Use rotating IP addresses

You should sign up for proxies and make your crawler tunnel through these proxies to avoid getting blocked.

This way if the server blocks the crawler, they will block the IP of the proxy you are using, not the crawler’s public IP.

To avoid your proxy being blocked, you can use many proxies and make the scraper randomly select proxies using an algorithm like the Fisher-Yates shuffle.

A smarter way is to sign up for backconnect proxies. With a backconnect proxy, you will be assigned a proxy to tunnel through, and after a few requests or minutes depending on the backconnect type, proxies change automatically.

This way you can scrape quickly, and also scrape contents disallowed in robots.txt file while the server will not know which IP Address to block.

You should note that with some proxies your public IP can still be detected. When signing up for a pProxy service, make sure you sign up anonymously.

 

2. Use trusted rotating user agents

Another way to identify bots is by their User Agents. Most web scraping bot developers neglect to set trusted agents and when they, very basic and blockable user agents are used. For example: curl7.71, python-request, node.

You can switch to a more trusted user agent that servers would not want to deny access to like the user agents of Google Chrome, Firefox, Safari, Opera and the other modern browsers.

A better way to avoid being blocked is to have an array of many trusted user agents and then randomly select one after shuffling the user agents well.

 

Conclusion

With the above tips, you can build a scraper that can get through Captchas and crawl most websites without getting blocked.

But why bother?

At WebAutomation, we remove the hassle of dealing with all this and take care of building scrapers and handling all the infrastructure and maintenance so all you need is a starter link on your target website to get the data you need as an exportable CSV, XLSX, XML, or JSON file.

This lets you focus on using the data rather than getting the data.

 

Get started with WebAutomation for Free today!

At WebAutomation, We make the process of extracting web data quick and efficient so you can focus your resources on what’s truly important, using the data to achieve your business goals. In our marketplace, you can choose from hundreds of pre-built extractors for the world’s biggest websites.

These pre-built data extractors turn almost any website into a spreadsheet or API with just a few clicks. The best part? We build and maintain them for you!

Using WebAutomation lowers costs, requires no programming, enables you to get started quickly since we build the extractors for you, and means you’ll always have the help you need to keep your extractor running since we handle all the backend security and maintenance.

Web Scraping has never been this easy.

Save Costs, Time and Get to market faster

Build your first online custom web data extractor.

Leave a comment:

You should login to leave comments.