You have built your own web crawler and it works perfectly, well the few times you tested it but when the bot starts scraping on a large scale, it is likely to get blocked unless you have accounted fo
By Victor Bolu @June, 13 2023
You built your own web crawler and it works perfectly.
At least, it does when you test it.
But then you run it on a large scale, and your bot gets blocked.
What’s going on? And how can you stop this from happening?
Websites are hosted on servers, and every time a user opens a website, a certain amount of server resources like RAM, CPU, I/O speed, and bandwidth get used:
When many users open a website at one time, it can start to use up so many server resources that the website starts to slow down.
When a bot accesses a website to collect data, it also uses these server resources. If the bot accesses the website repeatedly in a short period of time, it may overload the server’s resources and the server might stop responding.
Obviously, those responsible for server up-time don’t want this to happen, so they have created ways to identify and block web crawlers, while letting real human users through to their sites. One of the ways they do this is called Captcha. You have definitely seen a website with this protection when have to click the little box that says you are a human or pick bicycles or boats in a bunch of blurry photos.
So, the key is to convince the server your bot is a human so it will let you through to the website.
The basic rule is: Always browse like a human user would and you should never be blocked.
Don’t worry, you don’t need to use wigs and makeup on your bot.
Instead, you need to mimic human browsing behaviour.
Here are some ways humans typically use the web:
In the rest of this guide, we show you how to crawl responsibly by having your scraper emulate human behaviour (which we recommend).
We also give some tips about how you can crawl irresponsibly and still get away with it (but don’t tell anyone we told you this or we’ll deny it).
Respect websites and their rules. When you don’t play by a site’s rules, you risk being blocked really fast.
You can do this by viewing the target website’s robots.txt file and teaching your scraper to understand and crawl only the allowed content.
The robots.txt file is the webmasters’ instruction to web bots, they specify the content they want you to request and the ones you should not request according to the Robots exclusion protocol. The first step is knowing about robots.txt and the full robots.txt syntax and then teaching your scraper to interpret the syntax correctly.
It is like checking for buildings or rooms you are allowed to enter before entering them or even before even entering the compound.
Have you ever seen a real human spend 3 hours on a website and visit dozens or hundreds of pages every second?
NO!
Only web crawlers can do that.
Again, this is a quick way to go straight to Internet Jail and get your scraper blocked.
What you should do instead is program your scraper to “rest” between pages it visits. Perhaps for 5–10 seconds between each page. Another good practice is to have your scraper go inactive for a few hours mid-scrape. This is called “sleeping time” and can be achieved by scheduling a script to terminate the crawler at a random time after a certain number of minutes/hours of scraping, and then restart after a specified time.
This will mimic human behaviour and lessen the chances of your scraper getting blocked, or even worse, causing a full blackout of the website by making it think your crawler is a Distributed Denial of Service (DDoS) attack.
Scraping a website too fast is the easiest way to identify yourself as a bot and get blocked
On most websites, there are scripts to track user’s behaviour.
When a scraper has only one way of crawling a site each time it visits, it is easy to identify as a bot.
Instead, change up your scraper’s behaviour so it is not so clearly interacting with the website each time in the same way.
Servers are good at setting traps that web scrapers fall into without even knowing they have fallen into them.
These are referred to as honeypot traps.
An example is following links that are hidden with CSS, typing into hidden input fields, and submitting forms that are not visible to human users.
To avoid this, you have to train your scraper t to look out for “display: none” and “visibility: hidden” CSS properties and ignore elements with these properties.
This is one of the best ways to avoid being blocked as you look a lot like a real user (real users always use browsers).
With a headless browser, you can enable JavaScript, without which many modern websites cannot load properly. Most websites load the fewest HTML contents possible and then rely on AJAX to update other parts of the web page. This means that the content you intend to scrape might not be available unless JavaScript is present.
The most popular APIs to drive a headless browser includes Puppeteer , Selenium , and PhantomJS .
It’s easy to surf slowly with a headless browser because a lot of milliseconds are spent by the time a page fully loads with all its assets and scripts.
Most headless browsers using APIs also have nice ways to avoid honeypot traps by specifying that you are looking for visible contents.
Even with these ways to avoid getting blocked, you might eventually get blocked because you will never know the algorithms of some anti-scraping or anti-bot defenses.
Many sites employ Distil Networks to require crawlers and even real users to solve captchas at least once before being trusted. In most cases, solving captchas is the best way to bypass almost all anti-scraping techniques.
Fortunately, there are third parties that offer the ability to solve captchas by API at a specified cost. All you need is to create an account with them, fund your account, and follow their documentation to solve captchas.
One trusted captcha solving service is 2captcha. The starting price is $0.50 per 1000 solved captchas and it is easy to integrate.
Storing cookies and using them is a good way to bypass a lot of anti-scraping screening. Many captcha providers store cookies after you have successfully solved a captcha, and once you make requests with the cookies, they skip checking whether you are a real user or not.
Most websites also save cookies after you have passed their tests to show that you are a real user, and they will not retest you until the cookie expires.
As a web scraping developer, there are times you may want to scrape content disallowed in robots.txt files, scrape quickly, and perform other actions that identify you as a bot.
In such cases, identifying as a human is not even an option.
But you can still scrape by using the methods below:
You should sign up for proxies and make your crawler tunnel through these proxies to avoid getting blocked.
This way if the server blocks the crawler, they will block the IP of the proxy you are using, not the crawler’s public IP.
To avoid your proxy being blocked, you can use many proxies and make the scraper randomly select proxies using an algorithm like the Fisher-Yates shuffle.
A smarter way is to sign up for backconnect proxies. With a backconnect proxy, you will be assigned a proxy to tunnel through, and after a few requests or minutes depending on the backconnect type, proxies change automatically.
This way you can scrape quickly, and also scrape contents disallowed in robots.txt file while the server will not know which IP Address to block.
You should note that with some proxies your public IP can still be detected. When signing up for a pProxy service, make sure you sign up anonymously.
Another way to identify bots is by their User Agents. Most web scraping bot developers neglect to set trusted agents and when they, very basic and blockable user agents are used. For example: curl7.71, python-request, node.
You can switch to a more trusted user agent that servers would not want to deny access to like the user agents of Google Chrome, Firefox, Safari, Opera and the other modern browsers.
A better way to avoid being blocked is to have an array of many trusted user agents and then randomly select one after shuffling the user agents well.
With the above tips, you can build a scraper that can get through Captchas and crawl most websites without getting blocked.
But why bother?
At WebAutomation, we remove the hassle of dealing with all this and take care of building scrapers and handling all the infrastructure and maintenance so all you need is a starter link on your target website to get the data you need as an exportable CSV, XLSX, XML, or JSON file.
This lets you focus on using the data rather than getting the data.
At WebAutomation, We make the process of extracting web data quick and efficient so you can focus your resources on what’s truly important, using the data to achieve your business goals. In our marketplace, you can choose from hundreds of pre-built extractors for the world’s biggest websites.
These pre-built data extractors turn almost any website into a spreadsheet or API with just a few clicks. The best part? We build and maintain them for you!
Using WebAutomation lowers costs, requires no programming, enables you to get started quickly since we build the extractors for you, and means you’ll always have the help you need to keep your extractor running since we handle all the backend security and maintenance.
Web Scraping has never been this easy.