You have built your own web crawler and it works perfectly, well the few times you tested it but when the bot starts scraping on a large scale, it is likely to get blocked unless you have accounted fo
By Admin @September, 22 2020
You have built your own web crawler and it works perfectly, well the few times you tested it but when the bot starts scraping on a large scale, it is likely to get blocked unless you have accounted for it.
Websites are hosted on computers, referred to as servers because they are serving you their resources. And for every device that opens a website, a certain amount of server resources like RAM, CPU, I / O speed, bandwidth, and others are being used. CPUs are the brain of the computer responsible for executing programs by performing necessary calculations, a CPU with many colors is able to handle many executions and calculations at a time while the clock speed indicates how fast it can be done with these calculations, RAM also known as the memory specifies how many information can be read for processing by the CPU at a particular time, I / O speed determines how fast data can get written and read from disks, while Bandwidth is how much data can get transferred to or from the server at once.
Anytime a user opens a website on the browser, the above is what is happening on the server before the web page is opened. When many users are opening a website at once, these processes are happening from the users at the same time, and depending on the server's resources, it might take a while for the users to be able to open the web pages.
When a web bot is accessing a website to get and download data, these processes are also happening in the background, and if a crawler is accessing a website every second, the server might stop responding depending on the server's resources. Servers do not find this funny and they decided to start blocking web crawlers by detecting them and denying them access by showing a very static page telling them they are not allowed anymore, or showing captchas to prove they are humans before they can gain entry.
The good news is only crawlers will get blocked, human-users should never get blocked as the Web as we know it today was made for humans and human readable content.
The loophole is that real users will never get banned, and that means if you could make your crawler successfully disguise as a real human, it would be given full authority to access and do all it wants. After studying human behavior on the web, below are the ways they visit websites and carry out their web activities unlike web scrapers:
They do not break the rules
They do not surf too fast
They do not have only one pattern of crawling
They do not interact with the invisible parts of the web pages
They use browsers
They are able to solve captchas
Luckily, this guide teaches you how to
- crawl responsibly by completely emulating a real human being, and also
- how to crawl irresponsibly and still get away with it.
This is respecting the websites and their rules. You can do this by viewing the target website’s robots.txt file and then teaching your web scraper to understand and crawl only the allowed contents. The robots.txt file is the webmasters’ instruction to web bots, they specify the content they want you to request and the ones you should not request according to the Robots exclusion protocol. It is like checking for buildings or rooms you are allowed to enter before entering them or before even entering the compound. The thing about not playing by their rules you risk being blocked really fast. So, the first step is knowing about robots.txt, the full robots.txt syntax and then teaching your website to interpret the syntax well.
Have you ever seen a real human spending about 3 hours on a website and visiting many pages of the website every second throughout the 3 hours? NO! Only web crawlers could crawl for over a very long period of time, visiting about 10 pages in a second. This is a fast way to get blocked as their resource usage will go insanely up in no time and the requests are being performed by only one person, they block the person and the server is back to normal.
It is good for web bots builders to make their crawlers sleep or rest for a bit - maybe 5 seconds or so, humans could still surf every 5 seconds for a few hours. You could as well have a few hours, maybe 2 to 4 hours everyday that the crawler is totally inactive and not scraping the websites at all, these are called "sleeping time". This can be achieved by scheduling a script to terminate the crawler at a random minute in specified hours and also have to script to start the crawler after it’s sleeping time. This looks so humane! And by that you stand a lesser chance of ending up in the blacklist. Apart from being banned, the website might stop being available to the web bots and other users for a very long time. This is referred to as Distributed Denial of Service (DDoS) attack. And many websites are interested in blocking only this kind of attacks which means by just slowing, you are able to crawl a lot of websites
On most websites, there are scripts to track user’s behaviour and once a user has only one way of scraping every time they visit, these users are more likely to be blocked. Imagine you search for something on a website, visit and record link on the search result, and then start visiting the recorded links, you have already started spooking the server that you are a web bot, you will be given a few more trials and once you scrape like that subsequently, you appear on their radar, and finally get blocked after failing a few more tests to prove you are not a robot.
Servers are good at setting traps that web scrapers easily fall into without even knowing they have fallen into them. These traps are referred to as honeypot traps. An example is following links that are hidden with CSS, typing into hidden input fields and submitting forms that are not visible to human-users. To avoid this, you would have to train your web bot to look out for display: none and visibility: hidden CSS properties and ignore elements with these properties.
This is one of the best ways to avoid being blocked as you look a lot like a real user (real users always use browsers). Just by making your crawler drive a headless browser, you have a lot of privilege, and slim chances of being banned.
It's easy to surf slowly with a headless browser because a lot of milliseconds are spent by the time a page fully loads with all its assets and scripts. Also, most headless browsers driving APIs have nice ways to avoid honeypot traps by specifying that you are looking for visible contents, errors are thrown if the elements are not visible even if they are present.
If using a browser is too much, try sending Requests Headers that modern send browsers and use browsers' User Agent to disguise as a human-user
Even with all the ways to avoid getting blocked, you might eventually get blocked because you will never know the algorithms of some anti-scraping or anti-bots mechanisms. Some servers build a really great algorithm, some employ Distil Networks to require web bots and even some real users to solve captchas at least the first time before being trusted. And in most cases, solving captchas are ways to bypass almost all anti-scraping techniques.
Fortunately, there are many third parties that offer the ability to solve captchas just by APIs at a specified cost. All you will need is create an account with them, fund your account and follow their documentation to be able to solve captchas.
A trusted captcha solving service is 2captcha with the ability to solve many captchas. You will only have to create an account, fund it and use your API key to make requests as shown in their documentation. The starting price is $0.50 per 1000 solved captchas. It’s really easy to integrate.
Keeping cookies and using them is a good way to skip a lot of anti-scraping screening. Many captcha providers store cookies after you have successfully solved a captcha, and once you make requests with the cookies, they skip checking whether you are a real user or not. Most websites also save cookies after you have passed their tests to show that you are a real user, and they will not retest you until the cookie expires.
As a web scraping bot developer, there are times you will want to scrape contents disallowed in robots.txt files, scrape really fast, and do other things that identifies you as a web bot. In such cases, identifying as a human being is not even an option. You can still scrape it by following the below methods:
You should sign up for proxies and make your crawler tunnel through these proxies to avoid getting blocked most especially when not identifying yourself as a human-user. This way, if the server blocks the crawler, they would be blocking the IP of the proxy you are using, not the crawler’s public IP.
To avoid your proxy to be blocked, you could have many proxies and make the scraper select and random after shuffling with a nice algorithm like the Fisher-Yates shuffle. And if you are using InstantProxies, you could have for refresh proxies in case some get blocked.
A smarter way is signing for backconnect proxies. With a backconnect proxy, you will be assigned a proxy to tunnel through, after a few requests or minutes depending on the backconnect type, proxies change automatically.
This way you can scrape really fast, and also scrape contents disallowed in robots.txt file while the server do not which IP Address to block.
With some, proxies your public IP can still be detected. When signing up for a Proxy service, make sure you sign up for a very anonymous one if it involves asking questions from the customer service of the Proxy Providers
Another way to identify web bots is by their User Agents. Most web scraping bot developers neglect setting trusted agents and when they do this, very basic and blockable user agents are used. For example: curl7.71, python-request, node.
You could also switch to a more trusted user agent that servers would not want to deny access to like the user agents of Google Chrome, Firefox, Safari, Opera and the other modern browsers.
A better way to avoid being blocked is to have an array of many trusted user agents and then randomly select one after shuffling the user agents well.
Yes with the above tips, you can get away with scraping over 99.99 of websites but why don't we take the hassle away from you and you can focus on extracting quality data without the infrastructure headache. Our platform abstract the backend operations to allow you scrape anonymously and safely
You should login to leave comments.