At WebAutomation.io it is really important that we are transparent and open with our customers and stakeholders about what we do/what we don't do while web scraping. Our Vision has always been of the belief in an open and easily accessible web, which will allow our customers access data from the web without barriers using this data to grow their business.
We also acknowledge that we must respect the websites we scrape, so we design our products to always make a deliberate attempt at ensuring that our activities respect the websites we scrape. Its not always easy but we are committed
This article will give you a background of web scraping and the best practice we follow
Disclaimer: We are not your lawyer, and the recommendations in this article do not constitute legal advice. The recommendations outlined in this article are based on our experience working in the industry for years with hundreds of companies . If you want assistance with your specific situation then you should consult a lawyer.
‘Scraping’ of the web is just automated access to websites and it is lawful. Legal departments know this, which is why some of the largest companies in the world use this to convert the web into structured data for use in their businesses. As it has become an accepted part of business, it is important every company has a web scraping strategy and we can help with this .
Like any technology tool, it can be used for good or bad. If we fail to put controls in our tool, our customers can misuse to tool. These are what we do;
1. Only scrape PUBLICLY available data
We only collect data that is published on a website for everyone to have access to it. Similar to what a search engine will have access to
2. No authentication barriers should be broken to retrieve content
We will not exceed any authorised access i.e. get involved in hacking to gain entry into any systems to retrieve data
3. Our scraping activities should not overload or burden the services of the website (excessive volumes)
Probably the most important rule and we do this by
- Having limits to number of concurrent requests to a website from a single IP
- Enforce delays between requests by our crawlers
- If possible schedule crawl during websites off-peaks hours
4. Follow similar crawler paths as search engines
Search engines like Google and Bing crawl through every public website possible, crawling only publicly available data. We follow similar paths to avoid interfering with any websites operation ( as they all accommodate for search engines)
Guidelines for our users processing web scraped data
1. Obey GDPR Guidelines
GDPR in web scraping applies to scraping personal data of EU citizens. Any of our users planning to scrape any personal data of EU citizens should have a lawful reason for doing so
Examples of personal data include; Name, Email, Phone number, Address, User name, IP address, Bank or credit card info, Medical data, Biometric data
2. Obey Copyright laws
Any users scraping data from our platform should understand if the data is copyrighted it must respect the copyright if that data is planned to be re-used. Typical examples of copyrighted materials on the web include; Pictures, Videos & Music
Robots.txt: A text file webmasters create to instruct web robots how to crawl pages on their website
Terms of Service: A document by a website that covers a range of issues related to the behaviour of a website or service user
GDPR : the General Data Protection Regulation has been reinforced starting with 2018 in the European Union to enable residents to control their own data. The regulation prevents businesses from doing whatever they want with personally identifiable data such as names, addresses, phone numbers, and emails
LinkedIn vs hiQ : the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA)