Is web scraping legal or not? (2024)

General things to consider before doing web data extraction?

Disclaimer:This blog post is not legal advicein any respect. The legality cannot be generalised as the laws are different in each country.

Google built its business on scraping and indexing others content continuously. Do you think they are doing something illegal or unethical? No, they are not. They are providing an amazing value add to the extracted data.

Web scraping is not an illegal activity, but that doesn’t mean you can scrape any site you want. There are some sites that explicitly block any sort of automated data extraction either via the robots.txt file or their Terms of servicepage.

There are some general considerations for legality of web scraping:

Robots.txt

It is probably the first thing to check out before scraping a website. Robots.txt is used to communicate with web crawlers and web robots. This file informs the web robot about which areas of the website should not be processed or scanned. Robots.txt is located in the root of the web site hierarchy (e.g.https://www.example.com/robots.txt ).

User-agent: *Disallow:

If you find this in the robots.txt file of a website you’re trying to crawl, you’re in luck. This means all pages on the site are open to be crawled by bots.

User-agent: *Disallow: /

This is a pretty clear signal to avoid scraping these sites.

Terms ofService.

If you consider web scraping, you should also check web site’s“Terms of Use” or “Terms of Service”.

I think that a website’s robots.txt and “Terms of Use” should be coordinated with and complement one another because ultimately robots that crawl multiple sites probably don’t analyze “Terms of Use”. But if polite crawler definitely reads and obeys the robots.txt file rules before fetching a webpage.

Many web sites have clauses in “Terms of Service” that limit the way you can use the data found on the site. By violating “Terms of Use”, you are in a situation wherein the legal actions can be initiated against you for the breach of contract.

However If a Terms of Use provision does not say that it limits access to bots, spiders, etc, crawling is fine.

If a website clearly states that web scraping is not allowed, you must respectthat.

Nowadays Amazon as many other web sites provide an easy way to access to their data through official API — Product Advertising API. However if it is not enough data provided by API you can try to scrape their web site. Amazon allows crawling their pages.

Copyright infringement.

What do you want to do with the extracted data? If this is intended for your own personal use, then it is legal as it falls underfair use doctrine.

Fair use permits limited use of copyrighted material without having to first acquire permission from the copyright holder.

Technically, there is absolutely no difference between accessing a web site using an automated script and a human-driven viewing a website.

The complications start with reproducing copyrighted content.

Facts themselves are not protected by copyright. A narrative work that includes or explains facts can be protected by copyright (e.g. an encyclopedia is copyrightable).

But rephrasing/ reorganizing the data gets you around that.

Denial ofService.

Big popular web sites were built to handle high traffic. Smaller ones may not be so robust, and may not be ready to handle too many requests per second, causing degraded performance in a web site and shutting down access for other users. Malicious hackers use this tactic in what’s known as a “Denial of Service” attack.

Why does this happen? Well, automated data scrapers “read” a website pages much quicker than a human could. As not every site makes it clear how robust their server is, this is a bit tricky question to avoid excessively overload a server.

No matter whether you are a hacker or just a researcher, causing a Denial of Service errorto a site can result in legal action taken againstyou.

Here are some of considerations making sure your crawler doesn’t hit a web site too hard.

Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive.
Increase scraping intervals to avoid server overload.
Set your scraper to operate on off-peak business hours for the site
Smaller companies use smaller servers, so don’t scrape them as aggressively as, say, a giant corporation’s web site.

When in doubt,ask

And finally, if it’s not clear from a website, contact the webmaster and ask if and what you’re allowed to harvest.

Photo byNadine ShaabanaonUnsplash

Wrap Up

So web scraping is absolutely legal if done right. Furthretmore, scraping can provide many benefits to all involved.

There is a bunch of great use cases for web scraping:

Retailersuse web scraping to monitor their competitor prices and collect product reviews for analysis.
Lawyerslook for the past judgement reports for their case references
Recruiterscollect people profiles
Media companiesfollow trending topics and look for a fresh content for publications.