The Web scrapping industry is one of the industries like the Digital Marketing industry which would be seriously affected by the Data Protection Authorities.
According to a report on Webscrapping from stellar the market for Webscraping software and services may grow at a CAGR of 133% from around USD $ 800 million at present.
However the emergence of Data Protection laws across the globe is likely to be a serious threat to the development of the industry.
DPDPA 2023 provides that if personal information is “Publicly made available by a data principal” the act may not apply to such data. A question therefore arises on whether personal data available on the web either in websites or sites like Linkedin, Twitter or FaceBook can be freely scraped and used by businesses.
Most of the platforms like LinkedIn have themselves made “Scraping” a licensable service and therefore any company which scrapes data from these platforms will be liable to the platform if it violates the terms of the contract. But the question whether the platform itself has the power to license scraping is debatable. This permission has to be part of the consent to be sought from the data principal. If the data principal has provided the data for a specific purpose, its use for any other purpose including monetization by further licensing should be considered as secondary purpose.
If the platforms are clear in their notice and seek explicit consent, “Consent to allow Scraping of data by any web crawler” can be considered as not part of the basic consent. It is likely that many data principals who use the platform may agree that their profile may be made visible to any visitor to the profile page but scraping it for use by another third party for its own monetization may not be permitted.
If this provision is strictly applied, the business of “Web scraping” may suffer adversely.
Also these platforms need to determine if they incorporate a default condition that permission from the data principals is required before scrapping.
DGPSI recommends that platforms conduct their own DGPSI audits and set appropriate compliance conditions applicable for different jurisdictions.
In this context we may note that many of the GDPR supervisory authorities are issuing guidelines for Webscrapping.
For example the April 30, 2020 guideline of CNIL states
When individuals share their personal data with one data controller, it is not reasonably expected that they will receive direct marketing from another company – another company may re-use their data for such purposes only with the individuals’ consent.
Similarly, when a company re-uses publicly available online data of individuals in order to send direct marketing communications about its products and services by e-mail or through automated calling systems, the company must obtain the individuals’ consent before sending.
The guidelines therefore expect that Data Controllers before using web scraping tools should
- Verify the nature and origin of the data that will be scraped
- Minimize data collection
- Provide notice to individuals
- Manage the contractual relationship with the web scraping service provider
- Carry out a Data Protection Impact Assessment (“DPIA”) if necessary
Recently the Netherlands authority also issued guidelines stating the following.
The key takeaways from the guidelines are as follows.
1.Provides a clear definition and distinguishes between scraping and web crawling.
2. Discusses the stringent conditions under which scraping can meet the ‘legitimate interest’ basis, emphasizing that mere commercial interest is not sufficient.
3. Highlights the significant privacy risks associated with scraping, including the inadvertent collection of sensitive and criminal personal data, which often makes lawful processing challenging.
4. Advises on conducting a DPIA to assess risks and ensure compliance with GDPR before initiating any scraping projects.
5. Points out the complexities of using scraped data to train algorithms, stressing the need for ethical considerations to prevent biases and inaccuracies.
An english version of the guideline is available here
Naavi