Key takeaways
Data scraping is a booming multi-billion-dollar industry that involves extracting data from online sources, often using AI tools, with significant implications for privacy and intellectual property.
There are serious privacy issues and legal challenges related to scraping, as illustrated by cases like Clearview AI's use of facial recognition technology.
Governments and regulatory bodies are increasingly focusing on addressing the ethical and legal challenges of data scraping, with new laws and regulations being introduced to protect personal privacy and intellectual property rights.
Data scraping. It’s the multi-billion-dollar industry, and you might not even know about it.
What is scraping?
Scraping is the practice of robotically extracting data from online sources, by using purpose-built data collection software, and it’s a booming business. The global market for web-scraping alone is currently valued at USD$4.9B and is positioned to grow by 28% over the next eight years1.
In 2019, the personal details of 533 million Facebook users across most of the countries in the world were scraped2. Data included users’ locations, full names, emails and even their phone numbers. All the user information was placed on a publicly available database, and it went undetected for 18 months. In this instance, Facebook termed the scrapers as “malicious actors”, however there is a myriad of reasons why people scrape – to on-sell data, gain competitive intelligence or to inform artificial intelligence (AI) bots. Famously, a whistleblower at Nvidia claimed the firm is scraping a ‘human lifetime’ worth of YouTube content each day to do just this3.
When ChatGPT is asked, “do you scrape data?”, it replies with a complete rejection of the suggestion and promises that it only generates responses based on “a mixture of licensed data, data created by human trainers, and publicly available information.” This response is undoubtedly considered dubious by the plaintiffs of a new class action filed in August 2024, who allege that OpenAI used millions of YouTube video transcripts to train its chatbot.
With more data available online than ever, scrapers have developed AI programs to pick up the task of data crawling and gathering. Top-ranked AI tools such as ‘Bright Data’, ‘Octoparse’ and ‘ScrapingBee’ publicly boast capabilities such as user friendly ‘no coding needed’ interfaces, CAPTCHA avoidance ability and proxy management technology.
So as AI robots are working overtime to collect data, what are the relevant privacy issues and intellectual property (IP) implications?
Privacy issues
The issue of AI data scraping has been on the bureau of governments internationally for a while, culminating in a joint statement issued in 2023 by the Office of the Australian Information Commissioner, and 11 other international privacy regulators, including from the UK, New Zealand and Canada. Namely, they stressed that companies which host personal information of its users online (such as Meta) have obligations to protect personal data, and that all discoverable data online is still subject to privacy laws.
The Privacy Act 1988 (Cth) doesn’t mention scraping specifically; however, it outlines explicit rules regarding data access. According to Australian Privacy Principle (APP) 3, any organisation collecting data must only collect information that is reasonably necessary for, or directly related to, one or more of its functions or activities.
Additionally, the collection of sensitive information (as defined under section 6(1) to include personal details such as health information, biometric information, religion, ethnicity and trade union membership) requires consent for collecting, using and storing this information. Given that AI scraping is automated, compliance in this context becomes murky, as illustrated recently in Clearview AI Inc and Australian Information Commissioner [2023] AATA 1069.
Clearview AI is an international facial recognition software supposedly sold exclusively to law enforcement agencies across the world – including in Australia; although, in the early stages of the company, a billionaire investor, John Catsimatidis, is on the record as using the software to identify a fellow he saw on a date with his daughter4.
The algorithm works by matching images of crime suspects with a database of over 30 billion images, as admitted by Clearview AI5. The issue is that these reference images are obtained from online sources, including Facebook, Instagram and LinkedIn. The Administrative Appeals Tribunal (AAT) considered several issues, including whether Clearview AI had the necessary “Australian link” to be bound by the Australian Privacy Act. The tribunal found that it does.
Notably, Clearview AI were found to have breached APP 1.2 and 3.3. Respectively, these detail how entities holding personal data must take ‘reasonable steps’ to comply with the APPs, and that sensitive information can only be collected where consent is given.
The AAT held that facial images, when used for biometric purposes are classed as 'sensitive information'. Considering that Clearview was collecting such sensitive information using scraping-like technologies without consent, they were found to have failed to comply with APP 3.3, and consequentially breached APP 1.2 as well.
However, despite this ruling, and similar litigation internationally, a larger issue of enforceability arises. Despite the AAT upholding the Information Commissioner’s demands to cease collecting and destroy all scraped images of individuals in Australia, Clearview AI says that it’s not that simple6. This is because Clearview AI say they never collected data in respect of the country of residence of the person’s depicted in the images. As such, Clearview is not able to determine the location or nationality of individuals depicted in the images. Supposedly, nothing short of also scraping citizenship data, in addition to the images and their attached URLs, could enable them to make such deletions.
This highlights perhaps the most poignant issue with AI data scraping – where laws are subject to clear jurisdictional limits, much of the internet’s infrastructure has no such clearly defined geographical boundaries. Where regulators seek to enforce laws beyond their jurisdictional boundaries, that may exceed the legitimate exercise of the jurisdictional power. This was the case recently in eSafety Commissioner v X Corp [2024] FCA 499 where the judge held that:
In so far as the notice prevented content being available to users in other parts of the world, at least in the circumstances of the present case, it would be a clear case of a national law purporting to apply to ‘persons or matters over which, according to the comity of nations, the jurisdiction properly belongs to some other sovereign or State'...
Intellectual property up for grabs
Under the Copyright Act 1968 (Cth), original works such as images, music and literary works are protected. With strong penalties in place for copyright infringements, this poses additional challenges for data scrapers, even when accessing open-source information.
Associated Press v. Meltwater US Holdings (2013) is an early data scraping case from the United States that found reproducing copyrighted news articles was not protected by the 'fair use' exception. In this case, Meltwater used a web crawler to extract articles written by the Associated Press and made these excerpts available on their paid subscriber-based news search database. The Australian Copyright Act equivalent, the 'fair dealing' exception, is even narrower, as it doesn’t involve a judgment of factors such as nature and substantiality, but instead rigidly outlines specific instances where copyrighted material may be used without permission, such as for satire and legal advice.
In Facebook v Power Ventures (2009), an action in copyright was only upheld against the scraping firm because they had used the whole profile page, such as Facebook-owned graphics and non-user content. Afterall, users who post to Facebook retain the IP, and simply grant Facebook a license to show the content.
However, modern AI scraping presents greater challenges, particularly concerning traceability. While traditional scraping often involved the republication of copyrighted material, the advent of AI has shifted this practice toward training large language models using data derived from scraped sources to produce new materials derived from the data upon which the model is trained.
A recent accusation by the New York Times alleges that OpenAI used their copyrighted articles to train ChatGPT. This allegation is similar to those made in a class action involving The Authors Guild and seventeen other writers. The authors claim that their books were downloaded by OpenAI from pirate e-book websites and used to train ChatGPT. Copyright actions regarding the training of large language models are still novel and largely untested. However, although these actions are within a US context, if the court finds that OpenAI did infringe, it will likely lead to significant changes in how licensing for generative AIs might work in the future.
Regulatory development
Internationally, policy efforts to combat the prevalence of scraping thus far have had a large focus on personal privacy rather than copyright protections.
In Europe, the General Data Protection Regulation (GDPR) levies substantial fines for non-compliance with strict protection laws, not dissimilar to Australia’s Privacy Act 1988 (Cth). However, notable distinctions include the stipulation of explicit and informed consent for data processing, the right to information erasure and the mandate of ‘privacy by design’ in development.
Also, undoubtedly in response to the Clearview AI scandal, the EU’s new Artificial Intelligence Act, commenced as of August 2024, bans certain prohibited AI systems, including the untargeted scraping of facial images to create recognition databases.
With Australia having already raised penalties for data breaches in 2022, it is set to introduce new legislation before the Federal Parliament in 2024 to amend the Privacy Act once again. Some of the proposed changes, which have been "agreed to in principle," include adding obligations to complete privacy impact assessments for high-risk activities related to biometrics and facial recognition, implementing an overarching ‘fair and reasonable’ test when handling personal information, and establishing the right to erasure.
Data scraping best practices
Scraping undoubtedly raises significant ethical questions and serious legal challenges. Nevertheless, it has become a prevalent practice and a reality of the AI revolution. It highlights the unparalleled access to real-time data it presents at a cost-effective rate, which helps tailor consumer experiences and automate business processes. However, as regulatory reforms work to address privacy and intellectual property concerns, companies engaging in data scraping should follow some best practices to navigate these issues responsibly.
An example of integrity in web scraping can be observed at the Australian Bureau of Statistics (ABS), which uses web scraping software to extract publicly available information. The ABS clearly posts a disclaimer on their website, outlining the parameters of their data collection practices. They reassure that personal information is not scraped, highlight the purpose of reducing the burden on data providers, and demonstrate their awareness of relevant Australian laws which they must comply.
We're ready to assist
References
1. Global Market Insights. (2024). Alternative data market size, share, & growth report, 2024-2032. Global Market Insights.
2. Holmes, A. (2021). Stolen data of 533 million Facebook users leaked online. Business Insider.
3. Cole, S. (2024). Nvidia's AI scraping foundational model and the Cosmos project. 404 Media.
4. Hill, K. (2020). Before Clearview became a police tool, it was a secret plaything of the rich. The New York Times.
5. Liu, T. (2023). How we store and search 30 billion faces. Clearview AI
6. Wilson, C. (2024). Clearview AI is still collecting photos of Australians for its facial recognition database.