Is Web Scraping Legal for AI Training? NYT v. OpenAI, the EU AI Act, and Opt-Outs
Scraping public web data to train AI models is legally unsettled in the US — the major copyright lawsuits, led by New York Times v. OpenAI, are still being litigated. In the EU, the rule is clearer: commercial text and data mining is permitted under the DSM Directive unless the rightsholder has opted out in a machine-readable way, and the AI Act requires model providers to honor those opt-outs.
This is the newest and fastest-moving corner of web scraping law. The cases that defined scraping over the last decade — hiQ, Bright Data — were about access: who may collect public data. The AI-training fight is about use: what you may do with copyrighted content after collecting it. Different statutes, different defendants, different outcomes so far.
The US: copyright is the battleground
In December 2023, The New York Times sued OpenAI and Microsoft, alleging that millions of Times articles were used to train GPT models without permission. The core copyright claims survived OpenAI's motion to dismiss in 2025, and the case remains in active litigation — legal scholars describe it as the first big test of fair use for AI training. Dozens of parallel suits from authors, artists, and music publishers are working through the courts on the same theory.
Until appellate courts rule on fair use for training, US law gives no settled answer. What is settled: the access cases still apply. Scraping public pages while logged out is not a CFAA violation (hiQ), and platform terms generally cannot bind logged-out scrapers (Meta v. Bright Data, 2024). The open question is what copyright law says about the training itself.
The EU: permitted, with a machine-readable off switch
The EU answered by statute instead of litigation. Article 4 of the DSM Copyright Directive permits commercial text and data mining (TDM) — which European courts and regulators treat as covering web scraping for AI training — unless the rightsholder has expressly reserved their rights in a machine-readable format. German courts applied the TDM exceptions in the first AI-dataset case, Kneschke v. LAION (Hamburg, 2024), and subsequent European decisions have confirmed that scraping is a form of TDM.
The AI Act builds on this: Article 53(1)(c) requires providers of general-purpose AI models to maintain a copyright policy that identifies and honors TDM opt-outs. In practice, robots.txt directives and similar machine-readable signals have become the de facto opt-out mechanism — which is why AI labs publish dedicated crawler user-agents and why publishers increasingly block them.
Practical rules in 2026
- Training a model on scraped copyrighted content: high-risk in the US until the fair-use cases resolve; permitted in the EU only where no machine-readable opt-out exists.
- Honoring robots.txt is no longer just etiquette — in the EU it feeds directly into AI Act compliance obligations.
- Factual and public business data remains the safest category: facts aren't copyrightable, and the access precedents protect collecting them from public pages.
- B2B contact data for outreach is not AI-training data — it's governed by the privacy rules in our GDPR/CCPA guide, not the copyright fight.
For sales teams, the takeaway is reassuring: using compliantly collected business contact data, the way platforms like Sales.co provide it, sits in the well-settled category — factual, public-source business data — far from the unresolved copyright questions around model training.