Major News Outlets Push Back Against Web Archive Used for AI Training

Publishers want their content out, and they want it out now.

Twenty major news organizations are demanding removal from Common Crawl, the web archive used to train AI chatbots.

Across the long arc of media history, those who gather and distribute knowledge have always wrestled with those who profit from it. Today, twenty major news publishers — among them CNN, NBC, and USA Today — have formally demanded that Common Crawl, a nonprofit web archive foundational to AI training, remove their content and cease enabling its unauthorized use. The dispute is not merely legal; it is existential, touching on who owns the labor of journalism and whether the machines learning from that labor owe anything in return. The outcome may quietly redraw the economics of both the press and artificial intelligence.

Twenty publishers have drawn a hard line, formally demanding their reporting be stripped from one of the most widely used AI training archives in existence.
The conflict cuts deep: newsrooms that have already weathered the erosion of print revenue now face AI systems that can mimic their work without sending a single dollar back to the journalists who produced it.
Common Crawl, long regarded as neutral public infrastructure, suddenly finds itself at the center of a copyright reckoning it was never designed to adjudicate.
The News/Media Alliance's letter signals that publishers are done waiting for courts to settle the fair-use debate — they want removal now, not vindication later.
If the demands succeed, a cascade of similar claims from authors, photographers, and musicians could fundamentally raise the cost of training AI models and reshape who can afford to build them.

A coalition of twenty major news organizations, including CNN, NBC, and USA Today, has launched a formal campaign to remove their content from Common Crawl — the vast nonprofit web archive that AI companies have relied upon to train their language models. The News/Media Alliance delivered the demand in a letter sent Wednesday, marking a significant hardening of the media industry's posture toward AI data practices.

Common Crawl has long served as neutral infrastructure, systematically preserving web content for researchers and developers. But as generative AI scaled up, the archive became something more consequential: an enormous, readily available corpus of human-written text fed into commercial systems. Publishers argue that this specific use — training AI products that can replicate the value of journalism without compensating its creators — crosses a line that web archiving alone does not.

The economic stakes are not abstract. News organizations have spent years navigating declining print revenue and platform competition. Generative AI introduces a new pressure: if readers can extract answers from a chatbot trained on news content, the incentive to visit a newsroom's website — let alone pay for it — erodes further. Publishers are not objecting to archiving as a concept; they are objecting to their work becoming raw material for competitors who owe them nothing.

How Common Crawl responds will matter beyond this single dispute. Honoring removal requests at scale would require new systems and set precedents that could invite similar demands from across the creative economy. More broadly, if content creators can successfully reclaim their work from training datasets, the cost of building large AI models could rise substantially — concentrating that power among only the wealthiest developers. For now, the publishers have drawn their line. Whether the AI industry will respect it is the question the rest of the story turns on.

A coalition of major news organizations has begun a formal campaign to scrub their content from Common Crawl, the vast web archive that artificial intelligence companies have relied on to train their chatbots and language models. CNN, NBC, and USA Today are among twenty publishers who have now demanded that the nonprofit repository remove their articles and prevent future unauthorized use of their work for AI purposes.

The News/Media Alliance, a trade group representing newspapers and magazines, sent the letter to Common Crawl on Wednesday. The move signals a hardening line from publishers who have watched their reporting become raw material for AI systems without compensation or explicit permission. Common Crawl operates as a nonprofit that systematically archives web content—a practice that has long been foundational to how search engines and researchers access information. But as AI companies have scaled up their training datasets, the archive has become something else: a vast, readily available corpus of human-written text that can be fed into machine learning models.

The tension underlying this dispute is straightforward. Publishers invest in reporting, editing, and distribution. They own that work. AI companies, meanwhile, have argued that training on publicly available text falls within fair use or is otherwise permissible. Common Crawl itself has positioned itself as a neutral infrastructure project, not a party to the underlying copyright questions. But publishers are no longer willing to wait for those questions to be resolved in court. They want their content out, and they want it out now.

The letter from the News/Media Alliance frames the issue in terms of unauthorized use and intellectual property protection. Publishers are not arguing that web archiving itself is wrong—libraries and researchers have legitimate reasons to preserve digital content. What they object to is the specific use of their archives to train commercial AI systems that compete with or diminish the value of the original reporting. A chatbot trained on news articles can generate summaries, answer questions, and produce text that resembles journalism without any revenue flowing back to the newsrooms that produced it.

This coordinated pushback reflects a broader reckoning in the media industry. For years, publishers have struggled with the economics of digital news—declining print revenue, competition from platforms like Google and Facebook for advertising dollars, and the difficulty of building sustainable subscription businesses. The rise of generative AI has added a new threat to that precarious equation. If readers can get answers from a chatbot trained on news content, why would they visit a news website? And if they do visit, why would they pay for a subscription?

Common Crawl's response to these demands remains to be seen. The organization has historically positioned itself as a public good, providing researchers and developers with access to web data. Honoring opt-out requests from major publishers would set a precedent and potentially require the organization to develop new systems for respecting content removal requests at scale. It could also invite similar demands from other copyright holders—authors, musicians, photographers—whose work has been incorporated into AI training datasets.

The broader implication is that the current model of AI training data acquisition may be unsustainable. If major content creators can successfully demand removal from archives, or if they can negotiate licensing agreements that require payment, the cost of training large language models could rise significantly. That could reshape the economics of AI development itself, potentially slowing the pace of innovation or concentrating it among companies wealthy enough to license content at scale. For now, the publishers are drawing a line. Whether Common Crawl and the AI industry will respect it remains an open question.

Citações Notáveis

The News/Media Alliance urged Common Crawl to honor publishers' requests to remove content and prohibit unauthorized use, including for AI purposes.

— News/Media Alliance letter

A Conversa do Hearth Outra perspectiva sobre a história

Inventor

Why are these publishers acting now, specifically? Web archives have existed for decades.

Model

Because the stakes have changed. A web archive used by researchers is one thing. A web archive feeding commercial AI systems that can generate text that looks like news—that's a different threat entirely. Publishers see their content being used to build products that compete with them.

Inventor

But isn't this content already public? If it's on the internet, doesn't that mean it's fair game?

Model

That's the legal question everyone's wrestling with. Publishers would say there's a difference between being indexed by a search engine and being fed wholesale into a training dataset. One drives traffic; the other potentially replaces the need to visit the original source.

Inventor

What happens if Common Crawl refuses to remove the content?

Model

Then we're likely headed toward litigation. Publishers have already shown they're willing to sue—there are several AI copyright cases in motion right now. This letter is partly a legal move, establishing that they made a formal demand.

Inventor

Could this actually change how AI companies train their models?

Model

It could. If major publishers successfully opt out, and if other content creators follow, the available training data shrinks. That makes training more expensive and potentially less diverse. It might force AI companies to negotiate licenses instead of just scraping.

Inventor

Is there a middle ground here?

Model

Possibly. Some publishers might accept payment for their content being used in training. Others might allow it for non-commercial research but not for commercial products. The question is whether AI companies are willing to pay, and whether publishers can collectively hold out long enough to force the issue.

Quer a matéria completa? Leia o original em Bloomberg ↗

Major News Outlets Push Back Against Web Archive Used for AI Training

Citações Notáveis

Cobertura Relacionada

Receba o Register no seu e-mail