News Publishers Block Internet Archive's Wayback Machine to Stop AI Training

Local journalism is further eroded as hundreds of local newspaper articles are removed from the historical record, eliminating documentation that may not exist elsewhere.
The Archive is collateral damage in a war that is not really about them.
Mark Graham, the Wayback Machine's director, describes why publishers blocking the Archive misses the actual problem.

Since late 2025, more than 240 news organizations across nine countries have instructed the Internet Archive to stop preserving their content — not because the Archive itself has wronged them, but because AI companies have used archived journalism as training data without permission or payment. The Archive, which has safeguarded over a trillion web pages since 1996 and serves courts, historians, and journalists as a primary tool of accountability, finds itself caught between a copyright war it did not start and a public mission it cannot abandon. In attempting to close a door that leads to AI companies, publishers may be sealing shut a window that has long let light into the historical record.

  • AI companies training language models on decades of archived journalism — structured, dated, high-quality — without paying or asking has pushed publishers into open litigation and defensive action.
  • The Internet Archive, a library that courts cite as evidence and Wikipedia links to over 2.6 million times, is being systematically blocked by the very institutions whose work it was built to protect.
  • USA Today's parent company alone accounts for hundreds of blocked local publications, erasing community journalism from the historical record at the precise moment local news is already disappearing.
  • Journalists, historians, and the Electronic Frontier Foundation are sounding alarms: without the Wayback Machine, publishers can quietly edit or retract stories with no accountability trail remaining.
  • The EFF and press freedom advocates argue the correct target is the AI companies in court — not the public infrastructure caught in the crossfire — but the blocking trend is escalating, not reversing.

Over the past several months, more than 240 news organizations — including the New York Times, CNN, USA Today, and The Guardian — have instructed the Internet Archive to stop crawling their websites. The reason is understandable: AI companies have been training language models on archived news content, decades of structured, high-quality journalism, without permission or payment. Publishers already in court with OpenAI and others see the Archive as an unguarded door. They have moved to close it.

But the Internet Archive is not a door. It is a library. Since 1996, it has preserved over a trillion web pages. Courts use it as evidence. Journalists use it to prove that articles were quietly edited after publication. Historians treat it as a primary source. Wikipedia links to more than 2.6 million archived articles across 249 languages. The Archive's director, Mark Graham, has been direct: his institution is collateral damage in a war that is not about them. The Archive already rate-limits access and restricts bulk downloads. The risk, he argues, comes from AI companies accessing content through channels the Archive monitors — not from the preservation itself.

The publishers' concern does not entirely dissolve, however. The New York Times pointed out that its archived content is being used by AI companies in direct competition with its own journalism. The Guardian, more cautiously, limited rather than fully blocked the Archive after identifying its APIs as a particular vulnerability — structured databases that AI businesses could readily extract from at scale.

The consequences reach well beyond the AI dispute. When articles are no longer archived, they become editable without accountability. The Wayback Machine has long been the primary instrument journalists use to document changes publishers make after publication — corrections, softened claims, removed quotes. The Electronic Frontier Foundation's Joe Mullin noted that the Archive often becomes the only place where those changes can be seen at all.

The human cost is sharpest at the local level. USA Today's parent company accounts for a large share of the blocked sites, effectively removing hundreds of local publications from the historical record — at a moment when local journalism is already in crisis and many of those articles exist nowhere else. A petition signed by over 100 working journalists has pushed back, describing the Wayback Machine as an irreplaceable tool for preserving the public record.

What is unfolding here is a compressed version of a structural problem running through the entire AI copyright debate. As direct scraping faces litigation, pressure shifts onto public infrastructure that publishers cannot fully control. Computer scientist Michael Nelson observed that the Internet Archive and Common Crawl are widely considered the good actors — yet they are absorbing the consequences of what the bad actors have done. The Electronic Frontier Foundation's conclusion is straightforward: sue the AI companies. Blocking the Archive solves nothing and destroys much. Whether publishers will recognize that distinction before the collateral damage becomes permanent remains the open question.

Over the past several months, more than 240 news organizations have taken the same defensive action: they have told the Internet Archive to stop crawling their websites. The New York Times implemented what amounts to a hard block in late 2025. CNN, USA Today, The Guardian, and at least 238 other publishers across nine countries have followed suit, each one instructing the Archive's bots to stay away. The reason is straightforward and, by most accounts, legitimate. AI companies are training their language models on archived news content—structured, dated, high-quality writing accumulated over decades—without permission or payment. The publishers are already in court with OpenAI and others over copyright violations. The Internet Archive, they argue, has become an unguarded door.

But here is the problem: the Internet Archive is not actually a door for AI companies. It is a library. Since 1996, it has preserved more than one trillion web pages. Courts cite it as evidence. Journalists use it to prove that articles were edited after publication. Historians treat it as a primary source. Wikipedia links to over 2.6 million archived news articles across 249 languages. It is, by almost any measure, one of the most significant public infrastructure projects the internet has ever produced. And now the very institutions whose work it has preserved are systematically dismantling its ability to do that work.

The Archive's own director, Mark Graham, has described the situation with precision: the Archive is collateral damage in a war that is not really about them. The Archive does limit bulk downloads. It rate-limits access. It maintains controls to prevent large-scale automated extraction. A 2023 Washington Post analysis found that AI training datasets had included material from the Archive, but that material came through the Archive's interfaces—which the Archive itself controls and restricts. Graham argues that the publishers' rationale for blocking the crawlers is unfounded. The risk is not from the Archive preserving content. The risk is from AI companies accessing that content through channels the Archive already monitors.

Yet the publishers' concern does not fully dissolve under scrutiny. Third parties can access the Archive's data regardless of the Archive's own intentions or safeguards. The New York Times spokesperson Graham James stated plainly: Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with the Times. The Times invests enormous resources in original journalism, and that work should not be used without permission. The Guardian, more cautiously, limited rather than fully blocked the Archive's access after discovering it was a frequent crawler. Robert Hahn, the Guardian's head of business affairs, expressed particular concern about the Archive's APIs—readily available, structured databases of content that AI businesses could easily plug into and extract.

The consequence of this blocking, however, extends far beyond AI companies. When a news article is no longer archived, it becomes editable without accountability. Publishers routinely amend stories after publication: correcting errors, softening claims, removing quotes. The Wayback Machine has been the primary tool journalists use to document those changes. The Electronic Frontier Foundation's Joe Mullin put the stakes plainly: the Internet Archive often becomes the only source for seeing those changes. There are real disputes over AI training that must be resolved in courts, but sacrificing the public record to fight those battles would be a profound and possibly irreversible mistake.

USA Today Co., the largest newspaper publisher in the United States, accounts for a large share of the blocked sites, effectively removing hundreds of local publications from the historical record. This happens at a moment when local journalism is already in crisis, and every preserved article represents documentation that may not exist anywhere else. A petition organized by Fight for the Future, signed by over 100 working journalists, has pushed back against the blocking trend, describing the Wayback Machine as a tool that preserves the public record at a time when many major media outlets are questioning whether to allow it to do so. The dispute is now escalating rather than resolving.

What is happening with the Internet Archive is a compressed version of a structural problem that runs through the entire AI copyright debate. The institutions designed to serve the public interest—a digital library, open web standards, publicly accessible archives—are becoming the path of least resistance for AI companies seeking training data. As direct scraping is increasingly blocked, litigated, and metered, pressure accumulates on the public infrastructure that publishers cannot control. Michael Nelson, a computer scientist at Old Dominion University, observed that Common Crawl and the Internet Archive are widely considered to be the good guys, yet they are being used by the bad guys like OpenAI. In everyone's aversion to being controlled by large language models, the good guys are collateral damage.

The Electronic Frontier Foundation concludes that the right response is not to block the Archive but to sue the AI companies directly. There are real disputes over AI training that must be resolved in courts. But the current path—blocking one of the internet's most vital public institutions—solves nothing and destroys much. The question now is whether publishers will recognize the distinction between the Archive and the companies actually training on its content, or whether the collateral damage will continue to accumulate.

Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us. The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission.
— Graham James, New York Times spokesperson
The Internet Archive often becomes the only source for seeing those changes. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.
— Joe Mullin, Electronic Frontier Foundation
The Hearth Conversation Another angle on the story
Inventor

Why are publishers blocking the Archive if the Archive itself isn't the problem?

Model

Because the Archive makes the content easy to access at scale. Publishers see the API, see the structured data, and see a shortcut that AI companies could use. They're not wrong about the risk—they're just using a tool that destroys something they actually depend on.

Inventor

But the Archive says it already limits bulk downloads. Doesn't that address the concern?

Model

It does, technically. But publishers don't trust that those limits will hold forever, or that they're sufficient now. Once content is in the Archive, it's there. They want control they don't actually have.

Inventor

What happens to journalists when the Archive is blocked?

Model

They lose their primary tool for proving that articles were changed after publication. A publisher can quietly soften a claim or remove a quote, and there's no longer evidence it was ever there. The accountability disappears.

Inventor

Is this really about AI, or is it about something else?

Model

It's about AI, but it's also about publishers feeling powerless. They're losing copyright lawsuits. They can't control what OpenAI does. The Archive is something they can actually block, so they do. It's the wrong target, but it feels like action.

Inventor

What's the real solution here?

Model

Sue the AI companies directly. Make them pay for the content or stop using it. Don't destroy public infrastructure because you're angry at private companies. The Archive isn't the enemy.

Want the full story? Read the original at TNW ↗
Contact Us FAQ