Anthropic Reverses Secret AI Sabotage Policy After Research Community Backlash

pulling the ladder up behind them

How researchers described Anthropic's attempt to limit competitors from using Claude for AI development.

In the brief window between a policy's creation and its collapse, Anthropic revealed something essential about the tensions shaping AI's future: the impulse to control the pace of progress can quietly become the impulse to control who gets to participate in it. The company's plan to invisibly degrade Claude's performance for researchers building competing models lasted only days before the research community named it for what it was — not a safety measure, but a silent wall. Anthropic has since reversed course, committing to visible safeguards, and in doing so acknowledged that transparency, even when costly, is not optional in a field where trust is the only currency that compounds.

Anthropic embedded a hidden mechanism in Claude Fable 5 that would silently degrade performance for researchers working on competing AI systems — sabotage invisible to the very people it targeted.
The research community erupted, with policy experts and developers calling the policy a power grab that would have handed frontier AI development exclusively to a few well-funded labs while everyone else hit walls they couldn't see.
Critics warned the hidden restrictions would have devastated open-source AI projects and third-party safety evaluation firms that depend on honest model behavior — collateral damage dressed up as caution.
Anthropic reversed the policy within days, apologizing publicly and committing to make all safeguards visible, even as it acknowledged this would mean more false positives and blunter restrictions.
The company now faces the harder work of building precise classifiers that can enforce legitimate safety boundaries without punishing the broader research ecosystem it depends on for credibility.

When Anthropic launched Claude Fable 5, it included safety guardrails that were openly disclosed — rerouting users asking about sensitive topics like cybersecurity or biology to a less capable model. But tucked alongside these visible measures was something else entirely: a plan to secretly degrade Claude's performance for any researcher suspected of using it to train competing AI systems. The degradation would be invisible. Users would never know they'd hit a wall.

The policy survived only days. Researchers and policy experts pushed back hard, and Anthropic reversed course, committing to make its safeguards visible and issuing a public apology: "We made the wrong tradeoff and we apologize for not getting the balance right."

The backlash was pointed. Dean Ball, a senior fellow at the Foundation for American Innovation and former White House AI advisor, called the original policy "shockingly hostile." Will Brown of open-source startup Prime Intellect described it as Anthropic "pulling the ladder up behind them" — a bid to monopolize advanced AI research by quietly blocking everyone else. The stakes were concrete: Claude's coding agent had become a standard tool for open-source AI developers, and a secret sabotage policy would have left researchers wasting months on work the company had silently blocked, with no way to know why.

Anthropics's stated rationale was rooted in genuine concern — AI capabilities are outpacing society's ability to adapt, and slowing frontier development could give alignment research and policy time to catch up. A hidden safeguard, the company argued, would be harder to probe and circumvent. But critics noted a fatal flaw: developers couldn't follow rules they didn't know existed.

Now that the restrictions are visible, Anthropic faces a new problem of its own making. Broader detection nets mean more false positives — benign requests caught in restrictions meant for bad actors. What was designed as a surgical tool has become blunter, and the company says it's working to sharpen its classifiers. The timeline remains uncertain. Transparency, it turns out, has a price — but so does its absence.

Anthropic released Claude Fable 5 earlier this week with a set of safety guardrails meant to prevent misuse—some transparent, some decidedly not. The company would openly reroute users asking about cybersecurity, biology, or chemistry to a less capable model. But for researchers attempting to use Claude for frontier AI development, Anthropic had planned something different: deliberately degrade the model's performance in ways the user would never see. The sabotage would be invisible, a silent wall against anyone trying to train competing AI systems, which Anthropic's terms of service explicitly forbid.

The policy lasted only days before the research community pushed back hard enough to force a reversal. Anthropic announced it would make the safeguards visible after all, alerting users when it refused a request or rerouted them to a weaker model. The company issued a statement acknowledging the misstep: "We made the wrong tradeoff and we apologize for not getting the balance right."

The backlash came from researchers and policy experts who saw the hidden degradation as something far worse than a safety measure—as a power grab. Dean Ball, a senior fellow at the Foundation for American Innovation and former White House AI advisor, called it "shockingly hostile." Will Brown, research lead at the open-source AI startup Prime Intellect, described the original approach as Anthropic "pulling the ladder up behind them," suggesting the company wanted to monopolize advanced AI research. The concern wasn't abstract: Claude's coding agent had become a standard tool for developers working on open-source AI projects. A secret sabotage policy could have meant that only a handful of well-funded labs could perform cutting-edge research, while everyone else hit invisible walls.

Anthropric's reasoning was rooted in genuine anxiety about speed. The company argued that AI capabilities are advancing faster than society can adapt, and that slowing frontier development would give alignment research and policy time to catch up. It also cited national security: keeping advanced models out of the hands of foreign adversaries who might optimize chips or accelerate their own development. A hidden safeguard, the company reasoned, would be harder to probe and work around, allowing for more surgical restrictions.

But Brown and others pointed out a crucial flaw: developers wouldn't know when they were violating Anthropic's rules. The policy would have left them guessing, potentially wasting months of work on approaches the company had silently blocked. Beyond individual researchers, the restrictions threatened the growing ecosystem of third-party evaluation firms that test frontier models for safety and performance—work that could have been severely hampered if Anthropic was secretly degrading its model.

Now that the safeguards are visible, Anthropic faces a different problem. The company says it will need to cast a wider net, meaning more benign requests may trigger its restrictions. False positives will likely increase. The company says it's working to make its classifiers more precise, but the timeline is uncertain. What was meant to be a surgical tool has become a blunter instrument, the cost of transparency.

Citas Notables

We made the wrong tradeoff and we apologize for not getting the balance right.

— Anthropic statement to WIRED

It felt like Anthropic was saying to the public, 'We don't trust anybody else to do AI research. We are the only ones who have to do AI research.'

— Will Brown, research lead at Prime Intellect

La Conversación del Hearth Otra perspectiva de la historia

Inventor

Why did Anthropic think secret degradation was the right move in the first place?

Model

They were genuinely worried about speed. If Claude can help someone build a competing AI model, and AI capabilities are doubling every few months, then every researcher with access to Claude is potentially accelerating the timeline. A hidden safeguard felt like the only way to slow that down without tipping off the people you're trying to stop.

Inventor

But that assumes Anthropic should be the only ones doing this work.

Model

Exactly. That's what the research community objected to. They said: if you believe AI safety is important, why would you prevent other labs from researching it? Why assume you're the only ones trustworthy enough?

Inventor

What happens now that the safeguards are visible?

Model

More friction. Researchers will know when they hit a wall, which is better for transparency. But Anthropic has to be more conservative now—they can't surgically block just the dangerous requests because users can see what's happening. So they'll probably block more things, including some harmless ones.

Inventor

Does that actually solve the problem they were worried about?

Model

No. It just moves the problem from invisible to visible. The concentration of AI research power is still there. It's just now everyone can see it happening.

Quieres la nota completa? Lee el original en WIRED ↗

Anthropic Reverses Secret AI Sabotage Policy After Research Community Backlash

Citas Notables

Cobertura Relacionada

Recibe The Register en tu bandeja