Nature: AI Cannot Be Trusted to Write Scientific Reviews

Rigorous peer review produces better science than gentle review does

Research shows papers that faced tougher criticism during review went on to have greater influence on their fields.

As artificial intelligence grows more capable of mimicking the surface features of expertise, the scientific community is confronting a deeper question: whether the appearance of rigor can substitute for its substance. Nature's examination of AI in peer review arrives at a sobering conclusion — that the machinery of judgment, accountability, and hard-won disciplinary wisdom cannot yet be delegated to algorithms. The integrity of the scientific record, it turns out, depends not just on what gets checked, but on who is doing the checking and why it matters to them.

Academic journals, overwhelmed by submission volumes, are increasingly tempted to let AI systems conduct peer review — trading depth of judgment for speed and cost savings.
New research reveals a troubling gap: AI can flag statistical errors but cannot assess whether a question was worth asking, whether conclusions are overclaimed, or whether a methodology is truly adequate for its ambitions.
Data from Nature Communications shows that harsher human peer criticism correlates with higher citation impact — meaning the discomfort AI would spare authors may be precisely what makes science better.
If flawed research enters the literature at scale because AI missed what a seasoned expert would have caught, the damage compounds: other work builds on it, resources are misdirected, and public trust erodes further.
Publishers are being urged to draw a clear line — AI may assist with screening and administration, but the core judgment of whether research is sound and significant must remain a human responsibility.

In May of this year, Nature posed a question that cuts to the heart of how scientific knowledge is made and protected: can artificial intelligence be trusted to evaluate research? The answer its editors arrived at was a careful but firm no — not because AI lacks speed or processing power, but because peer review was never really about those things.

The pressure to automate is understandable. Journals receive thousands of submissions annually, and AI systems can read papers, flag issues, and generate summaries at a fraction of the cost of human expert time. But Nature's reporting argues that this efficiency misunderstands what peer review is actually for. A good reviewer asks hard questions, pushes back on weak reasoning, and demands clarity where they find ambiguity — and they do so partly because their own reputation is tied to what gets published. That accountability is not incidental; it is structural.

Research published in Nature Communications sharpened the point: papers that received tougher criticism during review went on to be cited more frequently and carry greater influence. Rigorous, demanding review — the kind that makes authors uncomfortable — produces better science. It is a form of gatekeeping that depends on something AI currently lacks: the wisdom to know why a finding matters, whether a question was worth asking, and when an author is overselling what their data can actually support.

The stakes of getting this wrong are not abstract. Flawed research that enters the literature becomes the foundation for future work. Errors propagate. Resources are wasted. Trust erodes. And correction, once something is published, is far harder than prevention.

Nature's reporting points toward a workable middle path: AI may have a legitimate role in initial screening and administrative tasks, but the core judgment of whether research is sound and significant enough to publish must remain in human hands. The harder question is whether the economic pressures on academic publishing will allow that standard to hold.

In May of this year, Nature published a podcast episode that posed a straightforward question: can artificial intelligence be trusted to evaluate scientific research? The answer, according to the publication's editors, is no—at least not yet, and perhaps not in the way the technology is currently being deployed.

The concern is not abstract. As academic journals face mounting pressure to process thousands of submissions each year, the temptation to automate peer review—the process by which expert scientists evaluate each other's work before publication—has grown stronger. AI systems can read papers quickly, flag methodological issues, and generate summaries. They can work around the clock. They cost nothing compared to the time of a human expert. But speed and efficiency, Nature's reporting suggests, are not what scientific peer review actually requires.

Peer review exists to protect the integrity of the scientific record. A good reviewer does more than check for obvious errors. They ask hard questions. They demand clarity where they find ambiguity. They push back on weak reasoning, questionable assumptions, and overclaimed conclusions. They do this partly because they understand the field deeply—they know what matters, what's been tried before, what the real stakes are. But they also do it because they have skin in the game. A reviewer's reputation is tied to the quality of work that gets published under their watch. That accountability matters.

Recent research published in Nature Communications examined the relationship between the harshness of peer review and the eventual impact of published papers. The data revealed something striking: papers that received tougher criticism during review went on to be cited more frequently and to have greater influence on their fields. This correlation suggests that rigorous, demanding peer review—the kind that makes authors uncomfortable, that forces them to strengthen their arguments and shore up their methods—actually produces better science. It's a form of quality gatekeeping that cannot easily be replicated by a machine.

AI systems, by contrast, lack something essential: they cannot truly understand why a finding matters, or why it doesn't. They can identify when a statistical test was applied incorrectly, but they cannot judge whether the question being asked was worth asking in the first place. They cannot weigh the significance of a result against the broader landscape of what is known. They cannot sense when an author is overselling their conclusions or when a methodology, while technically sound, is inadequate for the claims being made. These judgments require not just knowledge but wisdom—the kind that comes from years of working in a field, from having made mistakes and learned from them, from understanding the human and institutional contexts in which science actually happens.

The stakes of getting this wrong are real. If AI systems begin conducting peer review at scale, and if they lack the nuance to catch subtle but significant problems, then flawed research will enter the scientific literature more readily. Other researchers will build on it. Resources will be wasted. Public trust in science will erode further. The problem compounds because once something is published, it's harder to correct than it was to prevent in the first place.

Nature's reporting points toward a practical path forward: publishers need to establish clearer, more rigorous guidelines about where and how AI can be used in the editorial process. Perhaps AI can assist with initial screening, flagging papers that are obviously out of scope or riddled with red flags. Perhaps it can help with administrative tasks, organizing submissions or formatting references. But the core work of peer review—the judgment call about whether a piece of research is sound and significant enough to publish—needs to remain in human hands. The question now is whether the publishing industry will resist the economic pressure to automate and maintain that standard.

Notable Quotes

Papers that received tougher criticism during review went on to be cited more frequently and have greater influence on their fields

— Nature Communications research findings

The Hearth Conversation Another angle on the story

Inventor

Why does it matter so much that peer review be harsh? Couldn't a gentler process still catch errors?

Model

Harshness isn't the point—rigor is. The data shows that papers which faced tough questioning ended up being more influential. That's because the questioning forced authors to think harder, to test their own assumptions, to make their case stronger. A gentle review might miss the places where thinking is sloppy.

Inventor

But AI can check methodology, right? It can spot statistical errors.

Model

It can spot some of them. But peer review isn't mainly about catching arithmetic mistakes. It's about asking whether the question matters, whether the approach is adequate for answering it, whether the conclusions follow from the evidence. Those are judgment calls that require understanding a field the way a human expert does.

Inventor

What happens if we let AI do peer review anyway?

Model

Weak research gets published. Other scientists build on it. The literature becomes less reliable. And the whole system loses credibility because people stop trusting that published work has actually been vetted.

Inventor

So AI has no role in publishing?

Model

It might have a role in the margins—helping with screening, with administration, with organizing the work. But the actual judgment about whether something is ready to publish? That still needs a human who understands the field and has something to lose if they get it wrong.

Inventor

What would it take for AI to be ready for this?

Model

Honestly, I'm not sure. You'd need a system that understands not just the content of a paper but the context—what's been done before, what the real open questions are, what matters in the field. That's not just information retrieval. That's wisdom.

Want the full story? Read the original at Google News ↗

Nature: AI Cannot Be Trusted to Write Scientific Reviews

Notable Quotes

Related Coverage

Get The Register in your inbox