New benchmark reveals medical AI falls short in realistic diagnostic simulations

Passing an exam is not enough; clinical AI must gather information, handle uncertainty, and navigate bias.
AgentClinic reveals that strong performance on static medical questions doesn't translate to real-world diagnostic conversations.

A new benchmark called AgentClinic confronts medicine's oldest truth: knowing the right answer in the abstract is not the same as finding it in the room with a patient. Researchers have built a simulation that forces AI models to diagnose through dialogue, uncertainty, and incomplete information — the actual texture of clinical work — and found that the models best at passing exams are not necessarily best at practicing medicine. The study, published in npj Digital Medicine, does not condemn clinical AI so much as clarify what it still lacks: the capacity to reason well under the full weight of human complexity.

  • The leading AI model, Claude 3.5 Sonnet, outscored human physicians on curated exam cases at 62.1% versus 54% — a result that sounds like a breakthrough until the same model drops to 42.9% on real patient records.
  • Performance proved unexpectedly fragile: cutting diagnostic interactions from twenty to ten collapsed accuracy to 25%, and adding more time to reason paradoxically made things worse.
  • Cognitive and implicit biases injected into prompts measurably degraded accuracy, confirming that models trained on human data inherit human blind spots — a quiet but serious liability in clinical settings.
  • Multilingual gaps, specialty-by-specialty inconsistency, and near-failure on medical image interpretation reveal that current AI diagnostic capability is narrow and uneven rather than broadly reliable.
  • Researchers are urging the field to treat benchmark scores not as proof of readiness but as a map of remaining distance — calling for more realistic evaluation frameworks before any autonomous deployment.

A language model can pass a medical licensing exam and still fail to diagnose a patient sitting across from it. That paradox drives a new study in npj Digital Medicine introducing AgentClinic, a benchmark that tests AI the way medicine actually works: through conversation, uncertainty, and incomplete information.

For years, AI performance in medicine has been measured on static question-answer tasks — a vignette appears, all the data is present, the model selects an answer. By that standard, some models have already surpassed human physicians. But clinical work unfolds differently: a doctor listens, orders tests, interprets results, and decides under time pressure with imperfect information. AgentClinic simulates this as a dialogue among four agents — a doctor model, a patient, a measurement agent, and a moderator who knows the correct diagnosis — giving the doctor twenty exchanges to reach a conclusion.

The results were sobering. Claude 3.5 Sonnet led all models at 62.1% accuracy on curated MedQA cases, edging past three human physicians tested at 54%. But on real patient records from the MIMIC-IV database, its accuracy fell to 42.9%. GPT-4 dropped even further. The distance between a polished exam question and a messy real-world case is not small. Reducing interactions from twenty to ten collapsed accuracy to 25%; increasing them to thirty also hurt performance, suggesting that more time to reason is not always an advantage.

Tools, biases, and language all compounded the picture. A reasoning scratchpad helped some models but not others. Injecting cognitive biases into prompts dropped GPT-4's accuracy by several percentage points. Implicit biases — the unconscious associations absorbed from human-generated training data — had similar effects. Performance varied sharply by medical specialty, and image interpretation was a particular weakness: Claude 3.5 Sonnet reached only 37.2% accuracy on multimodal cases; GPT-4o-mini managed just 8%. Most models performed best in English and degraded meaningfully across other languages.

The researchers are candid about AgentClinic's own limits — it uses AI agents rather than real patients, and the human comparison group was small. But the central finding holds: strong exam performance is only weakly predictive of interactive, real-world diagnostic ability. Benchmark scores, they argue, are not evidence of readiness. They are a measure of how far clinical AI still has to travel.

A language model can ace a medical licensing exam and still fail to diagnose a patient sitting across from it. That paradox sits at the heart of a new study published in npj Digital Medicine, which introduces AgentClinic—a benchmark designed to test clinical AI agents the way medicine actually works: through conversation, uncertainty, incomplete information, and the need to ask the right questions.

For years, researchers have measured AI performance on static medical question-answer tasks. A case vignette appears on screen. All the relevant data is there. The model picks the right answer. By this metric, some large language models have already surpassed human physicians. But clinical work doesn't happen that way. A doctor sits with a patient, hears symptoms, orders tests, interprets results, and makes decisions with incomplete information and finite time. The gap between passing an exam and navigating that reality is vast.

AgentClinic closes that gap by simulating a clinical encounter as a dialogue between four agents: a doctor (the model being tested), a patient, a measurement agent that provides physical exam results, and a moderator that knows the correct diagnosis. The doctor agent gets twenty interactions to gather information before making a diagnosis. Researchers tested eleven language models on this setup using cases drawn from medical licensing exams, New England Journal of Medicine case challenges, and real electronic health records from the MIMIC-IV database.

The results were sobering. Claude 3.5 Sonnet, the top performer, achieved 62.1% accuracy on the MedQA cases—higher than the 54% accuracy of three human physicians tested under the same constraints. But when the same model tackled real patient records from MIMIC-IV, its accuracy dropped to 42.9%. GPT-4 fell from strong performance to 34%. The gap between a curated exam question and a messy real-world case is not small. Performance also proved fragile. Reducing the number of interactions from twenty to ten dropped accuracy to 25%. Increasing them to thirty also decreased accuracy, suggesting that more time to think doesn't always help.

The researchers then tested how different tools affected performance. A "Notebook" tool—essentially a scratchpad for reasoning—helped Claude 3.5 Sonnet reach 56.1% peak accuracy. But tools didn't help all models equally. Some benefited from retrieval-augmented generation using medical textbooks; others didn't. The lesson was clear: a tool that works for one model may not work for another, and adding capabilities doesn't guarantee improvement.

Bias proved consequential too. When researchers injected cognitive biases into prompts—simulating the kind of systematic errors humans make—GPT-4's accuracy dropped to around 48 to 50%. Implicit biases, the unconscious associations shaped by culture and society, had similar effects. A model trained on human data inherits human blindness. Performance also varied sharply by medical specialty. Claude 3.5 Sonnet excelled in internal medicine and gynecology but struggled elsewhere, suggesting that dialogue-based diagnosis may require different skills than multiple-choice exams.

Image interpretation revealed another weakness. When researchers tested four multimodal models on cases that included medical images, Claude 3.5 Sonnet achieved only 37.2% accuracy. GPT-4o-mini managed just 8%. Whether the image was shown upfront or the doctor had to request it made little difference—the models simply struggled to integrate visual information into diagnostic reasoning.

Language mattered too. Most models performed best in English and showed substantial variability across Chinese, French, Spanish, Hindi, Persian, and Korean. Claude 3.5 Sonnet maintained the strongest multilingual performance, but the gap between languages was real and wide.

The researchers are careful to note what AgentClinic is not. It remains a simplified simulation, using AI agents to play the roles of patient and moderator rather than real humans. The human comparison data came from only three physicians. There are risks of data leakage for proprietary models. But the core finding stands: strong performance on a static exam is only weakly predictive of performance in an interactive, realistic setting. Before any AI system is deployed to diagnose patients, it needs to be tested the way AgentClinic tests it—in conversation, under constraint, with incomplete information, and against real-world cases. The benchmark performance itself is not evidence of readiness. It is, instead, a measure of how far clinical AI still has to go.

Strong performance on static medical question-answering tasks was only weakly predictive of performance in the interactive AgentClinic setting.
— Study authors, npj Digital Medicine
Benchmark performance should not be interpreted as evidence that any model is ready for autonomous clinical diagnosis.
— Study authors, npj Digital Medicine
A Conversa do Hearth Outra perspectiva sobre a história
Inventor

Why does it matter that Claude beats human doctors on MedQA but struggles on real patient records?

Model

Because MedQA is a curated exam. All the information is already there, organized, waiting to be analyzed. Real patients don't come with a clean list of symptoms. A doctor has to ask what hurts, when it started, what makes it worse. That's a different skill entirely.

Inventor

So the benchmark is saying that passing an exam doesn't mean you can practice medicine.

Model

Exactly. It's like saying someone who scored well on a driving test can navigate a city in a snowstorm. The test measures one thing; the world demands another.

Inventor

What surprised you most in the results?

Model

That reducing interactions from twenty to ten dropped accuracy so much. You'd think more time to think would help. Instead, it seems like the models get lost in their own reasoning. They overthink or they miss something early on that they'd catch with more back-and-forth.

Inventor

The bias findings are interesting. When you inject bias into the prompts, accuracy drops. Does that mean the models are actually detecting bias and rejecting it?

Model

Not quite. It means bias corrupts their reasoning. When a prompt includes a cognitive bias—like recency bias, where you overweight recent information—the model's diagnostic accuracy falls. It's not that the model is noble. It's that bias makes reasoning worse, even for machines.

Inventor

What about the image problem? Eight percent accuracy for one model on images?

Model

Images are hard. The models have to see a chest X-ray and understand what they're looking at, then integrate that visual information with the patient's history and symptoms. Most of them can't do it well. That's a real gap if you want AI to work in radiology or any specialty that relies on imaging.

Inventor

If I'm a hospital administrator, what should I take from this?

Model

Don't deploy these models for autonomous diagnosis yet. Use them as tools—to help a human doctor think through a case, to flag possibilities, to reduce cognitive load. But the final decision has to stay with a person who can see the patient, ask follow-up questions, and take responsibility for the outcome.

Quer a matéria completa? Leia o original em News-Medical ↗
Fale Conosco FAQ