AI agent MIRA outperforms physicians in simulated emergency cases, but human oversight remains critical

Advice is easy. But hospitals don't work on advice.
The challenge MIRA solved was translating clinical knowledge into the structured actions that hospital systems actually require.

In a carefully bounded simulation, an autonomous AI system called MIRA worked through hundreds of real emergency cases and outdiagnosed experienced physicians — a result that arrives not as a triumph of machines over medicine, but as a quiet reckoning with what clinical intelligence actually requires. The gap in accuracy was measurable and significant, yet the researchers who built MIRA were the first to insist it is not ready for real patients, real stakes, or real consequences. What the study illuminates is less a finish line than a threshold: the moment when a machine can finally speak the language of a hospital, and the long, careful work that must follow before anyone decides what it should be allowed to say.

  • An AI system outperformed board-certified physicians by more than ten percentage points in diagnostic accuracy across 574 emergency cases — a margin too large to dismiss as noise.
  • The result unsettles a long-held assumption: that navigating a real hospital's infrastructure — ordering tests, reading results, writing prescriptions in valid formats — was a barrier machines could not yet cross.
  • Safety metrics were striking in their own right, with zero high-severity drug interactions and perfect recall for admission decisions in the most dangerous cases, though the system sometimes erred toward over-admission rather than precision.
  • Researchers are pushing back against the momentum of their own findings, insisting that prospective real-world validation, governance frameworks, and physician oversight must precede any clinical deployment.
  • The study positions MIRA not as a replacement for human judgment but as a potential support tool — a distinction the authors treat as non-negotiable, not rhetorical.

In a sandboxed emergency department simulation, an autonomous AI system called MIRA worked through 574 real patient cases — ordering tests, interpreting results, deciding who needed admission — and diagnosed correctly in nearly 89 out of every 100 encounters. The physicians it was matched against, including board-certified doctors and residents, scored lower by margins that were statistically significant. Yet the researchers who built MIRA were careful to say what the numbers did not mean.

The problem MIRA was designed to address is older than AI itself. Medical schools teach clinical reasoning, but hospitals run on structured systems — electronic health records, standardized codes, precise data entry. Large language models had already shown they could pass medical exams. What remained unproven was whether a machine could actually do the work inside a real hospital's infrastructure, moving from question to test to result to prescription without a human translating each step. MIRA was built to attempt exactly that, operating within a virtual EHR with access to 11 specialized tools and more than 85,000 possible actions.

The results were stark. MIRA achieved 88.9% diagnostic accuracy overall, compared to 78.1% for board-certified physicians and 71.1% for the resident and attending group. It was especially strong with appendicitis and pancreatitis, and it avoided the trap of over-ordering — its test selection stayed well below historical baselines despite drawing on a broader range of individual parameters. An independent safety review of 468 prescriptions found zero high-severity drug interactions, zero renal dosing errors, and zero allergy mismatches. When deciding whether pneumonia or pulmonary embolism patients needed admission, it never missed someone who required inpatient care — though it sometimes admitted patients who might have been safely managed at home.

The researchers were explicit about what comes next. MIRA has not been tested on real patients in real time. Antibiotic selection and other nuanced decisions still require human expertise. Before any clinical deployment, the system would need prospective validation, governance frameworks, and oversight protocols that only real-world practice can produce. The study demonstrates that a machine can now operate fluently within the constraints of medical infrastructure. Whether that fluency can be trusted with actual lives remains, deliberately and carefully, an open question.

In a sandboxed simulation of an emergency department, an autonomous AI system called MIRA worked through 574 real patient cases, ordering tests, reading results, and deciding who needed admission. When the results came back, MIRA had diagnosed correctly in nearly 89 out of every 100 cases. The physicians it was matched against—experienced doctors with board certification, residents in training—scored lower. The gap was significant enough to matter. Yet the researchers who built MIRA were careful to say what it was not: a replacement for human judgment, a system ready for real patients, a tool that could work without oversight.

The problem MIRA was designed to solve is older than AI itself. Medical schools teach students to think like doctors—to listen, to ask questions, to order tests that narrow down possibilities, to synthesize conflicting information, to change their minds when new data arrives. But hospitals don't run on thinking. They run on electronic health records, on standardized codes, on structured data entry, on systems that demand actions be recorded in precise formats before they can take effect. Large language models had already shown they could pass medical exams and answer complex clinical questions. What remained unproven was whether a machine could actually do the work of a doctor inside a real hospital's infrastructure—ordering a blood test, reading the result, updating a diagnosis, writing a prescription in the right format, all without a human translating each step.

MIRA operates inside a virtual EHR that follows healthcare standards. It has access to 11 specialized tools and can make more than 85,000 different choices about what to do next. It can request a physical exam, order specific lab values, review a patient's history, write medication orders. In the study, it worked through cases drawn from the MIMIC-IV database—real emergency department presentations across eight diagnoses: appendicitis, pneumonia, pancreatic cancer, urinary tract infections, and others. It was tested against two groups of human physicians working the same cases under identical conditions. One group was four board-certified doctors. The other was a mix of four residents and two attending physicians.

The numbers were stark. MIRA achieved 88.9% diagnostic accuracy across all 574 cases, and 87.8% in the 311 cases where it was directly compared to physicians. Board-certified doctors averaged 78.1% accuracy. The residents and attending physicians together averaged 71.1%. The differences were statistically significant. MIRA was particularly strong with appendicitis and pancreatitis, achieving perfect recall on appendectomy cases. It handled pancreatic cancer at the level of experienced physicians. Pneumonia and urinary tract infections proved harder. But the system did not achieve these results by ordering every test available. It requested a broader range of individual blood parameters than human doctors typically would, yet its overall test selection stayed well below historical baselines. It avoided the trap of expensive, unnecessary imaging.

Safety is where the caution enters. An independent medical review of 56 patient cases and 468 prescriptions written by MIRA found zero high-severity drug interactions, zero renal dosing errors, zero medication-allergy mismatches. Prescription route specification—how a drug should be administered—was the weakest area, at 97% correctness. When MIRA had to decide whether pneumonia or pulmonary embolism patients needed hospital admission, it achieved perfect recall: it never missed someone who needed inpatient care. But that same caution meant it sometimes admitted patients who might have been managed as outpatients. The system erred toward safety, which is not the same as erring correctly.

What MIRA did not do is work alone. The researchers were explicit about this. The system is not ready for clinical deployment. It has not been tested on real patients in real time. It has not been validated prospectively. It did not achieve perfect accuracy on every decision—antibiotic selection, for instance, remained an area where human expertise would still be needed. The authors describe MIRA as a potential support tool for physicians, not a replacement. Before it could ever be used in an actual hospital, it would need governance frameworks, oversight protocols, and the kind of real-world testing that only comes from actual patient care. The study shows what is possible when a machine can operate fluently within the constraints and structures of medical practice. It does not yet show whether that possibility is safe to realize.

MIRA and similar AI agents are not replacements for expert human staff, and the model did not achieve 100% perfection in all treatment choices, highlighting the ongoing need for strict human supervision
— Study authors
La Conversación del Hearth Otra perspectiva de la historia
Inventor

Why does it matter that MIRA can order tests and write prescriptions inside an EHR? Couldn't it just give advice like other medical AI tools?

Model

Because advice is easy. A language model can tell you what to do. But a hospital doesn't work on advice—it works on actions recorded in specific formats that trigger other systems. MIRA had to learn to think like the infrastructure itself, not just like a doctor.

Inventor

And it outperformed experienced physicians. That's the headline everyone will read. Does that mean we should start using it?

Model

No. The study was a simulation. The patient was an AI agent responding from documented histories. Real emergency departments are messier—patients contradict themselves, information arrives out of order, you're interrupted. MIRA has never seen that.

Inventor

What about the safety findings? Zero drug interactions sounds perfect.

Model

It's encouraging, but it's also a narrow slice. They reviewed 56 cases and 468 prescriptions. That's not enough to know how rare a serious error would be in practice. And the system did miss things—antibiotic selection, for instance. It's not infallible.

Inventor

The perfect recall on admission decisions—never missing someone who needed hospital care—that seems like the most important finding.

Model

It is important. But it came with a cost: the system admitted some patients who probably didn't need to be admitted. In a real hospital, that has consequences. Beds fill up. Costs rise. You're trading one kind of error for another.

Inventor

So what would need to happen before this could actually be used?

Model

Real-world testing with actual patients. Governance frameworks that define when a doctor can trust the system and when they have to override it. And honestly, a culture shift—physicians would need to learn to work alongside it, not just check its work.

Quieres la nota completa? Lee el original en News-Medical ↗
Contáctanos FAQ