High agreement between machines and humans does not mean objectivity
In a field where surgical practice and pharmaceutical innovation are outpacing the evidence meant to guide them, researchers turned to eleven artificial intelligence systems to ask whether machines and medical experts read the same literature and reach the same conclusions. Across thirty-one clinical statements on combining obesity medications with bariatric surgery, the answer was largely yes — a ninety-three percent concordance that speaks less to the wisdom of machines than to the shared foundation of published human knowledge. The two statements that shifted in the process reveal something more nuanced: that AI may be most valuable not when it confirms what experts already believe, but when it quietly surfaces what they may have overlooked.
- A fast-moving field — GLP-1 drugs reshaping obesity treatment alongside surgery — is generating clinical questions faster than traditional expert panels can formally answer them.
- Eleven AI models, tested against thirty-one expert consensus statements in a single structured session, agreed with human specialists 93% of the time, raising urgent questions about what role machines should play in medical guideline development.
- Two statements were quietly but meaningfully revised: one confidence grade lowered because AI flagged thin pre-surgical evidence, another raised because every model detected promising signals in early-stage studies the panel had rated cautiously.
- Researchers are careful to warn that machine agreement is not machine wisdom — LLMs mirror the biases and gaps of the literature they were trained on, and cannot replicate the deliberative judgment of experienced clinicians.
- The study points toward a future where AI runs alongside expert panels in real time, flagging disagreements and surfacing overlooked evidence — but only under strict governance, locked model versions, and unambiguous human authority over final decisions.
On a single day in early June, researchers fed thirty-one clinical statements into eleven AI models and asked a deceptively simple question: do you agree or disagree? The statements came from the International Federation for the Surgery and Other Therapies of Obesity, published in 2024, and addressed one of modern medicine's moving targets — how to combine powerful new weight-loss drugs like GLP-1 receptor agonists with metabolic bariatric surgery in a field where practice is outrunning evidence.
The results were striking. The models agreed with expert consensus 93% of the time, and their consistency with one another registered at κ=0.81 — what researchers call substantial agreement. Twenty-nine of thirty-one statements needed no adjustment at all. Experts and machines, drawing from the same body of published literature, were arriving at the same place.
The two exceptions told the more interesting story. One statement, originally rated at the highest confidence level, held that there was insufficient evidence to routinely recommend these medications before surgery. Two models pushed back, citing emerging data in specific patient populations — enough to nudge the rating down one grade. The other shift ran in the opposite direction: a statement on combining these drugs with endoscopic procedures, initially rated as having limited evidence, earned unanimous AI support and was upgraded to moderate confidence. The models had detected patterns in pilot studies and case reports that the panel had treated more cautiously.
The researchers were deliberate about what this does and does not mean. Agreement between machines and humans is not objectivity — it is shared exposure to the same literature, with all its gaps and embedded biases. AI cannot deliberate, weigh competing values, or account for the lived experience of patients. What it can do is serve as a systematic check: surfacing overlooked studies, flagging statements that deserve scrutiny, and detecting early signals when evidence is beginning to shift.
The proposal that follows is carefully bounded. Future guidelines could integrate AI review during development — running models in parallel with expert panels, round by round, with disagreements flagged for human deliberation rather than deferred to machine judgment. This would require locked model versions, standardized prompts, and controlled access dates to ensure machines and experts are working from the same evidence base. The researchers acknowledge their own limitation: they evaluated a panel they themselves sat on. Whether this approach holds when the evidence is more contested and the evaluators truly independent remains the open question.
On a single day in early June, researchers fed thirty-one clinical statements into eleven different artificial intelligence models and asked them a simple question: do you agree or disagree? The statements came from the International Federation for the Surgery and Other Therapies of Obesity, released in 2024, and they addressed a gap in modern medicine—how to combine new obesity medications with metabolic bariatric surgery, a field moving faster than the evidence can keep pace.
The medications in question are powerful. GLP-1 receptor agonists and dual incretin mimetics have shown remarkable ability to help people lose weight and improve their metabolism. Surgeons wanted to know: should patients take these drugs before surgery, during recovery, or after? Should they be combined with other procedures? The expert panel had published answers, but the researchers wondered whether artificial intelligence systems trained on vast medical literature would reach the same conclusions.
What they found was striking alignment. Across all thirty-one statements, the language models agreed with the expert consensus ninety-three percent of the time. The statistical measure of consistency among the models themselves—a metric called Fleiss' kappa—came in at 0.81, which researchers classify as "substantial" agreement. Twenty-nine of the thirty-one statements required no adjustment whatsoever. The experts and the machines were reading the same evidence and drawing the same conclusions.
But the two statements that did shift tell a more interesting story. One statement, originally marked with the highest confidence level by experts, said there was insufficient evidence to routinely recommend these medications before surgery. Two of the eleven models disagreed, pointing to emerging evidence in specific patient populations. As a result, the statement was downgraded slightly—from the strongest confidence category to merely strong. The other shift went the opposite direction. A statement about combining these medications with endoscopic procedures for obesity had been rated as having limited evidence. Every single model agreed with it, and the confidence level was upgraded from limited to moderate. The models had detected patterns in recent pilot studies and case reports that suggested real promise, even if large randomized trials didn't yet exist.
The researchers were careful to name what they were not claiming. High agreement between machines and humans does not mean the machines are objective. It does not mean they have achieved clinical reasoning. The models are pattern-matching systems trained on published literature, and when they agree with experts, it likely means both are drawing from the same body of evidence. The models can reflect biases embedded in that literature. They cannot deliberate the way a panel of experienced surgeons can. They do not weigh competing values or consider the lived experience of patients.
What the study does suggest is that artificial intelligence can serve as a useful check on expert consensus—a way to surface overlooked literature, flag statements that deserve closer scrutiny, and detect early signals when evidence is shifting. The researchers propose that future guidelines could integrate AI review not after publication, as this study did, but during the development process itself, running models in parallel with expert panels through each round of deliberation. The machines would flag disagreements and point to evidence the humans might have missed. The humans would make the final judgment.
But this would require governance. Model versions would need to be locked. Prompts would need to be standardized and transparent. Access dates would need to be controlled so machines don't see newer evidence than the experts do. The goal would not be to replace human judgment but to augment it—to make the process faster, more responsive to emerging data, and less vulnerable to groupthink. The researchers acknowledge their own potential bias: they were members of the expert panel they were evaluating. Future work will need to test whether this collaborative approach holds up when the stakes are higher, the evidence more contested, and the people doing the evaluating truly independent.
Citações Notáveis
LLM outputs are influenced by the data and curation choices used during model training and by retrieval sources; they can therefore reflect, amplify, or re-weight biases present in published literature.— Study authors, acknowledging algorithmic bias as a limitation
Collaborative intelligence can be a useful adjunct to highlight convergent interpretations, surface recent or under-appreciated literature, and identify statements that deserve closer evidentiary scrutiny.— Study conclusion on appropriate use of AI in guideline development
A Conversa do Hearth Outra perspectiva sobre a história
Why does it matter that the machines agreed with the experts? Doesn't that just mean they're both reading the same textbooks?
Exactly—and that's both the promise and the limitation. The agreement tells us the expert panel wasn't operating in a vacuum or driven by opinion alone. But you're right that it doesn't prove the machines are thinking. They're pattern-matching. What matters is whether they can catch things humans miss—newer studies, alternative interpretations, evidence that hasn't yet made it into consensus.
The two statements that changed—did the machines actually improve the guidelines, or did they just introduce noise?
That's the question the field is wrestling with. One downgrade happened because some models said the evidence for pre-surgery medication wasn't as strong as the experts claimed. That's a useful challenge. The upgrade to endoscopic therapy happened because the models detected recent small studies suggesting benefit. But the researchers were honest: those studies are low-quality and heterogeneous. The machines spotted a signal, but it's not yet a clear answer.
So the machines are good at finding needles in haystacks, but not at deciding whether the needle matters?
That's a fair way to put it. They can synthesize large literatures quickly and flag patterns. But they can't weigh clinical judgment, patient values, or the difference between a promising pilot study and real-world evidence. They're a tool for the experts to use, not a replacement for expertise.
What happens if a machine disagrees with the expert consensus? Do you just ignore it?
The researchers propose you don't ignore it, but you don't automatically defer to it either. You document the disagreement, pull the underlying references the machine cited, and have the expert panel evaluate them using formal quality appraisal tools. The machine becomes a way to surface potentially overlooked literature while humans stay in charge of deciding what it means.
But couldn't the machines just be echoing what they were trained on? How do you know they're not just memorizing the consensus statement itself?
That's a real concern. The researchers acknowledge it directly. The IFSO consensus was published before this study, so it probably made its way into the training data of at least some of these models. That's why they're proposing future studies run in parallel with expert panels, before publication, so you can be sure the machines aren't just regurgitating the answer. Independence matters.