Inconsistency in an AI tool reflects patterns in training data, not clinical learning.
For the millions of older adults navigating a daily regimen of five, seven, or even ten medications, the risk of harm quietly compounds with each added prescription. Researchers at Mass General Brigham have asked whether artificial intelligence might help untangle this pharmaceutical complexity, testing ChatGPT's capacity to assist in deprescribing decisions for elderly patients. The findings, published in the Journal of Medical Systems, suggest the technology holds genuine promise but carries inconsistencies and blind spots that remind us how much distance remains between a capable tool and a trustworthy one.
- More than four in ten older Americans take five or more medications simultaneously, creating dangerous interaction risks that fragmented specialist care makes harder, not easier, to manage.
- ChatGPT demonstrated reasonable clinical judgment in many deprescribing scenarios, but gave contradictory answers when identical cases were presented across separate sessions — a sign the model mirrors inconsistent training data rather than stable medical logic.
- The AI showed a troubling tendency to dismiss pain as a clinical priority, recommending discontinuation of pain medications far more readily than other drug classes, while largely ignoring how severely a patient's daily functioning was impaired.
- Researchers and clinicians are calling for specialized training, rigorous validation, and sustained skepticism before any AI deprescribing tool is allowed to influence real patient care decisions.
An elderly patient arrives at a primary care office with a medication list that fills the page — five drugs, seven, sometimes more. For over four in ten older Americans, this is ordinary life. The condition, known as polypharmacy, compounds the risk of dangerous drug interactions, each prescription individually justified but collectively forming a chemical puzzle that grows harder to solve as specialists multiply and primary care physicians inherit the burden of making sense of it all.
Deprescribing — the deliberate removal of unnecessary medications — is the logical remedy, but it demands careful clinical reasoning that takes time and expertise increasingly in short supply. Researchers at Mass General Brigham's MESH Incubator decided to test whether ChatGPT could help. They built a series of clinical scenarios featuring an older patient on multiple medications, varying details like cardiovascular history and severity of functional impairment, then asked the AI whether specific drugs should be reduced or stopped.
The results were illuminating in two directions. ChatGPT showed sound judgment in many cases — appropriately recommending deprescribing for patients without heart disease, and growing more conservative when cardiac history entered the picture. But when identical scenarios were presented in new sessions, the model gave different answers, suggesting it was reflecting inconsistencies in its training data rather than applying reliable clinical logic.
More troubling still, the AI consistently underweighted pain as a clinical concern, recommending discontinuation of pain medications far more readily than other drug classes, while the severity of a patient's functional impairment barely registered in its reasoning at all. Senior author Dr. Marc Succi and lead author Arya Rao both emphasized the same conclusion: ChatGPT could meaningfully ease the burden on overstretched primary care providers, but only after rigorous refinement and validation against real clinical outcomes. What this research offers is not a solution, but a direction — and a clear-eyed account of how far the path still runs.
An elderly patient walks into a primary care office carrying a list of medications that stretches across the page. Five drugs. Seven. Sometimes ten or more. For more than four in ten older Americans, this is ordinary life—a condition called polypharmacy that compounds the risk of dangerous interactions between medications, each one prescribed for a legitimate reason but together creating a chemical puzzle that grows harder to solve.
The problem has become acute precisely because medicine has fragmented. As seniors accumulate specialists—a cardiologist here, an endocrinologist there—each one adds medications to the regimen without always knowing what the others have prescribed. Primary care doctors, already stretched thin, inherit the job of making sense of it all. Deprescribing—the deliberate removal of unnecessary drugs—is the logical answer, but deciding which medications to cut requires the kind of careful clinical reasoning that takes time and expertise in short supply.
Researchers at Mass General Brigham's MESH Incubator decided to test whether ChatGPT, the large language model that has captured public imagination, could help. They constructed a series of clinical scenarios, each one featuring the same older patient on multiple medications, then varied the details: Did the patient have a history of cardiovascular disease? How severe was their functional impairment? They asked the AI yes-or-no questions about whether specific drugs should be reduced or stopped. The results, published in April in the Journal of Medical Systems, revealed something genuinely useful—and something troubling.
ChatGPT showed reasonable judgment in many cases. When presented with a patient without cardiovascular disease, the model consistently recommended deprescribing. But introduce heart disease into the scenario, and the AI became more conservative, more likely to leave the medication list alone. This kind of clinical caution made sense. What did not make sense was what happened next: when the researchers presented identical scenarios in new chat sessions, ChatGPT gave different answers. The inconsistency suggested the model was reflecting patterns in its training data—which itself may contain inconsistent deprescribing practices—rather than applying stable clinical logic.
The researchers also noticed a peculiar bias. ChatGPT showed a tendency to disregard pain as a clinical concern and was far more willing to recommend stopping pain medications than other drug classes like statins or blood pressure medications. Severity of functional impairment—how much difficulty a patient had with basic daily activities—seemed to barely register in the model's decision-making at all. These were not minor quirks. They were gaps that could lead to real harm if a clinician relied on the tool without skepticism.
Dr. Marc Succi, the senior corresponding author, framed the findings carefully. ChatGPT could help ease the burden on general practitioners drowning in medication management. But caution was essential. The model needed refinement, better training, validation against real clinical outcomes. Arya Rao, the lead author and a Harvard medical student, echoed the point: AI tools could play an important role in keeping older adults safe, but only if they were continuously refined to capture the true complexity of medical decision-making.
What emerges from this work is not a solution but a direction. The problem is real and urgent—polypharmacy affects millions of seniors and increases their risk of serious adverse events. The tool shows promise. But the path from promise to practice requires humility about what current AI can and cannot do, and a commitment to testing it rigorously before it touches actual patient care.
Notable Quotes
AI-assisted polypharmacy management could help alleviate the increasing burden on general practitioners, but caution should be taken to increase accuracy of such models.— Dr. Marc Succi, Mass General Brigham
AI-based tools can play an important role in ensuring safe medication practices for older adults, but we must continue to refine these tools to account for the complexities of medical decision-making.— Arya Rao, Harvard Medical School researcher
The Hearth Conversation Another angle on the story
Why does polypharmacy happen in the first place? It seems like something a good system would prevent.
Because medicine isn't organized around individual patients—it's organized around diseases and specialties. A cardiologist treats the heart. An endocrinologist treats diabetes. Each one is right within their domain. But the patient is one person, and their body doesn't care about specialty boundaries.
So the primary care doctor is supposed to be the referee?
In theory, yes. In practice, they're overwhelmed. They're the ones who have to know everything about the patient, but they're also the ones with the least time per patient. That's where something like ChatGPT could actually help—not replace judgment, but handle some of the cognitive load.
But the study found it was inconsistent. How can you trust something that gives different answers to the same question?
That's the real problem. Inconsistency in an AI tool is different from inconsistency in a human. A human might be tired one day and sharp the next, but they're learning from experience. ChatGPT is just reflecting patterns in its training data. If the training data itself contains conflicting deprescribing practices, the model will too.
The bias against pain medications—that seems dangerous.
It is. Pain matters. It affects quality of life, function, everything. If an AI tool systematically undervalues pain, it could push doctors toward decisions that look good on paper but make patients' lives worse. That's why this needs to be tested carefully before it gets near real patients.
So what's the actual use case here?
Right now, it's a proof of concept. The tool could help a busy primary care doctor think through deprescribing decisions more systematically. But it needs to be trained better, tested against real outcomes, and used as a support tool, not a decision-maker. The human judgment has to stay in the room.