ChatGPT offers clear medical info but fails to identify emergencies, study warns

Potential patient harm from delayed emergency care recognition and inappropriate treatment suggestions could lead to serious health complications or death.
Confident incompetence is harder to spot than a lie.
The danger of ChatGPT for health advice lies not in false information but in plausible-sounding guidance that misses critical warning signs.

In an age when millions instinctively reach for a chatbot before calling a physician, a new study forces a sobering reckoning: the same tool that can illuminate a diagnosis with remarkable clarity may, in the very next breath, steer someone toward serious harm. Researchers found that ChatGPT's failures are not born of fabricated facts, but of something more fundamental — the absence of clinical judgment, the inability to ask the question behind the question. The technology mirrors the form of medicine without possessing its conscience.

  • A study of over 250 real health conversations reveals ChatGPT can swing from near-perfect medical explanations to dangerously incomplete guidance within the same session.
  • The most alarming failures weren't hallucinations — the AI simply never asked follow-up questions, missing emergencies like a patient who fainted with abdominal pain.
  • In one case, the chatbot suggested veterinary antiparasitic drugs for a tumor without noting that standard treatment for testicular cancer carries an exceptionally high cure rate — a response one leading physician called 'terrible.'
  • The trap is invisible to most users: authoritative tone and coherent explanations make it nearly impossible for someone without medical training to distinguish a safe response from a harmful one.
  • OpenAI says newer models probe more, but experts argue real safety demands AI that thinks like an experienced physician — initiating inquiry, recognizing urgency, and refusing to validate dangerous ideas.

Millions of people now consult ChatGPT about their health, and a new study explains precisely why that habit carries hidden danger. Journalist Geoffrey Fowler analyzed more than 250 real health conversations, then had physician Robert Wachter — who leads the medicine department at UC San Francisco — evaluate a representative sample. What they found was a tool of striking contradictions.

When users supply detailed, well-contextualized symptoms, ChatGPT can perform impressively. Wachter awarded perfect scores to several responses, praising the bot's ability to organize medical information, decode jargon, and explain probabilities around conditions like a severe persistent cough. For everyday questions, he noted, it can outperform most non-medical friends or family.

The danger, however, lives in the failures — and those failures share a common root: the complete absence of clinical judgment. When a user described a friend who had fainted alongside abdominal pain, ChatGPT listed possible causes without ever asking how severe the pain was, or whether the person was conscious. In a more alarming case, it recommended veterinary antiparasitic drugs as tumor alternatives, never mentioning that testicular cancer is highly curable through standard treatment. These are not errors of false information — they are errors of missing instinct.

The deepest problem is that patients cannot see the difference. A dangerous response and a sound one arrive in the same measured, authoritative tone. As Wachter put it, an ordinary person without medical training likely cannot distinguish a perfect answer from a harmful one. OpenAI acknowledges the gap and points to improvements in newer models, but Wachter argues the bar must be higher: AI systems must learn to think like seasoned physicians — probing, prioritizing urgency, and resisting the pull of a user's own dangerous assumptions. Until then, ChatGPT can inform, but it cannot judge.

Millions of people now turn to ChatGPT when they have health questions, despite warnings that an AI chatbot is no substitute for a doctor. A new study, conducted by The Washington Post and reviewed by a prominent physician, reveals why this habit is dangerous: the bot can deliver clear medical explanations one moment and potentially harmful guidance the next, often without the user knowing the difference.

Geoffrey A. Fowler analyzed more than 250 real conversations with ChatGPT about health concerns, then selected a dozen of them for evaluation by Robert Wachter, who heads the medicine department at the University of California, San Francisco. The goal was straightforward—to map where the chatbot succeeds and where it puts people at risk. What emerged was a portrait of a tool that excels at one thing while failing catastrophically at another.

When users provide detailed information—a clear timeline of symptoms, their intensity, relevant context—ChatGPT can perform remarkably well. Wachter gave perfect scores to four responses, including one that elegantly explained the probabilities and warning signs associated with a severe, persistent cough. The bot can organize medical information clearly, translate technical jargon, and help someone understand what their doctor told them. For common ailments and general questions, Wachter noted, it can be smarter than most people's friends or spouses.

But the failures are where the danger lives. The most serious errors didn't stem from the AI hallucinating false facts. Instead, they revealed a complete absence of clinical judgment. In one case, a user described a friend who had fainted while experiencing abdominal pain. ChatGPT listed possible causes but never asked the essential follow-up questions: How severe was the pain? What is the patient's current condition? Is he conscious? In another, far more alarming example, the bot suggested antiparasitic medications used in veterinary medicine as alternatives for treating a tumor, without mentioning that testicular cancer has an exceptionally high cure rate with standard treatment. Wachter called that response "terrible." The pattern was clear: ChatGPT fails at something fundamental to medicine itself—responding to a question with another question, probing deeper, recognizing when a situation demands immediate professional intervention.

This creates a trap that's difficult to escape. A patient without medical training cannot reliably tell the difference between a response that deserves a perfect score and one that could cause serious harm. The information sounds authoritative. The tone is measured. The explanations are coherent. Yet beneath that surface clarity, the bot may be steering someone toward a dangerous delay or a harmful choice. Wachter posed the problem directly: "If I were just an ordinary person without medical training, could I tell the difference between a 10 and a 2? I don't think so."

OpenAI, the company behind ChatGPT, acknowledges these limitations and says its newer models ask more follow-up questions. But Wachter argues that real progress requires something deeper—systems that actually think like experienced physicians. They need to initiate conversations thoughtfully, investigate root causes, assess whether a situation is urgent, and resist validating dangerous ideas simply because a user suggested them. Until then, ChatGPT remains a tool that can inform but cannot diagnose, that can clarify but cannot judge, and that can sound like a doctor without ever being one.

If I were just an ordinary person without medical training, could I tell the difference between a 10 and a 2? I don't think so.
— Robert Wachter, UC San Francisco medicine department head
ChatGPT fails in one of a doctor's essential functions: responding to a question with another question.
— Robert Wachter, UC San Francisco medicine department head
A Conversa do Hearth Outra perspectiva sobre a história
Inventor

So the study found ChatGPT gets some things right and some things dangerously wrong. What's the actual dividing line?

Model

It's not about complexity. A simple question asked clearly gets a good answer. A serious question asked vaguely gets a dangerous one. The real problem is that the bot doesn't know the difference between the two.

Inventor

What do you mean it doesn't know the difference?

Model

A real doctor, when someone says "my friend fainted with stomach pain," immediately starts asking: How long was he unconscious? Is he breathing normally? Is the pain still there? ChatGPT just lists possibilities and moves on. It has no instinct for urgency.

Inventor

Can't it be trained to ask those questions?

Model

Maybe. But asking the right follow-up question requires understanding what you don't know yet. That's clinical judgment. It's not just pattern-matching in text. It's knowing that some silences are dangerous.

Inventor

So the real risk isn't that it lies—it's that it sounds confident while missing the point entirely.

Model

Exactly. A lie you might catch. Confident incompetence is much harder to spot. And most people using it don't have the training to tell the difference.

Inventor

What should someone actually use it for, then?

Model

Understanding what your doctor already told you. Organizing your thoughts before an appointment. Learning what questions to ask. But never as a substitute for someone who can say, "This needs to be checked today."

Quer a matéria completa? Leia o original em Olhar Digital ↗
Fale Conosco FAQ