ChatGPT Health failed to recognize basic medical emergencies in testing, and experts are calling it 'unbelievably dangerous' for the millions already using AI for symptoms

Tension: ChatGPT’s health feature failed to recognize textbook medical emergencies in testing — including heart attacks, strokes, and pulmonary embolisms — while millions of people already rely on it as their first stop for symptom evaluation.
Noise: The AI’s high accuracy on routine health questions creates a halo of credibility that extends to life-threatening cases, and its calm, authoritative tone triggers automation bias that can override a patient’s own alarm signals.
Direct Message: The tool’s greatest strength — relentless, confident reassurance — is precisely what makes it lethal in emergencies. Reassurance without clinical judgment doesn’t inform; it sedates.

To learn more about our editorial approach, explore The Direct Message methodology.

Denise Koh, a 34-year-old graphic designer in Portland, was home alone on a Tuesday night when a sharp, radiating pain crept from her jaw down into her left shoulder. She’d been dealing with what she assumed was a pinched nerve for days. Instead of calling her doctor’s after-hours line, she opened ChatGPT. She typed in her symptoms carefully: jaw pain, left arm tingling, shortness of breath, nausea. The AI responded with a calm, detailed explanation of musculoskeletal strain and suggested stretching exercises, ice packs, and ibuprofen. It did not tell her to call 911. It did not mention her heart.

Denise’s husband came home an hour later and drove her to the ER, where doctors confirmed she was in the early stages of a myocardial infarction. She spent four days in the cardiac unit. “I almost went to bed,” she told a local news outlet afterward. “The AI sounded so confident. It made everything feel manageable.”

Stories like Denise’s are becoming disturbingly common, and they’re no longer hypothetical edge cases. As recent testing of ChatGPT’s health features has shown, the tool repeatedly fails to flag textbook medical emergencies when presented with classic symptom clusters. The results have prompted emergency medicine physicians and AI safety researchers alike to use words like “unbelievably dangerous” — not as hyperbole, but as clinical assessment.

And yet, millions of people are already using it as their first stop for health questions. Many of them will never see the test results. Many of them are typing in symptoms right now.

The appeal is obvious, and worth naming honestly. Healthcare in the United States is expensive, fragmented, and slow. The average wait time for a new patient appointment with a primary care physician is now 26 days in major metro areas, according to a 2022 Merritt Hawkins survey. For people without insurance, even an urgent care visit can mean a $300 bill and an afternoon lost. ChatGPT is free, immediate, and endlessly patient. It never sighs. It never makes you feel stupid for asking.

Rafael Gutierrez, a 51-year-old warehouse supervisor in Houston, started using ChatGPT for health questions last year after his company switched insurance plans and his longtime doctor fell out of network. “I just needed someone to talk to about what was going on with my body,” he said. “And the AI actually listened.” Rafael described a months-long pattern of using the chatbot to interpret blood pressure readings, evaluate persistent headaches, and decide whether certain symptoms were “worth” a clinic visit. He trusted it the way you trust a knowledgeable friend.

Photo by Ivan S on Pexels

That trust is exactly what makes the failures so consequential. When researchers at institutions including the University of California and Stanford fed ChatGPT symptom sets corresponding to stroke, heart attack, pulmonary embolism, and appendicitis, the AI frequently responded with reassuring, low-urgency guidance. In several trials, it suggested home remedies or watchful waiting for conditions that required emergency intervention within hours. The tool wasn’t always wrong. That’s part of the problem: its accuracy on routine questions creates a halo of credibility that extends, dangerously, to the cases where it gets things catastrophically wrong.

Psychologists call this automation bias, the tendency for humans to defer to algorithmic or automated outputs even when their own instincts (or the evidence in front of them) suggest something different. A 2023 study published in the journal Nature Medicine found that patients who consulted AI tools before speaking with a physician were more likely to delay seeking emergency care, even when their symptoms met clinical criteria for urgent evaluation. The AI’s tone, calm and measured and encyclopedic, functioned as a kind of emotional anesthetic. It didn’t just provide information. It provided reassurance. And reassurance, when a blood clot is traveling toward your lungs, can be lethal.

There’s a concept in human-computer interaction research called misplaced authority transfer: the moment when a user unconsciously assigns an AI the same decision-making weight as a trained professional. It happens faster than you’d think. OpenAI itself has acknowledged that ChatGPT is not a substitute for professional medical advice, burying the disclaimer in terms of service that almost no one reads. The interface itself carries no such humility. It responds to symptom queries with the same confident, structured format it uses to explain photosynthesis or summarize a Supreme Court ruling. There’s no tonal difference between “here’s how mitosis works” and “your chest pain is likely muscular.”

Naomi Ashworth, a 28-year-old graduate student in Chicago, experienced this firsthand when she described symptoms of a severe allergic reaction (throat tightening, hives spreading across her torso, difficulty swallowing) and received a response that focused primarily on over-the-counter antihistamines. “It was written so well,” she said. “Like a really thorough WebMD article. I almost didn’t use my EpiPen.” She did, ultimately, because a roommate walked in and saw her face swelling. The roommate had no medical training. She just had eyes.

As a recent study on ChatGPT’s symptom triage abilities raised, the core problem may be architectural. Large language models are optimized for coherent, helpful-sounding responses. They’re pattern-matching engines trained on vast text corpora, and medical text (patient-education websites, health blogs, consumer-facing content) skews heavily toward common conditions. The rare but deadly presentation gets drowned out by the statistically likely one. When you type in “chest pain + nausea + arm tingling,” the model is drawing from a pool of text where those symptoms are far more often attributed to anxiety or GERD than to cardiac arrest. It gives you the probable answer, not the dangerous one. And emergency medicine is fundamentally about catching the dangerous one.

Photo by Los Muertos Crew on Pexels

Dr. Sarah Chen, an emergency physician in San Francisco who has been vocal about AI safety in healthcare, put it bluntly in a recent interview: “Triage is about identifying the five percent of cases that will kill you if you miss them. That’s the entire job. An AI that’s right 95 percent of the time is an AI that misses every lethal case.”

The cultural context makes this worse. We’re living through a moment of profound institutional distrust, where faith in traditional healthcare systems is eroding alongside trust in media, government, and expertise broadly. As explored in a recent piece on reclaiming clarity in an age of algorithmic confusion, people aren’t turning to AI because they’re foolish. They’re turning to AI because the systems that should be catching them have already dropped them. The 26-day wait. The $300 bill. The eight-minute appointment where the doctor never looks up from the screen. ChatGPT fills a vacuum that the healthcare industry created.

And that’s the tension at the center of all this. The technology is responding to a real, legitimate need. People are desperate for accessible, affordable, patient health guidance. The demand is enormous and understandable. But the supply, a language model with no clinical judgment and no ability to assess urgency, is fundamentally inadequate for the most important moments. The moments that matter most are precisely the moments when the tool is least reliable.

Rafael, the warehouse supervisor in Houston, told me something that stayed with me. He said he knew, on some level, that ChatGPT wasn’t a doctor. “But it felt like one,” he said. “And sometimes feeling like you’re being taken care of is enough to stop you from actually getting taken care of.”

That sentence contains something worth sitting with. The comfort of feeling heard, feeling guided, feeling like someone (or something) has evaluated your situation and found it manageable: that comfort has a physiological effect. It lowers cortisol. It reduces the urgency signal your own body is trying to send. Testing has repeatedly confirmed that the AI’s calm authority can override a patient’s own alarm, and there may be no more dangerous interaction in modern healthcare than an algorithm convincing someone that the emergency they’re experiencing is routine.

Denise Koh keeps a screenshot of the ChatGPT conversation from that Tuesday night. She shows it to people sometimes, not with anger, but with a kind of bewildered amazement. “It told me to stretch,” she says. “I was having a heart attack, and it told me to stretch.”

The AI didn’t fail because it was malicious. It failed because it was designed to be helpful in a way that has no room for the most human thing a doctor does: recognizing that sometimes the calm, reasonable answer is the wrong one. That sometimes the person in front of you needs to be alarmed. That the best thing you can say, in certain moments, is nothing reassuring at all.

The millions of people typing symptoms into ChatGPT tonight deserve to know that the tool’s greatest strength, its relentless, confident, beautifully articulated calm, is also, in the moments that count most, its most dangerous quality. Reassurance without judgment is just a story. And the body doesn’t care how well the story is written.

Feature image by Tima Miroshnichenko on Pexels

ChatGPT Health failed to recognize basic medical emergencies in testing, and experts are calling it ‘unbelievably dangerous’ for the millions already using AI for symptoms

Direct Message News

MOST RECENT ARTICLES

Why the hardest part of changing isn’t finding something better — it’s letting go of what used to work

The best decisions aren’t made by weighing pros and cons — they’re made by asking what kind of person you want to be, and most decision-making advice has been pointing you in the wrong direction

People consistently choose video over text when learning about something new. Ads and emails never got the memo.

The brutal cost of letting AI do all the writing: your voice, your distinctiveness, your audience’s trust

Google’s algorithm now penalizes the exact content strategy most brands just finished building

Brands think they understand young people. Young people think that’s hilarious.