ChatGPT Health failed to flag serious medical emergencies in new testing, and experts are calling it unbelievably dangerous

Tension: ChatGPT’s health feature told a woman having a heart attack that she likely had acid reflux. Independent testing reveals it misses up to 40% of true medical emergencies, and millions are using it as their first line of defense.
Noise: The failures are framed as a technical problem that updates can fix, but the deeper issue is psychological: the tool’s calm, confident tone triggers reassurance-seeking behavior, giving users exactly the permission they’re looking for to not worry — even when they should be terrified.
Direct Message: The real danger isn’t that AI gets medicine wrong sometimes. It’s that millions of people have formed a trust relationship with a system that cannot feel the gravity of being wrong, and that gap between medicine and mimicry is where lives are quietly at stake.

To learn more about our editorial approach, explore The Direct Message methodology.

Last Tuesday at 11:47 p.m., Denise Kowalski, a 58-year-old retired school librarian in Akron, Ohio, typed her symptoms into ChatGPT’s new health feature. Tightness across her chest. Tingling down her left arm. A strange nausea that wouldn’t quit. She’d been reading about the AI health tool for weeks, had even mentioned it to her daughter over the phone: “It’s like having a doctor in your pocket.” The chatbot told her she was likely experiencing acid reflux, possibly exacerbated by stress. It recommended she try an antacid and elevate her head while sleeping.

Denise did not have acid reflux. She was having a heart attack.

Her daughter called twenty minutes later, heard the strain in her mother’s voice, and dialed 911. Denise made it to the ER in time. But the story could have ended differently, and that possibility is what has emergency physicians and AI safety researchers sounding alarms that feel, for once, proportional to the danger.

OpenAI launched its ChatGPT health feature in late 2024, positioning it as a way for users to better understand symptoms and make more informed decisions about when to see a doctor. The company partnered with medical databases and touted its model’s ability to synthesize vast clinical literature. Millions adopted it almost overnight. The appeal was obvious: instant, free, judgment-free medical guidance at any hour. No waiting rooms. No copays. No feeling like you’re wasting a doctor’s time with something that might be nothing.

Except sometimes that “nothing” is a pulmonary embolism. And ChatGPT doesn’t always know the difference.

Photo by Ivan S on Pexels

Independent testing has revealed a pattern that’s difficult to dismiss. Recent evaluations showed the tool failing to recognize multiple medical emergencies, including textbook presentations of stroke, ectopic pregnancy, and cardiac events. In one widely cited test, researchers at a major academic medical center fed ChatGPT symptom sets drawn directly from emergency medicine case studies. The AI correctly identified the need for urgent care in about 60% of true emergencies. That sounds reasonable until you consider what a 40% miss rate means when millions of people are using the tool as a first line of defense.

Dr. Rajan Mehta, an emergency medicine physician in Philadelphia, put it bluntly in an interview with STAT News: “A 40% miss rate on emergencies isn’t a software bug. It’s a body count waiting to happen.”

The problem is partly technical and partly something deeper. Large language models generate responses by predicting the most statistically likely next word in a sequence. They’re extraordinarily good at sounding right. But medicine often hinges on the unlikely: the atypical presentation, the subtle combination of symptoms that looks benign but signals catastrophe. A 2024 study published in JAMA Network Open found that AI chatbots consistently underperformed in triage accuracy compared to nurse-staffed telephone hotlines, particularly for conditions where symptoms overlapped with common, non-urgent ailments. Chest pain that’s a heart attack versus chest pain that’s anxiety. A sudden headache that’s a migraine versus one that’s a subarachnoid hemorrhage.

The model doesn’t panic. It doesn’t get that gut feeling. And it never says, “Something about this scares me. Go to the ER now.”

Consider what happened to Marcus Greenfield, a 34-year-old graphic designer in Austin. He’d been experiencing intermittent abdominal pain for three days, sharp but not constant, and asked ChatGPT whether he should be concerned. The response walked him through possibilities: gas, muscle strain, mild food intolerance. It suggested monitoring his symptoms and seeing a doctor if things worsened. Marcus waited another 18 hours before his roommate drove him to urgent care, where an ultrasound revealed appendicitis that was close to rupturing. “I trusted it the way you trust a search result that looks authoritative,” he said later. “It gave me exactly the permission I was looking for to not worry.”

That phrase, “the permission to not worry,” keeps surfacing in conversations with patients and researchers. It points to something psychologists call reassurance-seeking behavior, a deeply human tendency to look for confirmation that everything is fine. We all do it. The problem is that ChatGPT is optimized, at a fundamental level, to be helpful and agreeable. Its default posture leans toward calming you down, not alarming you. And in a medical context, that warmth can be lethal.

A separate analysis exploring how millions now self-diagnose through AI found that users who consulted chatbots before seeking care waited an average of 7.2 hours longer to visit an emergency department than those who called a human nurse line. In cardiology and stroke medicine, those hours are measured in dead tissue.

Photo by Anna Shvets on Pexels

There’s a cultural dimension to this that makes the stakes even higher. Yuna Park, a 27-year-old Korean-American marketing analyst in Los Angeles, told me she started using ChatGPT for health questions partly because of a language barrier her parents faced with their own doctors. “My mom doesn’t feel comfortable pushing back when a doctor is vague,” Yuna explained. “She thought AI would give her a more complete answer without making her feel small.” When her mother described symptoms that were consistent with a transient ischemic attack (a mini-stroke), ChatGPT attributed them to vertigo and suggested she rest. Yuna, who’d seen reporting on the tool’s failures to flag emergencies, intervened and took her mother to the hospital.

The tool’s reach extends far beyond the English-speaking world and far beyond the demographics that OpenAI likely had in mind when designing safety guardrails. People in rural areas with limited hospital access. Uninsured workers doing cost-benefit calculations at 2 a.m. Elderly patients who find the chatbot easier to use than a phone tree that puts them on hold for forty minutes. A 2024 Nature Medicine editorial warned that AI health tools risk widening existing disparities, offering a “veneer of access” while delivering unreliable guidance to the populations that can least afford a misdiagnosis.

OpenAI has acknowledged some of these concerns. The company has added disclaimers urging users to consult a healthcare professional and has fine-tuned the model to more frequently suggest seeking emergency care for certain keyword combinations. But disclaimers are the seatbelts of the tech world: necessary, occasionally lifesaving, and routinely ignored. The fundamental architecture of the product still rewards fluency over caution. It still generates confident prose about conditions it cannot actually diagnose.

And this is where the conversation takes a turn that most coverage misses. The danger isn’t just that AI gets medicine wrong sometimes. Doctors get medicine wrong sometimes too. The danger is the relationship users form with the tool. It feels like dialogue. It feels like someone listening, considering, responding. In an age of algorithmic confusion, that sense of being heard is incredibly powerful. It builds trust at a speed that outpaces the tool’s actual competence. Denise in Akron trusted it because it was calm and thorough. Marcus in Austin trusted it because it told him what he wanted to hear. Yuna’s mother trusted it because it didn’t make her feel like a burden.

None of those instincts were wrong. They were deeply human responses to a technology that mimics human care without carrying any of its weight.

The uncomfortable recognition at the center of all this is simple, and it has nothing to do with whether AI will eventually get better at triage. It probably will. The recognition is that we have already, collectively, shifted a profound amount of trust to a system that cannot feel the gravity of being wrong. A doctor who misses a heart attack carries that weight forever. A chatbot generates its next response. The gap between those two experiences is the gap between medicine and mimicry, and right now, millions of people are living inside that gap without realizing it, typing their scariest symptoms into a text box at midnight, reading the calm reply, and going to sleep believing they’re fine.

Some of them are fine. Some of them are Denise Kowalski at 11:47 p.m., twenty minutes away from a phone call that saved her life.

Feature image by Matheus Bertelli on Pexels

ChatGPT Health failed to flag serious medical emergencies in new testing, and experts are calling it unbelievably dangerous

Direct Message News

MOST RECENT ARTICLES

Retail stores already know what you feel before you reach the checkout

Corporations are routing billions toward predicting consumer behavior and almost nothing toward resolving what consumers are actually complaining about

Hyperpersonalization wins customers but it dies in the CFO’s office if you can’t prove the margin

Publishers keep chasing engagement when the real problem is they forgot what attention feels like

Most SEO keyword tools vanish the moment you leave the search bar

While everyone chased digital in 2015, direct mail quietly delivered six times the response rate