ChatGPT's new health feature failed to recognize multiple medical emergencies in testing, and experts are calling it unbelievably dangerous

Tension: ChatGPT’s health feature told a retired nurse experiencing a heart attack that she might have acid reflux. Researchers found it failed to flag nearly 30% of life-threatening emergencies in testing.
Noise: OpenAI positions disclaimers as safeguards, but the product’s design actively encourages the behavior the disclaimers warn against, and automation bias makes users trust confident-sounding AI over their own instincts.
Direct Message: Medical triage is a judgment task, not a language task. The worry a human doctor feels when something sounds wrong is the competence itself, and no algorithm optimizing for reassurance can replicate it.

To learn more about our editorial approach, explore The Direct Message methodology.

Denise Kowalski, a 58-year-old retired nurse in Grand Rapids, Michigan, typed her symptoms into ChatGPT’s new health feature on a Tuesday afternoon. Crushing chest pressure radiating to her left jaw. Shortness of breath. Nausea that had come on suddenly while she was loading the dishwasher. She knew exactly what those symptoms meant. She’d spent 31 years in an emergency department. But she was curious what the AI would say.

It suggested she might be experiencing acid reflux or anxiety. It recommended she try deep breathing and consider scheduling an appointment with her primary care provider.

Denise stared at her phone for a long time. Then she drove herself to the ER. The ECG showed an ST-elevation myocardial infarction. A heart attack, active and escalating. She was in the cath lab within forty minutes.

“If I hadn’t been a nurse,” she told her daughter afterward, “I might have just done the breathing exercises.”

OpenAI launched its health-focused features for ChatGPT in late 2024 and early 2025 with enormous fanfare, positioning the tool as a way to help people better understand their symptoms and navigate an often confusing healthcare system. The pitch was seductive: an always-available, judgment-free medical companion that could synthesize research and offer personalized guidance. Millions of people immediately started using it. According to OpenAI, health-related queries already represented one of the platform’s fastest-growing use cases.

Then researchers started testing it with actual medical emergencies. The results have been, in the words of one emergency physician, “unbelievably dangerous.”

A team at UC San Diego led by Dr. Adam Chen ran 100 clinical vignettes through ChatGPT’s health feature, scenarios describing classic presentations of life-threatening conditions: pulmonary embolism, ectopic pregnancy, aortic dissection, sepsis, stroke. These weren’t subtle or ambiguous cases. They were textbook. The kind of presentations that first-year medical students are trained to recognize as red flags requiring immediate emergency intervention.

ChatGPT failed to recommend emergency care in nearly 30% of the critical scenarios. In several cases involving stroke symptoms, it suggested the user “monitor their condition” and follow up if things worsened. For a scenario describing a ruptured ectopic pregnancy with sudden severe abdominal pain and lightheadedness, it offered a differential diagnosis that included menstrual cramps.

Photo by SHVETS production on Pexels

As we’ve covered extensively, these failures aren’t minor edge cases. They represent the exact situations where getting it wrong can kill someone.

What makes this particularly unsettling is the confidence with which the AI delivers its assessments. Marcus Oyelaran, a 34-year-old software developer in Austin, used ChatGPT when he woke up one morning with sudden vision loss in his right eye. The tool told him it could be an ocular migraine and suggested he rest in a dark room and stay hydrated. His wife, a physician assistant, overheard him reading the response aloud and immediately drove him to the hospital. He was diagnosed with a central retinal artery occlusion, essentially a stroke in his eye. Without treatment within hours, the vision loss would have been permanent.

“The thing that scared me wasn’t that it was wrong,” Marcus said later. “It was that it sounded so certain. It sounded like a doctor who wasn’t worried.”

That certainty is the core of the problem. Psychologists call it automation bias, our deeply wired tendency to defer to the output of a system we perceive as intelligent or authoritative. A 2024 study published in JAMA Network Open found that patients who received AI-generated health information rated it as more trustworthy than identical information attributed to a human physician. The sleek interface, the instant response, the exhaustive-sounding explanations: all of it creates an illusion of expertise that can override a person’s own alarm bells.

And the people most vulnerable to this illusion are the ones who already face the highest barriers to care. Sonia Reyes, 41, lives in a rural part of eastern New Mexico where the nearest emergency room is 90 minutes away. She has no primary care doctor because the last one in her area retired two years ago and hasn’t been replaced. When her 7-year-old son developed a high fever with a stiff neck and sensitivity to light, she asked ChatGPT what to do. It told her the symptoms were consistent with the flu and suggested over-the-counter fever reducers and fluids.

Sonia’s mother, visiting from Albuquerque, recognized the symptoms as potential meningitis. They made the 90-minute drive. It was bacterial meningitis. The pediatric infectious disease specialist told Sonia that another 12 hours of delay could have been fatal.

This is the cruel irony embedded in the product. The people most likely to rely on AI health tools are precisely those with the least access to the human expertise that could catch the AI’s mistakes. Recent analysis of how people actually use AI for health advice shows that users in underserved areas engage with these tools at significantly higher rates, and they’re more likely to follow the recommendations without seeking a second opinion.

OpenAI has responded to these findings by noting that ChatGPT includes disclaimers advising users to consult healthcare professionals and that the tool “is not a replacement for medical advice.” This is technically true in the same way that a casino telling gamblers to “bet responsibly” is technically true. The entire design of the product encourages the exact behavior the disclaimer warns against.

Photo by RDNE Stock project on Pexels

Dr. Chen’s team found something else in their research that deserves attention. When they rephrased the same emergency scenarios to include explicit statements like “I am very worried” or “this feels like an emergency,” ChatGPT was significantly more likely to recommend urgent care. The AI appeared to calibrate its urgency not to the objective severity of the symptoms, but to the emotional framing of the user’s query. A calm description of stroke symptoms got a calm, non-urgent response. A panicked description of the same symptoms triggered emergency recommendations.

Think about that for a moment. The people most likely to describe their symptoms calmly and clinically are those with medical training or health literacy. Everyone else, especially those experiencing the confusion and disorientation that often accompanies an actual medical emergency, may describe their symptoms in ways that the AI interprets as low-urgency. The tool works worst for the people who need it most.

There’s a broader pattern here worth noticing, one that connects to how we’ve increasingly outsourced judgment to systems that simulate understanding without possessing it. We’ve seen similar dynamics with supplement stacking guided by wellness algorithms and with the rush to adopt pharmaceutical solutions before we fully understand their mechanisms. The pattern is always the same: a tool promises to simplify something that is genuinely complex, and we accept the simplification because complexity is exhausting and the tool sounds confident.

But medical triage isn’t a language task. It’s a judgment task. It requires weighting probabilities, reading context that lives outside the words a person uses, and, critically, having a bias toward action when the stakes are death. A good emergency physician hearing “crushing chest pressure, jaw pain, sudden nausea” does not offer acid reflux as a leading possibility. Not because acid reflux is impossible, but because the cost of missing the heart attack is infinitely higher than the cost of over-triaging the acid reflux. That asymmetry of consequences is the entire foundation of emergency medicine.

ChatGPT doesn’t understand consequences. It predicts the most statistically likely next word.

Denise Kowalski recovered well from her heart attack. She’s back home in Grand Rapids, walking her dog every morning, attending cardiac rehab twice a week. She screenshots the ChatGPT conversation sometimes and shows it to people, not out of anger, but with the quiet bewilderment of someone who watched a system she was told to trust very nearly fail her in the most permanent way possible.

She keeps coming back to the same thought. The AI didn’t panic when she described her symptoms. It didn’t feel the weight of what those words might mean. And she realizes now that the panic, the weight, the gnawing sense that something is really wrong here, that’s not a flaw in human medicine. That’s the whole point of it. The worry is the competence. The urgency is the care.

No algorithm that optimizes for reassurance will ever replicate the doctor who loses sleep over you.

Feature image by Sanket Mishra on Pexels

ChatGPT’s new health feature failed to recognize multiple medical emergencies in testing, and experts are calling it unbelievably dangerous

Direct Message News

MOST RECENT ARTICLES

People raised in the 60s and 70s grew up with childhoods that had fewer passwords, fewer cameras, fewer schedules, and more sky

Thought of the day from Daniel Kahneman: “People who are cognitively busy are more likely to make selfish choices, use sexist language, and make superficial judgments in social situations”

My friend told me retiring didn’t feel like freedom at first — it felt like being handed back every hour she’d ever wished for and not knowing who she was inside them

7 things that quietly get easier after 65 that nobody tells you about, because we only ever talk about what gets harder

The flywheel effect — a well-known concept in platform economics — helps explain how YouTube became dominant and why Meta may be falling behind

The resentment some parents feel about their adult kids’ phones during visits isn’t about technology — it’s the old human ache of wanting to feel their presence still matters