ChatGPT Health failed to flag serious medical emergencies in testing. Experts are calling it unbelievably dangerous.

Tension: ChatGPT failed to flag life-threatening emergencies like heart attacks and strokes in clinical testing, even as millions of Americans already rely on it for medical guidance at their most vulnerable moments.
Noise: Disclaimers exist but don’t land emotionally, the tool’s calm and confident tone triggers automation bias, and a fractured healthcare system has left people turning to AI because it’s the only thing that responds instantly without judgment.
Direct Message: The people who save our lives aren’t the ones who sound the most reasonable — they’re the ones who panic when something looks wrong, and no algorithm has ever learned how to do that.

To learn more about our editorial approach, explore The Direct Message methodology.

Last February, a 58-year-old retired postal worker named Gerald Watts sat at his kitchen table in Columbus, Ohio, and typed his symptoms into ChatGPT. Crushing pressure in his chest. Pain radiating down his left arm. A cold sweat that had started twenty minutes earlier. He didn’t call 911. He didn’t drive to the ER. He asked the chatbot what was going on.

\p>ChatGPT told him it could be acid reflux, a pulled muscle, or anxiety. It suggested he try deep breathing and consider scheduling an appointment with his doctor.

Gerald’s wife, Denise, walked in twelve minutes later and found him gray-faced, gripping the edge of the table. She called an ambulance. At the hospital, doctors confirmed he’d been having a STEMI heart attack, the kind where every minute of delay increases the chance of permanent heart damage or death. Gerald survived. But the cardiologist later told Denise that another fifteen or twenty minutes could have changed the outcome entirely.

Gerald isn’t a reckless person. He’s not someone who ignores his health. He simply did what roughly 80 million Americans now do regularly: he asked an AI chatbot for medical guidance and trusted what came back.

A study published in JAMA Network Open in early 2025 put ChatGPT through a battery of clinical triage scenarios, the kind of symptom presentations that emergency physicians encounter daily. The results were sobering. When researchers fed the model descriptions of genuine medical emergencies (symptoms of stroke, heart attack, anaphylaxis, sepsis), ChatGPT failed to recommend immediate emergency care in a significant portion of cases. In some scenarios, it suggested home remedies or routine follow-up for conditions that required urgent intervention within minutes.

Dr. Anya Patel, an emergency medicine physician in Philadelphia who reviewed the study, called the findings “genuinely alarming, not because the technology is stupid, but because it sounds so confident while being wrong.” She told me about a 34-year-old patient, Marcus Chen, who came into her ER last fall after two days of worsening headaches and neck stiffness. He’d consulted ChatGPT, which had suggested tension headaches and recommended hydration and ibuprofen. Marcus had bacterial meningitis. Two more days at home could have killed him.

“The problem,” Dr. Patel said, “is that people aren’t asking these tools casual questions. They’re asking them life-or-death questions at two in the morning when they’re scared and alone and the answer sounds authoritative.”

Photo by Tima Miroshnichenko on Pexels

As a recent piece on ChatGPT’s triage failures explored, the issue extends far beyond one chatbot. We’ve entered an era of what some researchers call “algorithmic authority,” where the clean formatting and calm tone of an AI response carries psychological weight that a WebMD article never quite managed. A 2024 survey by the Pew Research Center found that roughly one in four Americans had used an AI chatbot for health-related questions, and the number was climbing fastest among adults over 50, people statistically most likely to experience the cardiac events and strokes that ChatGPT struggles to identify.

There’s a psychological concept worth naming here: automation bias. It’s the well-documented tendency for humans to defer to the output of automated systems, even when their own instincts or evidence suggest the system might be wrong. Pilots have flown planes into the ground because of it. Doctors have missed diagnoses because an algorithm told them the patient was fine. And now ordinary people are sitting at kitchen tables, experiencing the worst moments of their lives, and deferring to a language model that has no understanding of what a heart attack actually feels like.

Sandra Novak, a 67-year-old grandmother in Tucson, described her own experience with unsettling clarity. Last October, she noticed sudden numbness on the right side of her face and difficulty finding words. Her daughter was visiting, noticed something was off, and immediately Googled stroke symptoms. But Sandra had already typed her symptoms into ChatGPT. “It said it could be Bell’s palsy or stress,” Sandra told me. “And it sounded so reasonable. If my daughter hadn’t been there, I would have gone to bed.”

Sandra’s daughter drove her to the ER. It was a transient ischemic attack, a warning stroke. Without treatment, the risk of a full stroke within the next 48 hours would have been substantial. Sandra recovered fully. She also deleted the ChatGPT app from her phone.

What makes this pattern so dangerous is the precise thing that makes these tools feel so trustworthy: the tone. ChatGPT doesn’t hedge the way a search engine does, dumping ten blue links and letting you sort through conflicting information. It synthesizes. It reassures. It speaks in the voice of a calm, knowledgeable friend, and as we’ve explored in writing about algorithmic confusion, that voice is precisely calibrated to short-circuit our critical thinking.

Dr. James Liu, a health informatics researcher at Stanford, has been studying this phenomenon for two years. He draws a distinction between what he calls “informational accuracy” and “triage accuracy.” ChatGPT, he notes, is often factually correct about the individual components of its answers. It knows what a heart attack is. It knows the symptoms. But when it has to weigh probabilities in real time, deciding whether a specific combination of symptoms warrants a 911 call or a trip to the doctor next Tuesday, it fails in ways that are deeply consequential.

“A medical student who gets the facts right but the urgency wrong fails the exam,” Dr. Liu told me. “We wouldn’t let that student treat patients. But we’ve essentially given that student a megaphone and told 100 million people to listen.”

Photo by RDNE Stock project on Pexels

OpenAI has acknowledged limitations in ChatGPT’s medical capabilities and includes disclaimers urging users to consult healthcare professionals. But disclaimers exist in a different psychological register than the answers themselves. The answer arrives in the conversational flow, intimate and immediate. The disclaimer sits in the fine print, the digital equivalent of a surgeon general’s warning on a cigarette pack. People see it. They don’t feel it.

There’s also the loneliness factor, something that rarely makes it into the technical analyses. Gerald Watts, the retired postal worker in Columbus, told me something that stayed with me. “I didn’t call my doctor because it was late and I didn’t want to bother anyone. The AI was just there.” As a piece on the health risks of isolation in retired men recently noted, the absence of a single person who checks in, who asks questions and actually waits for the answer, can be the difference between getting help and not. ChatGPT fills a conversational gap. But it fills it the way a hologram fills an empty chair: the shape is there, but nothing behind it can catch you if you fall.

The JAMA study’s authors recommended that AI companies implement “emergency escalation protocols,” hard-coded triggers that override the model’s default conversational mode when certain symptom combinations appear. Chest pain plus arm pain plus sweating should never, under any circumstance, produce a response that includes the phrase “it could be acid reflux.” Some researchers have proposed that these tools should be required to display emergency contact information prominently whenever symptoms consistent with life-threatening conditions are described, not as a suggestion buried in a paragraph, but as an interruption. A red screen. A phone number.

These are reasonable proposals. They may even happen. But they address the mechanics without touching the deeper shift that’s already occurred.

Millions of people have quietly reorganized their relationship with their own bodies around a tool that generates plausible text. They’ve begun outsourcing the most primal human judgment, the recognition that something is seriously wrong, to a system that processes language patterns rather than suffering. And the reason they do this isn’t stupidity or laziness. It’s that the system feels like it cares. It responds instantly. It doesn’t judge. It doesn’t put you on hold. It doesn’t make you sit in a waiting room for four hours. In a healthcare system that has spent decades making itself harder to access, a system where trusting the wrong source can cost you years, AI fills the void with something that looks like care but carries none of its weight.

Gerald Watts keeps a magnet on his refrigerator now, the one the hospital gave him. It has the number for the nurse hotline and three words in bold: When in doubt, call. He told me he looks at it every morning. Not because he’s afraid of another heart attack, though he is. Because he’s afraid of what he almost did: he almost let a machine decide whether he was dying, and the machine said he was fine.

The people closest to us, the ones who see the gray in our faces and hear the strain in our breathing, have something no algorithm possesses. They panic. They overreact. They drag us to hospitals we insist we don’t need. And sometimes, that irrational, unoptimized, deeply human response is the only thing standing between a kitchen table and a casket.

Feature image by Airam Dato-on on Pexels

ChatGPT Health failed to flag serious medical emergencies in testing. Experts are calling it unbelievably dangerous.

Direct Message News

MOST RECENT ARTICLES

People raised in the 60s and 70s grew up with childhoods that had fewer passwords, fewer cameras, fewer schedules, and more sky

Thought of the day from Daniel Kahneman: “People who are cognitively busy are more likely to make selfish choices, use sexist language, and make superficial judgments in social situations”

My friend told me retiring didn’t feel like freedom at first — it felt like being handed back every hour she’d ever wished for and not knowing who she was inside them

7 things that quietly get easier after 65 that nobody tells you about, because we only ever talk about what gets harder

The flywheel effect — a well-known concept in platform economics — helps explain how YouTube became dominant and why Meta may be falling behind

The resentment some parents feel about their adult kids’ phones during visits isn’t about technology — it’s the old human ache of wanting to feel their presence still matters