ChatGPT outperformed expectations in a structured medical triage test, and some doctors aren’t surprised

ChatGPT outperformed expectations in a structured medical triage test, and some doctors aren't surprised
Add DMNews to your Google News feed.
  • Tension: ChatGPT matched or exceeded expected accuracy in a structured medical triage evaluation — and the physicians who aren’t surprised are the ones most familiar with how broken the current system already is.
  • Noise: The debate frames AI triage as a threat to clinical expertise, but this framing ignores that millions of patients never access that expertise in the first place — making the real comparison not AI vs. doctor, but AI vs. nothing.
  • Direct Message: The most dangerous thing in medicine has never been a new tool — it’s the assumption that the old system was working for everyone, when it was really only working for those who could afford to be seen.

To learn more about our editorial approach, explore The Direct Message methodology.

Dr. Renata Oliveira, a 51-year-old emergency physician in Baltimore, did something last March that she swore she’d never do. She typed a patient’s symptoms into ChatGPT — not to get a diagnosis, but to settle a bet with a resident who claimed the AI would send a hypothetical chest-pain case straight to cardiology. It didn’t. It asked about anxiety history, recent caffeine intake, and whether the patient had recently changed medications. Renata stared at the screen for a long moment. “That’s exactly the sequence I would have followed,” she told the resident. “And I’ve been doing this for twenty-three years.”

She wasn’t the only one who noticed.

A structured evaluation published in NPJ Digital Medicine tested large language models — including GPT-4 — against standardized medical triage scenarios, the kind of branching clinical decisions that separate a sprained ankle from a hidden fracture, a panic attack from a pulmonary embolism. The results caught even cautious researchers off guard. ChatGPT matched or outperformed the expected triage accuracy benchmarks in a majority of test cases, demonstrating a capacity for structured reasoning that went far beyond keyword matching or symptom-Googling.

The reaction from the medical community has been split in a way that reveals something deeper than any debate about technology.

On one side: alarm. Dr. James Whitfield, a 63-year-old internist in Phoenix, called the findings “genuinely dangerous” at a hospital grand rounds presentation. His concern isn’t that the AI got the answers right — it’s that people will assume it always will. “Triage isn’t just pattern recognition,” he said. “It’s reading the room. It’s noticing that a patient who says they’re fine is gripping the armrest hard enough to turn their knuckles white. A chatbot can’t see knuckles.”

He’s not wrong. But the other side of the split is more interesting — and more uncomfortable for the medical establishment.

medical AI triage
Photo by Mikhail Nilov on Pexels

Dr. Ananya Mehta, a 38-year-old family physician in Austin, has been quietly experimenting with AI-assisted workflows for over a year. She doesn’t use ChatGPT to diagnose. She uses it to think out loud. “I’ll describe a case — anonymized, obviously — and ask it to generate differential diagnoses I might be overlooking,” she told me. “It’s like having a colleague who never gets tired, never gets defensive, and has read everything.” She paused. “That last part is the one that makes doctors uncomfortable.”

What Ananya is describing isn’t artificial intelligence replacing clinical judgment. It’s something psychologists call cognitive offloading — the strategic delegation of mental tasks to an external tool so the brain can focus on higher-order reasoning. We do it every time we use a calculator instead of doing long division. The question everyone keeps arguing about is whether medical triage is more like long division or more like poetry.

The honest answer is that it’s both. And that’s where the conversation gets tangled.

Medical triage follows protocols. There are flowcharts, scoring systems, evidence-based decision trees that have been refined over decades. The Manchester Triage System, the Emergency Severity Index — these are structured frameworks, and structured frameworks are exactly what large language models are increasingly good at navigating. A 2023 study in JAMA Internal Medicine found that AI-generated responses to patient questions were rated as both more empathetic and more thorough than physician responses — a finding that landed like a grenade in medical Twitter discourse.

But protocols are only the skeleton. The flesh is context. Marcus DelVecchio, a 47-year-old paramedic in Chicago, put it this way: “I’ve triaged thousands of calls. The protocol says chest pain in a 55-year-old male is high priority. But I’ve learned that when a guy calls at 2 AM and his voice is flat — not panicked, just flat — that’s the one I run the lights for. The calm ones are the ones who are dying.” No model captures that yet. Maybe no model ever will.

And yet the test results keep landing. The structured triage evaluation didn’t just show that ChatGPT could follow decision trees — it showed the model reasoning through ambiguous presentations, weighing competing possibilities, and flagging cases where more information was needed rather than leaping to a conclusion. That last behavior — the willingness to say “I need more data” — is something medical educators spend years trying to teach human students.

The cultural moment matters here. We’re living through a period where trust in institutions — including medical institutions — is fracturing in ways that have real consequences. As we explored in a piece about how certain supplement combinations may actually accelerate brain aging, people are already making complex health decisions based on information they’ve gathered outside the clinical setting. They’re not waiting for permission. The rise of GLP-1 drugs has shown how quickly patients will adopt something that works — sometimes with effects that go far beyond the original prescription. AI triage tools will follow the same adoption curve, whether the medical establishment is ready or not.

doctor using computer
Photo by Tima Miroshnichenko on Pexels

Dr. Whitfield’s fear — that patients will treat ChatGPT as an oracle — isn’t hypothetical. Surveys already show that roughly a third of Americans have used AI chatbots for health-related questions. But Ananya Mehta offers a counterpoint worth sitting with: “We already have a system where people wait six hours in an emergency room, get four minutes with an overwhelmed resident, and leave with a diagnosis they don’t understand. If we’re going to be honest about the risks of AI triage, we have to be honest about the risks of the system it’s being compared to.”

That comparison is the part nobody wants to linger on. The idealized version of medical triage — the experienced physician who sees every patient, weighs every nuance, catches every subtle cue — exists in some places. But for millions of people, especially in rural and underserved communities, it doesn’t. What exists is a phone nurse reading from a script, or a three-hour wait, or nothing at all. Against that baseline, a tool that correctly identifies high-acuity presentations 80% of the time isn’t a threat. It’s a lifeline.

This is the tension that keeps surfacing across so many areas of modern life — the gap between the system we describe and the system people actually experience. We talk about healthcare as if every patient encounter is a Marcus Welby episode. The reality is overbooked schedules, burnout, diagnostic errors that kill an estimated 250,000 Americans per year. ChatGPT doesn’t need to be perfect to be useful. It needs to be better than the void.

Renata Oliveira told me something that stuck. She said the moment she typed those symptoms into ChatGPT wasn’t the moment she lost faith in her own expertise. It was the moment she realized that expertise and tools aren’t rivals — they’re married to each other. “I use a stethoscope,” she said. “I use an EKG. I use imaging. None of those replaced me. They made me more accurate. This is the same argument we’ve been having since X-rays, and we keep having it because we keep confusing the tool with the hand that holds it.”

The doctors who aren’t surprised by ChatGPT’s triage performance aren’t the ones who think less of medicine. They’re the ones who’ve watched the quiet, unglamorous ways that unexpected tools reshape how people function. They’re the ones who know that the most dangerous thing in medicine has never been a new technology. It’s the belief that the old way was working perfectly fine for everyone — when it was really only working for the people who could afford to be seen.

Marcus the paramedic will still listen for the flat voice at 2 AM. No algorithm replaces that. But somewhere in a town with one urgent care clinic and a two-week wait for a primary care appointment, a 34-year-old mother is going to type her child’s symptoms into a chatbot at midnight. And the question that actually matters — the one that cuts through all the professional anxiety and the breathless headlines — isn’t whether we trust the machine. It’s whether we trust it more than silence.

Feature image by MART PRODUCTION on Pexels

Picture of Maya Torres

Maya Torres

Maya Torres is a lifestyle writer and wellness researcher who covers the hidden patterns shaping how we live, work, and age. From financial psychology to health habits to the small daily choices that compound over decades, Maya's writing helps readers see their own lives more clearly. Her work has been featured across digital publications focused on personal development and conscious living.

MOST RECENT ARTICLES

Psychology says the reason you feel uneasy about the TikTok deal isn’t about privacy — it’s about recognizing a protection racket dressed as policy

Psychology says the reason you feel uneasy about the economy isn’t inflation—it’s the quiet realization that the government now profits from the companies it controls

Psychology says the reason you feel uneasy about government ‘wins’ isn’t cynicism — it’s pattern recognition from every landlord who also wrote your lease

Psychology says the reason you feel uneasy about the TikTok deal isn’t paranoia — it’s your brain recognizing a protection racket dressed as governance

Small businesses keep waiting for the perfect mobile moment — it already passed

USPS just made snail mail digital — and nobody noticed