ChatGPT outperformed expectations in a structured medical triage test, and some doctors aren't surprised

Tension: ChatGPT matched or exceeded expected accuracy in a structured medical triage evaluation — and the physicians who aren’t surprised are the ones most familiar with how broken the current system already is.
Noise: The debate frames AI triage as a threat to clinical expertise, but this framing ignores that millions of patients never access that expertise in the first place — making the real comparison not AI vs. doctor, but AI vs. nothing.
Direct Message: The most dangerous thing in medicine has never been a new tool — it’s the assumption that the old system was working for everyone, when it was really only working for those who could afford to be seen.

To learn more about our editorial approach, explore The Direct Message methodology.

Dr. Renata Oliveira, a 51-year-old emergency physician in Baltimore, did something last March that she swore she’d never do. She typed a patient’s symptoms into ChatGPT — not to get a diagnosis, but to settle a bet with a resident who claimed the AI would send a hypothetical chest-pain case straight to cardiology. It didn’t. It asked about anxiety history, recent caffeine intake, and whether the patient had recently changed medications. Renata stared at the screen for a long moment. “That’s exactly the sequence I would have followed,” she told the resident. “And I’ve been doing this for twenty-three years.”

She wasn’t the only one who noticed.

A structured evaluation published in NPJ Digital Medicine tested large language models — including GPT-4 — against standardized medical triage scenarios, the kind of branching clinical decisions that separate a sprained ankle from a hidden fracture, a panic attack from a pulmonary embolism. The results caught even cautious researchers off guard. ChatGPT matched or outperformed the expected triage accuracy benchmarks in a majority of test cases, demonstrating a capacity for structured reasoning that went far beyond keyword matching or symptom-Googling.

The reaction from the medical community has been split in a way that reveals something deeper than any debate about technology.

On one side: alarm. Dr. James Whitfield, a 63-year-old internist in Phoenix, called the findings “genuinely dangerous” at a hospital grand rounds presentation. His concern isn’t that the AI got the answers right — it’s that people will assume it always will. “Triage isn’t just pattern recognition,” he said. “It’s reading the room. It’s noticing that a patient who says they’re fine is gripping the armrest hard enough to turn their knuckles white. A chatbot can’t see knuckles.”

He’s not wrong. But the other side of the split is more interesting — and more uncomfortable for the medical establishment.

Photo by Mikhail Nilov on Pexels

Dr. Ananya Mehta, a 38-year-old family physician in Austin, has been quietly experimenting with AI-assisted workflows for over a year. She doesn’t use ChatGPT to diagnose. She uses it to think out loud. “I’ll describe a case — anonymized, obviously — and ask it to generate differential diagnoses I might be overlooking,” she told me. “It’s like having a colleague who never gets tired, never gets defensive, and has read everything.” She paused. “That last part is the one that makes doctors uncomfortable.”

What Ananya is describing isn’t artificial intelligence replacing clinical judgment. It’s something psychologists call cognitive offloading — the strategic delegation of mental tasks to an external tool so the brain can focus on higher-order reasoning. We do it every time we use a calculator instead of doing long division. The question everyone keeps arguing about is whether medical triage is more like long division or more like poetry.

The honest answer is that it’s both. And that’s where the conversation gets tangled.

Medical triage follows protocols. There are flowcharts, scoring systems, evidence-based decision trees that have been refined over decades. The Manchester Triage System, the Emergency Severity Index — these are structured frameworks, and structured frameworks are exactly what large language models are increasingly good at navigating. A 2023 study in JAMA Internal Medicine found that AI-generated responses to patient questions were rated as both more empathetic and more thorough than physician responses — a finding that landed like a grenade in medical Twitter discourse.

But protocols are only the skeleton. The flesh is context. Marcus DelVecchio, a 47-year-old paramedic in Chicago, put it this way: “I’ve triaged thousands of calls. The protocol says chest pain in a 55-year-old male is high priority. But I’ve learned that when a guy calls at 2 AM and his voice is flat — not panicked, just flat — that’s the one I run the lights for. The calm ones are the ones who are dying.” No model captures that yet. Maybe no model ever will.

And yet the test results keep landing. The structured triage evaluation didn’t just show that ChatGPT could follow decision trees — it showed the model reasoning through ambiguous presentations, weighing competing possibilities, and flagging cases where more information was needed rather than leaping to a conclusion. That last behavior — the willingness to say “I need more data” — is something medical educators spend years trying to teach human students.

The cultural moment matters here. We’re living through a period where trust in institutions — including medical institutions — is fracturing in ways that have real consequences. As we explored in a piece about how certain supplement combinations may actually accelerate brain aging, people are already making complex health decisions based on information they’ve gathered outside the clinical setting. They’re not waiting for permission. The rise of GLP-1 drugs has shown how quickly patients will adopt something that works — sometimes with effects that go far beyond the original prescription. AI triage tools will follow the same adoption curve, whether the medical establishment is ready or not.

Photo by Tima Miroshnichenko on Pexels

Dr. Whitfield’s fear — that patients will treat ChatGPT as an oracle — isn’t hypothetical. Surveys already show that roughly a third of Americans have used AI chatbots for health-related questions. But Ananya Mehta offers a counterpoint worth sitting with: “We already have a system where people wait six hours in an emergency room, get four minutes with an overwhelmed resident, and leave with a diagnosis they don’t understand. If we’re going to be honest about the risks of AI triage, we have to be honest about the risks of the system it’s being compared to.”

That comparison is the part nobody wants to linger on. The idealized version of medical triage — the experienced physician who sees every patient, weighs every nuance, catches every subtle cue — exists in some places. But for millions of people, especially in rural and underserved communities, it doesn’t. What exists is a phone nurse reading from a script, or a three-hour wait, or nothing at all. Against that baseline, a tool that correctly identifies high-acuity presentations 80% of the time isn’t a threat. It’s a lifeline.

This is the tension that keeps surfacing across so many areas of modern life — the gap between the system we describe and the system people actually experience. We talk about healthcare as if every patient encounter is a Marcus Welby episode. The reality is overbooked schedules, burnout, diagnostic errors that kill an estimated 250,000 Americans per year. ChatGPT doesn’t need to be perfect to be useful. It needs to be better than the void.

Renata Oliveira told me something that stuck. She said the moment she typed those symptoms into ChatGPT wasn’t the moment she lost faith in her own expertise. It was the moment she realized that expertise and tools aren’t rivals — they’re married to each other. “I use a stethoscope,” she said. “I use an EKG. I use imaging. None of those replaced me. They made me more accurate. This is the same argument we’ve been having since X-rays, and we keep having it because we keep confusing the tool with the hand that holds it.”

The doctors who aren’t surprised by ChatGPT’s triage performance aren’t the ones who think less of medicine. They’re the ones who’ve watched the quiet, unglamorous ways that unexpected tools reshape how people function. They’re the ones who know that the most dangerous thing in medicine has never been a new technology. It’s the belief that the old way was working perfectly fine for everyone — when it was really only working for the people who could afford to be seen.

Marcus the paramedic will still listen for the flat voice at 2 AM. No algorithm replaces that. But somewhere in a town with one urgent care clinic and a two-week wait for a primary care appointment, a 34-year-old mother is going to type her child’s symptoms into a chatbot at midnight. And the question that actually matters — the one that cuts through all the professional anxiety and the breathless headlines — isn’t whether we trust the machine. It’s whether we trust it more than silence.

Feature image by MART PRODUCTION on Pexels

ChatGPT outperformed expectations in a structured medical triage test, and some doctors aren’t surprised

Direct Message News

MOST RECENT ARTICLES

People raised in the 60s and 70s grew up with childhoods that had fewer passwords, fewer cameras, fewer schedules, and more sky

Thought of the day from Daniel Kahneman: “People who are cognitively busy are more likely to make selfish choices, use sexist language, and make superficial judgments in social situations”

My friend told me retiring didn’t feel like freedom at first — it felt like being handed back every hour she’d ever wished for and not knowing who she was inside them

7 things that quietly get easier after 65 that nobody tells you about, because we only ever talk about what gets harder

The flywheel effect — a well-known concept in platform economics — helps explain how YouTube became dominant and why Meta may be falling behind

The resentment some parents feel about their adult kids’ phones during visits isn’t about technology — it’s the old human ache of wanting to feel their presence still matters