- Future Reps
- Posts
- Diagnosing the Diagnosers
Diagnosing the Diagnosers
When AI claims expert-level accuracy, what do PTs need to ask?
Volume 30 November 3, 2025

From the Editor
We know, we know…this issue is a bit of a numbers nerd’s dream. If you came here hoping for a cool new wearable or a GPT that can do burpees, we owe you one.
But with so many AI tools claiming “doctor-level accuracy,” we thought it was time to dig into what those numbers actually mean and what PTs should watch for when vendors show up with charts and buzzwords.
So yes, it’s a bit of a stats-fest. But we promise it’s the good kind: practical, relevant, and the kind of data that might actually shape your documentation or referral decisions.
Hang in there. Fewer decimals next time. Probably.
📰 This Week's Highlights
AI Diagnoses with Reasoning Transparency
Summary: A Harvard Medical School team introduced “Dr. CaBot,” an AI system capable of not only delivering a diagnosis but also laying out its reasoning step-by-step. In a historic first, the model’s analysis was published alongside a human expert’s differential in the NEJM Case Records series. While not benchmarked at scale yet, the system represents a push toward transparent, interpretab
le diagnostic AI.
Article (estimated read time 5 minutes)
Why this matters for PTs:
If future AI tools help you decide between systemic, neurogenic, or musculoskeletal causes of dysfunction, interpretability will be essential. You’ll need to see why an AI reached its conclusion before trusting it to guide care, documentation, or referrals.
AI-Curated Imaging Reduces Diagnostic Errors
Summary: A study in internal medicine found that using AI to enhance and interpret imaging (e.g., reconstructing low-quality scans or sharpening detail) reduced diagnostic errors by ~45%: from 22% down to 12%. The cohort included 60 real-world cases and showed that preprocessing with AI can meaningfully support clinical decisions.
Article (estimated read time 4 minutes)
Why this matters for PTs:
MSK imaging—especially ultrasound and MRI—is increasingly part of PT workflows. If AI is used to interpret or enhance these images, understanding its impact on accuracy (and potential for error) is crucial for planning care and communicating with referring providers.
Frameworks to Reduce AI Misdiagnosis
Summary: A new review in Frontiers in Medicine outlines common failure points in diagnostic AI systems—such as biased training data, overconfident outputs, and unclear human oversight—and offers a multidimensional framework for safer design and deployment. While no single error rate is cited, the paper notes real-world AI performance can drop 15–30% from lab benchmarks, with documented subgroup disparities (e.g., 23% higher false-negative rates in rural patients).
Article (estimated read time 6 minutes)
Why this matters for PTs:
If your clinic is trialing AI for triage or screening (e.g., red flags, systemic causes), you’ll want to ensure the tools are validated on populations like yours, have clinician oversight built in, and include bias monitoring plans.
GPT-4o vs Gemini vs Claude: LLMs in Clinical Comparison
Summary: A recent study compared leading LLMs (GPT-4o, Gemini 2.0, Claude Sonnet 3.5) on 600 complex clinical questions. GPT-4o led with 84–88% accuracy; Gemini followed at 68–77%; GPT-4 trailed at ~54%. The study assessed not just correctness, but also citation quality and clinician satisfaction.
Article (estimated read time 5 minutes)
Why this matters for PTs:
Not all LLMs are created equal. Whether you're using AI to draft HEPs, referral letters, or SOAP notes, know which model is behind your tool—and whether its strengths match your task. GPT-4o may be best for guideline interpretation; others may shine in summarization.
Overall Key Numbers for Your PT Lens
GPT‑4o: ~84.2% (2020) & ~88.2% (2021) accuracy on Chinese NMLE exam dataset. Nature
GPT‑4: ~74.7% (2020) & ~73.2% (2021) on same dataset. Nature
GPT‑3.5: ~50.5% (2020) & ~50.8% (2021). Nature
Diagnostic‑AI error reduction: In internal medicine cohort (n=60) AI tools cut error rate from 22% → 12% (≈45% reduction) in one study.
In radiology‑/LLM‑based tasks, one model reported “correctness” 77.2% and “accuracy” 68.0% (for Gemini) vs 54.0% for GPT‑4 in one comparative study.
Bonus Reads
Heterogeneity and predictors of the effects of AI assistance on radiologists (Nature Medicine, 2024) — Large‑scale study showing how AI assistance impacts 140 radiologists across 15 chest‑X‑ray tasks, and highlighting when AI helps vs when it may hurt. Nature
A systematic review and meta‑analysis of diagnostic accuracy of AI models (2025) — Review reporting that AI models in diagnostic tasks averaged ~52.1% accuracy, comparable to non‑expert physicians but below experts. pmc.ncbi.nlm.nih.gov