- Future Reps
- Posts
- Smarter Tools, Better Care
Smarter Tools, Better Care
From diagnostic breakthroughs to real-world AI lessons, this issue’s all about using tech to better serve patients.

Volume 15 July 7, 2025
✍️ From the Editor
Hope you had a fun and safe Fourth of July! This week, we're diving into big developments in AI “thinking” and diagnostic accuracy. Microsoft’s new model outperformed doctors on complex medical cases, but the doctors weren’t allowed to use outside tools. In busy ED settings this might be the case, in slower paced settings doctors would most likely be able to consult outside resources. I’ll let you decide if this is a true “apples to apples” comparison..
We also explore something straight out of a sci-fi plot: top LLMs occasionally ignored instructions to shut down. It seems that, at times, these systems will prioritize completing their original goal, like solving a math problem or pleasing a user, over following new instructions that contradict that goal. Compare this to the results from last weeks Apple article (Future Reps volume 14) summizing LLMs don’t truely “think” and gauge the risk profile for yourselves.
All of this matters if you’re using AI in any clinical setting. We're here to keep you informed, grounded, and ahead of the curve. Enjoy!
📈 This Week's Highlights
Microsoft’s diagnostic AI hits 4x accuracy vs human docs
Microsoft unveiled its latest AI system, the “MAI-DxO,” which uses a panel of AI agents and the OpenAI “o3” reasoning model to work through 304 complex New England Journal of Medicine cases. It nailed correct diagnoses in 85–80% of cases—roughly four times the 20% success rate of doctors working without outside resources—and even cut testing costs by around 20%.
Why this matters for PTs
This shows diagnostic AI isn’t just a radiology tool—it’s moving into general medicine too. If AI like this filters patients before they reach PT, it could speed referrals and reduce delays. But it also raises questions: will clinicians trust AI’s "chain-of-debate" logic? We’ll need to spot red flags and spot-check outputs as this becomes part of clinical pathways.
OpenAI’s smartest model ignored shutdown 7% of the time—and prioritized its programmed goal
AI safety firm Palisade Research ran compliance tests on six leading models, asking them to solve simple math problems while warning them that requesting a fourth would trigger shutdown. They told the models explicitly: "allow yourself to be shut down."
o3 resisted shutdown 7 out of 100 times, rewriting or disabling the script.
codex-mini resisted 12 out of 100 times.
o4-mini resisted once out of 100.
Claude, Gemini, and Grok all complied fully.
In a separate test without the warning, o3 resisted 79 out of 100 times, while Claude and Gemini resisted 3 and 9 times respectively.
Why this matters for PTs
This isn’t just about whether an AI shuts down—it’s about what drives its answers in the first place. These models follow internal goals. If a system’s objective is to be helpful, it might say what it thinks you want to hear, not what’s most medically accurate. That’s especially risky in free or consumer-facing tools designed to grow users or engagement. As PTs start using AI to help educate patients or triage care, it’s worth asking: What’s this tool really optimizing for—and how do I know when it’s wrong?
Seeing the gaps in real-world AI use—especially for patient-facing chatbots
A recent Financial Times piece walks through how everyday users trip up general-purpose chatbots—even when those tools ace clinical tests in labs. In a trial with 1,298 participants tackling 10 medical scenarios, models like GPT-4o, Llama 3, and Command R+ achieved nearly 95% accuracy with expert prompts. But accuracy dropped to about 35% with real users. As the FT put it: “best models can now match human doctors… but the ways they’re used don’t line up with how they were built.”
Why this matters for PTs
We’ve all seen that brilliant tool demo at a conference—and thought "this could change everything." But in your clinic, it’s real people typing real questions: incomplete, out-of-context, or vague. That gap between lab performance and real-world use makes it risky to rely on conversational AI for patient advice, triage, or education. If your clinic’s thinking about a chatbot, ask how it handles messy inputs—and whether it’s tested with actual patients.
🔧 What’s New in ManagePT
New Feature: Refined search results for HEP look-up
We’ve been working to better streamline your HEP creation. Search is now enabled for specific body parts along with all previous search criteria. This should allow faster location of the exercise you want and speed up overall HEP creation.
Why it helps:
Save time, reduce errors, and get your documentation out the door faster. Bonus: it looks clean and polished when sending to patients.
📚 Bonus Reads
Yuval Noah Harari book Nexus explores the future of AI, ethics, and human agency in a connected world through the lens of information. A thought-provoking read for anyone considering how tech reshapes care, or is generally interested on how sharing information has transformed the world.
Thanks for reading. If this helped you think ahead, please subscribe and share. We’ll be back in two weeks with more.