Can AI outdiagnose doctors? Microsoft's tool is 4 times better for complex cases

krisanapong detraphiphat/Getty

Research on AI for medicine looks increasingly promising — the tech already speeds up drug development, Google is using AI to improve its medical advice, and wearable companies are leveraging the technology for predictive health features. Now, Microsoft is the latest to move the goal post.

On Monday, the company announced in a blog post that Microsoft AI Diagnostic Orchestrator (MAI-DxO), its medical AI system, successfully diagnosed 85% of cases in the New England Journal of Medicine (NEJM). This rate of diagnosis is more than four times higher than human physicians. NEJM cases are particularly complex and often require several specialists.

Also: OpenAI’s HealthBench shows AI’s medical advice is improving – but who will listen?

Given how inaccessible, complex, and confusing healthcare systems continue to be, it’s no surprise people are seeking help from technology wherever possible.

“Across Microsoft’s AI consumer products like Bing and Copilot, we see over 50 million health-related sessions every day,” Microsoft said in the announcement. “From a first-time knee-pain query to a late-night search for an urgent-care clinic, search engines and AI companions are quickly becoming the new front line in healthcare.”

How it works

Human physicians must pass the US Medical Licensing Examination (USMLE) to practice medicine, a test that’s also used to evaluate how AI systems perform in medical contexts, both model-to-model and when compared with humans.

Currently, AI scores well on the USMLE — a side effect, Microsoft said, of the models memorizing (rather than understanding) answers to multiple-choice questions, which won’t produce the most sound medical analysis. Most industry-standard AI benchmarks have been saturated for a while, meaning AI models are evolving too quickly for the tests to be usefully challenging.

To combat this issue, Microsoft created the Sequential Diagnosis Benchmark (SD Bench). Sequential diagnosis is a process real clinicians use to diagnose patients by beginning with how their symptoms present and proceeding with questions and tests from there. The test presents diagnostic challenges from 304 NEJM cases, which humans and AI models can use to ask questions.

Also: Anthropic says Claude helps emotionally support users – we’re not convinced

Microsoft then paired the diagnostic agent, MAI-DxO, with several frontier models, including GPT, Llama, Claude, Gemini, Grok, and DeepSeek, and put the agent to the SD Bench test. MAI-DxO turns whatever LLM it is using into a “virtual panel of physicians with diverse diagnostic approaches collaborating to solve diagnostic cases,” Microsoft explained.

In a video demo, MAI-DxO also shows its reasoning as it queries the benchmark, develops possible diagnoses, and tracks the cost of each requested test. Once the agent has the required information from the benchmark about the case, it changes its diagnoses, asking for different scans and displaying a diagnostic process much more familiar to human physicians.

Correct diagnoses that cost less

“MAI-DxO boosted the diagnostic performance of every model we tested,” said Microsoft’s blog post, noting that the system performed best when paired with OpenAI’s o3 model. The company compared the results to those of 21 physicians from the UK and the US with experience ranging from five to 20 years, who reached a mean accuracy of just 20%.

Also: You shouldn’t trust AI for therapy – here’s why

Microsoft noted that MAI-DxO is also configurable, meaning it can run within cost limitations set by a user or organization — a feature that lets the agent run a cost-benefit analysis of certain tests, which is highly relevant to the astronomical pricing of US medical care and something human doctors and patients have to consider as well.

This feature is also a guardrail, of sorts — without it, the AI might “default to ordering every possible test — regardless of cost, patient discomfort, or delays in care,” the blog post explained. MAI-DxO also returned higher accuracy and lower costs than individual models or human physicians.

Will AI replace your doctor?

Probably not anytime soon — though Microsoft’s blog post noted that because of its breadth of knowledge, AI can surpass “clinical reasoning capabilities that, across many aspects of clinical reasoning, exceed those of any individual physician.”

The company believes systems like this one can “reshape healthcare” by giving patients the option to check themselves reliably and help doctors with complex cases. The cost savings would be another plus for an industry constantly plagued by inexplicably high costs and opaque pricing structures.

Also: AI is relieving therapists from burnout. Here’s how it’s changing mental health

Microsoft conceded that MAI-DxO has only been tested on these special cases, so it’s unclear how it would handle everyday tasks. However, this issue may not be relevant anyway if the agent isn’t intended to replace human doctors, which Microsoft also maintained in the blog post.

MAI-DxO is part of a “dedicated consumer health effort” Microsoft AI initiated last year, the company said in the release. Other AI products within that initiative include RAD-DINO, a radiology workflow tool, and Microsoft Dragon Copilot, a voice AI assistant designed for medical professionals.

What's Hot

Elon Musk’s last co-founder reportedly leaves xAI

From Moon hotels to cattle herding: 8 startups investors chased at YC Demo Day

Aetherflux reportedly raising Series B at $2 billion valuation

Littlebird raises $11M for its AI-assisted ‘recall’ tool that reads your computer screen

Former GitHub CEO raises record $60M dev tool seed round at $300M valuation

OpenAI to acquire the team behind executive coaching AI tool Convogo

College social app Fizz expands into grocery delivery

A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

The Reason Murderbot’s Tone Feels Off

Most Popular