Every AI model is flunking medicine - and LMArena proposes a fix

johan63/iStock/Getty Images Plus via Getty Images

ZDNET’s key takeaways

AI frontier models fail to provide safe and accurate output on medical topics.
LMArena and DataTecnica aim to ‘rigorously’ test LLMs’ medical knowledge.
It’s not clear how agents and medicine-specific LLMs will be measured.

Get more in-depth ZDNET tech coverage: Add us as a preferred Google source on Chrome and Chromium browsers.

Despite the numerous AI advances in medicine cited throughout scholarly literature, all generative AI programs fail to produce output that is both safe and accurate when dealing with medical topics, according to a new report by benchmark firm LMArena.

The finding is especially concerning given that people are going to bots such as ChatGPT for medical answers, and research shows that people trust AI’s medical advice over the advice of doctors, even when it’s wrong.

Also: Patients trust AI’s medical advice over doctors – even when it’s wrong, study finds

The new study, comparing OpenAI’s GPT-5 with numerous models from Google, Anthropic, and Meta, finds that “performance in real-world biomedical research remains far from adequate.”

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

A knowledge gap in medicine

“No current model reliably meets the reasoning and domain-specific knowledge demands of biomedical scientists,” according to the LMArena team.

The report concludes that current models are simply too lax and too fuzzy to meet the standards of medicine:

“This fundamental gap highlights the growing mismatch between general AI capabilities and the needs of specialized scientific communities. Biomedical researchers work at the intersection of complex, evolving knowledge and real-world impact. They don’t need models that ‘sound’ correct; they need tools that help uncover insights, reduce error, and accelerate the pace of discovery.”

LMArena + DataTecnica

The study echoes findings from other benchmark tests related to medicine. For example, in May, OpenAI unveiled HealthBench, a suite of text prompts concerning medical situations and conditions that could reasonably be submitted to a chatbot by a person seeking medical advice. That study found that the best accuracy score, by OpenAI’s o3 large language model, 0.598, left ample room for improvement on the benchmark.

Also: OpenAI’s HealthBench shows AI’s medical advice is improving – but who will listen?

Expanding the benchmark

To address the gap between AI models and medicine, LMArena has teamed with startup DataTecnica, which earlier this year unveiled a benchmark suite of tests for Gen AI called CARDBiomedBench, a question-and-answer benchmark for evaluating LLMs in biomedical research.

Together, LMArena and DataTecnica plan to expand what’s called BiomedArena, a leaderboard that lets people compare AI models side by side and vote on which ones perform the best.

Also: Meta’s Llama 4 ‘herd’ controversy and AI contamination, explained

BiomedArena is meant to be specific to medical research, rather than very general questions, unlike general-purpose leaderboards.

The BiomedArena work is already used by scientists at the Intramural Research Program of the US National Institutes of Health, they note, “where scientists pursue high-risk, high-reward projects that are often beyond the scope of traditional academic research due to their scale, complexity, or resource demands.”

The BiomedArena work, according to the LMArena team, will “focus on tasks and evaluation strategies grounded in the day-to-day realities of biomedical discovery — from interpreting experimental data and literature to assisting in hypothesis generation and clinical translation.”

Also: You can track the top AI image generators via this new leaderboard – and vote for your favorite too

As ZDNET’s Webb Wright reported in June, LMArena.ai ranks AI models. The website was originally founded as a research initiative through UC Berkeley under the name Chatbot Arena and has since become a full-fledged platform, with financial support from UC Berkeley, a16z, Sequoia Capital, and others.

Where could they go wrong?

Two big questions loom for this new benchmark effort.

First, studies with doctors have shown that gen AI’s usefulness expands dramatically when AI models are hooked up to databases of “gold standard” medical information, with dedicated large language models (LLMs) able to outperform the top frontier models just by tapping into information.

Also: Hooking up generative AI to medical data improved usefulness for doctors

From today’s announcement, it’s not clear how LMArena and DataTecnica plan to address that aspect of AI models, which really is a kind of agentic capability — the ability to tap into resources. Without measuring how AI models use external resources, the benchmark could have limited utility.

Second, numerous medicine-specific LLMs are being developed all the time, including Google’s “MedPaLM” program developed two years ago. It’s not clear if the BiomedArena work will take into account these dedicated medicine LLMs. The work so far has tested only general frontier models.

Also: Google’s MedPaLM emphasizes human clinicians in medical AI

That’s a perfectly valid choice on the part of LMArena and DataTecnica, but it does leave out a whole lot of important effort.

What's Hot

Cursor admits its new coding model was built on top of Moonshot AI’s Kimi

Delve accused of misleading customers with ‘fake compliance’

AI startups are eating the venture industry and the returns, so far, are good

Cursor admits its new coding model was built on top of Moonshot AI’s Kimi

Stripe, PayPal Ventures bet on India’s Xflow to fix cross-border B2B payments

Ali Partovi’s Neo looks to upend the accelerator model with low-dilution terms

College social app Fizz expands into grocery delivery

A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

The Reason Murderbot’s Tone Feels Off

Most Popular

College social app Fizz expands into grocery delivery

A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

The Reason Murderbot’s Tone Feels Off

Our Picks

Cursor admits its new coding model was built on top of Moonshot AI’s Kimi

Delve accused of misleading customers with ‘fake compliance’

AI startups are eating the venture industry and the returns, so far, are good

Subscribe to Updates

What's Hot

Every AI model is flunking medicine – and LMArena proposes a fix

ZDNET’s key takeaways

A knowledge gap in medicine

Expanding the benchmark

Where could they go wrong?

Related Posts

Subscribe to Updates