Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Cursor admits its new coding model was built on top of Moonshot AI’s Kimi

    March 22, 2026

    Delve accused of misleading customers with ‘fake compliance’

    March 21, 2026

    AI startups are eating the venture industry and the returns, so far, are good

    March 20, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Cursor admits its new coding model was built on top of Moonshot AI’s Kimi
    • Delve accused of misleading customers with ‘fake compliance’
    • AI startups are eating the venture industry and the returns, so far, are good
    • Bluesky announces $100M Series B after CEO transition
    • Consumer-focused privacy company Cloaked raises $375M as it expands to enterprise
    • Tools for founders to navigate and move past conflict
    • K2 to launch its first high-powered satellite for space compute
    • Anori, Alphabet’s new X spinout, is tackling one of the world’s most expensive bureaucratic nightmares
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»Apps»Every AI model is flunking medicine – and LMArena proposes a fix
    Apps

    Every AI model is flunking medicine – and LMArena proposes a fix

    TechurzBy TechurzAugust 19, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Every AI model is flunking medicine - and LMArena proposes a fix
    Share
    Facebook Twitter LinkedIn Pinterest Email


    johan63/iStock/Getty Images Plus via Getty Images

    ZDNET’s key takeaways

    • AI frontier models fail to provide safe and accurate output on medical topics.
    • LMArena and DataTecnica aim to ‘rigorously’ test LLMs’ medical knowledge.
    • It’s not clear how agents and medicine-specific LLMs will be measured.

    Get more in-depth ZDNET tech coverage: Add us as a preferred Google source on Chrome and Chromium browsers.

    Despite the numerous AI advances in medicine cited throughout scholarly literature, all generative AI programs fail to produce output that is both safe and accurate when dealing with medical topics, according to a new report by benchmark firm LMArena. 

    The finding is especially concerning given that people are going to bots such as ChatGPT for medical answers, and research shows that people trust AI’s medical advice over the advice of doctors, even when it’s wrong.

    Also: Patients trust AI’s medical advice over doctors – even when it’s wrong, study finds

    The new study, comparing OpenAI’s GPT-5 with numerous models from Google, Anthropic, and Meta, finds that “performance in real-world biomedical research remains far from adequate.” 

    (Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

    A knowledge gap in medicine

    “No current model reliably meets the reasoning and domain-specific knowledge demands of biomedical scientists,” according to the LMArena team.

    The report concludes that current models are simply too lax and too fuzzy to meet the standards of medicine:

    “This fundamental gap highlights the growing mismatch between general AI capabilities and the needs of specialized scientific communities. Biomedical researchers work at the intersection of complex, evolving knowledge and real-world impact. They don’t need models that ‘sound’ correct; they need tools that help uncover insights, reduce error, and accelerate the pace of discovery.”

    LMArena + DataTecnica

    The study echoes findings from other benchmark tests related to medicine. For example, in May, OpenAI unveiled HealthBench, a suite of text prompts concerning medical situations and conditions that could reasonably be submitted to a chatbot by a person seeking medical advice. That study found that the best accuracy score, by OpenAI’s o3 large language model, 0.598, left ample room for improvement on the benchmark. 

    Also: OpenAI’s HealthBench shows AI’s medical advice is improving – but who will listen?

    Expanding the benchmark

    To address the gap between AI models and medicine, LMArena has teamed with startup DataTecnica, which earlier this year unveiled a benchmark suite of tests for Gen AI called CARDBiomedBench, a question-and-answer benchmark for evaluating LLMs in biomedical research.

    Together, LMArena and DataTecnica plan to expand what’s called BiomedArena, a leaderboard that lets people compare AI models side by side and vote on which ones perform the best.

    Also: Meta’s Llama 4 ‘herd’ controversy and AI contamination, explained

    BiomedArena is meant to be specific to medical research, rather than very general questions, unlike general-purpose leaderboards.

    The BiomedArena work is already used by scientists at the Intramural Research Program of the US National Institutes of Health, they note, “where scientists pursue high-risk, high-reward projects that are often beyond the scope of traditional academic research due to their scale, complexity, or resource demands.”

    The BiomedArena work, according to the LMArena team, will “focus on tasks and evaluation strategies grounded in the day-to-day realities of biomedical discovery — from interpreting experimental data and literature to assisting in hypothesis generation and clinical translation.”

    Also: You can track the top AI image generators via this new leaderboard – and vote for your favorite too

    As ZDNET’s Webb Wright reported in June, LMArena.ai ranks AI models. The website was originally founded as a research initiative through UC Berkeley under the name Chatbot Arena and has since become a full-fledged platform, with financial support from UC Berkeley, a16z, Sequoia Capital, and others.

    Where could they go wrong?

    Two big questions loom for this new benchmark effort.

    First, studies with doctors have shown that gen AI’s usefulness expands dramatically when AI models are hooked up to databases of “gold standard” medical information, with dedicated large language models (LLMs) able to outperform the top frontier models just by tapping into information. 

    Also: Hooking up generative AI to medical data improved usefulness for doctors

    From today’s announcement, it’s not clear how LMArena and DataTecnica plan to address that aspect of AI models, which really is a kind of agentic capability — the ability to tap into resources. Without measuring how AI models use external resources, the benchmark could have limited utility.

    Second, numerous medicine-specific LLMs are being developed all the time, including Google’s “MedPaLM” program developed two years ago. It’s not clear if the BiomedArena work will take into account these dedicated medicine LLMs. The work so far has tested only general frontier models. 

    Also: Google’s MedPaLM emphasizes human clinicians in medical AI

    That’s a perfectly valid choice on the part of LMArena and DataTecnica, but it does leave out a whole lot of important effort.

    fix flunking LMArena medicine model proposes
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleElectronic Health Record Giant Epic Rolling Out New AI Tools
    Next Article Samsung will give you a free 65-inch TV right now – here’s how to get one
    Techurz
    • Website

    Related Posts

    Opinion

    Cursor admits its new coding model was built on top of Moonshot AI’s Kimi

    March 22, 2026
    Opinion

    Stripe, PayPal Ventures bet on India’s Xflow to fix cross-border B2B payments

    February 24, 2026
    Opinion

    Ali Partovi’s Neo looks to upend the accelerator model with low-dilution terms

    February 20, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    College social app Fizz expands into grocery delivery

    September 3, 20252,288 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202516 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202512 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    College social app Fizz expands into grocery delivery

    September 3, 20252,288 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202516 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202512 Views
    Our Picks

    Cursor admits its new coding model was built on top of Moonshot AI’s Kimi

    March 22, 2026

    Delve accused of misleading customers with ‘fake compliance’

    March 21, 2026

    AI startups are eating the venture industry and the returns, so far, are good

    March 20, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2026 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.