Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Google’s new Pixel phone insurance includes unlimited claims, but is it legit? I did the math

    August 29, 2025

    Lost luggage hauls are the internet’s strangest new trend

    August 29, 2025

    Salt Typhoon APT techniques revealed in new report

    August 29, 2025
    Facebook X (Twitter) Instagram
    Trending
    • Google’s new Pixel phone insurance includes unlimited claims, but is it legit? I did the math
    • Lost luggage hauls are the internet’s strangest new trend
    • Salt Typhoon APT techniques revealed in new report
    • Today’s Wordle #1532 Hints And Answer For Friday, August 29th
    • Onboarding Success: Learn the Cold Start Algorithm
    • Why China Builds Faster Than the Rest of the World
    • I took this 360-degree camera around the world – why it’s still the most versatile gear I own
    • Creating a qubit fit for a quantum future
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»Apps»Every AI model is flunking medicine – and LMArena proposes a fix
    Apps

    Every AI model is flunking medicine – and LMArena proposes a fix

    TechurzBy TechurzAugust 19, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Every AI model is flunking medicine - and LMArena proposes a fix
    Share
    Facebook Twitter LinkedIn Pinterest Email


    johan63/iStock/Getty Images Plus via Getty Images

    ZDNET’s key takeaways

    • AI frontier models fail to provide safe and accurate output on medical topics.
    • LMArena and DataTecnica aim to ‘rigorously’ test LLMs’ medical knowledge.
    • It’s not clear how agents and medicine-specific LLMs will be measured.

    Get more in-depth ZDNET tech coverage: Add us as a preferred Google source on Chrome and Chromium browsers.

    Despite the numerous AI advances in medicine cited throughout scholarly literature, all generative AI programs fail to produce output that is both safe and accurate when dealing with medical topics, according to a new report by benchmark firm LMArena. 

    The finding is especially concerning given that people are going to bots such as ChatGPT for medical answers, and research shows that people trust AI’s medical advice over the advice of doctors, even when it’s wrong.

    Also: Patients trust AI’s medical advice over doctors – even when it’s wrong, study finds

    The new study, comparing OpenAI’s GPT-5 with numerous models from Google, Anthropic, and Meta, finds that “performance in real-world biomedical research remains far from adequate.” 

    (Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

    A knowledge gap in medicine

    “No current model reliably meets the reasoning and domain-specific knowledge demands of biomedical scientists,” according to the LMArena team.

    The report concludes that current models are simply too lax and too fuzzy to meet the standards of medicine:

    “This fundamental gap highlights the growing mismatch between general AI capabilities and the needs of specialized scientific communities. Biomedical researchers work at the intersection of complex, evolving knowledge and real-world impact. They don’t need models that ‘sound’ correct; they need tools that help uncover insights, reduce error, and accelerate the pace of discovery.”

    LMArena + DataTecnica

    The study echoes findings from other benchmark tests related to medicine. For example, in May, OpenAI unveiled HealthBench, a suite of text prompts concerning medical situations and conditions that could reasonably be submitted to a chatbot by a person seeking medical advice. That study found that the best accuracy score, by OpenAI’s o3 large language model, 0.598, left ample room for improvement on the benchmark. 

    Also: OpenAI’s HealthBench shows AI’s medical advice is improving – but who will listen?

    Expanding the benchmark

    To address the gap between AI models and medicine, LMArena has teamed with startup DataTecnica, which earlier this year unveiled a benchmark suite of tests for Gen AI called CARDBiomedBench, a question-and-answer benchmark for evaluating LLMs in biomedical research.

    Together, LMArena and DataTecnica plan to expand what’s called BiomedArena, a leaderboard that lets people compare AI models side by side and vote on which ones perform the best.

    Also: Meta’s Llama 4 ‘herd’ controversy and AI contamination, explained

    BiomedArena is meant to be specific to medical research, rather than very general questions, unlike general-purpose leaderboards.

    The BiomedArena work is already used by scientists at the Intramural Research Program of the US National Institutes of Health, they note, “where scientists pursue high-risk, high-reward projects that are often beyond the scope of traditional academic research due to their scale, complexity, or resource demands.”

    The BiomedArena work, according to the LMArena team, will “focus on tasks and evaluation strategies grounded in the day-to-day realities of biomedical discovery — from interpreting experimental data and literature to assisting in hypothesis generation and clinical translation.”

    Also: You can track the top AI image generators via this new leaderboard – and vote for your favorite too

    As ZDNET’s Webb Wright reported in June, LMArena.ai ranks AI models. The website was originally founded as a research initiative through UC Berkeley under the name Chatbot Arena and has since become a full-fledged platform, with financial support from UC Berkeley, a16z, Sequoia Capital, and others.

    Where could they go wrong?

    Two big questions loom for this new benchmark effort.

    First, studies with doctors have shown that gen AI’s usefulness expands dramatically when AI models are hooked up to databases of “gold standard” medical information, with dedicated large language models (LLMs) able to outperform the top frontier models just by tapping into information. 

    Also: Hooking up generative AI to medical data improved usefulness for doctors

    From today’s announcement, it’s not clear how LMArena and DataTecnica plan to address that aspect of AI models, which really is a kind of agentic capability — the ability to tap into resources. Without measuring how AI models use external resources, the benchmark could have limited utility.

    Second, numerous medicine-specific LLMs are being developed all the time, including Google’s “MedPaLM” program developed two years ago. It’s not clear if the BiomedArena work will take into account these dedicated medicine LLMs. The work so far has tested only general frontier models. 

    Also: Google’s MedPaLM emphasizes human clinicians in medical AI

    That’s a perfectly valid choice on the part of LMArena and DataTecnica, but it does leave out a whole lot of important effort.

    fix flunking LMArena medicine model proposes
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleElectronic Health Record Giant Epic Rolling Out New AI Tools
    Next Article Samsung will give you a free 65-inch TV right now – here’s how to get one
    Techurz
    • Website

    Related Posts

    Security

    9 iPhone 17 Air rumors I’m tracking – and why Apple’s ultra-thin model is set to kill the Plus

    August 28, 2025
    AI

    I’m a diehard Google Pixel fan – here’s why I’m not upgrading to the latest model

    August 24, 2025
    Security

    I compared the Pixel 10 Pro to every older Google flagship model – the biggest upgrades

    August 24, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    You Can Now Get Starlink for $15-Per-Month in New York, but There’s a Catch

    July 11, 20257 Views

    Non-US businesses want to cut back on using US cloud systems

    June 2, 20257 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    You Can Now Get Starlink for $15-Per-Month in New York, but There’s a Catch

    July 11, 20257 Views

    Non-US businesses want to cut back on using US cloud systems

    June 2, 20257 Views
    Our Picks

    Google’s new Pixel phone insurance includes unlimited claims, but is it legit? I did the math

    August 29, 2025

    Lost luggage hauls are the internet’s strangest new trend

    August 29, 2025

    Salt Typhoon APT techniques revealed in new report

    August 29, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2025 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.