Close Menu
TechurzTechurz
    What's Hot

    IQM, Europe’s first public quantum company, admits the future of the tech is uncertain

    July 2, 2026

    Indian tech tycoon bets $30M of his own money to build AI alternative to Microsoft Office

    July 2, 2026

    Bending Spoons defies SaaS slump, surges 40% on first day of trading

    July 1, 2026
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Tech Pulse
    • IQM, Europe’s first public quantum company, admits the future of the tech is uncertain
    • Indian tech tycoon bets $30M of his own money to build AI alternative to Microsoft Office
    • Bending Spoons defies SaaS slump, surges 40% on first day of trading
    • Humble Robotics’ CEO says the tech finally caught up to the vision for autonomous vehicles
    • Autonomous vehicle hype is back, and Humble Robotics is bringing it to freights
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    TechurzTechurz
    • Home
    • Tech Pulse
    • Future Tech
    • AI Systems
    • Cyber Reality
    • Disruption Lab
    • Signals
    TechurzTechurz
    Home - AI - LLM Benchmarking: Surprising Task Complexity Gains
    AI

    LLM Benchmarking: Surprising Task Complexity Gains

    TechurzBy TechurzJuly 2, 2025Updated:May 10, 2026No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    LLM Benchmarking: Surprising Task Complexity Gains
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The main purpose of many large language models (LLMs) is providing compelling text that’s as close as possible to being indistinguishable from human writing. And therein lies a major reason why it’s so hard to gauge the relative performance of LLMs using traditional benchmarks: quality of writing doesn’t necessarily correlate with metrics traditionally used to measure processor performance, such as instruction execution rate.

    RELATED: Large Language Models Are Improving Exponentially

    But researchers at the Berkeley, Calif. think tank METR (for Model Evaluation & Threat Research) have come up with an ingenious idea. First, identify a series of tasks with varying complexity and record the average time it takes for a group of humans to complete each task. Then have various versions of LLMs complete the same tasks, noting cases in which a version of an LLM successfully completes the task with some level of reliability, say 50 percent of the time. Plots of the resulting data confirm that as time goes on, successive generations of an LLM can reliably complete longer and longer (more and more complex) tasks.

    No surprise there. But the shock was that this improvement in the ability of LLMs to reliably complete harder tasks has been exponential, with a doubling period of about seven months.

    IEEE Spectrum reached out to Megan Kinniment, one of the authors of an METR research paper describing this work and its surprising implications.

    Table of contents
    1 Evaluating LLM Performance Metrics
    2 What Exponential Growth in AI Means for Humanity
    3 Catastrophic Risks from Advanced AI

    Evaluating LLM Performance Metrics

    Did you suspect that you’d get these results?

    Megan Kinniment: I, at least personally, didn’t expect us to have quite as clear an exponential as we did. Models have definitely been getting better quickly, though. So some fast rate of progress wasn’t entirely unexpected.

    As you point out in the paper, it’s always dangerous to look into the future and extrapolate. However, you suggest that there is a likelihood of this continuing, which means that by 2030 we’ll be looking at monthlong tasks being within the capability of the most advanced large language models.

    Kinniment: Let’s have a look at that. By one month, we mean around 167 working hours, so the number of [human] working hours in a month. And that’s at 50 percent reliability. But longer tasks typically seem to require higher reliability to actually be useful. So that’s something that could make the in-practice, real-world, economic impacts not be as intense as what is predicted.

    There are a number of things that would have to continue for this prediction to come true. Hardware would have to continue improving at roughly the rate it’s improving; software would have to keep improving. You would have to have sufficient training data and availability of that training data to continue training at the breathtaking clip that’s been occurring in recent years.

    Kinniment: The forecasts and the dates that we’ve found are just extrapolating the trend that we see on our task suite. [The trends are] not taking into account real-world factors or compute-scaling changes.

    If a large language model could somehow achieve the ability to complete 167-hour type tasks with 50 percent reliability, what are the kinds of things that that now puts in the realm of capability for a large language model?

    Kinniment: Well, the big one that we often think about is accelerating AI R&D research itself. To the extent that you can make models that accelerate your company’s ability to make better models, you could end up in a situation where AI capabilities develop really quite rapidly.

    What Exponential Growth in AI Means for Humanity

    What you are describing is reminiscent of the idea of the singularity, where you have AIs creating other AIs on their own, not assisted by human beings.

    Kinniment: I think that you could get acceleration that is quite intense and does make things meaningfully more difficult to control without it necessarily resulting in this massively explosive growth. There are reasons to think that you might have various bottlenecks that slow things down in practice. Even if it were the case that we had very, very clever AIs, this pace of progress could still end up bottlenecked on things like hardware and robotics. But yeah, the singularity is for sure an idea that is relevant to this whole sector of things.

    Things could go quite quickly, but it’s not like it’s the singularity or nothing. [AI-development rates] that were mild compared to a singularity could still be quite intense for how the world needs to adapt.

    You indicated in the paper that some large language models seem to be improving in their ability to adapt and improve from mistakes.

    Kinniment: I think it’s actually been a relatively gradual thing since ChatGPT, and potentially before that. They’re less likely to get stuck. They’re a bit better at changing strategies when things aren’t working, but that’s a bit hit or miss. And they’re definitely a lot better at doing things than they used to be and better at using tools. But it does seem like there’s some fundamental aspects that haven’t changed a great deal. One thing that I like to look at when I get a new model is, on each task, we give the model a number of tokens, a number of words that it can say. And if you could imagine giving them more and more time or more and more tokens to do a task, how does that affect how likely they are to succeed? And basically, what we see is they plateau quite strongly. There’s a point at which you give them more tokens and it doesn’t really help. And for each new model, that plateau gets a bit higher.

    Megan Kinniment was on the team at METR that published the results of a study of LLM performance.Megan Kinniment

    Humans, I imagine, also have diminishing returns. But if you give a human lots and lots of time to do something, they’ll probably do a better job, especially if you have multiple humans. And I think I’d be pretty impressed with a large language model that, even if its absolute score was lower, seemed like it could just keep doing things and improving. That could be a big deal.

    You found that models performed worse on tasks that had higher “messiness” scores. Was there any signal that you got out of the data that this state of affairs might be changing? In other words, that models might be gaining greater ability to handle tasks that had higher messiness?

    Kinniment: Messiness was a measure that I made to try and get a somewhat quantitative measure of how unrealistic our tasks were compared to the real world. And most of our tasks aren’t that messy. It’s a 16-point scale. The mean is about 3, and the most messy tasks are about 8 out of 16.

    So what would a 16 task be in terms of messiness?

    Kinniment: Something like espionage, where you have a lot of resource limitations. It’s very punishing. You have agents that are optimizing against you actively. It’s easy to mess up. It’s novel.

    Are you all planning to follow up this study?

    Kinniment:OpenAI published o3, and o3 was a little bit more capable than anticipated given the trend. So we are doing some amount of follow-up in terms of measuring other models. We do want to keep focused on informing the world about AI development and catastrophic risks from AI systems.

    Catastrophic Risks from Advanced AI

    What are the most likely catastrophic risks from AI? I mean, the ones that come to my mind are massive dislocations in employment if and when AI becomes supremely capable.

    Kinniment: When we’re talking about catastrophic risks, we’re not just talking about mass unemployment. We’re talking about things that are more like this: if everybody became unemployed or you just didn’t need human workers for the vast majority of things, you might not need human workers to maintain your military, or much fewer humans. That could make it easier for somebody to perform a coup, essentially. Or, if you have a vast quantity of geniuses in a data center, then that would make you a very powerful person. If you use that to produce military hardware, it’s possible we could get a concentration of power, and you might not have a democratic state anymore.

    All this would happen, obviously, without any form of consciousness. These would be machines that would have the capability to scheme and plot and plan, but without the kind of consciousness that characterizes human ability to do this. Consciousness isn’t necessary for this.

    Kinniment:Consciousness is a hard problem. I’m not sure if consciousness is necessary for any particular behavior. It feels a bit above my pay grade. I also think it’s not crazy that they could be conscious at this point. They would be very intelligent.

    So you think it’s possible that they may be conscious at some point in the future?

    Kinniment: I mean, if they’re as intelligent as you and I, then it doesn’t seem quite crazy. It doesn’t seem crazy for them to not be, and it doesn’t seem crazy for them to be.

    From Your Site Articles

    Related Articles Around the Web

    Benchmarking complexity gains LLM surprising task
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGoogle Chrome hit by another serious security flaw – update your browser ASAP
    Next Article Nuki Smart Lock review: This retrofit smart lock gives your old deadbolt some key new features
    Techurz
    • Website

    Related Posts

    AI Systems

    The Future of AI Systems: 7 Architectural Shifts Driving the AI Revolution

    June 13, 2026
    Opinion

    Tiny startup Arcee AI built a 400B-parameter open source LLM from scratch to best Meta’s Llama

    January 29, 2026
    Cyber Reality

    How Passwork 7 Addresses Complexity of Enterprise Security

    October 3, 2025
    Add A Comment
    Latest Tech Pulse

    College social app Fizz expands into grocery delivery

    September 3, 20252,290

    12 Father’s Day E-Card Sites That Are Actually Good

    June 4, 202523

    SolarSquare in talks to raise up to $60M as India’s rooftop solar market draws major VC interest

    May 23, 202622
    Stay In Touch
    • YouTube
    • WhatsApp
    • Twitter
    • Pinterest
    • LinkedIn

    Techurz helps readers stay ahead of digital change with clear, practical, future focused technology intelligence written today,searched tomorrow.

    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Company
    • About Us
    • Contact Us
    • Our Authors / Editorial Team
    • Write For Us
    • Advertise
    Policy
    • Editorial Policy
    • Privacy Policy
    • Terms and Conditions
    • Affiliate Disclosure
    • Cookie Policy
    • Disclaimer
    • DMCA
    Explore
    • AI Systems
    • Cyber Reality
    • Future Tech
    • Disruption Lab
    • Signals
    • Tech Pulse
    • Sitemap

    Join the Techurz Brief

    The future does not arrive suddenly.
    Stay ahead with fast, sharp tech signals.

    Type above and press Enter to search. Press Esc to cancel.