Close Menu
TechurzTechurz
    What's Hot

    Evotrex raises $30M to build the RV that doesn’t need a charging station

    June 9, 2026

    It’s not FAANG anymore. It’s MANGOS.

    June 9, 2026

    Zepto’s IPO filing reveals fast growth, bigger losses, and a valuation question nobody’s answered yet

    June 9, 2026
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Tech Pulse
    • Evotrex raises $30M to build the RV that doesn’t need a charging station
    • It’s not FAANG anymore. It’s MANGOS.
    • Zepto’s IPO filing reveals fast growth, bigger losses, and a valuation question nobody’s answered yet
    • How to apply to Startup Battlefield 2026, what you need ahead of today’s June 8 deadline
    • Beyond Instagram: Introducing the next generation of social apps
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    TechurzTechurz
    • Home
    • Tech Pulse
    • Future Tech
    • AI Systems
    • Cyber Reality
    • Disruption Lab
    • Signals
    TechurzTechurz
    Home - AI - What Apple’s controversial research paper really tells us about LLMs
    AI

    What Apple’s controversial research paper really tells us about LLMs

    TechurzBy TechurzJune 17, 2025Updated:May 10, 2026No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    What Apple's controversial research paper really tells us about LLMs
    Share
    Facebook Twitter LinkedIn Pinterest Email


    CHRISTOPH BURGSTEDT/SCIENCE PHOTO LIBRARY/Getty

    Generative AI models quickly proved they were capable of performing technical tasks well. Adding reasoning capabilities to the models unlocked unforeseen capabilities, enabling the models to think through more complex questions and produce better-quality, more accurate responses — or so we thought. 

    Last week, Apple released a research report called “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” As the title reveals, the 30-page paper dives into whether large reasoning models (LRMs), such as OpenAI’s o1 models, Anthropic’s Claude 3.7 Sonnet Thinking (which is the reasoning version of the base model, Claude 3.7 Sonnet), and DeepSeek R1, are capable of delivering the advanced “thinking” they advertise. 

    (Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

    Also: OpenAI’s o1 lies more than any major AI model. Why that matters

    Apple carried out the investigation by creating a series of experiments in the form of diverse puzzles that tested models beyond the scope of traditional math and coding benchmarks. The results showed that even the smartest models hit a point of diminishing returns, increasing reasoning to solve a problem’s complexity up to a limit. 

    I encourage you to read it if you are remotely interested in the subject. However, if you don’t have the time and just want the bigger themes, I unpack it for you below.  

    What are large reasoning models (LRMs)? 

    In the research paper, Apple uses “large reasoning models” when referring to what we would typically just call reasoning models. This type of large language model (LLM) was first popularized by the release of OpenAI’s o1 model, which was later followed by its release of o3. 

    The concept behind LRMs is simple. Humans are encouraged to think before they speak to produce a comment of higher value; similarly, when a model is encouraged to spend more time processing through a prompt, its answer quality should be higher, and that process should enable the model to respond to more complex prompts well. 

    Also: Apple’s ‘The Illusion of Thinking’ is shocking – but here’s what it missed

    Methods such as “Chain-of-Thought” (CoT) also enable this extra thinking. CoT encourages an LLM to break down a complex problem into logical, smaller, and solvable steps. The model sometimes shares these reasoning steps with users, making the model more interpretable and allowing users to better steer its responses and identify errors in reasoning. The raw CoT is often kept private to prevent bad actors from seeing weaknesses, which could tell them exactly how to jailbreak a model. 

    This extra processing means these models require more compute power and are therefore more expensive or token-heavy, and take longer to return an answer. For that reason, they are not meant for broad, everyday tasks, but rather reserved for more complex or STEM-related tasks.  

    This also means that the benchmarks used to test these LRMs are typically related to math or coding, which is one of Apple’s first qualms in the paper. The company said that these benchmarks emphasize the final answer and focus less on the reasoning process, and are therefore subject to data contamination. As a result, Apple set up a new experiment paradigm. 

    The experiments

    Apple set up four controllable puzzles: Tower of Hanoi, which involves transferring disks across pegs; Checkers Jumping, which involves positioning and swapping checkers pieces; River Crossing, which involves getting shapes across a river; and Blocks World, which has users swap colored items. 

    Apple

    Understanding why the experiments were chosen is key to understanding the paper’s results. Apple chose puzzles to better understand the factors that influence what existing benchmarks identify as better performance. Specifically, the puzzles allow for a more “controlled” environment where, even when the level intensity is adjusted, the reasoning remains the same. 

    “These environments allow for precise manipulation of problem complexity while maintaining consistent logical processes, enabling a more rigorous analysis of reasoning patterns and limitations,” the authors explained in the paper. 

    The puzzles compared both the “thinking” and “non-thinking” versions of popular reasoning models, including Claude 3.7 Sonnet, and DeepSeek’s R1 and V3. The authors manipulated the difficulty by increasing the problem size. 

    The last important element of the setup is that all the models were given the same maximum token budget (64k). Then, 25 samples were generated with each model, and the average performance of each model across them was recorded. 

    The results 

    The findings showed that there are different advantages to using thinking versus non-thinking models at different levels. In the first regime, or when problem complexity is low, non-thinking models can perform at the same level, if not better, than thinking models while being more time-efficient. 

    Apple

    The biggest advantage of thinking models lies in the second, medium-complexity regime, as the performance gap between thinking and non-thinking models widens significantly (illustrated in the figure above). Then, in the third regime, where problem complexity is the highest, the performance of both model types fell to zero. 

    Also: With AI models clobbering every benchmark, it’s time for human evaluation

    “Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts,” said the authors. 

    They observed a similar collapse when testing five state-of-the-art thinking models: o3 mini (medium and high configurations), DeepSeek R1, DeepSeek R1 Qwen 32B, and Claude 3.7 Sonnet Thinking on the same puzzles used in the first experiment. The same pattern was observed: as complexity grew, accuracy fell, eventually plateauing at zero. 

    Figure 6: Accuracy and thinking tokens vs. problem complexity for reasoning models across puzzle environments. As complexity increases, reasoning models initially spend more tokens while accuracy declines gradually, until a critical point where reasoning collapses—performance drops sharply and reasoning effort decreases.

    Apple

    Even more interesting is the change in the number of thinking tokens used. Initially, as the puzzles grow in complexity, the models accurately allocate the tokens necessary to solve the issue. However, as the models get closer to their drop-off point for accuracy, they also start reducing their reasoning effort, even though the problem is more difficult, and they would be expected to use more. 

    The paper identifies other shortcomings: for example, even when prompted with the necessary steps to solve the problem, thinking models were still unable to do so accurately, despite it having to be less difficult technically. 

    What does this mean?

    The public’s perception of the paper has been split on what it really means for users. While some users have found comfort in the paper’s results, saying it shows that we are further from AGI than tech CEOs would have us believe, many experts have identified methodology issues. 

    The overarching discrepancies identified include that the higher-complexity problems would require a higher token allowance to solve than that allocated by Apple to the model, which was capped at 64k. Others noted that some models that would have perhaps been able to perform well, such as o3-mini and o4-mini, weren’t included in the experiment. One user even fed the paper to o3 and asked it to identify methodology issues. ChatGPT had a few critiques, such as token ceiling and statistical soundness, as seen below. 

    I asked o3 to analyse and critique Apple’s new “LLMs can’t reason” paper. Despite its inability to reason I think it did a pretty decent job, don’t you? pic.twitter.com/jvwqt3NVrt

    — rohit (@krishnanrohit) June 9, 2025

    My interpretation: If you take the paper’s results at face value, the authors do not explicitly say that LRMs are not capable of reasoning or that it is not worth using them. Rather, the paper points out that there are some limitations to these models that could still be researched and iterated on in the future — a conclusion that holds true for most advancements in the AI space.

    The paper serves as yet another good reminder that none of these models are infallible, regardless of how advanced they claim to be or even how they perform on benchmarks. Evaluating an LLM based on a benchmark possesses an array of issues in itself, as benchmarks often only test for higher-level specific tasks that don’t accurately translate into everyday applications of these models. 

    Get the morning’s top stories in your inbox each day with our Tech Today newsletter.

    Apples controversial LLMs paper Research tells
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleIn just 3 months, Ramp’s valuation jumped to $16B from $13B
    Next Article Gardyn Indoor Hydroponic Garden Review: Better Growing Through AI
    Techurz
    • Website

    Related Posts

    Opinion

    What ClickUp’s mass layoff tells us about the future of work

    May 25, 2026
    Opinion

    AI research lab NeoCognition lands $40M seed to build agents that learn like humans

    April 21, 2026
    Opinion

    Popular AI gateway startup LiteLLM ditches controversial startup Delve

    March 30, 2026
    Add A Comment
    Latest Tech Pulse

    College social app Fizz expands into grocery delivery

    September 3, 20252,289

    SolarSquare in talks to raise up to $60M as India’s rooftop solar market draws major VC interest

    May 23, 202621

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202518
    Stay In Touch
    • YouTube
    • WhatsApp
    • Twitter
    • Pinterest
    • LinkedIn

    Techurz helps readers stay ahead of digital change with clear, practical, future focused technology intelligence written today,searched tomorrow.

    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Company
    • About Us
    • Contact Us
    • Our Authors / Editorial Team
    • Write For Us
    • Advertise
    Policy
    • Editorial Policy
    • Privacy Policy
    • Terms and Conditions
    • Affiliate Disclosure
    • Cookie Policy
    • Disclaimer
    • DMCA
    Explore
    • AI Systems
    • Cyber Reality
    • Future Tech
    • Disruption Lab
    • Signals
    • Tech Pulse
    • Sitemap

    Join the Techurz Brief

    The future does not arrive suddenly.
    Stay ahead with fast, sharp tech signals.

    Type above and press Enter to search. Press Esc to cancel.