4 min read

Apple's Research Reveals the Limits of the AI Reasoning Model

Apple's Research Reveals the Limits of the AI Reasoning Model
Photo by Zhiyue / Unsplash

For the past year, the AI industry has been captivated by a new frontier: reasoning models. Led by OpenAI's powerful "o-series" and Google's Gemini, these models promised to do more than just generate text; they could "think," breaking down complex problems and arriving at more accurate answers. But a recent research paper from Apple suggests this revolution might be built on an "illusion." The paper, which found that today's best reasoning models experience a "complete accuracy collapse" when faced with novel, complex puzzles, is sending a quiet shockwave through an industry built on the premise of exponential progress.

What are these "reasoning models" supposed to do?

The leap from standard large language models (LLMs) to large reasoning models (LRMs) was marked by the introduction of techniques like "Chain-of-Thought" (CoT). Instead of answering a query instantly, models like OpenAI's o-series, Anthropic's Claude Thinking, and DeepSeek's R1 are encouraged to generate an internal monologue or "thinking process" before producing a final answer.

The theory is simple: more thinking leads to better answers. This extra compute time allows the model to explore different paths, self-verify its steps, and avoid the kind of "hallucinations" that plague simpler models. The results were initially stunning, with models solving complex math and coding problems that were previously out of reach. But this power comes at a steep price. These models require vastly more compute power for inference, making them significantly more expensive to run. As noted in a McKinsey report, inference on OpenAI's o1 model costs six times more than on its non-reasoning counterpart, GPT-4o.

Why did Apple use puzzles to test AI?

Apple's researchers argued that standard AI benchmarks, which are often focused on math and coding problems, have a critical flaw: data contamination. It's difficult to know if a model is genuinely "reasoning" or if it has simply seen the problem or a similar one in its vast training data. To get around this, Apple created a series of controllable puzzles (like Tower of Hanoi and Blocks World) where the underlying logic is consistent, but the complexity can be precisely increased. This allowed them to test the models' pure problem-solving ability in an environment they had never seen before.

What did Apple's research find?

The results were stark. While the reasoning models performed well on problems of low and medium complexity, their performance fell off a cliff as the puzzles became harder. Apple's paper describes this as a "complete accuracy collapse," where even the most advanced models failed entirely beyond a certain complexity threshold.

Perhaps more surprisingly, the research uncovered a "counter-intuitive scaling limit." As problems got progressively harder, the models didn't try harder. Instead, after a certain point, their reasoning effort—measured by the amount of "thinking" tokens they generated—actually declined, despite having an adequate token budget to continue. It's as if the models recognized a problem was too hard and simply gave up. Here's a key passage from the research paper's abstract:

Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.

In addition, the research found that LRMs do not necessarily outperform LLM, despite more "thinking" supposedly involved in LRMs :

By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.

This suggests a fundamental limitation in the current approach to AI reasoning, challenging the "more compute equals more intelligence" scaling law that has driven the industry.

Is the AGI bubble starting to burst?

Apple's findings add to a growing body of evidence that the race to Artificial General Intelligence (AGI) is hitting a wall. This isn't a new concern. As far back as late 2024, reports indicated that major AI labs, including OpenAI and Google, were struggling to achieve significant performance gains with their newest models, partly due to a scarcity of high-quality training data. At the time, industry analysts questioned if the AGI bubble is bursting a little bit.

Apple's paper provides a new, more rigorous lens on this problem. It suggests that the "reasoning" demonstrated by current models may be a sophisticated form of pattern matching, not a generalizable problem-solving capability. This has significant implications for an industry that has staked tens of billions of dollars on the promise of super-intelligent systems. It calls into question the value of paying a premium for these advanced models if they fail at the very tasks they are marketed for. It also lends credibility to the strategies of companies like DeepSeek, which have focused on achieving near-frontier performance with radical compute efficiency, rather than just scaling up raw thinking time.

Apple's research paper doesn't mean AI progress is over. But it does serve as a crucial reality check, suggesting that the next major breakthrough won't come from simply making existing models think longer, but from inventing entirely new ways for them to think at all.


The Reference Shelf

  • What Apple's controversial research paper really tells us about LLMs (ZDNet)
  • The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity (Apple)
  • OpenAI, Google and Anthropic Are Struggling to Build More Advanced AI (Bloomberg)