AI: Too Smart or Too Dumb to Think Like Us?
- Dennis Hulsebos
- 4 dagen geleden
- 5 minuten om te lezen
Why “Illusion of thinking” makes a great clickbait title, but a poor reading of the evidence.

The Headline Hype: How a Stress-Test Became a Doom-Post
Most headlines follow the same tone: “AI reasoning fatally flawed,” “Apple exposes the illusion of intelligence.” These headlines reference Apple’s recent paper, The Illusion of Thinking, but rarely do they engage with the paper’s actual content.
So what did the paper actually test? It benchmarked models on a narrow class of symbolic puzzles: Tower of Hanoi, Blocks World, River Crossing, and Checkers. These puzzles were chosen because they offer objective ways to check correctness. Even so, the authors make clear that this represents only a narrow slice of reasoning. It is not intended as a comprehensive evaluation of intelligence.
More critically, all reasoning in the experiment had to occur via text alone. Models were not permitted to use calculators, external memory, or code interpreters. These are all standard parts of today’s language model toolkits, but were stripped away for the purpose of the study. The goal was to measure what models can accomplish using only their internal "working memory."
This design choice has deep consequences. If a person were asked to solve dozens of logic puzzles mentally without using a notepad or digital tools, and they failed, we would not conclude they are incapable of thinking. But in the case of AI, this constraint is somehow being used to make precisely that claim. In truth, this says more about the experiment's artificial setup than it does about any hard cognitive limits in the models.
Failures only occurred at extreme scales. The puzzles were deliberately extended to require hundreds of tightly sequenced steps. This adds a level of complexity far beyond normal human working memory, and arguably beyond the demands of almost any real-world task.
In summary, the paper does not prove that AI systems cannot think. It shows that current models have trouble scaling to high-complexity symbolic problems without assistance. That nuance is often ignored because it does not lend itself to viral headlines.
Why ‘Thinking’ Is a Moving Target
The assertion that AI cannot think assumes we already have a clear understanding of what thinking is. But there is no such settled definition. Different academic fields frame thinking in fundamentally different ways, and even within each field, major disagreements persist.

This diversity of perspectives highlights why comments like “predicting the next word is not thinking” are philosophical judgments, not empirical conclusions. That statement aligns with Searle’s criticism, but it overlooks Turing’s behavioural approach and dismisses functionalist arguments that evaluate thought based on pattern and output.
Scientific disciplines are equally divided. Neuroscience defines thinking as brain activity, measurable by electrical and chemical changes in specific regions. Yet this only identifies the location of activity, not the content or nature of thought. Cognitive science describes thinking in terms of symbol manipulation, attention, memory, and planning. These models often borrow metaphors from computing, but they struggle to capture subjective experience.
These divergent perspectives are not trivial. They reflect real and unresolved tensions. Is thinking something that happens privately, or something we observe in action? Does it require consciousness, or just coherent output? Must it be embodied, or can it emerge from disembodied computation?
There is no agreed-upon answer. The most widely accepted minimum definition is vague: thinking is the manipulation of information in a way that supports goal-directed behaviour. That definition is useful but not definitive. It highlights why the Apple paper cannot close the debate. It simply explores one corner of a complex map.
Does Token Prediction Count as Reasoning?
At their mathematical core, modern language models still select the next token that best fits the context. Yet the surrounding architecture has changed dramatically in the past year, and those changes matter for any claim about reasoning. Contemporary frontier systems such as OpenAI’s o3 series, including GPT‑4o, Google’s Gemini family, now headed by Gemini 2·5 and Gemini Ultra, Anthropic’s Claude 3 line, and independent entrants like Mistral‑Large do far more than roll out an endless string of guesses. They maintain vector memories so they can refer back to earlier dialogue, they run internal deliberation passes that generate multiple candidate chains of thought and vote among them, and they decide on the fly when to call specialised tools such as a calculator, a search retriever or a code interpreter. The raw token predictor has become the engine inside a broader vehicle built for planning, reflection and execution.
Because of these additions, a single model can outline a twelve-step product launch, simulate competing market scenarios, test its own calculations in Python and then rewrite the plan after spotting a bottleneck. Each sub‑task still produces tokens, but the larger loop resembles the deliberative cycle we associate with human reasoning: set a goal, search for options, evaluate, act and review. Philosophers call this instrumental rationality, the capacity to pursue ends through coherent means. Whether the model understands in any phenomenological sense remains an open debate, yet its behaviour now meets the functional criteria that many cognitive scientists apply when they study reasoning in animals or humans.
The functionalist interpretation therefore gains strength. If a system can allocate attention, chain intermediate results, verify its own work and revise when evidence shifts, then it performs a process we ordinarily label thinking. Refusing to use that word because the substrate is silicon rather than neurons becomes less about empirical evidence and more about philosophical preference.
Why It Still Matters
So why does this paper matter? Not because it delivers a fatal blow to machine reasoning. It matters because it shows where the limits currently are. It identifies thresholds beyond which current models need support to continue performing.
It helps us ask better questions. What assumptions are built into the task? What aids are permitted? How long is the reasoning chain before failure sets in? These questions are crucial, not just for benchmarking, but for designing better systems.
It also shows us what to build next. The future of AI is likely to involve modular systems: memory components, retrieval mechanisms, and planning agents. These additions are not cheating. They are how humans think, too. We think with notes, reminders, collaboration, and tools.
Finally, the paper reminds us just how hard it is to define what thinking really means. As long as scientists, philosophers, and engineers use different yardsticks, statements like “AI can’t think” will remain rhetorical rather than scientific.
What lies ahead is not a clean line between thinkers and non-thinkers, but a continuum. A landscape of systems that reason, some more narrowly than others, some more creatively, some more usefully. The task is not to declare whether AI can think, but to understand the ways it does and build accordingly.