Reading ‘The Illusion of Thinking’

I saw this research paper getting some attention and it made me realise it had been a while since I sat down and read a research paper. So that is what I did. I have read just a handful of LLM related papers before, so I am by no means the expert. Keep that in mind as you read my opinions. Also keep that in mind if I write out obvious-to-you things, I am just trying to find my way through the terminology. Also keep that in mind if I say something wrong. You are welcome to correct me though. Finally, I am somewhat writing this as my “notes and thoughts post reading”, not necessarily with the intention of being a nice and coherent blog post for you, the reader.

The paper in question is “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” by Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The paper is not published in a conference or journal at the moment, as far as I can tell. So does that make it a technical report instead of a paper? In the fast-moving AI world, the line between both seems a bit blurred. The authors were all working for Apple during this research. You can find the paper on this Apple CDN.

I remember reading an opinion stating that Apple would be biased against reasoning models given how far behind they are in the “AI race”. If they can show that the “dumber” models are actually pretty good, then it gives them an excuse. I can see some logic in that, it does feel like Apple missed the boat so far.

Note: I read some further taking apart of this paper afterwards. I have added a summary of that breakdown at the end of this post. You may want to skim that first. Some of their remarks have been added to my original musings.

Some Terminology

To not interrupt the rest of the text, I will throw explanations of some words in here, at least how I currently understand them.

Large Language Models (LLM) you are probably familiar with by now. The default approach to all this AI hype. Trains on large amounts of data. Uses that training to guess the next word (well, token) based on words (well, tokens) that precede it. Somehow this works quite well.
Large Reasoning Models (LRM). Basically LLM with reasoning? Reasoning here being that it requires more training, such as having humans verify output in training, having humans write out the step by step thinking process for something, or having more mathematically-oriented training data, which also would contain step by step explanations of what is happening in, say, a proof. These models are then also expected to “think harder” when answering a question. I assume this is just a consequence of that kind of training data, though I imagine with models you can also just tell them to be overly verbose (as if they weren’t already!). I am unsure whether “write out a thinking process” is part of the base prompt used by AI companies. Note: I am not sure whether “LRM” is used outside of this paper, but I will use it in this blog post to be consistent with the paper.
Chain of Thought. Basically the ramblings these models have prior to an answer. Having this be long seems to be inherent to the reasoning models, if you train one on data that focuses on step by step walking through ideas. These things are so annoyingly verbose because it makes them less error prone. Reasoning models even more so. I know in Gemini (Pro version?), you can see this chain of thought, which is different from (or builds up to) the final answer you get.
Temperature. A number. At zero, there is no randomness in results. Same input produces same output. The higher it is, the more randomness (creative, if you will) the answers have. Around 1 seems to be the default. Presumably this is some parameter in an equation. Honestly this only gets a mention in the appendix (it is set to 1.0), but I had forgotten which direction that scaled, so I decided to write it out here.

Summary

The paper aims to compare LLMs to LRMs. The authors do not want to use mathematical problems for this, as they fear those particular problems to be part of the training data, which would skew results when you talk about actual thinking, for some definition of actual thinking. Instead, they use four puzzles:

Towers of Hanoi
Some game of moving checkers left-right
River crossing problem (there’s a few variations of this that you might have heard of, like with wolves, sheep, and cabbage having to get across a river in a boat)
Blocks world, which appears to be a known testing problem in the AI world and requires for the agent to be able to do some planning to solve the problem.

Another stated advantage with these four puzzles is that the complexity can be changed by increasing a certain n: the number of disks in the Towers of Hanoi, number of checkers in the checkers puzzle, number of individuals in the river crossing, and number of blocks in blocks world.

To compare LLMs to LRMs, the authors ask both (paired by vendor, so regular Claude versus Claude thinking) to solve each puzzle at varying levels of complexity. The authors break down the comparisons and results in three bands of complexity, where the categorisation of complexity is derived from the same trends they observe in each puzzle. So not any predefined notion of complexity.

Low complexity. On low complexities, both types of model get the same results. Because LRMs are trained to be more verbose in getting to their answer, they are less efficient here. The reasoning models also “overthink” the problem. More tokens (read: more processing; read: more energy; read: more money) are used to get to the same result. Since this is a waste, the authors conclude that LLMs win out at low complexity.
Medium complexity. As mentioned, the categorisation of complexity is derived from observed trends. It seems that low complexity ends when the regular LLM model starts making (too many) mistakes. There is a noticeable drop off in accuracy of its answers. In the reasoning model, this drop off is delayed and stretched out a bit more. In this category, reasoning models win out.
High complexity. Finally, there is a point where reasoning models also collapse entirely, their accuracy bottoms out. Beyond that point is classified as high complexity. In this category, both models are wrong.

The authors make a curious observation: up to a point, the reasoning models will use more tokens. However, as the complexity increases, this trend will stop and in some cases reverse, using fewer tokens at the higher complexities. Of course, at those higher complexities, the answers were wrong. The authors cannot explain why this happens. The predefined token limit was not reached.

The authors further look into the “thinking” of the models, i.e., the wall of text they spew out prior to formulating the solution they give the user. Here, they notice that at low complexities, the correct solution actually does appear in the reasoning text relatively early. The reasoning model seems to keep going beyond that point, not realising things are solved already. This is referred to as “overthinking”. In medium complexity, the correct solution appears closer to the end of that thought process, after having gone through some incorrect solutions.

Finally, the authors repeat the puzzles, but provide the reasoning model with the algorithm to solve the puzzle in question. They find no (meaningful) difference in the accuracy of the answers.

My Comments, Questions

The Accuracy Drop Off

The authors classify low, medium, and high complexities. At medium complexity, regular LLMs start doing really bad (for some definition of really bad). At high complexity, so do their reasoning variants. In the Tower of Hanoi problem I am definitely inclined to agree with this complexity breakdown of the graph. In the other problems, however, this is less noticeable. For a side-by-side comparison of them all, check out Figure 4 on page 7. In checker jumping, the collapse to near zero accuracy is present in both at nearly the same time. Only Claude thinking just scrapes out 60% accuracy at complexity 2. In river crossing, there is no low complexity, instead starting at medium complexity for complexity 2 (i.e., the LLM never gets it right with sufficient accuracy). Here the thinking models perform well still, at 80%, while the regular LLM gets just ~10%. Higher complexities they are both useless. Blocks world does have somewhat of a medium complexity too: thinking models hover around 20-40% accuracy for a while, while the regular model becomes useless early on.

My gut reaction here is that Towers of Hanoi is the more commonly known problem of the four (my subjective opinion). I would expect it to appear more often in training data. I would expect it to be more likely that written out solutions of it appear more often in the training data, especially for models touted to be trained on step-by-step training data. Is this just a matter of “up to this complexity, this exact problem had solutions in the training data”? It makes me skeptical it is a “thinking” issue. It just feels like a “can this model retrieve the solutions it has seen plenty of times for these instalments of this puzzle?”.

The Token Usage Drop Off

Past a certain complexity, the models start using fewer tokens again. They have not reached the token limit, they just do not seem to even try.

My gut reaction when seeing this result was to think that there is some extra service running to estimate the complexity of a question. If it estimates it to be too much, it might just have the main model not bother trying to solve the problem. Would this even work right now? Would AI companies charging users money by the token even care about this? (I guess the larger question there is whether an AI company makes money yet from users paying by the token or whether it is still venture capital vibes).

That explanation also would not hold for Deepseek which, as far as I understand, runs locally and thus would not have such a service running.

I am a bit stumped for a reason and I find it a pity that the authors did not delve deeper into this.

Others (see addendum at end of this post) have pointed out that the models have (a) a limit to their number of output tokens that is exceeded by many of the higher complexity puzzles and (b) explicitly stop outputting long repetitive output, so they are to some extent aware of their own output limits.

Providing the Algorithm

For these known puzzles, I would expect the algorithm to solve any of them would have appeared in the training data plenty of times. As such, I can see why providing the algorithm to the model does not seem to increase accuracy.

I still somewhat see these models as Markov chains, much much much better ones, mind, but Markov chains all the same. Given a set of words, predict the next one. That prediction is based on the training data. Experiments like these failing so badly on something seemingly simple raises one big question for me: Are the steps to solve the easier versions of these problems simply written out often enough in the training data already? This would also be in line with the models performing noticeably better on the various complexities of Towers of Hanoi, the puzzle that, in my subjective experience, is much more commonly known about.

Other Notes

The authors conclude that LLMs win out at low complexity (in terms of quicker answers). However, complexity was derived from how the regular and the reasoning model compared to each other. To be even more correct, it is defined as the point that the regular model starts answering incorrectly considerably more than before. For a user of an LLM, this is not actionable information, they are unlikely to know when that inflexion point is reached. For researchers it creates an opportunity: how to identify this point? How to do this automatically or consistently?
I am not sure I entirely agree with how they vary complexity. At least for something like Towers of Hanoi, I do not think the algorithm changes in any way for more disks, there are just more steps to walk through. It is constantly the same concept applied to the current state of the puzzle. I think for the checkers and river crossing puzzles, the same is likely true. I am not as familiar with blocks world to make a statement. Others have confirmed this feeling, see addendum at the end of this post.
Note that the provided solution does not need to be optimal. Instead every step needs to be allowed (given the rules of the puzzle) and the final state needs to be the desired state.
Parsing out the steps is done with some regex after telling the LLM to use a certain format to output steps. Manual cleaning of the data is performed prior to analysis.
In a way, LLMs being so bad at this when there are simple algorithms seems counter-intuitive to how we (or at least, I) have been thinking about software programs. Software programs of old might have been bad about thinking up proper way to solve a problem, but once given an algorithm they sure were way better at it than the average human. Messing up this badly on rather quite simple puzzles while still being able to point execution in the right direction is the polar opposite of that. If the two approaches meet, then you have something standalone and very powerful indeed.

I guess this is where things are perhaps headed, with LLMs and friends thinking up solutions, programming code, and having other systems executing that code. And then we all hope it works well enough.
I still wonder whether reasoning models are also explicitly asked in their prompts to be overly verbose and step by step detailed. Is that a part of how companies create chatbot reasoning models?
The authors note their aim is not to create a new benchmark. That said, the appendix does seem to describe the puzzles and setup in enough detail to use them as such a benchmark.
I need to still look up “pass@k”, something they referenced a few times.
To count tokens, they use the same tokeniser as the corresponding model. Here they specifically mention cl100k_base, which leads me to tiktoken, a tokeniser by OpenAI.
On page 10, figure 7, I was somewhat confused at first. Some notes that I added to clarify it to myself. A “solution” is a set of moves. A solution is correct if it solves the problem entirely. I think the distribution per complexity is made up of parsed solutions (correct or incorrect). It is unclear whether this is a combination of all solutions across all runs of the experiment (at that complexity) or whether this is first aggregated per run. The ✓ and ✗ seem to be the mean of the distributions of correct and incorrect solutions.
In that same figure 7, I wonder the following. A higher complexity implies there are more steps to solve the problem. As such, the number of tokens used will have to be more. Presumably, the text between solutions does not grow with complexity. Does this affect the position of solutions in any way? Incorrect solutions can be of more varying lengths the longer the actual solution is. Or rather, there is more chance of it being so.
At the bottom of page 10, the authors state “This likely suggests that examples of River Crossing with N > 2 are scarce on the web, meaning LRMs may not have frequently encountered or memorized such instances during training”. This seems to affirm my worry mentioned before: is there really thinking happening? Does this directly have anything to do with complexity? Or is it just an information retrieval machine that has not encountered that information?
The appendix provides a bit too much detail about their puzzle step validation programs. Felt irrelevant (of course you have a program to check the step validity for you), but I guess it was just in the appendix.
While reading the prompts for the puzzles, I wondered: could it be that some are just confusingly worded? I think I in particular wondered that when trying to parse the prompt for the river crossing. Coincidentally the problem the models did horribly on. Others have confirmed this feeling, see the addendum at the end of the post.
Temperature is set at 1.0 for all. Does this have any implications? Would varying it provide better or worse results?
A remark in the appendix points out that the models perform better on Tower of Hanoi than river crossing puzzles even though the river crossing puzzles they fail on have a lower “compositional depth” than the Tower of Hanoi puzzles they succeed on. Compositional depth here is the number of steps needed to go from initial state to solution state. This feels like it again affirms that Tower of Hanoi would just have more written out solutions in the training data? I know, I am a bit too focused on that as a big explainer of things.
Similarly, when checking at what point in a solution a model messes up, the authors note that 100+ moves solution for complexity 8 in Towers of Hanoi can be fine, while failure happens before move 50 in that same puzzle on complexity 15.
It is noted that thinking models exhibit high variance in their solutions. On average, they still do better than regular models of course. They might just be less reliable. I wonder if this can have a negative effect on users. If you are asking the model about things you do not know, then you are likely unable to recognise an incorrect solution as well. With a high variance, you can have great success or complete failure, and you cannot find out. This harkens back to one of the main complaints that users have about these models. They sound confident, but how can you be confident they are correct?

Interesting Papers Cited

(yes yes, I should add authors and such to this, but I got lazy copy-pasting out of PDF)

Used as a reference for “are [these models just] leveraging different forms of pattern matching?” So that is in line with the worry I have stated a few times in these notes.
- “GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models”. 2025.
I am mostly curious what became of this. I had and have not heard about it before, but it sounds interesting.
- “Phi-3 technical report: A highly capable language model locally on your phone”. 2024.
Honestly, the title just caught my eye.
- “Understanding large language models through the problem they are trained to solve”. 2023.
- “Alice in Wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models”. 2024.
Used as references for “Generating a Chain of Thought (CoT) has been shown to improve model performance”
- “Chain-of-thought prompting elicits reasoning in large language models”. 2022.
- “Lambada: Backward chaining for automated reasoning in natural language”. 2022.
- “Teaching algorithmic reasoning via in-context learning”. 2022.
- “Large language models are zero-shot reasoners”. 2022.
Used as references for “Incorporating self-verification prior to the final answer has been shown to improve model performance”
- “Large language models are better reasoners with self-verification”. 2023.
- “Making language models better reasoners with step-aware verifier”. 2023.
- “Sample, scrutinize and scale: Effective inference-time search by scaling verification”. 2025.
I am curious what “pause tokens” means.
- “Think before you speak: Training language models with pause tokens”. 2024.
A thinking trace is not necessarily equal to the final result.
- “Reasoning models don’t always say what they think”. 2025.
- “LLMs can easily learn to reason from demonstrations structure, not content, is what matters!”. 2025.
I am curious what this means. Also, I just noticed this is by someone from my alma mater, so maybe I will give it extra attention next.
- “The relationship between reasoning and performance in large language models – o3 (mini) thinks harder, not longer”. 2025.
Some cited benchmarks
- “Puzzles: A benchmark for neural algorithmic reasoning”. 2024.
- “Large language models still can’t plan (A benchmark for llms on planning and reasoning about change)”. 2022.
- “LLMs still can’t plan; can LRMs? A preliminary evaluation of openai’s o1 on planbench”. 2024.

Conclusion

I hope this made some sense. As mentioned before, this is in a big part just me taking notes during/after reading. Maybe glancing through it will encourage you to read the paper yourself. Or tell you enough about it that you don’t think you need to bother. I thought about explicitly writing down pros and cons, perhaps even a reject, weak reject, weak accept, accept rating, but I always disliked that part the most of reviewing papers for the academic community, so I cannot be bothered right now. I will leave it at this: the paper discovered some interesting data points, but falls short at poking at an explanation for those points.

Anyway, I still enjoy reading research papers. When deciding to read this one, I had set myself the goal of actually writing out my understanding / thoughts / questions as a way to force myself to think deeper about it. Writing it out is also fun, to a point, but it is quite the time sink. I will aim to do this whole exercise again with another paper, but I do not know yet how often I will be able to do so.

You might think: why even bother? Just have an LLM generate a summary and roll with that. At that point I might as well just stick to reading summaries that were already online. The entire point of this exercise is to train myself :)

Update a day later as I decide to read some other, more thorough, reviews of this paper.

The raise several issues, I will just provide a quick summary here (following the structure of the arxiv paper).

Models have an output limit. @scaling01 points out this limit is reached when hitting 12 and 13 disks, without any reasoning text in-between.
Models recognise this output limit, stating “The pattern continues, but to avoid making this too long, I’ll stop here”. Supposedly the current Claude model is explicitly trained to not be verbose.

This is somewhat in line with me wondering whether something would be estimating complexity and breaking off “thinking”. I guess it is more simple than that and simply notices “hey, this is a lot of useless output, let’s stop”.

Similarly to this, models might also skip intermediate steps of a solution during the reasoning process, further affecting the paper’s analysis of the models’ thinking process.
Evaluation of a solution is done by checking whether the entire set of moves leads to the solution. A model based on probability that is 0.9999 times correct per move (well, token), with a solution of several thousand, will simply have a more realistic chance of messing up once. @scaling01 shows that the graph of “chance of messing up once” in function of the number of tokens required for a complexity in the Tower of Hanoi is pretty similar looking to the Accuracy graph in the original paper.
Apparently the way the river crossing puzzle is posed, there are no solutions for N ≥ 6 with boat capacity of 3.
The arxiv paper mentions that the models manage to create a program to solve the problem, which I don’t think anyone was doubting at this point.
It is pointed out that complexity as defined in the paper is not a well-defined metric. As I mentioned in my own rambling: Tower of Hanoi is a pretty straightforward algorithm to apply. There is no extra complexity, it just takes longer. The review paper points out that blocks world does increase in difficulty, as more planning is required. This is mentioned as an explanation as to why the models fail on “simpler” blocks world problems while proceeding further on “harder” Tower of Hanoi problems. The original paper’s “complexity” was simply ill-defined.
I had wondered about the way the prompts are crafted. It turns out I had glossed over a bigger flaw in the blocks world prompt. Here, the original authors state that an optimal solution must be found. A much harder problem than finding just any solution.

Reading 'The Illusion of Thinking'