Reading ‘The Relationship Between Reasoning and Performance in Large Language Models’

Look, I said I was going to read some more papers again — and I did — but then summer travel hit and general busy times and I never got around to rewriting notes into a blog post. Which is also something I claimed I would start doing. So here we are, you reading my trying to remember what I thought about “The Relationship Between Reasoning and Performance in Large Language Models — O3 (mini) Thinks Harder, Not Longer” by Marthe Ballon, Andres Algaba, and Vincent Ginis. Find it on arxiv.

Why this paper? Because when it was first posted online, I saw it pass by on LinkedIn by the first author. She is not a connection of mine, but works at my old university and I’m quite sure she is/was a member of my old student group. Either that or I am mixing up people which, honestly, is quite possible too. Either way, that made me decide to read this one even if it might not be on many people’s radar.

What can I still say after all this time? Well, I did take notes in the PDF while reading, so I am hoping I can stitch my opinion together based on that. I am not starting over, so what I can get out of that, is what you will get out of this. (Just picture that I put a massive vertical gap under this paragraph to make it appear we’re both getting nothing out of this. What a great joke)

What’s It About?

The paper compares two or three OpenAI models, depending on how you define model: O1-mini, O3-mini with reasoning set to its default (they refer to this as “(m)” for medium), and O3-mini with reasoning set to high (they refer to this as “(h)”).

Comparison is done by running the models on questions from the Omni-MATH benchmark, a benchmark consisting of “Olympiad Level” questions. This is a paper from the math department after all. The benchmark assigns a difficulty to its questions from 1 (easy) to 10 (hard). The paper makes the difficulty rating coarser, using those assigned difficulties to create four categories grouping questions with difficulties [1, 4], (4, 5], (5, 6], and (6, 10]. Answers are judged by… another model (Omni Judge), but that is a pretty common approach.

O3-mini (h) does better than O3-mini (m) does clearly better than O1-mini does way better than gpt-4o. Oh ye, we’re pulling in gpt-4o for this part, but it is largely irrelevant for the main point of the paper besides telling you that O1 and onwards are clearly better.

The paper further looks at the “chain of thought” aka the rambling (I’m sorry, reasoning) a model does to get to its final answer. Specifically, it considers the number of reasoning tokens used. It finds that O3-mini uses about the same number of tokens as does O1-mini, while getting better results. The high version gets slightly better results still, but uses a lot of extra tokens (read: computing power) to achieve this.

My General Opinion

The main result boils down to: newer model has better accuracy for same token usage. I would hope so? If OpenAI released a new model, claimed it was better, but you got the same accuracy from the same token usage (read: same money you are paying), then I think there would have been an uproar pretty early. A good amount of text in the paper is spent repeating this conclusion.

The secondary result says that higher reasoning effort and (much higher) maximum token usage leads to better results, but likely not enough improvement for most use cases. That’s a fair result and good to have it quantified.

This study is an OK first look at accuracy vs token usage, but I am curious whether the pattern holds. This study can be extended by a fresh look at more recent models, both within OpenAI models as within other model families (Llama, Qwen, …). It also needs to use max_reasoning_tokens as an extra variable, as I will get into below.

Further Remarks / Nitpicking / Questions

O3-mini (high) has a higher max_reasoning_tokens set: 100k for O3-mini (h) vs 25k for O3-mini (m) and O1-mini. Note that this is different from the setting they use to define “high reasoning”, which is reasoning_effort. This choice of number of reasoning tokens feels rather consequential, yet it is kind of glossed over. I would have been more curious to see the effect of the same max_reasoning_tokens used for all models or, even better, using max_reasoning_tokens as another variable in the experiment. As it stands, the paper makes conclusions about the (h) model using more tokens while only mentioning this different limit in the appendix. I am left wondering whether any conclusions about the token usage in the (h) model can be taken at face value. I am left with more questions. Is higher reasoning enough to get more accuracy? Is higher number of tokens the main contributor?

This difference in token usage due to a setting bothers me in a few different spots in the paper.
I think the paper could have done with a more bulleted listing of the actual conclusions to make, but I am not sure if there are any daring statements we can really conclude from it. Right now, there are various mini takeaways placed all over the text. I find the structure of the text more confusing than it needs to be. I am reminded of my prof telling me to use boxed environments at the end of sections to summarise takeaways in a sentence or two.
The paper finds that accuracy decreases as the chain of thought grows, also when controlling for question difficulty. This happens for all tested models. My gut reaction here is that question difficulty is not some absolute metric. Perhaps the definition needs to go the other way around? If chain of thought grows, then it is a more difficult question (for that model). The paper does touch on this in its discussion.
The paper does point out that the decrease in accuracy is less profound in better models. Read: the downwards slope is not as steep. From this it concludes that setting higher token usage will have more chance of being useful than when doing so with a worse model where the dropoff makes it pointless. This a nice takeaway.
The paper points out that the Discrete Mathematics category uses more tokens while Calculus and Algebra use fewer. It does not delve deeper into why that might be. I had hoped a look at the reasoning output would have given some insights here.
“Each reasoning model refused to answer a few questions (flagged as invalid prompts), which were subsequently omitted from analysis.” Did this happen for different questions? Same questions across the models? Also “refused”? Why? Did that remove a question from the entire analysis or just skipped in the reporting for that particular model?
Figure 2 reports on “relative token usage”, but does not elaborate what it is relative to. Presumably relative to whichever question had the highest token usage and then only within each model, since otherwise (h) would be much higher? These are assumptions I can only make after reading the entire paper though, indicating that the provided explanation is insufficient (for this reader :) ).

Referenced Papers

I only list those that seemed interesting enough to take another look at, based on the description in the text or just the title.

“A new class of reasoning models has emerged that couples reinforcement learning with test-time compute scaling”.

The usual model scaling approach was to train the model for a longer time (“train-time compute”). This paper out of Google DeepMind formally introduced giving more time to producing the answer (“test-time compute”).
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters by Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Aug 2024.
- This blog post is not cited in the paper, but I bumped into it when searching for the above: Scaling Test-Time Compute with Open Models - HuggingFaceH4
“More reasoning tokens yields log-linear performance gains”.
- s1: Simple test-time scaling
Pretty sure I have seen this one pass by a few times already, so (a) it is probably important (b) I should probably get around to reading it properly. Actually, turns out there is a similarly named one referenced further down, adding that one here too. Dunno which of the two I had seen before.
- Language Models are Few-Shot Learners by Tom Brown et al. 2020.
- Large language models are zero-shot reasoners by Takeshi Kojima et al. 2022.
Interesting title and in Nature so perhaps made more readable(?)
- Mathematical discoveries from program search with large language models by Bernardino Romera-Paredes et al. Nature, 2024.
- Solving olympiad geometry without human demonstrations by Trieu H Trinh et al. Nature, 2024.
These two would likely have to be read closely together to make sense enough.
- Do not think that much for 2+ 3=? On the overthinking of o1-like LLMs by Xingyu Chen et al. 2024.
- Thoughts are all over the place: On the underthinking of o1-like LLMs by Yue Wang et al. 2025.
This one was mentioned in the other paper I posted about here so far, so I’m going to go out on a limb and assume that it is important for chain of thought discussions.
- Chain-of-thought prompting elicits reasoning in large language models by Jason Wei et al. 2022.
On number of tokens versus achieved results.
- The Impact of Reasoning Step Length on Large Language Models by Mingyu Jin et al. 2024.
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models by Mosh Levy et al. 2024.
References various benchmarks. I don’t care enough to repeat them, but it’s good to know in case I ever am in a situation to need them. Math benchmarks in refs 17, 20, 21, 22, 23, 24, and 25. Also references more general benchmarks in refs 26, 27, 28, and 29. Finally, some coding benchmarks in 30 and 31.
When the paper discusses prompting strategy, it references the following 3+1 papers. Feels important for model evaluation.
- Large language models are zero-shot reasoners by Takeshi Kojima et al. 2022.
- Self-consistency improves chain of thought reasoning in language models by Xuezhi Wang et al. 2022.
- Tree of thoughts: Deliberate problem solving with large language models by Shunyu Yao et al. 2024.
- Let me speak freely? A study on the impact of format restrictions on performance of large language models by Zhi Rui Tam et al. 2024.