Look, I said I was going to read some more papers again — and I did — but then summer travel hit and general busy times and I never got around to rewriting notes into a blog post. Which is also something I claimed I would start doing. So here we are, you reading my trying to remember what I thought about “The Relationship Between Reasoning and Performance in Large Language Models — O3 (mini) Thinks Harder, Not Longer” by Marthe Ballon, Andres Algaba, and Vincent Ginis. Find it on arxiv.

Why this paper? Because when it was first posted online, I saw it pass by on LinkedIn by the first author. She is not a connection of mine, but works at my old university and I’m quite sure she is/was a member of my old student group. Either that or I am mixing up people which, honestly, is quite possible too. Either way, that made me decide to read this one even if it might not be on many people’s radar.

What can I still say after all this time? Well, I did take notes in the PDF while reading, so I am hoping I can stitch my opinion together based on that. I am not starting over, so what I can get out of that, is what you will get out of this. (Just picture that I put a massive vertical gap under this paragraph to make it appear we’re both getting nothing out of this. What a great joke)

What’s It About?

The paper compares two or three OpenAI models, depending on how you define model: O1-mini, O3-mini with reasoning set to its default (they refer to this as “(m)” for medium), and O3-mini with reasoning set to high (they refer to this as “(h)”).

Comparison is done by running the models on questions from the Omni-MATH benchmark, a benchmark consisting of “Olympiad Level” questions. This is a paper from the math department after all. The benchmark assigns a difficulty to its questions from 1 (easy) to 10 (hard). The paper makes the difficulty rating coarser, using those assigned difficulties to create four categories grouping questions with difficulties [1, 4], (4, 5], (5, 6], and (6, 10]. Answers are judged by… another model (Omni Judge), but that is a pretty common approach.

O3-mini (h) does better than O3-mini (m) does clearly better than O1-mini does way better than gpt-4o. Oh ye, we’re pulling in gpt-4o for this part, but it is largely irrelevant for the main point of the paper besides telling you that O1 and onwards are clearly better.

The paper further looks at the “chain of thought” aka the rambling (I’m sorry, reasoning) a model does to get to its final answer. Specifically, it considers the number of reasoning tokens used. It finds that O3-mini uses about the same number of tokens as does O1-mini, while getting better results. The high version gets slightly better results still, but uses a lot of extra tokens (read: computing power) to achieve this.

My General Opinion

The main result boils down to: newer model has better accuracy for same token usage. I would hope so? If OpenAI released a new model, claimed it was better, but you got the same accuracy from the same token usage (read: same money you are paying), then I think there would have been an uproar pretty early. A good amount of text in the paper is spent repeating this conclusion.

The secondary result says that higher reasoning effort and (much higher) maximum token usage leads to better results, but likely not enough improvement for most use cases. That’s a fair result and good to have it quantified.

This study is an OK first look at accuracy vs token usage, but I am curious whether the pattern holds. This study can be extended by a fresh look at more recent models, both within OpenAI models as within other model families (Llama, Qwen, …). It also needs to use max_reasoning_tokens as an extra variable, as I will get into below.

Further Remarks / Nitpicking / Questions

  • O3-mini (high) has a higher max_reasoning_tokens set: 100k for O3-mini (h) vs 25k for O3-mini (m) and O1-mini. Note that this is different from the setting they use to define “high reasoning”, which is reasoning_effort. This choice of number of reasoning tokens feels rather consequential, yet it is kind of glossed over. I would have been more curious to see the effect of the same max_reasoning_tokens used for all models or, even better, using max_reasoning_tokens as another variable in the experiment. As it stands, the paper makes conclusions about the (h) model using more tokens while only mentioning this different limit in the appendix. I am left wondering whether any conclusions about the token usage in the (h) model can be taken at face value. I am left with more questions. Is higher reasoning enough to get more accuracy? Is higher number of tokens the main contributor?

    This difference in token usage due to a setting bothers me in a few different spots in the paper.

  • I think the paper could have done with a more bulleted listing of the actual conclusions to make, but I am not sure if there are any daring statements we can really conclude from it. Right now, there are various mini takeaways placed all over the text. I find the structure of the text more confusing than it needs to be. I am reminded of my prof telling me to use boxed environments at the end of sections to summarise takeaways in a sentence or two.
  • The paper finds that accuracy decreases as the chain of thought grows, also when controlling for question difficulty. This happens for all tested models. My gut reaction here is that question difficulty is not some absolute metric. Perhaps the definition needs to go the other way around? If chain of thought grows, then it is a more difficult question (for that model). The paper does touch on this in its discussion.
  • The paper does point out that the decrease in accuracy is less profound in better models. Read: the downwards slope is not as steep. From this it concludes that setting higher token usage will have more chance of being useful than when doing so with a worse model where the dropoff makes it pointless. This a nice takeaway.
  • The paper points out that the Discrete Mathematics category uses more tokens while Calculus and Algebra use fewer. It does not delve deeper into why that might be. I had hoped a look at the reasoning output would have given some insights here.
  • “Each reasoning model refused to answer a few questions (flagged as invalid prompts), which were subsequently omitted from analysis.” Did this happen for different questions? Same questions across the models? Also “refused”? Why? Did that remove a question from the entire analysis or just skipped in the reporting for that particular model?
  • Figure 2 reports on “relative token usage”, but does not elaborate what it is relative to. Presumably relative to whichever question had the highest token usage and then only within each model, since otherwise (h) would be much higher? These are assumptions I can only make after reading the entire paper though, indicating that the provided explanation is insufficient (for this reader :) ).

Referenced Papers

I only list those that seemed interesting enough to take another look at, based on the description in the text or just the title.