It's considered an emergent phenomenon of LLMs [1]. So arithmetic reasoning seems to increase as LLMs reasoning grows too. I seem to recall a paper mentioning that LLMs that are better at numeric reasoning are better at overall conversational reasoning too, so it seems like the two come hand in hand.
However we don't know the internals of ChatGPT-4, so they may be using some agents to improve performance, or fine-tuning at training. I would assume their training has been improved IMO.
The results from playing with this are really bizarre: (sorry, formatting hacked up a bit)
To calculate
7^1.83
, you can use a scientific calculator or an exponentiation function in programming or math software. Here is the step-by-step calculation using a scientific calculator:
Input the base: 7
Use the exponentiation function (usually labeled as ^ or x^y).
Input the exponent: 1.83
Compute the result.
Using these steps, you get:
7^1.83 ≈ 57.864
So, 7^1.83 ≈ 57.864
Given this, and the recent announcement of data analysis features, I’m guessing the GPT-4o is wired up to use various tools, one of which is a calculator. Except that, if you ask it, it also blatantly lies about how it’s using a calculator, and it also sometimes makes up answers (e.g. 57.864 — that’s off by quite a bit).
I imagine some trickery in which the LLM has been trained to output math in some format that the front end can pretty-print, but that there’s an intermediate system that tries (and doesn’t always succeed) to recognize things like “expression =” and emits the tokens for the correct value into the response stream. When it works, great — the LLM magically has correct arithmetic in its output! And when it fails, the LLM cheerfully hallucinates.
However we don't know the internals of ChatGPT-4, so they may be using some agents to improve performance, or fine-tuning at training. I would assume their training has been improved IMO.
[1]: https://arxiv.org/pdf/2206.07682