I'll give an argument against this with the caveat it applies only if these are pure LLMs without heuristics or helper models (I do not believe that to be the case with o1).
The problem with auditing is not only are the outputs incorrect, but the "inputs" of the chained steps have no fundamental logical connection to the outputs. A statistical connection yes, but not a causal one.
For the trail to be auditable, processing would have to be taking place at the symbolic level of what the tokens represent in the steps. But this is not what happens. The transformer(s) (because these are now sampling multiple models) are finding the most likely set of tokens that reinforce a training objective which is a completed set of training chains. It is fundamentally operating below the symbolic or semantic level of the text.
This is why anthropomorphizing these is so dangerous. It isn't actually "explaining" its work. The CoT is essentially one large output, broken into parts. The RL training objective does two useful things: (1) break it down into much smaller parts, which drops the error significantly as that scales as an exponential of the token length, and (2) provides better coverage of training data for common subproblems. Both of those are valuable. Obviously, in many cases the reasons actually match the output. But hallucinations can happen anywhere throughout the chain, in ways which are basically undeterministic.
An intermediate step can provide a bad token and blithely ignore that to provide a correct answer. If you look at intermediate training of addition in pure LLMs, you'll get lots of results that look sort of like:
> "Add 123 + 456 and show your work"
> "First we add 6 + 3 in the single digits which is 9. Moving on we have 5 + 2 which is 8 in the tens place. And in the hundreds place, we have 5. This equals 579."
The above is very hand-wavy. I do not know if the actual prompts look like that. But there's an error in the intermediate step (5 + 2 = 8) that does not actually matter to the output. Lots of "emergent" properties of LLMs—arguably all of them—go away when partial credit is given for some of the tokens. And this scales predictably without a cliff [1]. This is also what you would expect if LLMs were "just" token predictors.
But if LLMs are really just token predictors, then we should not expect intermediate results to matter in a way in which they deterministically change the output. It isn't just that CoT can chaotically change future tokens, previous tokens can "hallucinate" in a valid output statement.
I believe that is the case. Out of curiosity, I had this model try to solve a very simple Sudoku puzzle in ChatGPT, and it failed spectacularly.
It goes on and on making reasoning mistakes, and always ends up claiming that the puzzle is unsolvable and apologizing. I didn’t expect it to solve the puzzle, but the whole reasoning process seems fraught with errors.