Looking at o1's behavior, it seems there's a key architectural limitation: while...

skissane · 2025-01-18T21:19:46 1737235186

> This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits.

Rather than retaining all those steps, what about just retaining a summary of them? Or put them in a vector DB so on follow-up it can retrieve the subset of them most relevant to the follow-up question?

throwup238 · 2025-01-19T03:50:50 1737258650

That’s kind of what (R/C)NNs did before the Attention is all you need paper introduced the attention mechanism. One of the breakthroughs that enabled GPT is giving each token equal “weight” through cross attention instead of letting them get attenuated in some sort of summarization mechanism.

refulgentis · 2025-01-18T21:14:11 1737234851

Is that relevant here? the post discussed writing a long prompt to get a good answer, not issues with ex. step #2 forgetting what was done in step #1.

inciampati · 2025-01-20T20:51:10 1737406270

https://platform.openai.com/docs/guides/reasoning/advice-on-... this explains the bug. o1 can't see its past thinking. This would seem to limit the expressivity of the chain of thought. Maybe within one step it's UTM, but with the loss of memory, extra steps will be needed to make sure the right information is passed forward. The model is likely to start to forget key ideas which it had that it didn't write down in the output. This will tend to make it drift and start to focus more on its final statements and less (or not at all) on some of the things which led it to them.

inciampati · 2025-01-19T13:11:18 1737292278

Yes it is, because the post discussed this approach precisely because unrolling the actual chain of thought in interactive chat does not work.

And it's doubly relevant because chain of thought let's transformers break out of TCO complexity and be UTM. This matters because TC0 is pattern matching while UTM is general intelligence. Forgetting what the model thought breaks this and (ironically) probably forces the model back into one-shot pattern matching. https://arxiv.org/abs/2310.07923