Fair Enough. Then increase N (N is almost always increased when a model is scale...

HarHarVeryFunny · 2024-09-04T23:16:06 1725491766

Increasing number of layers isn't a smart way to solve it. It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".

This lack of "variable compute" is a widely recognized shortcoming of transformer-based LLMs, and there are plenty of others. The point apropos this thread is that you can't just train an LLM to be something that it is not. If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process... the best it can do is to model the outcome of the generative process, not the process itself. The difference is that without having learnt the generative process, the model will fail when presented with a novel input that it didn't see during training, and therefore didn't memorize the "cheat sheet" answer for.

og_kalu · 2024-09-05T00:10:36 1725495036

>Increasing number of layers isn't a smart way to solve it.

The "smart way" is a luxury. Solving the problem is what matters. Think of a smart way later if you can. That's how a lot of technological advancement has worked.

>It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".

Same thing. Efficiency is nice but a secondary concern.

>If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process.

Every inference problem has itself a fixed number of compute steps it needs (yes even your chess move). Variability is a nice thing for between inferences(maybe move 1 required 500 but 2 only 240 etc) A nice thing but never a necessary thing.

3.5-turbo-instruct plays chess consistently at 1800 Elo so clearly the N of the current SOTA is already enough to play non-trivial chess at a level beyond most humans.

There is an N large enough for every GI problem humans care about. Not to sound like a broken record but once again, limited =/ trivial.