> which you point out LLMs cannot do, would not be an issue in an appropriate RL...

> which you point out LLMs cannot do, would not be an issue in an appropriate RL setup.

Hm? it's pretty trivial to use a sampler for LLMs that has a beam search and will effectively 'backtrack' a 'bad' selection.

It just doesn't normally help-- by construction the LLM sampled normally already approximates the correct overall distribution for the entire output, without any search.

I assume using a beam search does help when your sampler does have some non-trivial constraints (like the output satisfies some grammar or passes an algebraic test, or even just top-n sampling since those adjustments on a token by token basis result in a different approximate distribution than the original distribution filtered by the constraints).