> which you point out LLMs cannot do, would not be an issue in an appropriate RL setup.
Hm? it's pretty trivial to use a sampler for LLMs that has a beam search and will effectively 'backtrack' a 'bad' selection.
It just doesn't normally help-- by construction the LLM sampled normally already approximates the correct overall distribution for the entire output, without any search.
I assume using a beam search does help when your sampler does have some non-trivial constraints (like the output satisfies some grammar or passes an algebraic test, or even just top-n sampling since those adjustments on a token by token basis result in a different approximate distribution than the original distribution filtered by the constraints).
Hm? it's pretty trivial to use a sampler for LLMs that has a beam search and will effectively 'backtrack' a 'bad' selection.
It just doesn't normally help-- by construction the LLM sampled normally already approximates the correct overall distribution for the entire output, without any search.
I assume using a beam search does help when your sampler does have some non-trivial constraints (like the output satisfies some grammar or passes an algebraic test, or even just top-n sampling since those adjustments on a token by token basis result in a different approximate distribution than the original distribution filtered by the constraints).