In reinforcement learning, Q\* represents the optimal action-value function

andy_xor_andrew · on Nov 23, 2023

which makes sense. you can pretty easily imagine the problem of "selecting the next token" as a tree of states, with actions transitioning from one to another, just like a game. And you already have naive scores for each of the states (the logits for the tokens).

It's not hard to imagine applying well-known tree searching strategies, like monte-carlo tree search, minimax, etc. Or, in the case of Q*, maybe creating another (smaller) action/value model that guides the progress of the LLM.

sudosysgen · on Nov 23, 2023

Absolutely, maximizing conditional probabilities is easily modeled as a Markov decision process, which is why you can use RL to train Transformers so well (hence RLHF, I've also been experimenting with RL based training for Transformers for other applications - it's promising!). Using a transformer as a model for RL to try to choose tokens to maximize overall likelihood given immediate conditional likelihood estimation is something that I imagine many people experimented with, but I can see it being tricky enough for OpenAI to be the only ones to pull it off.