Hacker News new | past | comments | ask | show | jobs | submit login

In reinforcement learning, Q* represents the optimal action-value function



which makes sense. you can pretty easily imagine the problem of "selecting the next token" as a tree of states, with actions transitioning from one to another, just like a game. And you already have naive scores for each of the states (the logits for the tokens).

It's not hard to imagine applying well-known tree searching strategies, like monte-carlo tree search, minimax, etc. Or, in the case of Q*, maybe creating another (smaller) action/value model that guides the progress of the LLM.


Absolutely, maximizing conditional probabilities is easily modeled as a Markov decision process, which is why you can use RL to train Transformers so well (hence RLHF, I've also been experimenting with RL based training for Transformers for other applications - it's promising!). Using a transformer as a model for RL to try to choose tokens to maximize overall likelihood given immediate conditional likelihood estimation is something that I imagine many people experimented with, but I can see it being tricky enough for OpenAI to be the only ones to pull it off.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: