Seems like they have made progress in combining reinforcement learning and LLMs. Andrej Karpathy mentions it in his new talk (~38 minutes in) [1], and Ilya Sutskever talks about it in a lecture at MIT (~29 minutes in) [2]. It would be a huge breakthrough to find a proper reward function to train LLMs in a reinforcement learning setup, and to train a model to solve math problems in a similar fashion to how AlphaGo used self-play to learn Go.
Q* may also be a reference to the well-known A* search algorithm but with this letter referring to Q-learning, further backing the reinforcement learning theory. https://en.wikipedia.org/wiki/Q-learning
Wonder how a "self-play" equivalent would look like for LLMs, since they have no easy criterion to evaluate how well they are doing like in Go (as mentioned in the videos).
I expect self-consistency might be one useful reward function.
Of course in the real world, for a real intelligent system, reality is the feedback/reward system, but for an LLM limited to it's training set, with nothing to ground it, maybe this is the best you can do ...
The idea is essentially that you need to assume (but of course GI-GO) that most of the training data is factual/reasonable whether in terms of facts or logic, and therefore that anything you can deduce from the training data that is consistent with the majority of the training data should be held as similarly valid (and vice versa).
Of course this critically hinges on the quality of the training data in the first place. Maybe it would work best with differently tagged "tiers" of training data with different levels of presumed authority and reasonableness. Let the better data be used as a proxy for ground truth to "police" the lesser quality data.
Maybe I’m off mark here but it seems like video footage of real life would be a massively beneficial data set because it can watch these videos and predict what will happen one second into the future and then see if it was correct. And it can do this over millions of hours of footage and have billions of data points.
It seems plausible you could have the LLM side call upon its knowledge of known problems and answers to quiz the q-learning side.
While this would still rely on a knowledge base in the LLM, I would imagine it could simplify the effort required to train reinforcement learning models, while widening the domains it could apply to.
ChatGPT does have some feedback that can be used to evaluate, in the form of thumbs up/down buttons, which probably nobody uses, and positive/negative responses to its messages. People often say "thanks" or "perfect!" in responses, including very smart people who frequent here.
ChatGPT was trained (in an additional step to supervised learning of the base LLM) with reinforcement learning from human feedback (RLHF) where some contractors were presented with two LLM output to the same prompt and they had to decide, which one is better. This was a core ingredient to the performance of the system.
Well, you could post a vast amount of comments into social media and see if and how others react to it. It's still humans doing the work, but they would not even know.
If this was actually done (and this is just wild baseless speculation), this would be a good reason to let Sam go.
[1] https://www.youtube.com/watch?v=zjkBMFhNj_g&t=2282s
[2] https://www.youtube.com/watch?v=9EN_HoEk3KY&t=1705s