Hacker News new | past | comments | ask | show | jobs | submit login

Seems like they have made progress in combining reinforcement learning and LLMs. Andrej Karpathy mentions it in his new talk (~38 minutes in) [1], and Ilya Sutskever talks about it in a lecture at MIT (~29 minutes in) [2]. It would be a huge breakthrough to find a proper reward function to train LLMs in a reinforcement learning setup, and to train a model to solve math problems in a similar fashion to how AlphaGo used self-play to learn Go.

[1] https://www.youtube.com/watch?v=zjkBMFhNj_g&t=2282s

[2] https://www.youtube.com/watch?v=9EN_HoEk3KY&t=1705s




Q* may also be a reference to the well-known A* search algorithm but with this letter referring to Q-learning, further backing the reinforcement learning theory. https://en.wikipedia.org/wiki/Q-learning


Thanks for the links, very interesting.

Wonder how a "self-play" equivalent would look like for LLMs, since they have no easy criterion to evaluate how well they are doing like in Go (as mentioned in the videos).


I expect self-consistency might be one useful reward function.

Of course in the real world, for a real intelligent system, reality is the feedback/reward system, but for an LLM limited to it's training set, with nothing to ground it, maybe this is the best you can do ...

The idea is essentially that you need to assume (but of course GI-GO) that most of the training data is factual/reasonable whether in terms of facts or logic, and therefore that anything you can deduce from the training data that is consistent with the majority of the training data should be held as similarly valid (and vice versa).

Of course this critically hinges on the quality of the training data in the first place. Maybe it would work best with differently tagged "tiers" of training data with different levels of presumed authority and reasonableness. Let the better data be used as a proxy for ground truth to "police" the lesser quality data.


Maybe I’m off mark here but it seems like video footage of real life would be a massively beneficial data set because it can watch these videos and predict what will happen one second into the future and then see if it was correct. And it can do this over millions of hours of footage and have billions of data points.


Yes - that would help, but only to limited degree if just part of training set.

1) Really need runtime prediction feedback, not just pretraining

2) Really need feedback on results of one's own (prediction-driven) actions (incl. speech), not just on passive "what will happen next" observations


In math specifically, one could easily imagine a reward signal from some automated theorem proving engine


Yeah. I went into some detail of how it might work here: https://news.ycombinator.com/item?id=38036986


One could generate arbitrarily many math problems, where the solution is known.


It seems plausible you could have the LLM side call upon its knowledge of known problems and answers to quiz the q-learning side.

While this would still rely on a knowledge base in the LLM, I would imagine it could simplify the effort required to train reinforcement learning models, while widening the domains it could apply to.


ChatGPT does have some feedback that can be used to evaluate, in the form of thumbs up/down buttons, which probably nobody uses, and positive/negative responses to its messages. People often say "thanks" or "perfect!" in responses, including very smart people who frequent here.


ChatGPT was trained (in an additional step to supervised learning of the base LLM) with reinforcement learning from human feedback (RLHF) where some contractors were presented with two LLM output to the same prompt and they had to decide, which one is better. This was a core ingredient to the performance of the system.


They could also look at the use of the regenerate button, which I do use often, and would serve the same purpose


The veil of ignorance has been pushed back and the frontier of discovery forward


Well, you could post a vast amount of comments into social media and see if and how others react to it. It's still humans doing the work, but they would not even know.

If this was actually done (and this is just wild baseless speculation), this would be a good reason to let Sam go.


I see a lot of comments on reddit these days that are very clearly language models so it’s probably already happening on a large scale


Have you got an example you could show? I'm curious




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: