Seems like they have made progress in combining reinforcement learning and LLMs....

jug · on Nov 23, 2023

Q* may also be a reference to the well-known A* search algorithm but with this letter referring to Q-learning, further backing the reinforcement learning theory. https://en.wikipedia.org/wiki/Q-learning

Sol- · on Nov 23, 2023

Thanks for the links, very interesting.

Wonder how a "self-play" equivalent would look like for LLMs, since they have no easy criterion to evaluate how well they are doing like in Go (as mentioned in the videos).

HarHarVeryFunny · on Nov 23, 2023

I expect self-consistency might be one useful reward function.

Of course in the real world, for a real intelligent system, reality is the feedback/reward system, but for an LLM limited to it's training set, with nothing to ground it, maybe this is the best you can do ...

The idea is essentially that you need to assume (but of course GI-GO) that most of the training data is factual/reasonable whether in terms of facts or logic, and therefore that anything you can deduce from the training data that is consistent with the majority of the training data should be held as similarly valid (and vice versa).

Of course this critically hinges on the quality of the training data in the first place. Maybe it would work best with differently tagged "tiers" of training data with different levels of presumed authority and reasonableness. Let the better data be used as a proxy for ground truth to "police" the lesser quality data.

93po · on Nov 23, 2023

Maybe I’m off mark here but it seems like video footage of real life would be a massively beneficial data set because it can watch these videos and predict what will happen one second into the future and then see if it was correct. And it can do this over millions of hours of footage and have billions of data points.

HarHarVeryFunny · on Nov 23, 2023

Yes - that would help, but only to limited degree if just part of training set.

1) Really need runtime prediction feedback, not just pretraining

2) Really need feedback on results of one's own (prediction-driven) actions (incl. speech), not just on passive "what will happen next" observations

jhrmnn · on Nov 23, 2023

In math specifically, one could easily imagine a reward signal from some automated theorem proving engine

cubefox · on Nov 25, 2023

Yeah. I went into some detail of how it might work here: https://news.ycombinator.com/item?id=38036986

manx · on Nov 23, 2023

One could generate arbitrarily many math problems, where the solution is known.

lixy · on Nov 23, 2023

It seems plausible you could have the LLM side call upon its knowledge of known problems and answers to quiz the q-learning side.

While this would still rely on a knowledge base in the LLM, I would imagine it could simplify the effort required to train reinforcement learning models, while widening the domains it could apply to.

walthamstow · on Nov 23, 2023

ChatGPT does have some feedback that can be used to evaluate, in the form of thumbs up/down buttons, which probably nobody uses, and positive/negative responses to its messages. People often say "thanks" or "perfect!" in responses, including very smart people who frequent here.

lagrange77 · on Nov 23, 2023

ChatGPT was trained (in an additional step to supervised learning of the base LLM) with reinforcement learning from human feedback (RLHF) where some contractors were presented with two LLM output to the same prompt and they had to decide, which one is better. This was a core ingredient to the performance of the system.

93po · on Nov 23, 2023

They could also look at the use of the regenerate button, which I do use often, and would serve the same purpose

ChatGTP · on Nov 23, 2023

The veil of ignorance has been pushed back and the frontier of discovery forward

jansan · on Nov 23, 2023

Well, you could post a vast amount of comments into social media and see if and how others react to it. It's still humans doing the work, but they would not even know.

If this was actually done (and this is just wild baseless speculation), this would be a good reason to let Sam go.

93po · on Nov 23, 2023

I see a lot of comments on reddit these days that are very clearly language models so it’s probably already happening on a large scale

AlexAndScripts · on Nov 23, 2023

Have you got an example you could show? I'm curious