Any RL task needs to decompose the loss. This was also the issue with RLHF model...

Any RL task needs to decompose the loss.

This was also the issue with RLHF models. The loss of predicting the next token is straightforward to minimize as we know which weights are responsible for the token being correct or not. identifying which tokens had the most sense for a prompt is not straightforward.

For thinking you might generate 32k thinking tokens and then 96k solution tokens and do this a lot of times. Look at the solutions, rank by quality and bias towards better thinking by adjusting the weights for the first 32k tokens. But I’m sure o1 is way past this approach.