Iterative reasoning preference optimization

Jimmc414 · 2024-05-01T03:50:56

Create chain-of-thought candidates with LLM, form preference pairs depending on whether the answers are correct, train using DPO and NLL, then repeat the cycle.

Apparently this takes Llama-2-70B-chat from 55.6% to 81.6% on the GSM8k benchmark

popinman322 · 2024-05-01T04:48:18

Also, similar to Orca-Math but without a teacher model. They also followed an iterative DPO/KTO scheme, but with no length normalized NLL loss term.

algo_trader · 2024-05-01T07:55:32

If we had a magical (fast) oracle for grading responses, have people done search/expert iteration for LLMs?

Specifically for codegen, i am playing with an iterative interpreter that can quickly (re)evaluate a tree of similar responses

careai · 2024-05-01T04:11:38

thank you for your service