Hacker News new | past | comments | ask | show | jobs | submit login
Iterative reasoning preference optimization (arxiv.org)
19 points by Jimmc414 16 days ago | hide | past | favorite | 4 comments



Create chain-of-thought candidates with LLM, form preference pairs depending on whether the answers are correct, train using DPO and NLL, then repeat the cycle.

Apparently this takes Llama-2-70B-chat from 55.6% to 81.6% on the GSM8k benchmark


Also, similar to Orca-Math but without a teacher model. They also followed an iterative DPO/KTO scheme, but with no length normalized NLL loss term.


If we had a magical (fast) oracle for grading responses, have people done search/expert iteration for LLMs?

Specifically for codegen, i am playing with an iterative interpreter that can quickly (re)evaluate a tree of similar responses


thank you for your service




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: