Apparently this takes Llama-2-70B-chat from 55.6% to 81.6% on the GSM8k benchmark
Specifically for codegen, i am playing with an iterative interpreter that can quickly (re)evaluate a tree of similar responses
Apparently this takes Llama-2-70B-chat from 55.6% to 81.6% on the GSM8k benchmark