I am confused about how the MCTSr algorithm actually is. It is not clear how it is better than simply mutating potential answers (by LLM) and sorting by LLM self-eval.
I have a hard time understanding how many LLM evals MCTSr actually does. How the rollout limit is implemented is not described at all. It doesn't seem like it can mean the same thing as for normal MCTS because there is not any definitive "end" to the tree search. Additionally, MCTS is normally limited by the nodes expanded.
Aside from theoretical concerns, it is not clear why they have not include an > 8 rollout version in the tables or used an LLM stronger than Llama 3 8B. If the concept scales well it should be able to beat GPT-4 and friends by a rather large margin.
Obviously marrying search with LLMs is a ripe area for research, but I find it hard to actually take anything away from this paper.
Two differences in MCTS (or their MCTSr) from what you describe: First is that using just the self-eval as an authoritative score rather than a probability distribution increases the likelihood that a low-quality evaluation (a.k.a. an inaccurate evaluation) will throw off the entire search process. Second is that by using MCTS, a low-quality action (MCTSr -> solution) that doesn't turn out to be low-quality until several refinement rounds doesn't throw off the entire search, either, since the tree search can go back to the root and experiment with a different earlier branch.
A single rollout is the full process described in chapter 3. "The algorithm iterates through these stages until a termination condition T is met, including rollout constraints or maximum exploration depth"
I can't say as to why they didn't try using a bigger model than Llama 3 8B, or more rollouts.
> They did include an 8-rollout version in the tables? I can't say as to why they didn't try using a bigger model than Llama 3 8B.
That was a typo. I meant a > 8 rollout version. It doesn't seem like they have hit massively diminishing returns yet.
> A single rollout is the full process described in chapter 3. "The algorithm iterates through these stages until a termination condition T is met, including rollout constraints or maximum exploration depth"
A rollout is not the entire process. The summary of normal MCTS correctly identifies rollouts as "random simulations by selecting moves arbitrarily until a game’s conclusion is reached, thereby evaluating the node’s potential". Nothing actually like rollouts is ever described.
Typically MCTS is limited by nodes expanded which is likely what they mean, but because they correctly described rollouts, it seems like I am missing something. Also they mention AlphaGo which replaces rollouts with a neural eval which maybe is relevant.
If rollouts means nodes expanded, 4 and 8 are both just really low numbers.
Has anyone implemented this successfully? I'm 90% of the way there but I'm not sure I fully understand the node expansion process. Results are so-so. Improved for sure, but not solving my toy math problem...
I have a hard time understanding how many LLM evals MCTSr actually does. How the rollout limit is implemented is not described at all. It doesn't seem like it can mean the same thing as for normal MCTS because there is not any definitive "end" to the tree search. Additionally, MCTS is normally limited by the nodes expanded.
Aside from theoretical concerns, it is not clear why they have not include an > 8 rollout version in the tables or used an LLM stronger than Llama 3 8B. If the concept scales well it should be able to beat GPT-4 and friends by a rather large margin.
Obviously marrying search with LLMs is a ripe area for research, but I find it hard to actually take anything away from this paper.
EDIT:
Added a missing greater than sign