The search algorithm shares a lot in common with our Pluribus poker AI (https://ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-...), but we added "retrospective belief updates" which makes it way more scalable. We also didn't use counterfactual regret minimization (CFR) because in cooperative games you want to be as predictable as possible, whereas CFR helps make you unpredictable in a balanced way (useful in poker).
The most surprising takeaway is just how effective search was. People were viewing Hanabi as a reinforcement learning challenge, but we showed that adding even a simple search algorithm can lead to larger gains than any existing deep RL algorithm could achieve. Of course, search and RL are completely compatible, so you can combine them to get the best of both worlds, but I think a lot of researchers underestimated the value of search.
I just spent three weeks going through your research. Thank you for that work, especially the supplementary materials.I wish I'd known how much the ideas in the pluribus paper depended on reading the libratus paper.
I see what you're saying about the real time search (which took me quite some time to understand). I came up with a way to do that from disk due to memory limitations. It limits the number of search iterations but doesn't seem to have a huge negative impact on quality so far.
The most surprising takeaway is just how effective search was. People were viewing Hanabi as a reinforcement learning challenge, but we showed that adding even a simple search algorithm can lead to larger gains than any existing deep RL algorithm could achieve. Of course, search and RL are completely compatible, so you can combine them to get the best of both worlds, but I think a lot of researchers underestimated the value of search.