Yes. I was disappointed to find that they needed a huge labeled dataset of Diplo...

Yes. I was disappointed to find that they needed a huge labeled dataset of Diplomacy games to train the language model, and despite that it still generated a lot of nonsense (as usual for language models) that they then had to invent 16 other ad-hoc models to filter out. It's super cool that they got it to work, but it's nothing like a general method for communicating and collaborating with humans on any task.

Hopefully there will be follow-up work to increase the generality by reducing the amount of labeled data and task-specific tweaking required, similar to the progression of AlphaGo->AlphaGo Zero->AlphaZero->MuZero.