I think the biases are pretty obvious, but the most serious shortcoming of this benchmark is that their result (30% WER on CV) is not reproducible: it's not clear what they trained their model on, and the model itself is not available, so you just have to take their word for it.
More than kind of like; you nailed it: exactly like. This has been an infamous issue with machine learning for decades, where unwary researchers/developers can do this quite accidentally if they're not careful.
The thing is that training data is very often hard to come by due to monetary or other cost, so it's extremely tempting to share some data between training and testing -- yet it's a cardinal sin for the reason you said.
Historically there have been a number of cases where the best machine learning packages (OCR, speech recognition, and more) have been best because of the quality of their data (including separating training from test) more than because of the underlying algorithms themselves.
Anyway it's important to note that developers can and do fall into this trap naively, not only when they're trying to cheat -- they can have good intentions and still get it wrong.
Therefore discussions of methodology, as above, are pretty much always in order, and not inherently some kind of challenge to the honesty and honor of the devs.