I think this is self-conflicting. If the evaluation is proprietary then it is most certainly not reputable. We'd want open metrics where we can analyze the limitations. Of course, we'd need open data too, but that's exceptionally rare these days. Plus, a metric isn't going to really tell us if we have have spoilage or not. You can get some evidence for spoilage through a trained model, but it is less direct, fuzzier, and more tells us about what information it was able to memorize rather than if the data was spoiled.
i don’t buy that premise. in practice we’re seeing a lot of evidence that you can’t trust the open evals because of contamination (maybe accidental, though there’s definitely incentive to cheat and move up the leaderboards).
closed/subjective ranking and evaluation has been around since there were critics. yes it’s hard to bootstrap trust, but i can’t see a way around it because the open evals can’t really be trusted either.
I find this argument weird. I'm not saying you can trust the open evals, I'm just saying you can know their limits. Closed evals you're a lot more blind.