> This would easily be solved by using the same distribution as a live USB, shar...

> This would easily be solved by using the same distribution as a live USB, sharing testing sets, and compiling things from scratch with predefined options, but nobody seems to want to go to that much effort to get coherent comparisons.

Alternatively, you could have a sort of "shootout CI server" where people upload their compiled binaries as Docker images and the CI server runs them against several a random subset from a set of (hidden) fixed test datasets, averaging the results. (Random and hidden such that you can't just overfit against the test set; fixed so that it's still mostly measuring the same thing.)

I think the Netflix Prize sort of worked like this?