Although it usually only matters when compared code scores similarly, it is important to be able to know when benchmark comparisons are statistically significant[1]. I wrote a library[2] a while back to do this. This becomes even more of a concern when the benchmarked code is running on a machine which is not well-isolated and dedicated to running the benchmark.
[1]: http://en.wikipedia.org/wiki/Statistical_significance
[2]: http://github.com/Pistos/better-benchmark