Some benchmarks are designed to measure absolute performance, in order to answer questions like "How many servers should we buy, if we expect to handle X hits/second?" or "What's the limit of our app stack's scalability, in hits/second?" But... the O.P is NOT benchmarking for absolute performance.
But the O.P. here is benchmarking toward a different purpose: comparing relative performance. These tests are designed to answer a totally different kind of question: "Which of these app stacks performs best, given the same hardware budget for each stack?" or "How many extra servers do we need to buy, if we want to use Apache/mod_php instead of Nginx/php-fpm?"
With relative-performance benchmarks, we have to assume that it's valid to extrapolate from small servers to large, and from one server to many. That is, if Ngnix beats Apache by 1000% on a lonely 500MHz Pentium 3 box, what can we predict about NGinx vs. Apache performance on a dozen-strong cluster of quad-core, dual-socket 3.6GHz machines? In more general terms: How well does each application "scale up/out, horizontally"?
The answer depends on the application type and software architecture. For example, most modern web servers are multi-thread/process apps, with minimal shared state in between. Also, modern web stacks generally push cross-request state into a separate datastore layer (if any). As a result, modern web apps tend to scale up linearly, to the performance limits of the datastore layer. Until your database becomes a bottleneck, you can expect that 2x web servers == 2x hits/second.
He's benchmarking relative performance, but we don't know what else is running on the host machine during the benchmark. What if one of the other users on the host was running a very cpu/network intensive process during only one of his test runs?
Because we people obviously don't have a datacenter in our own basement. And the common mistake people make when benchmarking is running the servers on their own machine and then use the same machine to benchmark the server it's running.
You need to have multiple (powerful) machines for this. And also, spinning up machines in the cloud is quite easy to do and allows people to reproduce the same test results because you have access to exactly the same environment.
>You need to have multiple (powerful) machines for this
Oh really? For simple http 'hello world' comparision (where you are interested in relative numbers, not absolute ones) bechmark?
All you need is one old and slow laptop (with test contenders) and one modern and mighty (with test script). The only thing you have to be sure about is the test script can generate more load than test contenders can handle. Even if the old laptop isn't slow enough you can just add some predictable and stable load to cpu/disks/network/whatever is a bottleneck for them - you may use tools that are available for that or even quick & dirty hacks like one-liner 'while(1) {do some math}' that effectively make your 2-cores CPU 1-core while running with high system priority.
You could use cluster compute instances which use hardware virtualisation, and I think gives you the whole machine. With spot pricing, you could run them for about 21¢ an hour.
Unless you're looking for absolute performance numbers, then doing it on a vm on someone else's server is probably the most realistic deployment scenario.