What would've been great is if instead of just pointing out the flaws in testing methodologies, the OP picked out a few publications that do run the tests accurately. That way we could judge how seriously to take some of the claims of the X25-M successors, like the X25-E Extreme.
For example they are the only ones who seem to benchmark SandForce-based SSDs (from OCZ, Crucial, ADATA, etc) correctly by taking care of defeating the transparent compression and deduplication features of this controller. They did that by patching IOMeter to use blocks of random bytes instead of constant bytes.
Anandtech takes care of testing the 2 cases, 100% random data and non-random data, to give a range of expected performance on real-world workloads.
However they only do this for tests run with IOMeter, because most other benchmarking tools are unable to write random data. Which brings me back to one of my points that many benchmarking tools are flawed in the sense they don't take into account potential deduplication.
Great article. I would be interested in more detail on how you specifically benchmark storage. I benchmark a lot of enterprise SAN storage for Oracle database clusters and I've found a tool called ORION that works surprisingly well. It tests small random I/Os, large sequential I/Os, or any combination thereof, and uses the Oracle asynchronous I/O libraries so you can simulate a real database workload.
The nice thing is that I can plug in my workload mix - 90% read, 10% writes, point it to all of my raw devices, and it will test all combinations of small random, large sequential I/Os, multiple queue depths, and give me CSV file outputs showing my I/O per second, MB per second, and latency for every data point.
You can also tell it the size of your storage array or disk cache, which it will add to Linux kernel cache, and pre-warm the cache with /dev/random. For example, if I'm testing on a server with 64GB of RAM and 8GB of storage cache, it will pre-warm a total of linux kernel cache + 8GB, so it might pre-warm 60-70GB of data before each data point to get accurate, repeatable results without the effects of cached reads/writes.
I use scripts I wrote myself in combination with Linux CLI tools (dd, iostat). They give me 6 pieces of information: sustained sequential read and write throughput (at different LBA offsets), read and write latency (4kB I/O with queue depth of 1), and random read and write IOPS (4kB I/O with queue depth of at least 32, and, if necessary, using random data to characterize the effectiveness of any dedup/compression feature).
That's it. Once I have these 6 synthetic results, that gives me more and better information than 9 out of 10 benchmarking articles on the Net.
I usually don't need more complex synthetic benchmark results (eg. x% read / y% write). After that I simply test the real workload.
One could also argue that if it's really that hard to tune your benchmarking system to get the optimal results it's not really relevant to any real-life task...
I really think real world app testing is the best way to test these things. Important things like app launch times, boot times, performance the above after the SSD has been aged. Average numbers matter way more than peak numbers as peak numbers may last a fraction of the times as other measured rates.
Seems for most purposes, a simplistic benchmark that represents some realistic scenarios is preferable to benchmarks designed to maximize performance numbers.