> Like quite a bit of other Win32 APIs (WriteGather, for example), it looks tailor made for database journaling.
Indeed, WriteFileGather and ReadFileScatter are specifically tailored for writing from and reading into the buffer pull. The IO unit is the sequential layout of an extent (8 pages of 8Kb each), but in memory pages are not sequential so they have to be 'scattered' at read and 'gathered' at write.
You also have to keep in mind that the entire IO stack, Windows and SQL Server, was designed in the days of spinning media where sequential vs. random access was ~80x faster. SSD media has very different behavior and I'm not sure the typical 'journaling' IO pattern is capable of driving it to the upper bound of physical speed.
As a side note, I was close some folk that worked on ALOJA http://hadoop.bsc.es/ and it was a very interesting discussion I had with them: the default configuration for Java/Hadoop was providing, out of the box, the best IO performance on Linux. Same configuration was a disaster on Windows and basically every parameter had to be 'tuned' to achieve decent performance. This paper has some of their conclusions: https://www.bscmsrc.eu/sites/default/files/bsc-msr_aloja.pdf
SSDs also like nice big sequential writes, although it matters less. It's easier on their wear-leveling firmware and erase blocks. And fewer commands back and forth to the drive reduce latency.
Sure, but the point of the remark is that random IO on spinning rust is absolutely dreadful compared to sequential (on consumer drives it's common to have a difference of 2 orders of magnitude or more), the difference is significantly lower on SSDs and SSDs have a ton of IOPS so random accesses can happen concurrently leading to much lower effective latency.
You're absolutely correct. Wear leveling can affect total IOPS and drive longevity though, so it's still something to consider for those eking out the best total performance from the drive.
Windows is fast at some things and slow at some things.
For instance, metadata reads on the filesystem are much slower on Windows than Linux, so it takes much longer to do the moral equivalent of "find /"
Circa 2003 I would say the Apache Web Server ran about 10x faster on Linux than Solaris, but that a Solaris mail server was 10x faster than the Linux server.
Turns out that Linux and Apache grew up together to optimize performance for the forked process model, but that fsync()-and-friends performance was much worse than Solaris at that time, if you want to meet specifications for reliable delivery.
https://news.ycombinator.com/item?id=11864211 contains an interesting discussion from the author of PyParallel about how doing IO on NT in the style preferred by Linux is slow. This leads people to believe "NT IO is slower than Linux". However, doing IO on NT in the style preferred by NT is faster than Linux doing it's preferred thing.
That discussion and others have led me to the conclusion that the NT kernel has an excellent design and a subpar implementation (since only Microsoft's team can work on it), whereas Linux has a crappy design and an excellent implementation (being constantly refined and iterated by anyone). Kind of makes you wonder what could be possible if Microsoft would ever open-source it.
Here's [1] the current block I/O code from the kernel, annotated with the authors who last touched each line.
While I don't have time right now to actually go and count the number of distinct people involved, that's a lot of hands that have touched that source file over the years. And this view's only showing the changes that make up the current version of that file-- there are authors not credited there because their code has since been overwritten or edited by someone else.
Often the file system performance is also measured while an on-access virus scanner is active on Windows while measuring/testing the same thing on Linux without on-access scanner.
At that point one is comparing apples and oranges.
Been a while since I used AWS in anger, but EC2 instances were massively (hair-pullingly) variable from one moment to the next. I can't see any detail on either blog post (GNU/Linux or Microsoft Windows) regarding how they catered for this, how many runs they did of their custom benchmark code, and what kind of variances they were seeing in each iteration.
"Jonathan, I highly doubt that. We see similar results when running on physical hardware. I'm posting the results of EC2 instances here to ensure that they can be easily reproduced, but we are we have two identical boxes in the office that sit there are show Windows being much faster in this kind of thing." -- from the comments
As a former Windows/C++/C# dev who has been working on linux for five years now, I have never automatically assumed Windows was slower than linux. The main advantages of linux over windows are not in the performance area, imo, but in any case I think you'd have to average a lot of runs to make sure of getting reasonably meaningful numbers from an ec2 instance.
The Linux version was benchmarked with gettimeofday() while the Windows one with QueryPerformanceCounter. The first has a lower resolution of 10 micros and as such the benchmark is not comparable.
gettimeofday() is totally inappropriate for benchmarking. Any time the system clock is being adjusted by NTP, for instance, your benchmark timing will be skewed.
They should be using the following API if they're going to use the system time to measure time differences:
clock_gettime(CLOCK_MONOTONIC, ×pec);
It's a really good clue that a timing function is inappropriate for benchmarking when the man page talks about time zones.
gettimeofday has nanosecond precision and has whatever resolution the underlying clock has. On most recent machines, the clock is the TSC and you really do get nanoseconds out.
80% difference in a microbenchmark is not nothing, but it's hardly unusual. In a real application, these kind of tiny differences may well be much less dramatic, especially if you consider that most apps will be tuned to the OS they're designed for and thus pick the "happy path" for that OS.
And that 80% difference is in the buffered case, but that's also the least relevant - you can use user space buffering (which is normal anyhow, on both OS's) to amortize the system call costs.
Either I am reading this wrong or something is not right here.
Buffered:
windows = 0.006
linux = 0.03
80% Win for windows?
But where do those numbers come from?
The time in ms for linux was 522, the time for windows was 410. That's not an 80% win.
where does the "Write cost" number come from?
In general for the other numbers I don't think they are comparing the same things. I don't think it is a coincidence that the two systems had write times of both about 10s and 20s for the different tests. Where linux took 20s and windows took 10s I'd bet that they were comparing different behaviors.
Time is the total running time for 64k writes, write cost is for a single write, so just dividing time by 65536 gives the write cost. This is correct for all listed tests on Linux and Windows except two, the two buffered tests on Linux. I am not sure what went wrong, both write cost values are about 3.75 times to high assuming the total time is correct.
It's well known windows performs better in these sorts of situations. Probably because of the reasons mentioned, they have their own products that rely on good performance for these code paths.
Indeed, WriteFileGather and ReadFileScatter are specifically tailored for writing from and reading into the buffer pull. The IO unit is the sequential layout of an extent (8 pages of 8Kb each), but in memory pages are not sequential so they have to be 'scattered' at read and 'gathered' at write.
You also have to keep in mind that the entire IO stack, Windows and SQL Server, was designed in the days of spinning media where sequential vs. random access was ~80x faster. SSD media has very different behavior and I'm not sure the typical 'journaling' IO pattern is capable of driving it to the upper bound of physical speed.
As a side note, I was close some folk that worked on ALOJA http://hadoop.bsc.es/ and it was a very interesting discussion I had with them: the default configuration for Java/Hadoop was providing, out of the box, the best IO performance on Linux. Same configuration was a disaster on Windows and basically every parameter had to be 'tuned' to achieve decent performance. This paper has some of their conclusions: https://www.bscmsrc.eu/sites/default/files/bsc-msr_aloja.pdf