I'm surprised that neither GNU yes nor GNU cat uses splice(2). Here's what I get. Note that ./rust is the final Rust program from the article modified to print stats and exit after a set byte count[1], ./splice is a simple C program that uses splice to copy the input string from a generated anonymous temporary file to stdout, and ./consume uses splice to move data from stdin to /dev/null. In all tests the default string of "y\n" is used.
splice is Linux-only. Most systems have sendfile(2), including Linux, but I didn't test it. The implementation and semantics of sendfile vary across platforms.
[1] I did this before I wrote ./consume (which takes an optional byte count limit), having assumed that GNU cat was using splice. As cat doesn't have a way to limit the number of bytes read/written, and other tools like head or tail definitely don't use splice, the simplest way to limit the benchmark without introducing a bottleneck was to have the producers themselves print stats and exit. The relevant diff is
+ let mut count = 0u64 as usize;
+ let bytes = (1u64 << 33) as usize;
...
-while locked.write_all(filled).is_ok() {}
+while locked.write_all(filled).is_ok() {
+ count += filled.len();
+ if count >= bytes {
+ break;
+ }
+}
> I'm surprised that neither GNU yes nor GNU cat uses splice(2). ... splice is Linux-only.
The fact that splice is Linux only is most likely why. GNU programs are portable to many different OSes, many of which no longer exist in any real form. They try to not use any 'single OS' specific feature if at all possible.
splice(2) and tee(2) were basically tailor made for cat(1) and tee(1), and for some reason I was under the impression that GNU tee was using splice(2) and/or tee(2). Using these could be trivial--just a few extra lines of code that could fall-through to the existing methods, for a huge speed-up in performance. (Performance matters because the consumers could be CPU bound, and an inefficient cat or tee might be taking away resources that could be used by the consumer.)
Regarding portability, GNU tail uses the Linux-specific inotify(2) to respond faster to writes. Like alot of OSS, coreutils uses the BSD .d_type member extension[1] of struct dirent to avoid unnecessary stat() calls. There are many other more intrusive OS-specific details baked into coreutils, but often it's the nature of the problem--in many situations you're dependent on platform-specific details or features. For the most part, these nitty-gritty platform-specific details are far more intrusive in terms of code complexity than the performance optimizations.
[1] Missing from Solaris, and probably most other SysV derivatives.
1) Because splice(2) requires either the source or sink to be a pipe. The source is a regular file so the sink has to be a pipe.
2) Because the example Rust program(s), emulating yes(1), had no [simple] way to measure throughput except by piping to another program. We can't fairly compare program A that writes directly to /dev/null with program B that writes to a pipe even if program A can measure its throughput.
2a) What jwilk said.
3) For some reason I thought that glibc had optimizations to elide fwrites to /dev/null, and some of my code was using stdio (e.g. for the final trailing bytes less than the pagesize). I could've sworn either glibc or bash did this, but I can't find any mention of it, now. I realize it would be crazy difficult and ugly for glibc to do this (because of dup2, etc), but glibc does alot of crazy things, and in any event I didn't bother checking beforehand.
Mostly it comes down to fairly comparing benchmarks and kernel facilities. Otherwise, yes, those would be classic examples of Useless Use of Cat.
splice is Linux-only. Most systems have sendfile(2), including Linux, but I didn't test it. The implementation and semantics of sendfile vary across platforms.
[1] I did this before I wrote ./consume (which takes an optional byte count limit), having assumed that GNU cat was using splice. As cat doesn't have a way to limit the number of bytes read/written, and other tools like head or tail definitely don't use splice, the simplest way to limit the benchmark without introducing a bottleneck was to have the producers themselves print stats and exit. The relevant diff is