If you read the description you’ll see that it’s optimizing around certain syscalls like memcopy (which is otherwise limiting and reduces speed by 5x). I think it’s less language choice, and more low level optimization that is making a difference here.
(To check, I wrote a Go program that doen nothing else than fmt.Print($largebody) in a for loop and it tops out at around 10GB/s)
Note that memcpy is not a syscall. Common syscalls do copies. The fast implementation uses the vmsplice syscall as a clever way to pump a pipe with data, because it doesn't copy.
Indeed. Today for most systems memcpy is in fact a compiler intrinsic. C's standard library offers a memcpy but the reality is that your C standard library was likely supplied by your C compiler vendor, and when you ask for memcmp() in C you get their preferred intrinsic inlined into your program unless you've been very firm in declining that optimisation.
In Rust this memcpy intrinsic is required (for the core language, not just the standard library) because Rust clearly has arbitrarily large data structures you can literally copy, so, somebody has to take responsibility for doing that work, Rust says it's the compiler back-end's job. On a toy CPU the memcpy might really just be a trivial read->register->write loop because that's adequate.
Oops, I notice hours too late to edit it, that this says memcmp() in one place where it mostly says memcpy(). In fact your C compiler likely has intrinsics for both, but I did mean memcpy() everywhere in the comment.
(To check, I wrote a Go program that doen nothing else than fmt.Print($largebody) in a for loop and it tops out at around 10GB/s)