Hacker News new | past | comments | ask | show | jobs | submit login

This is not the full truth, "rep movsb" is fast until another threshold, after which either normal or non-temporal store is faster.

All thresholds are described in https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...

And they are not final, i. e. Noah Goldstein still updates them every year.




Which is these is "faster" depends greatly on whether you have the very rare memcpy-only workload, or if your program actually does something useful. Many people believe, often with good evidence, that the most important thing is for memcpy to occupy as few instruction cache lines as is practical, instead of being something that branches all over kilobytes of machine code. For comparison, see the x86 implementations in LLVM libc.

https://github.com/llvm/llvm-project/blob/main/libc/src/stri...


It depends on the CPU. There is no good reason for "rep movsb" to be slower at any big enough data size.

On a Zen 3 CPU, "rep movsb" becomes faster than or the same as anything else above a length slightly greater than 2 kB.

However there is a range of multi-megabyte lengths, which correspond roughly with sizes below the L3 cache but exceeding the L2 cache, where for some weird reason "rep movsb" becomes slower than SIMD non-temporal stores.

At lengths exceeding the L3 size, "rep movsb" becomes again the fastest copy method.

The Intel CPUs have different behaviors.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: