Which is these is "faster" depends greatly on whether you have the very rare memcpy-only workload, or if your program actually does something useful. Many people believe, often with good evidence, that the most important thing is for memcpy to occupy as few instruction cache lines as is practical, instead of being something that branches all over kilobytes of machine code. For comparison, see the x86 implementations in LLVM libc.
It depends on the CPU. There is no good reason for "rep movsb" to be slower at any big enough data size.
On a Zen 3 CPU, "rep movsb" becomes faster than or the same as anything else above a length slightly greater than 2 kB.
However there is a range of multi-megabyte lengths, which correspond roughly with sizes below the L3 cache but exceeding the L2 cache, where for some weird reason "rep movsb" becomes slower than SIMD non-temporal stores.
At lengths exceeding the L3 size, "rep movsb" becomes again the fastest copy method.
All thresholds are described in https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...
And they are not final, i. e. Noah Goldstein still updates them every year.