x86 does, it's called REP MOVS, and it copies entire cachelines at a time (if the size of the copy lets it.) AFAIK it has done this since the P6 family (Pentium II), and has only gotten better over time. As a bonus, it is only 2 bytes long. I believe the Linux kernel uses it as its default memcpy()/memmove().
It's funny enough that the REP family of instructions has existed at least since the original 8086/8088. It's very much a relic of the CISC heritage of the x86 ISA, and it was probably meant to make writing assembly code easier and to save space - rather to improve performance.
I remember it was actually quite common in early PC programs, but I don't remember well enough whether it was also generated by compilers, or just existed in hand-optimized assembly (which was of course extremely common back then).
As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.
There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.
This is all discussed in Intel's optimization manual, by the way.
Here's some very interesting discussion about it: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...