Intel has `rep movs` which is recommended for general use, you can beat it for sizes < 128B using a specialized loop. A very good memcpy for general purposes would just branch over the size to either a specialized small copy or `rep movs` for larger sizes.
The GCC version is just bananas. https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysde...
Compare the newer ERMS implementation: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...