M1 can do 3 loads or 2 stores per cycle and has 6 ALU ports, so a loop doing 2 bytes/cycle taking 2 cycles average is not really out of question.
But with the constant overheads present it's kind of useless to compare (the C benchmarks in OP test 56-char strings, cycling though 10000 of such, so the inputs themselves don't fit in L1; I'd imagine the actual SSE/AVX code by itself should be much faster, certainly faster than the ~1B/cycle that the benchmark reports).
A difference between this and the OP is that OP doesn't pass in a length, instead terminating on the first invalid character (which is the trailing null byte in the benchmark). Means that the array-out-of-bounds check can't be abused.