M1 can do 3 loads or 2 stores per cycle and has 6 ALU ports, so a loop doing 2 bytes/cycle taking 2 cycles average is not really out of question.
But with the constant overheads present it's kind of useless to compare (the C benchmarks in OP test 56-char strings, cycling though 10000 of such, so the inputs themselves don't fit in L1; I'd imagine the actual SSE/AVX code by itself should be much faster, certainly faster than the ~1B/cycle that the benchmark reports).
A difference between this and the OP is that OP doesn't pass in a length, instead terminating on the first invalid character (which is the trailing null byte in the benchmark). Means that the array-out-of-bounds check can't be abused.
Hexadecimal is a number system that wasn't designed for any particular computation or conversion with anything but to fit efficiently in nibbles and bytes. You maybe conflating encodings like ASCII and EBCDIC as "designed" for mechanical conversion.
Firstly, how are you going to select whichever of the following is between 0..9 branchlessly and cheaply?
x0 = c - '0'
result = (x0 < 10) ? x0 : (c - 'A' + 10)
Answer: You're not.
Instead,
DELTA_CHECK and DELTA_REBASE aren't specified
M128I V // 16 hex chars as ASCII/Latin-1 bytes
M128I VM1, HASH_KEY, CHECK, T3, T5
UINT M
// 2. Subtract 1 from all character values
FOR I := 0 TO 15 DO
VM1[I; I8] = V[I; I8] - 1
DONE
// 3. Using a 4-bit shift and a mask, select the most significant 4 bits of each byte value.
FOR I := 0 TO 3 DO
HASH_KEY[I; U32] = (VM1[I; U32] >> 4) AND 0x0F0F0F0F
DONE
// 4. Do a vectorized table lookup using the most significant 4 bits, and add the result to the value. This inexpensive computation can detect any non-hexadecimal character: any such character would map to a byte value with the most significant bit set. We later branch out based on a movemask which detects these bytes.
FOR I := 0 TO 15 DO
IF HASH_KEY[I; I8] & 0x80 != 0 THEN
CHECK[I; I8] = VM1[I; I8]
ELSE
CHECK[I; I8] = VM1[I; I8] + DELTA_CHECK[(HASH_KEY[I; I8] & 0x0F); I8]
ENDIF
DONE
// 5. We do a similar computation, a vectorized table lookup using the most significant 4 bits, and add the result to the value, to map the character to a 4-bit value (between 0 and 16 [15 sic]).
FOR I := 0 TO 15 DO
IF HASH_KEY[I; I8] & 0x80 != 0 THEN
V[I; I8] = VM1[I; I8]
ELSE
V[I; I8] = VM1[I; I8] + DELTA_REBASE[(HASH_KEY[I; I8] & 0x0F); I8]
ENDIF
DONE
// [check MSB of each byte in CHECK for error]
FOR I := 0 TO 15 DO
IF (CHECK[I; U8] >> 7) != 0 THEN
ERROR
ENDIF
DONE
// 6. We use a multiply-add and a pack instruction to pack the 4-bit values, going from 16 bytes to 8 bytes.
// b = 0x0110_0110_0110_0110_0110_0110_0110_0110
FOR I := 0 TO 7 DO
// dst[i+15:i] := a[i+15:i+8]*b[i+15:i+8] + a[i+7:i]*b[i+7:i]
// dst[i+15:i] := a[i+15:i+8]*1 + a[i+7:i]*16
T3[I; I16] = MIN(V[2*I+1; U8] + (V[2*I; U8] << 4), 0xffff)
DONE
FOR I := 0 TO 7 DO
T5[I; U8] = MIN(T3[2*I; I16], 0xff)
T5[I+7; U8] = MIN(T3[2*I; I16], 0xff) // duplicate, thrown away in the final result
DONE
// 7. We write the 8 bytes to the output. [The later code ignores the top 8 bytes of t5]
am i missing something?