Decoding base16 sequences quickly

scrame · on Aug 6, 2023

> the first 10 digits are 0, 1, 2, 3, 5, 6, 7, 8, 9

am i missing something?

iraqmtpizza · on Aug 5, 2023

with the latest Graal, my hex decoder does 3.6 GB/s on Apple M1. No AVX coding required.

short val = DECODE_TABLE[extractor.applyAsInt(offset) << 7 | extractor.applyAsInt(offset+1)];

decode lookup table is 26318 bytes long

I believe that CPU does 3.2 GHz so it's running at less than one cycle per byte. ~14.2 cycles per 16-byte string

anonymoushn · on Aug 5, 2023

It sounds surprising if a gather-based approach beats doing the math. Can you post something that readers can run?

dzaima · on Aug 5, 2023

M1 can do 3 loads or 2 stores per cycle and has 6 ALU ports, so a loop doing 2 bytes/cycle taking 2 cycles average is not really out of question.

But with the constant overheads present it's kind of useless to compare (the C benchmarks in OP test 56-char strings, cycling though 10000 of such, so the inputs themselves don't fit in L1; I'd imagine the actual SSE/AVX code by itself should be much faster, certainly faster than the ~1B/cycle that the benchmark reports).

jsheard · on Aug 5, 2023

That LUT would fit into the M1s L1 cache, so in a microbenchmark you'd expect it to be pretty fast.

The question is how does it perform when there's surrounding code competing for cache space?

anonymoushn · on Aug 5, 2023

Well, you'll also only hit a tiny part of the LUT, but it still takes a lot more instructions because there's no gather in NEON

iraqmtpizza · on Aug 5, 2023

The JIT is doing all the work, it appears.

https://pastebin.com/TvJ8Y9cq

    private static final byte[] _16 = new byte[16];
    @Setup(Level.Trial)
    public void setUp() {
        final byte[] _eight = new byte[_16.length / 2];
        new Random(System.currentTimeMillis() + System.nanoTime())
                .nextBytes(_eight);
        FastHex.encodeBytes(_eight, 0, _eight.length, _16, 0);
    }

    @Benchmark
    @Fork(value = 1, warmups = 1)
    @BenchmarkMode(Mode.Throughput)
    @Warmup(iterations = 1)
    @Measurement(iterations = 3)
    public void largeHexFast(Blackhole blackhole) {
        blackhole.consume(FasterHex.decode(0, _16.length, i -> _16[i]));
    }

dzaima · on Aug 5, 2023

A difference between this and the OP is that OP doesn't pass in a length, instead terminating on the first invalid character (which is the trailing null byte in the benchmark). Means that the array-out-of-bounds check can't be abused.

iraqmtpizza · on Aug 5, 2023

Yeah, the methodologies aren't directly comparable (JMH is dark magic as well). For the record I used JMH 1.37

38 · on Aug 6, 2023

only getting 69 MB/s with my crappy Go code:

    package hex
    
    import (
       "bytes"
       "encoding/hex"
       "testing"
    )
    
    func Benchmark_Hex(b *testing.B) {
       src := bytes.Repeat([]byte("00"), 100_000_000)
       for n := 0; n < b.N; n++ {
          hex.Decode(src, src)
       }
    }

jheriko · on Aug 5, 2023

why is a lookup necessary? surely we just need a subtraction and conditional move - this is how hexadecimal is designed.

1letterunixname · on Aug 5, 2023

Hexadecimal is a number system that wasn't designed for any particular computation or conversion with anything but to fit efficiently in nibbles and bytes. You maybe conflating encodings like ASCII and EBCDIC as "designed" for mechanical conversion.

Firstly, how are you going to select whichever of the following is between 0..9 branchlessly and cheaply?

    x0 = c - '0'
    result = (x0 < 10) ? x0 : (c - 'A' + 10)

Answer: You're not.

Instead,

    DELTA_CHECK and DELTA_REBASE aren't specified

    M128I V // 16 hex chars as ASCII/Latin-1 bytes
    M128I VM1, HASH_KEY, CHECK, T3, T5
    UINT M

    // 2. Subtract 1 from all character values

    FOR I := 0 TO 15 DO
        VM1[I; I8] = V[I; I8] - 1
    DONE

    // 3. Using a 4-bit shift and a mask, select the most significant 4 bits of each byte value.

    FOR I := 0 TO 3 DO
        HASH_KEY[I; U32] = (VM1[I; U32] >> 4) AND 0x0F0F0F0F 
    DONE

    // 4. Do a vectorized table lookup using the most significant 4 bits, and add the result to the value. This inexpensive computation can detect any non-hexadecimal character: any such character would map to a byte value with the most significant bit set. We later branch out based on a movemask which detects these bytes.

    FOR I := 0 TO 15 DO
        IF HASH_KEY[I; I8] & 0x80 != 0 THEN
            CHECK[I; I8] = VM1[I; I8]
        ELSE
            CHECK[I; I8] = VM1[I; I8] + DELTA_CHECK[(HASH_KEY[I; I8] & 0x0F); I8] 
        ENDIF
    DONE

    // 5. We do a similar computation, a vectorized table lookup using the most significant 4 bits, and add the result to the value, to map the character to a 4-bit value (between 0 and 16 [15 sic]).

    FOR I := 0 TO 15 DO
        IF HASH_KEY[I; I8] & 0x80 != 0 THEN
            V[I; I8] = VM1[I; I8]
        ELSE
            V[I; I8] = VM1[I; I8] + DELTA_REBASE[(HASH_KEY[I; I8] & 0x0F); I8] 
        ENDIF
    DONE

    // [check MSB of each byte in CHECK for error]
    FOR I := 0 TO 15 DO
        IF (CHECK[I; U8] >> 7) != 0 THEN
            ERROR
        ENDIF
    DONE

    // 6. We use a multiply-add and a pack instruction to pack the 4-bit values, going from 16 bytes to 8 bytes.
    // b = 0x0110_0110_0110_0110_0110_0110_0110_0110
    FOR I := 0 TO 7 DO
        // dst[i+15:i] := a[i+15:i+8]*b[i+15:i+8] + a[i+7:i]*b[i+7:i] 
        // dst[i+15:i] := a[i+15:i+8]*1 + a[i+7:i]*16 
        T3[I; I16] = MIN(V[2*I+1; U8] + (V[2*I; U8] << 4), 0xffff)
    DONE

    FOR I := 0 TO 7 DO
        T5[I; U8]   = MIN(T3[2*I; I16], 0xff)
        T5[I+7; U8] = MIN(T3[2*I; I16], 0xff) // duplicate, thrown away in the final result
    DONE

    // 7. We write the 8 bytes to the output. [The later code ignores the top 8 bytes of t5]

https://github.com/NLnetLabs/simdzone/issues/28

https://github.com/NLnetLabs/simdzone/issues/26

anonymoushn · on Aug 6, 2023

The lookup is one of the cheapest instructions the cpu has.