It sounds surprising if a gather-based approach beats doing the math. Can you po...

dzaima · on Aug 5, 2023

M1 can do 3 loads or 2 stores per cycle and has 6 ALU ports, so a loop doing 2 bytes/cycle taking 2 cycles average is not really out of question.

But with the constant overheads present it's kind of useless to compare (the C benchmarks in OP test 56-char strings, cycling though 10000 of such, so the inputs themselves don't fit in L1; I'd imagine the actual SSE/AVX code by itself should be much faster, certainly faster than the ~1B/cycle that the benchmark reports).

jsheard · on Aug 5, 2023

That LUT would fit into the M1s L1 cache, so in a microbenchmark you'd expect it to be pretty fast.

The question is how does it perform when there's surrounding code competing for cache space?

anonymoushn · on Aug 5, 2023

Well, you'll also only hit a tiny part of the LUT, but it still takes a lot more instructions because there's no gather in NEON

iraqmtpizza · on Aug 5, 2023

The JIT is doing all the work, it appears.

https://pastebin.com/TvJ8Y9cq

    private static final byte[] _16 = new byte[16];
    @Setup(Level.Trial)
    public void setUp() {
        final byte[] _eight = new byte[_16.length / 2];
        new Random(System.currentTimeMillis() + System.nanoTime())
                .nextBytes(_eight);
        FastHex.encodeBytes(_eight, 0, _eight.length, _16, 0);
    }

    @Benchmark
    @Fork(value = 1, warmups = 1)
    @BenchmarkMode(Mode.Throughput)
    @Warmup(iterations = 1)
    @Measurement(iterations = 3)
    public void largeHexFast(Blackhole blackhole) {
        blackhole.consume(FasterHex.decode(0, _16.length, i -> _16[i]));
    }

dzaima · on Aug 5, 2023

A difference between this and the OP is that OP doesn't pass in a length, instead terminating on the first invalid character (which is the trailing null byte in the benchmark). Means that the array-out-of-bounds check can't be abused.

iraqmtpizza · on Aug 5, 2023

Yeah, the methodologies aren't directly comparable (JMH is dark magic as well). For the record I used JMH 1.37