But there are 1 billion rows... So plenty of parallelism. Your 32 lanes can be w...

dzaima · 2024-03-10T19:23:18 1710098598

Processing 32 inputs at 1-byte granularity means that you'll need to expand each of those 32 inputs across 5 registers, and means you lose the ability to trivially shift them along (would need 15 blends to align the 5 input bytes to 3 output bytes for the 32×i8 approach, vs a single vpsllvd or some shuffle for 8×i32), or do any arithmetic wider than 8-bit (i.e. no 64-bit multiply at the core of the method in the article). More lanes just for the sake of it makes no sense (not counting unrolling as you can unroll anything to any amount (give or take register pressure)).

dzaima · 2024-03-10T22:16:38 1710108998

Also, fwiw, re: your gather/scatter approach at https://news.ycombinator.com/item?id=38867074:

I believe you can't get away with just 16+16-bit per state transition case - you have to validate that the input is what the entry was for, and your proposed 16-bit fields don't include anything for that. (having a full 2^32-element table won't help you, and would be a massive 16GB by itself anyways).

I don't see how 2^16 counters can work, as there are clearly 400×400 = 160000 different name×temperature combinations to track; lowering to 160 temperatures is awful. (maybe you could squeeze that together with the output state bits though, which are somewhat fewer, but doesn't help much).

Which leaves with just the option of increasing to 8 bytes per entry.

That ends up as a massive table as e.g. a window of "4\nAb" has 400×40 possible input states, and is 10×141 cases, for a total of 22560000 transitions, 180MB; and there are more than just this window, and you need some empty space to reduce collisions (though some enormous L3-s maybe could still cover it. maybe.)

The current best native solution works at ~2 cycles per input byte per core, so your method per 4 bytes has to beat 8 cycles/element. Your described solution is (cycle counts on Intel, as Zen 4 has much slower gather/scatter):

1. gather the next block of 4 bytes for each element; let's say 0.33c/elt as these are all in L1.

2. gather 8-byte values from the state transition table; as the table is massive and random access, that's limited by the RAM to cache interface; though you only need 8 bytes, the CPU will fetch 64B cache lines, so even with DDR5 at 64GB/s that's just 1B reads per second, or 3c/elt at 3GHz.

3. vpconflictd for when elements hit the same histogram bucket (~1.2c/elt on Intel; 0.08c/elt on Zen 4; also needs some fallback for when the collisions do happen (luckily unlikely if you mask out states that don't terminate the record); also means that these steps 3&4&5 cannot be interleaved without a separate histogram table for each).

4. gather the histogram previous value (0.5c/elt?).

5. scatter the incremented histogram values (0.7c/elt?).

Furthermore, to handle the rarer temperatures/collisions (the distribution of names is uniform so there's no point in handling only a subset of those), you need some fallback mechanism. Targeting 400 temperatures gives you a 5% chance of a failed lookup per entry (i.e., for a 16-element register, there's a 56% chance that at least one will hit a bad temperature), and you must restore the failed lane to a working one very fast, or otherwise all lanes will end up dying quite quickly. Some options:

6.a. do a scalar ctz+blsr loop over a mask of the failed entries (a bunch of cycles and very branch-mispredicty);

6.b. increase to ~600 targeted temperatures, for only 0.2% chance of a failed lookup (still 3.1% over a 16-element vector), and do a scalar fallback when it does happen (still somewhat mispredicty);

6.c. store all failed entries to a buffer (scatter, 0.7c/elt, probably even with most entries being masked off? could be wrong though); and then of course whatever code for actually handling these 5% of more complex entries.

So that's somewhere around 0.33 + 3 + 1.2 + 0.5 + 0.7 + 0.7 = 6.43 cycles per 4 bytes, not far from the 8 cycles of current solutions, and that's not counting any of the intermediate arithmetic, assuming ideal cache & RAM performance (and that the CPU can actually OoO over all of the RAM latency and not flop on the gather/scatter of the histogram not having known addresses for a while), and doesn't leave much for the table initialization.