Hacker News new | past | comments | ask | show | jobs | submit login

Can you vectorize this with SSE. Most of the magic could be done with vector of four 32 bit integers.

The question is if the cost of setting up the initial vector and extracting the result isnt prohibitive.




It can, and various other 1BRC implementations have (though I doubt hotspot would manage to do it itself, never mind that most 1BRC submissions were run with Graal for the lower startup overhead). The 32×32→64-bit multiplication poses an issue on the default SSE2 as it has no 32-bit or 64-bit multiplication, but SSE4.1 adds exactly the needed thing, pmuldq (although, as that returns a 64-bit result, you'd need two of such to process a full vector of 32-bit ints).


Did the challenge limit submissions to only using SSE2? Seems odd, given the prevalence of SSE4.2 support.

PMULUDQ is in SSE2, though I haven't checked if that's usable for the problem here. There's also PMULLD in SSE4.1 if you only need a 32-bit result. But for summing digits, perhaps SSE2's PMADDWD could be sufficient?


The official 1BRC was Java-only, so no using any architecture-specific SIMD at all; the test system did have AVX2 though, and that's what most non-competing native solutions (including mine) targeted.

Completely forgot about pmuludq, that works too for SSE2. But a 32-bit result is insufficient for the magic number method, needs to be at least 36-bit. I originally used vpmaddubsw+vpmaddwd, but switched to vpmuldq for the reduced register pressure, and I was already only parsing 4 numbers in ymm registers so the 64-bit result didn't affect me (after parsing 4 temperatures I immediately did the respective hashmap stuff for each).


The temperature fields are interleaved with name fields, so I don't think you'd get any extra benefit from SSE. Also, since the temperature field is variable-sized, it would probably not pay off even if it was stored by column.

SSE was successfully applied to finding the delimiter between the name and the temperature, though.


It can still be beneficial - yes you need to load the temperatures individually, but there's enough operations afterward to make it worth it.


I imagine code like this gets auto-vectorized either right from the start or after Hotspot detects a hotspot


I wonder what you mean here. What code exactly would get auto-vectorized? Parsing the temperature surely not, since it's not in an array of fixed-size fields.


once you've done the work to implement parsing as branchless bitwise operations on integers I think that code can be extended automatically to operate on larger registers


I think it would work if that was the only code in the loop. But the loop spans several more nontrivial operations, including hashtable insertion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: