It can, and various other 1BRC implementations have (though I doubt hotspot would manage to do it itself, never mind that most 1BRC submissions were run with Graal for the lower startup overhead). The 32×32→64-bit multiplication poses an issue on the default SSE2 as it has no 32-bit or 64-bit multiplication, but SSE4.1 adds exactly the needed thing, pmuldq (although, as that returns a 64-bit result, you'd need two of such to process a full vector of 32-bit ints).
Did the challenge limit submissions to only using SSE2? Seems odd, given the prevalence of SSE4.2 support.
PMULUDQ is in SSE2, though I haven't checked if that's usable for the problem here. There's also PMULLD in SSE4.1 if you only need a 32-bit result. But for summing digits, perhaps SSE2's PMADDWD could be sufficient?
The official 1BRC was Java-only, so no using any architecture-specific SIMD at all; the test system did have AVX2 though, and that's what most non-competing native solutions (including mine) targeted.
Completely forgot about pmuludq, that works too for SSE2. But a 32-bit result is insufficient for the magic number method, needs to be at least 36-bit. I originally used vpmaddubsw+vpmaddwd, but switched to vpmuldq for the reduced register pressure, and I was already only parsing 4 numbers in ymm registers so the 64-bit result didn't affect me (after parsing 4 temperatures I immediately did the respective hashmap stuff for each).
The temperature fields are interleaved with name fields, so I don't think you'd get any extra benefit from SSE. Also, since the temperature field is variable-sized, it would probably not pay off even if it was stored by column.
SSE was successfully applied to finding the delimiter between the name and the temperature, though.
I wonder what you mean here. What code exactly would get auto-vectorized? Parsing the temperature surely not, since it's not in an array of fixed-size fields.
once you've done the work to implement parsing as branchless bitwise operations on integers I think that code can be extended automatically to operate on larger registers
The question is if the cost of setting up the initial vector and extracting the result isnt prohibitive.