Hi all, author here. Several people have raised questions about SIMD. For many w...

glangdale · on May 26, 2017

I think this is marginally more expensive than my suggestion, but I'm not 100% sure, and I think the critical paths are similar. My suggestion requires (after the load):

1 shift (4 bits right), unfortunately no byte-granularity shift is available.

2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)

2 PSHUFB operations to look up a table

1 PAND to combine the table.

... and then rejoins your implementation at the VPMOVMSKB.

I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.

Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.

hoytech · on May 26, 2017

> Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?

With glibc there's also the IFUNC option:

https://sourceware.org/glibc/wiki/GNU_IFUNC

haberman · on May 26, 2017

If you're using AVX instructions make sure to insert a vzeroupper after you're done, otherwise you take a big speed hit transitioning between SSE and AVX. The compiler will generate these automatically for AVX instructions it is emitting:

https://godbolt.org/g/jNXJyA

Here's an intrinsic for it:

http://technion.ac.il/doc/intel/compiler_c/main_cls/intref_c...

ot · on May 26, 2017

Little known fact about AVX: every time you use an AVX instruction Intel processors reduce the core frequency by a few percents for about a millisecond to avoid overheating (I guess because they need to turn on an otherwise unused functional unit).

As a result, if you sprinkle some AVX instructions in your code but not enough to cause a significant efficiency win you might actually end up reducing your capacity (and affecting latency).

burntsushi · on May 27, 2017

Can you provide a citation for that or a link to more reading? I'm not asking because I don't believe you, but because I would like to learn more. :-)

ot · on May 27, 2017

Sure, see this doc from Intel: https://computing.llnl.gov/tutorials/linux_clusters/intelAVX...

"Intel AVX instructions require more power to run. When executing these instructions, the processor may run at less than the marked frequency to maintain thermal design power (TDP) limits."

...

"When the processor detects Intel AVX instructions, additional voltage is applied to the core. With the additional voltage applied, the processor could run hotter, requiring the operating frequency to be reduced to maintain operations within the TDP limits. The higher voltage is maintained for 1 millisecond after the last Intel AVX instruction completes, and then the voltage returns to the nominal TDP voltage level."