> not add the AVX2 compiler flag It is a better idea to do instruction selection...

dalke · on Sept 8, 2022

I do that, using manual CPUID tests, along with allowing environment variables to override the default path choices.

But if the compiler by default doesn't enable AVX2 then it will fail to compile the AVX2 intrinsics unless I add -mavx2.

Even worse was ~10 years ago when I had an SSSE3 code path, with one file using SSSE3 intrinsics.

I had to compile only that file for SSSE3, and not the rest of the package, as otherwise the compiler would issue SSSE3 instructions where it decided was appropriate. Including in code that wasn't behind a CPUID check.

Thus crash on hardware without SSSE3.

See https://stackoverflow.com/questions/15527611/how-do-i-specif... for more info about my solution. Someone last year contributed a solution for MS Windows.

pabs3 · on Sept 8, 2022

See the wiki page, the function multi-versioning stuff means you can use AVX2 in select functions without adding -mavx2. And using SIMD Everywhere you can automatically port that to ARM NEON, POWER AltiVec etc.

dalke · on Sept 8, 2022

EDIT: after I wrote the below I realize I could use automatic multi-versioning solely to configure the individual functions, along with with a stub function indicating "was compiled for this arch?" I think that might be more effective should I need to revisit how I support multiple processor architecture dispatch. I will still need the code generation step.

Automatic multi-versioning doesn't handled what I needed, at least not when I started.

I needed a fast way to compute the popcount.

10 years ago, before most machines supported POPCNT, I implemented a variety of popcount algorithms (see https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0... ) and found that the fastest version depended on more that just the CPU instruction set.

I ended up running some timings during startup to figure out the fastest version appropriate to the given hardware, with the option to override it (via environment variables) for things like benchmark comparisons. I used it to generate that table I linked to.

Function multi-versioning - which I only learned about a few month ago - isn't meant to handle that flexibility. To my understanding.

I still have one code path which uses __popcountll built-in intrinsics and another which has inline POPCNT assembly, so I can identify when it's no longer useful to have the inline assembly.

(Though I used AVX2 if available, I've also read that some of the AMD processors have several POPCNT execution ports, so may be faster than using AVX2 for my 1024-bit popcount case. I have the run-time option to choose which to use, if I ever have access to those processors.)

Furthermore, my code generation has one path for single-threaded use and one code path for OpenMP, because I found single-threaded-using-OpenMP was slower than single-threaded-without-OpenMP and it would crash on multithreaded macOS programs, due to conflicts between gcc's OpenMP implementation and Apple's POSIX threads implementation.

The AVX2 popcount is from Muła, Kurz, and Lemire, https://academic.oup.com/comjnl/article-abstract/61/1/111/38... , with manually added prefetch instructions (implemented by Kurz). It does not appear that SIMD Everywhere is the right route for me.

pabs3 · on Sept 8, 2022

If you implement your own ifunc instead of using the compiler-supplied FMV ifunc, you could do your benchmarks from your custom ifunc that runs before the program main() and choose the fastest function pointer that way. I don't think FMV can currently do that automatically, theoretically it could but that would require on additional modifications to GCC/LLVM. From the sounds of it, running an ifunc might be too early for you though, if you have to init OpenMP or something non-stateless before benchmarking.

SIMD Everywhere is for a totally different situation; if you want to automatically port your AVX2 code to ARM NEON/etc without having to manually rewrite the AVX2 intrinsics to ARM ones.