BLAS inner loops are usually explicitly hand-vectorized, so Fortran’s autovector...

gnufx · on May 14, 2019

Not arguing with that, but I think the jury is out on whether they need to be hand vectorized. With recent GCC, generic C for BLIS' DGEMM gives about 2/3 the performance of the hand-coded version on Haswell, and it may be somewhat pessimized by hand-unrolling rather than letting the compiler do it. The remaining difference is thought to be mainly from prefetching, but I haven't verified that. (Details are scattered in the BLIS issue tracker earlier this year.)

For information of anyone who doesn't know about performance of level 3 BLAS: It doesn't come just from vectorization, but also cache use with levels of blocking (and prefetch). See the material under https://github.com/flame/blis. Other levels -- not matrix-matrix -- are less amenable to fancy optimization, with lower arithmetic density to pit against memory bandwidth, and BLIS mostly hasn't bothered with them, though OpenBLAS has.

stephencanon · on May 14, 2019

Haswell is now a 6 year-old core, so compiler cost models and vectorization support for AVX2 + FMA have mostly caught up. The state of autovectorization targeting AVX-512 or even NEON on new cores is quite a bit less satisfactory for now.

It's long been the case that compilers generate adequate vector code for five-year old cores. A large part of the wins for hand-vectorization (and assembly) come from targeting cores that haven't even been released yet, or were just made available.

Also, 2/3 is ... significantly worse than I would expect, actually. My experience writing GEMM (I did it professionally for a decade) is that getting to 80% of peak is mechanical, and the remaining 10-15% is where all the real work is.

It's also a mistake to ignore L1 and L2 BLAS; for data that fits in the cache hierarchy, these become important, and you can absolutely squeeze out significant gains from hand optimization. For the HPC community, such small problems are uninteresting, but for everyday consumer computing, you can reap huge benefits here.

gnufx · on May 15, 2019

I wasn't in a position to make realistic comparisons on SKX at the time, but it seems that Zen is similar to Haswell (and Haswell/Broadwell is still pretty relevant in HPC). I'll eventually get back to it and try on SKX. I expect you can do better with the generic code, as suggested. The point is that received wisdom is that you can't just rely on compiler vectorization but need careful hand-(un?!)rolled code. The story around BLIS was that GCC wouldn't vectorize loops that at least GCC6 actually does (better than the SSE3 code in the how-to-optimize-dgemm tutorial). Then the suggestion was to force it with the OpenMP simd pragma, which is worse on Haswell because it doesn't use fma, but I don't know if that's relevant for, say, POWER using the generic kernels. I'm not actually ignoring level 1 and 2, although my interest is HPC. BLIS' reduction-type loops in generic C at least get vectorized now via -funsafe-math-optimizations. Small dgemm is important in some HPC applications too, for which BLIS doesn't do too well, but there's libxsmm.