BLAS is a set of subroutine interfaces that originated in Fortran, but can be implemented in any language; most mainstream implementations use one or more of Fortran, C, C++, assembly, and DSLs.
Right, though a LAPACK implementation is probably going to be mostly Fortran from the reference version, typically with some optimized routines, e.g. what OpenBLAS imports. Something to note is that unfortunately, the interface is the obsolete Fortran77-style, not modern Fortran, though that does make it more likely you can interchange implementations at run time as appropriate.
It's my impression that common implementations of BLAS have fortran at the core because fortran (moreso than C) lends itself to automatic vectorization.
This is partially true (or at least once was, the performance gap is pretty slim nowadays), but most of the high-performance BLAS implementations like OpenBLAS or MKL have their hot path code written in hand-tuned assembler anyway.
BLAS inner loops are usually explicitly hand-vectorized, so Fortran’s autovectorization advantages don’t help. I’ve written many BLAS kernels in assembly over the years.
Not arguing with that, but I think the jury is out on whether they need to be hand vectorized. With recent GCC, generic C for BLIS' DGEMM gives about 2/3 the performance of the hand-coded version on Haswell, and it may be somewhat pessimized by hand-unrolling rather than letting the compiler do it. The remaining difference is thought to be mainly from prefetching, but I haven't verified that. (Details are scattered in the BLIS issue tracker earlier this year.)
For information of anyone who doesn't know about performance of level 3 BLAS: It doesn't come just from vectorization, but also cache use with levels of blocking (and prefetch). See the material under https://github.com/flame/blis. Other levels -- not matrix-matrix -- are less amenable to fancy optimization, with lower arithmetic density to pit against memory bandwidth, and BLIS mostly hasn't bothered with them, though OpenBLAS has.
Haswell is now a 6 year-old core, so compiler cost models and vectorization support for AVX2 + FMA have mostly caught up. The state of autovectorization targeting AVX-512 or even NEON on new cores is quite a bit less satisfactory for now.
It's long been the case that compilers generate adequate vector code for five-year old cores. A large part of the wins for hand-vectorization (and assembly) come from targeting cores that haven't even been released yet, or were just made available.
Also, 2/3 is ... significantly worse than I would expect, actually. My experience writing GEMM (I did it professionally for a decade) is that getting to 80% of peak is mechanical, and the remaining 10-15% is where all the real work is.
It's also a mistake to ignore L1 and L2 BLAS; for data that fits in the cache hierarchy, these become important, and you can absolutely squeeze out significant gains from hand optimization. For the HPC community, such small problems are uninteresting, but for everyday consumer computing, you can reap huge benefits here.
I wasn't in a position to make realistic comparisons on SKX at the time, but it seems that Zen is similar to Haswell (and Haswell/Broadwell is still pretty relevant in HPC). I'll eventually get back to it and try on SKX.
I expect you can do better with the generic code, as suggested. The point is that received wisdom is that you can't just rely on compiler vectorization but need careful hand-(un?!)rolled code. The story around BLIS was that GCC wouldn't vectorize loops that at least GCC6 actually does (better than the SSE3 code in the how-to-optimize-dgemm tutorial). Then the suggestion was to force it with the OpenMP simd pragma, which is worse on Haswell because it doesn't use fma, but I don't know if that's relevant for, say, POWER using the generic kernels.
I'm not actually ignoring level 1 and 2, although my interest is HPC. BLIS' reduction-type loops in generic C at least get vectorized now via -funsafe-math-optimizations. Small dgemm is important in some HPC applications too, for which BLIS doesn't do too well, but there's libxsmm.
You'll typically need "restrict"ed arguments for vectorization in C, but then GCC usually does vectorize the loops OK in BLAS-like code. There may be advantages from Fortran arrays; I forget the details.