That's simply implausible for any proper measurement of the large-dimension leve...

That's simply implausible for any proper measurement of the large-dimension level-3 operations that normally matter (and probably others I haven't measured). OpenBLAS certainly uses AVX2 assembler on Zen if it's built and run correctly, but you can get >60% of MKL's AVX2 performance with a plain C micro-kernel. The only plausible reason for an order of magnitude difference on large GEMM would be multi-threaded versus serial.

I keep saying there's far too much mythology around MKL, disproved experimentally, and if you're using Zen I don't know why you wouldn't use AMD's version of BLIS. It doesn't even make sense to talk about MKL on Zen without the version, since the story keeps changing.

OpenBLAS and BLIS actually are free software and cross-platform. MKL is still proprietary, and certainly doesn't run on the POWER platform I support. Also note that it only got the small matrix performance relevant to tensorflow after libxsmm showed the way.