I hate the overuse of acronyms (IHTOOA). One way to get back is just to start making up your own -- throw a few random TLA's out at your next meeting. It's fun.
Also, I would like to push for an anti-acronym day. A full day where you have to use full and complete terminology. Ironic thing is, you don't add that many syllables when you say "Proof of Concept" vs POC and similar such absurd overused abbreviations.
To be fair, in the supercomputing community (or, more generally "high performance computing"--HPC), certain acronyms for libraries are very well-known: MPI, HDF5, FFTW, BLAS, MKL--no one is going to bat their eyes at seeing those names without an acronym expansion.
Think you could make it one day without using those acronyms?
Use of acronyms is usually an audience dependent thing. I am frequently in the in the audience of talks/papers where I don't know all the acronyms. I think the general rule should be, for writing, if there is a reasonable percentage (say more than 10%) of readers who don't know the acronyms, define it on first use. When speaking, give some context or define the terms early in your talk, to help folks along.
I say in a meeting the other day with marketing folks. They may be the worst offenders. Techies are pretty bad too.
I don't agree. The article quite clearly refers to Intel MKL, which is widely known at least in the number crunching world. Plus, it's just a blog post.
Tagging onto this to mention another informative reddit post on the subject, posted today 01/09/2020 which suggests that a previous workaround is no longer possible:
Wow, why is Intel supporting Zen kernels in MKL? That seems... really interesting. One of the few places where Intel still has a somewhat clear advantage is in high performance numerical code that can't be easily multiprocessed (e.g. off-the-shelf ML code) because MKL is much faster than AMD alternatives.
Also, does anyone know if one can use patchelf on e.g. Python (or the numpy compiled section?) to get MKL/Zen support? I don't have my Zen CPU on me to test.
" ... Intel must not ... Intentionally include design/engineering elements in its products that artificially impair the performance of any AMD microprocessor."
If you read Agner Fog's old "Cripple AMD" thread from 2009, when he actually served as a FTC witness, you will see that his later updates in 2010-11 shows no reduction in behavior of this kind on Intel's side. If Intel were to comply, this would've been a solved problem.
Indeed, I had three theories, from optimistic to cynical:
1. They are sharing increasingly much code between oneDNN (formerly MKL-DNN) and Intel MKL. So, it simply pays of from a development cost perspective.
2. Intel sees AArch64 as a serious competitor. AMD may be an enemy, but at least an enemy on the same architecture. Better to strengthen x86_64 than to give AArch64 too much momentum in HPC. (The #1 supercomputer in TOP500 is AArch64 [1])
3. Write Zen kernels, make them slightly less efficient than the Intel ones. They still beat OpenBLAS et al. in benchmarks, even on Zen. But to HPC customers they can show that Intel CPUs are faster.
I think (1) is somewhat optimistic, given the huge benefit that they had with MKL in HPC. If (1) is not the case, I definitely hope (2) is.
> 2. Intel sees AArch64 as a serious competitor. AMD may be an enemy, but at least an enemy on the same architecture. Better to strengthen x86_64 than to give AArch64 too much momentum in HPC. (The #1 supercomputer in TOP500 is AArch64 [1])
Historically, this makes the most sense to me: AMD has pooped the bed before, and Intel was there to pick up the pieces and reclaim its crown. I think there's a legitimate worry that if ARM takes enough marketshare from x86/x64, that Intel will never get it back.
> 3. Write Zen kernels, make them slightly less efficient than the Intel ones. They still beat OpenBLAS et al. in benchmarks, even on Zen. But to HPC customers they can show that Intel CPUs are faster.
IMO, this is more on AMD than Intel. If Intel's kernel is faster than the open alternative, then Intel is doing AMD a favor, even if they're not doing the absolutely best job they possibly could.
> IMO, this is more on AMD than Intel. If Intel's kernel is faster than the open alternative, then Intel is doing AMD a favor, even if they're not doing the absolutely best job they possibly could.
If AMD chips are slow on the Intel-optimized kernel, then it's partly on AMD.
If Intel writes an "AMD-optimized" kernel that's worse on AMD than the "Intel-optimized" kernel, that's definitely not on AMD. And in that case it's not doing them a favor to make that kernel when they could just use the same code on everything.
> 4. This is one person (or a small group) at Intel who considers it the right thing to do.
> Maybe?
I guess it depends on where you work. Most places would probably fire someone who "took the initiative" to improve a competitor's product on company time.
Not on company time, but after leaving intel, some former employees added ARM support to ispc[0]. In the years since then, current Intel employees have maintained and even expanded the ARM support and Intel’s website even mentions it now[1].
I bet it's #3, and not necessarily maliciously. Higher end Intel SKUs do outperform AMD quite easily on practical floating point workloads. That's a fact that could be highlighted by testing _Intel_ MKL side by side. I don't really think it matters a whole lot in the grand scheme of things - all the serious compute has been on the GPUs for the past decade, and that's not about to change. It is critical for Intel now to make their Xe GPUs viable for HPC - something they're well positioned to do, but will fuck up anyway because they're a huge, bureaucratic behemoth with lots of political infighting, and people on the top have not a faintest clue as to what's going on, nor the ability to acquire such a clue.
I think the real question one should be asking is "why is AMD's support for high-performance computing" so bad?
This applies both to CPUs (where they have to rely on Intel for this) and toe GPUs (where their offerings are under resourced, buggy and lagging what NVidia offer).
I work for AMD on our GPU math libraries. If you have specific complaints, I will gladly listen. I can't promise more than that, but I will read and consider whatever you write.
I'm a true believer in ROCm HIP. It has tremendous potential. I joined the company specifically because I wanted to help ensure its success... mostly by writing fast, reliable code, but also by listening to our users.
If you (or anyone else) would prefer to respond privately, my email is my HN username at gmail.
This isn't too hard to explain, I think - until recently AMD has been very strapped for cash and all their software engineering bandwidth had gone towards optimizing games (where they're still keeping up with NVIDIA where progress doesn't seem to have slowed).
how is that good news when your investigation shows in reality its "Intel seems to be adding cripple-Zen kernels" when compared to spoofing Intel? 382 GF/s vs 430 GF/s
There is no difference. The operations these kernels use are all well-defined and work identically on all CPUs that implement them. In general, FP/SIMD is not nearly as much of a crapshoot as people seem to expect it to be. Beyond timings, there are generally no user-space visible differences in operation between any AVX/SSE instructions in Intel/AMD cpus.
And thats exactly the key question. The original kernel that Intel uses on their own CPUs works perfectly fine on AMD. Otherwise Matlab wouldn't have validated it in their production release in 2020a. So the key question is: Why is Intel taking efforts in implementing a specific kernel for AMD that is somewhat slower. I guess there aren't too many answers out there that make sense.
The entire Industry should develop a true interest to push OSS alternatives. And all OSS projects should honestly think twice about implementing such a CSS like the MKL as a standard.
Why this obsession with MKL on AMD hardware? I've long made measurements of OpenBLAS on "large" dimension serial DGEMM, at least. Even on Intel hardware between Westmere and SKX, with the exception of KNL, it was always at least within the sort of noise level of HPC jobs of MKL performance, and was always better than ACML on the generations of Opterons we used.
There are results for AVX2 systems with older versions of OpenBLAS and BLIS, which is AMD's BLAS at https://github.com/flame/blis/blob/master/docs/Performance.m... I haven't seen or made measurements for EPYC2 yet (and don't have results with the current versions online for Haswell and SKX) but I'd be surprised if AMD BLIS doesn't perform equally well on EPYC2. OpenBLAS serial DGEMM is very similar to MKL on SKX, for instance, with BLIS not far behind. I think AMD work has contributed to BLIS Haswell performance; their BLAS certainly supports Intel hardware decently, as well as aarch64, at least.
If you're interested in small dimension matrix multiplication on AVX2 hardware, consider libxsmm and AMD's recent "SUP" support in BLIS, e.g. https://github.com/flame/blis/blob/master/docs/PerformanceSm... MKL only got good at small dimensions because of libxsmm.
Off-topic for AMD, but in lieu of detailed figures, here are the first few points for measurements to hand on SKX serial square DGEMM with BLIS 0.7, OpenBLAS 0.3.10, and MKL mkl-2021.1-beta06 using the framework for the figures on the BLIS site:
People sometimes run code written, or even compiled, by somebody else. Strange but true. ("At least I think it's strange, and I am assured it is true." -- DA)
Stop me if this is just too weird, but I believe if you rely on proprietary software then you should choose the platform where you get the best performance of that software.
There are a very finite number of HPC clusters available to any given researcher with a given grant. Each european country might have 1-10 available and you have to apply for compute time.
If you’re in a commercial setting, your company might have 1 cluster and N simulation programs.
To add to that, I believe OpenBLAS surpassed MKL in benchmarks a while ago. Even if I had an Intel CPU, I would probably use OpenBLAS if I had the choice.
I benchmarked some of my large transformer networks the last few days and MKL is still 50% faster than OpenBLAS.
What's even worse in real-world applications is that OpenBLAS misbehaves when an application uses threads. This is also described in the OpenBLAS FAQ:
If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following.
So what are the results with libxsmm and current AMD BLAS, as that must be for small dimensions?
The reason it's serial BLAS that mainly matters is that HPC codes are usually parallelized above the BLAS; why do you want the nesting? Swapping in threaded OpenBLAS or BLIS is something you might do with basically serial stuff like vanilla R, e.g. https://loveshack.fedorapeople.org/blas-subversion.html#_add... OpenBLAS threading has been somewhat buggy, but the main problem with its OpenMP support currently seems to be that using OMP_PLACES kills it.
The new mkl_serv_intel_cpu_true() function seems to have been known since Agner Fog's 2019 update. I am quite surprised that no changes to the feature indicator was needed though.
If you are publishing an application, I still recommend using the intel_dispatch_patch.zip.
* Intel's “cripple AMD” function (2019)"
https://news.ycombinator.com/item?id=24307596