AVX-512, what’s useful for us

raphlinus · on Dec 22, 2017

If I'm understanding it correctly, they're not actually using the 512 bit (ZMM) registers, because using them can cause overall system slowdown. It seems to me they're only really useful if you're doing an AVX-512 intensive workload. And do those really exist? For something like bulk matrix multiplications, GPGPU is going to be much better, both in throughput and in operations per joule. I'm remaining to be convinced that the ecological niche occupied by SIMD is significant, let alone expanding.

chrisseaton · on Dec 22, 2017

> For something like bulk matrix multiplications, GPGPU is going to be much better, both in throughput and in operations per joule.

But you have to get all of your matrix out onto the system bus, and over to the GPU, then start the kernel, and then copy it all the way back again, to use that. ZMM is just a register. You can operate on it immediately and stream data from memory while you do the multiply.

equalunique · on Dec 24, 2017

AMD's HSA is intended to enable memory sharing between the CPU and GPU.

steve_musk · on Dec 22, 2017

I work in an HPC lab (computational chemistry) and we have a hand-coded AVX-512 codepath. I can't give any specifics (because I don't know them) but I know there is a non-trivial speedup versus the standard codepath (or AVX2).

However, GPUs blow it out of the water for much a lower price since we only need FP32. I think the main reason we invested time adding supports is for the Xeon Phi cards. I guess is could be worthwhile for some FP64 pipelines from a cost/performance perspective.

m_mueller · on Dec 23, 2017

If you look at Skylake-SP architecture vs. recent GPUs, the chip design at first glance doesn't seem so different anymore between these two, CPUs are just much less focused, which pays a 2x price in theoretical performance for the same die space, even using Intel's superior process technology. Now that being said, I think the GPU/SIMT model of vector computing is just much smarter. Why let me jump through all these hoops of masking and compiler optimizations if all I want is a branch and an early exit for a specific set of values? GPU schedulers and drivers make this easy to use and with somewhat predictable performance results. Furthermore (and probably more importantly), why is Intel putting this amount of compute power on a CPU without significantly upgrading memory bandwidth? A 28 core Skylake-SP using full vectorization now has 3x (!) the FLOP/Byte system balance compared to NVIDIA P100. Seriously? System balance was once an argument against GPUs, but not anymore apparently...

jabl · on Dec 23, 2017

> Now that being said, I think the GPU/SIMT model of vector computing is just much smarter.

I'm not sure. For an argument in favor of vectors, see https://riscv.org/wp-content/uploads/2015/06/riscv-vector-wo...

> Why let me jump through all these hoops of masking and compiler optimizations if all I want is a branch and an early exit for a specific set of values? GPU schedulers and drivers make this easy to use and with somewhat predictable performance results.

If the underlying hw is SIMD (vectors) and not SIMT anyway, as Nvidia hw apparently is, why should I have to go through the effort of rewriting my code in CUDA, and hope that some opaque driver will manage to turn that into efficient vector code?

I mean, ideally I'd just like to write C/C++/Fortran/Julia/Haskell/whatever code, and the compiler would autovectorize it.

> Furthermore (and probably more importantly), why is Intel putting this amount of compute power on a CPU without significantly upgrading memory bandwidth?

Flops are cheap, bw expensive. But yeah, certainly the are many applications that would benefit from a much better bw/flops ratio.

Then again, with the latest Teslas you have 16 GB with awesome bw, after that you're trying to feed the firehose through the PCIe straw.

fulafel · on Dec 23, 2017

Fun fact: AVX-512 came from the design that is now known as Xeon Phi.

You can read Tom Forsyth's story on it on his ask.fm (https://ask.fm/tom_forsyth)

I paste it here because I couldn't figure out how to link to the ask.fm post:

"Q: How did you get involved in the development of Larrabee, and what concepts do you have to know to do such a thing? by Foo Bar

8 months ago

A: It was a rather convoluted process! * Michael Abrash and Mike Sartain at Rad Game Tools were working with Intel to make a new processor called Simple, Massive Array of Cores (SMAC) * I was also at Rad Game Tools working on Granny3D and was asked to be the DirectX expert, since I knew the details of the API very well. * As well as doing general coding on the whole software rendering stack, I helped capture a bunch of shader workloads from existing games, and then wrote a compiler for the SSE-based instruction set SMAC used at the time to prove it would be efficient at running these shaders. * It wasn't. * The compiler helped us add instructions to SSE to make it more efficient. * It still kinda sucked. * We threw SSE away and started again with a new vector instruction set we called "SMAC New Instructions" (SMACNI). Didn't really know what it had to be, except "not SSE". * All of us contributed ideas to the new instruction set, and then I'd make the compiler understand each idea, and we'd see how well it worked on these real shader workloads. * Feature by feature we created SMACNI, and bit by bit I became more of an instruction architect, less of a software coder. I learned a ton about hardware on the way, mainly by asking real architects stupid questions and trying to understand their answers. Hardware is nothing at all like software people imagine it is. * At some point, SMAC was given an official codename "Larrabee", SMACNI became Larrabee New Instructions (LRBNI), and I stopped working for Rad being a contractor for Intel, and instead became a full time Intel hardware architect - although I still sat at the same desk doing the same job with the same people. * We made Larrabee 1, aka Knights Ferry, and I started work on the next version of the instruction set and architecture. * We made Larrabee 2, aka Knights Corner, aka the first Xeon Phi. All exactly the same bit of silicon, just running slightly different software. * At this point there was a big push to make the next chip, Knights Landing, run all the existing MMX, SSE and AVX code (KNF and KNC didn't run any of those, it was just x86-64 and LRBNI), and conversely to push LRBNI onto the mainstream Intel cores. So I worked with all the rest of the architects at Intel in a massive board for a couple of years to hammer out how to merge these two instruction sets and encodings. The result was AVX512. * Those meetings were exhausting frustrating work and moved so agonizingly slowly, that once it was completed, it didn't take much persuading from Michael Abrash to go to Valve and work on virtual reality with him instead. * AVX512 has now shipped inside Knights Landing (the latest Xeon Phi chip), and inside the Skylake Xeon cores. Hopefully we'll see it in the mainstream desktop cores shortly. It's pretty cool seeing an instruction set I designed shipping in so many high-profile cores. "

stagger87 · on Dec 22, 2017

Define significant. I write software that couldn't exist without SIMD, as do all my competitors. (Note: it could exist but wouldn't offer nearly the same level of capability, i.e. it would be a different product) By specifying the last several generations of Intel hardware I can guarantee customers can run my software (AVX1 minimum) with consumer level desktops/laptops. AVX-512 hardware support hasn't reached the average consumer yet, but I am ready to take advantage of it when it does. (signal/video processing)

raphlinus · on Dec 22, 2017

I don't want to overstate it; I've written a lot of SIMD in my life too. If you've got audio processing that needs to run in a realtime thread, I can see how SIMD is appealing, because we don't really have mechanisms to achieve realtime on GPU (yet). However, for a lot of seriously heavy computation, it seems like doing it on the GPU is a win. For the cases where memory transfer back and forth to the GPU is expensive, even integrated GPU should offer significantly more computational resources than even AVX-512.

One of the use cases I'm thinking of is font rendering, where I published a SIMD-heavy prototype a couple years ago, and it's blown out of the water by a newer GPU-based approach: http://pcwalton.github.io/blog/2017/02/14/pathfinder/

mycall · on Dec 22, 2017

Have you looked at RISC-V Vector proposal yet? Same person who did AVX-512 for Intel.

clarry · on Dec 22, 2017

https://riscv.org/wp-content/uploads/2015/06/riscv-vector-wo...

https://riscv.org/wp-content/uploads/2016/12/Wed0930-RISC-V-...

EDIT: This looks really nice after skimming a little.

justincormack · on Dec 22, 2017

They only cause slowdown on some models, if you pay Intel enough they wont. The advantage of AVX512 is you can use it on every core (rather than trying to work out how to partition a GPU), and it is useful for lots of general purpose computations.

Dylan16807 · on Dec 22, 2017

What models don't slow down? Are you claiming the slowdown isn't actually necessary to keep the chip stable?

justincormack · on Dec 22, 2017

compare these

https://en.wikichip.org/wiki/intel/xeon_silver/4116#Frequenc...

https://en.wikichip.org/wiki/intel/xeon_gold/6154#Frequencie...

https://en.wikichip.org/wiki/intel/xeon_platinum/8180m#Frequ...

So platinum can run avx512 on 28 cores at 2.3GHz, while silver runs 12 cores at 1.4GHz

brigade · on Dec 22, 2017

From your links, gold/plat has a ~15% clock speed hit when using ymm muls, then another ~20% hit when going to zmm muls.

So for 512-bit wide to not slow things down on Intel's chips, you need ~30% runtime of all cores to already be in AVX.

jabl · on Dec 22, 2017

Xeon Phi, maybe? Since the entire raison d'être of that chip is doing avx-512 computation. Or if you're a glass half empty kind of person, you'd argue that it doesn't speed up when running scalar code. :)

dragontamer · on Dec 22, 2017

It will be a while before AVX-512 becomes practical however. AMD doesn't support it (so any RyZen or Threadripper fans will miss out), and even Intel 8th Gen Coffee-lake doesn't support it.

Only Intel Extreme i9 and Xeon Silver / Gold / Platinum seems to support it. So the market for this instruction set is quite limited.

jsheard · on Dec 22, 2017

FWIW we're only one generation away from AVX512 on consumer CPUs, Intel's upcoming Cannon Lake architecture will support it.

https://www.anandtech.com/show/11928/intels-document-points-...

dragontamer · on Dec 22, 2017

Well, one-generation away from consumers being able to buy the chip. And maybe 5-years away before a sizable number of consumers upgrade to that chip (or newer)... since the typical Desktop is at LEAST 5 years old in my experience...

The Users who really need the feature are likely upgrading to AVX-512 computers already. IE: Mac Pro. So adoption is not as bad as my hyperbole above. But its still going to be a while before we can assume AVX512 support on machines.

Hell, with so many people running Sandy Bridge (i7-2xxx series), you can't even assume AVX2 support today.

pjmlp · on Dec 22, 2017

That is the beauty of JIT compilers, some JVMs like Azul's already support it.

dragontamer · on Dec 22, 2017

Some of the AVX512 instruction set seems very well suited for automatic JIT compilers. I'm sure Azul immediately jumped on board for the "Conflict Detection" instruction set and are auto-vectorizing tons of more loops.

But other AVX512 instructions don't seem very easy for compilers to automatically apply. In particular: Scatter/Gather instructions have implications on the most-efficient way to lay out data in memory. I'm sure an auto-vectorizer can take advantage of it slightly, but it'd take an AVX512 expert to determine the best memory-layout for various data-structures in this new AVX512 world.

Although I guess an auto-vectorizer could use those instructions to handle more cases... smart programmers would still have to tune their code (or really: their data-structures) to be done in such a way that the auto-vectorizer / optimizing compilers can utilize these instructions.

jabl · on Dec 22, 2017

AFAIK, scatter/gather have been available in supercomputer vector ISA's since the mid-1970'ies. And even with the state of compiler technology back then, the Cray Fortran compiler was able to use scatter/gather to vectorize loops with indirect addressing (e.g. a[ind[i]] ). Such as occurs e.g. in sparse matrix style computations.

tambre · on Dec 22, 2017

Couldn't you use something like GCC function multi-versioning?

dragontamer · on Dec 22, 2017

AVX512 has so many more features above-and-beyond Intel's typical SIMD implementation. Feature wise, its beginning to be competitive against NVidia's PTX CUDA architecture. Like, AVX512 is a really, really good instruction set (or I guess: a really good set of instruction sets).

Assuming AVX512 F, CD, VL, DQ, and BW (the expected AVX512 instructions in CannonLake):

* AVX512F -- "Standard" 512-bit arithmetic already has major improvements, above and beyond the 256bit -> 512bit upgrade. AVX512 has 32-registers per core (when AVX2 and earlier only have 16). The new set of opmask instructions also allow for way more code to turn into "branch-free" code which is friendly for pipelines. This is already a major step forward alone with huge implications for multimedia code.

* AVX512-CD: Conflict Detection. These instructions allow auto-vectorizers to "resolve loop conflicts" and auto-vectorize more code.

* VL, DQ -- Extend AVX512 to Bytes, Shorts, Longs, Long Longs.

* BW -- Extend AVX512 to operate on only 256-bit and 128-bits at a time.

--------------------

I'm certain that some code, which could not be vectorized in AVX2 (or lower), will be vectorized with AVX512. Maybe even automatically as compiler writers implement high-level features / auto-vectorizers.

jabl · on Dec 22, 2017

I wonder why they did the BW thing instead of just defining a vector length register like other vector ISA's (which would have allowed to get rid of a remainder loop, leading to less code bloat and more efficient execution for short loops where the number of iterations is not an integer multiple of the ISA vector length).

thecompilr · on Dec 22, 2017

VL - Extend AVX512 to operate on only 256-bit and 128-bits at a time. (vector length extension)

DQ - Extend AVX512 to Longs, Long Longs. (double word and quadword extension)

BW - Extend AVX512 to Bytes, Shorts. (byte and word extension)

dman · on Dec 22, 2017

Intel did a self goal here by limiting availability of AVX512 to select Xeon SKUs.

jsheard · on Dec 22, 2017

Yep. If you use Intel ISPC to write vector code (and you should, it's seriously underrated) it can also target multiple instruction sets and dispatch to the best supported one at runtime.

Fronzie · on Dec 22, 2017

Yes, that or manual runtime dispatching. Which is doable if there are only a few hot-spots in the code.

It'd be so sweet if multi-versioning became well supported across all major compilers (including MSVC)

hugo19941994 · on Dec 22, 2017

Skylake-X Core i7 CPUs also support AVX-512

https://ark.intel.com/products/123767/Intel-Core-i7-7820X-X-...

That's still a pretty limited selection of processors though.

AndrewGaspar · on Dec 22, 2017

Only if you're writing general purpose software. We're taking advantage of AVX-512 in HPC since the cost of re-writing the software to take advantage of the hardware is a smart investment due to the large capital costs involved.

summarity · on Dec 22, 2017

    document.querySelector('#k2Container').style.color = 'black';

and the blog post becomes almost readable.

Other than that, nice intro.

kierank · on Dec 22, 2017

Yeah, sorry about that, that site's getting replaced soon.

tambre · on Dec 22, 2017

Might want to also think about SSL. Certificates are free from Let's Encrypt.

That said, props for the IPv6 support!

tytytytytytytyt · on Dec 22, 2017

It's still far better than the 2x font size sites.

pkaye · on Dec 22, 2017

I changed some Golang code to AVX in my last project. In isolation that code ran like 2-4x faster but as part of the full program, the program was 5% slower overall. Could never make a sense of it. Any thoughts on how to determine the cause?

mscrivo · on Dec 22, 2017

AVX code is known to make the CPU run way hotter than usual. Perhaps that caused throttling that made general code running at the same time, or within a short span thereafter, perform worse?

pkaye · on Dec 22, 2017

That is one theory I had but i'm not sure how to determine if CPU is throttling (on Ubuntu Linux.)

thecompilr · on Dec 22, 2017

You can either use lscpu, which is less accurate, or the best way is to check:

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq

where instead of cpu0 you can write any core number, and it will give you the current frequency of that core in KHz.

pkaye · on Dec 22, 2017

I'll try that out. Thanks.

lathiat · on Dec 23, 2017

Note that with pstate /proc/cpuinfo is not reflective (not suggested here but, in the past, the MHz used to change to reflect the scaling speed). You could also look at 'powertop'

dpayne · on Dec 23, 2017

One way I've found to isolate this is to turn off turbo boost and underclock the cpu, while making sure the avx offset is set to 100%. While you would never want to run a production system like this, it does help to eliminate any issues with cpu throttling. If the avx-512 version of the program still runs slower then something else is interfering besides cpu throttling.

jsheard · on Dec 22, 2017

Did you remember to put a vzeroupper instruction at the end of your AVX functions? If not then you pay a performance penalty when transitioning from AVX to SSE code, and the rest of your program might be inadvertently touching SSE code hidden inside the Go runtime.

pkaye · on Dec 22, 2017

Yes I already do that.

sathvikl · on Dec 23, 2017

Running AVX-512 on all cores will lower all core operating frequency, in order to maintain the TDP of the processor.

When you run it in isolation, there's more headroom since only one core can run at the higher freq. and use AVX512 registers.

Also the power license ensures the cores running AVX512 code runs at a lower frequency.

Make sure to guard your AVX512 code block with VZEROUPPER when exiting it

crististm · on Dec 23, 2017

cache invalidation may be a reason

gok · on Dec 23, 2017

Doesn't mention what I find the coolest part of AVX-512: the conflict detection instructions. Finally a way to vectorize loops with indirect loads!

ninegunpi · on Dec 22, 2017

for slowing down awkward code?