AVX-512: when and how to use these new instructions

imh · on Sept 7, 2018

I read a post yesterday on a generalized notion of compositionality [0]. It was neat and extolled the virtues of modularity and compositionality and being able to reason about system by reasoning about its parts.

If I'm understanding OP, this means that to use the AVX-512 instructions well, a compiler that has to think about instruction speed as a function of what other instructions are around it. It might be faster to write operation X with these instructions than without, but only if you don't also write operation Y with them, because then the CPU would get too hot.

That sounds so much harder! Hot damn! I know CPUs are complicated and 1 instruction = 1 cycle is wrong in many ways, but this just sounds especially difficult.

[0] https://news.ycombinator.com/item?id=17923075

gameswithgo · on Sept 7, 2018

The compiler would need to know what other programs are running on the core, or the computer (depending on CPU/instructions). It has no hope!

This is why it is advisable for SIMD libraries or libraries that use SIMD to always offer some levers for this kind of thing.

This makes it a pain as you may have to write the same function 2 or 3 times in different SIMD instruction sets, or use a library to do that for you:

https://github.com/jackmott/simdeez

Twirrim · on Sept 7, 2018

This situation gets absolutely awful when you consider that the Bronze and Silver Xeons do even more aggressive down throttling. Bronze speed plummets if only one core is doing AVX instructions.

Compilers can't hope to realistically handle this. JITs at least have a chance, but adding in handling for this behaviour surely requires a lot more complexity than I'd imagine most runtime developers would want to add to their code.

BeeOnRope · on Sept 7, 2018

Since bronze and silver largely only have one AVX-512 FP unit, running AVX-512 in the L2 license is almost totally pointless: you'd often be better off running twice as many AVX/AVX2 instructions on the two 256-bit units since you run at a higher frequency and the FLOP/cycle is the same.

The exception would be if your kernel can make some good use of other wide instructions such as memory access or shuffles.

gnufx · on Sept 8, 2018

For a relevant JIT library, see libxsmm, once submitted here with no take-up.

spitfire · on Sept 7, 2018

Essentially a compiler would have to add energy pressure alongside its existing scheduling for memory atency/pressure.

Doable, but someone will have to take the first jump.

jcranmer · on Sept 7, 2018

Not really.

The power throttling and voltage gating that goes on takes a long time--at least microseconds, up to a few milliseconds. The scheduling concerns that compilers deal with are worried about around tens to hundreds of clock cycles, a factor of well over a thousand.

BeeOnRope · on Sept 7, 2018

Sure, but it's not really a scheduling decision. I think the GP is correct in as much compiler now have to make the hard choice of whether to use any AVX at all, and it's a global trade-off: even though using a few 64-byte moves might be locally optimal, you now need a higher license hence slower CPU and you can only evaluate if that trade-off makes sense in the scope of the larger program: how much such speedups do you get and does it compensate for the lower frequency?

BeeOnRope · on Sept 7, 2018

Curious, does any compiler implement any kind of general algorithm for "memory pressure"? For register allocation (hence pressure), they do I think - but the memory layout, at least in lower level languages, is mostly fixed by the source so I didn't think there was much flexibility there.

freeman478 · on Sept 7, 2018

It seems like this should enable JIT compilers (java/.net/js) to realise even more of their theorical gains.

Twirrim · on Sept 8, 2018

At the cost of some potentially significant complexity in the code. I wonder if the trade off would be worth it.

drb91 · on Sept 7, 2018

I wonder if one could generate multiple implementations of a code block and select at runtime which is the best one given the current state of the CPU. Obviously this would require some architecture changes to have a multi-address-select jump or whatever, but this fundamentally seems like a problem only solvable with information known at runtime.

...though, come to think about it, this would be pretty easy with a tracing JIT.

muricula · on Sept 8, 2018

It’s called an indirect jump. If you use C++ virtual functions, C function pointers, or Go interfaces it happens all the time.

drb91 · on Sept 8, 2018

That is an entirely different functionality from what I am suggesting. The select would select based on some other internal state to the cpu than a register.

yalok · on Sept 8, 2018

That explains what I observed while optimizing convolution functions in OpenBLAS, adding support for AVX512 while AVX2 was already there. On a 36-core Xeon Platinum there was no measurable gain when moving to AVX512, maybe even a bit slower, when running intensive convolutions full of Fuse Multiply Add - FMLA. I was puzzled and thought it’s due to hitting the memory bandwidth limit (but pre-caching more didn’t help), or due to the pipeline - but again, unrolling the loop and reshuffling AVX512 instructions in between “lighter” instructions didn’t seem to help either. I should have monitored the CPU clock changes. But I had no idea there was this kind of dependency. I now wonder what should be the average ratio of heavy 512-bit was AVX512 instructions vs lighter 256-bit instructions to avoid getting into permanent L2 mode and stay at L1. Maybe paralleling one 512bit loop unroll with several 256bit unrolls may still yield some gains in heavy avx usage... Thank you for the article!

nkurz · on Sept 8, 2018

I now wonder what should be the average ratio of heavy 512-bit was AVX512 instructions vs lighter 256-bit instructions to avoid getting into permanent L2 mode and stay at L1.

One of the co-authors talks more about the exact limits here: https://www.realworldtech.com/forum/?threadid=179654&curpost...

The quick answer is that on the W-2104 system he tested, you can sustain one FMA every 2 cycles while still remaining in the medium speed L1 state.

His avx-turbo tool (https://github.com/travisdowns/avx-turbo) can be used to check the situation for your particular processor.

gnufx · on Sept 8, 2018

Does the OB work follow BLIS and libxsmm? There's been at least some discussion of avx512 trades-off around them and the authors would probably advise. (Not that there's a single "avx512"...)

jblow · on Sept 8, 2018

These instructions will be around for a long time, but their performance attributes will change in 5 minutes when Intel releases the next wave of processors.

I think given the current state of things it would be irresponsible for compilers to generate heavy instructions unless asked. Forget trying to be smart about it ... we already fail to be smart about things that are much simpler and more visible.

More interestingly, this may be what all CPU behavior looks like in 10 years, because if Intel has to resort to this kind of design now, why would hat change any time soon? Instead of worrying about primarily keeping the execution units full, people trying to write fast code may be primarily concerned with keeping them NOT full so that the chip doesn’t slow down. Which sounds crazy and hard to deal with.

VHRanger · on Sept 8, 2018

Fwiw, autovectorizers tend to be pretty conservative about it.

So to use cutting edge instructions you generally have to hand code them (either in intrinsics like Lemire did there or in asm)

BeeOnRope · on Sept 8, 2018

Especially for FP instructions, where compilers are heavily restricted by the standard and IEEE754 semantics. Vectorization often changes the order of operations and since FP math is generally not associative.

For integer operations, auto-vectorization is more prevalent since everything can be reordered more freely. clang especially auto-vectorizes a ton of stuff even at -O2.

int_19h · on Sept 8, 2018

But that's precisely why most language standards don't require strict IEEE754 semantics. And why people mostly compile with -ffast-math or equivalent when they do.

gameswithgo · on Sept 8, 2018

You won't get floating point simd of any kind without the ffastmath flag on gcc, msvc, or llvm or probably anything.

bcaa7f3a8bbc · on Sept 7, 2018

A modern CPU running an instruction set specifically designed for high-performance computing, working under proper operational conditions, and with aqueduct cooling, can still go overheat and trigger its internal throttling. And since Turbo Boost, throttling is explicitly used not only as a powersave or protective measure, but part of the normal operation.

So we have now being at the closest point to the CPU power wall in history ...

crankylinuxuser · on Sept 7, 2018

I'm no CPU engineer, but this smells like some sort of chip-level macro function that cuts edges even more than the spectre/meltdown issue.

I remember joke opcodes back in the day. One was "Halt and catch fire". Is that seriously what this is doing?

Symmetry · on Sept 7, 2018

It would be easy for Intel to make the a CPU that never throttled down, they would just have to clock the CPU low enough that it would never run into thermal problems no matter how much load the CPU was experiencing. But then they'd be much worse for everybody's use case except maybe CPU reviewers.

Changing the clock speed based on thermal headroom is really hard to do well but Intel does it well and most other chip-makers are trying to duplicate Intel. This is really the opposite of those old chips that would sometimes burst into flame because a modern CPU which throttles based on temperature will never burst into flame even if you remove the heat sink and put it in a 150 degree over.

valarauca1 · on Sept 7, 2018

AVX-512 does not do to micro-op splitting (on intel server class CPU's, it does on some intel consumer cpu's).

Amusing Zen is emulating AVX512 and AVX2 via micro-op splitting and it performs better under some workloads.

The real issue is path propagation delays of 512bits worth of electricity is extremely non-trivial, and costs a shit load of power. Just `mov`'ing to the AVX-512 instructions (initially when AVX-512 is not warm) can stall the CPU for 10,000+ cycles as it tries to power on all those registers.

brigade · on Sept 7, 2018

Someone analyzed AVX512 performance on the elusive Lenovo laptop? Link!

Also, source on the power up stall? AVX(2) didn’t have that and I’m highly surprised AVX512 would. Agner at least claims the same reduced throughput during warmup, but I think he only has early silicon.

BeeOnRope · on Sept 7, 2018

AVX(2) definitely had the power-up stall on many chips, including all client Skylake I think.

brigade · on Sept 7, 2018

No, it had reduced throughput of AVX instructions while the ALUs powered up. Not a stall.

BeeOnRope · on Sept 8, 2018

Yeah maybe you are right for Skylake client, I haven't tested carefully there, but I'll probably get around to it. This thread [1] indicates that it may have only been Haswell that had the halted portion.

On to Skylake-SP, however, that chip is reported to have both reduced throughput and fully halted periods in [2].

Some have speculated it has to do whether chips have an integrated IVR: the models with integrated IVR having less capability of handling high dI/dt events. I don't know about that though (Skylake-SP still has external VR, right?).

[1] https://www.agner.org/optimize/blog/read.php?i=378#378 [2] https://software.intel.com/en-us/comment/1926876#comment-192...

kazinator · on Sept 7, 2018

> cuts edges even more than the spectre/meltdown issue.

Indeed. If you have some piece of code in a different security context that conditionally executes a heavy instruction based on a decision made over some sensitive data, doesn't this provide a way to obtain information about that data?

scottlamb · on Sept 7, 2018

> If you have some piece of code in a different security context that conditionally executes a heavy instruction based on a decision made over some sensitive data, doesn't this provide a way to obtain information about that data?

Yes. http://www.numberworld.org/blogs/2018_6_16_avx_spectre/

BeeOnRope · on Sept 7, 2018

Yes, this is call NetSpectre basically. Well NetSpectre describes two side channels, but the faster of the two is an AVX-clock/transition related side channel that relies on the CPU downclocking behavior.

Symmetry · on Sept 7, 2018

If that's the case then you're leaking a lot more information via the power draw and timing then you are via the clock speed.

dataflow · on Sept 7, 2018

Is this happening with AVX-512 a realistic scenario?

neiled · on Sept 7, 2018

Oh now I get the name of that TV show! I never realised it was the name of a joke opcode before, thanks :)

cestith · on Sept 7, 2018

Sometimes it's only partly a joke. https://en.wikipedia.org/wiki/Halt_and_Catch_Fire

twtw · on Sept 7, 2018

I'm still upset about the permanent-until-vzeroupper "transition" penalty on nonVEX instructions on skylake after any VEX instruction.

chrisseaton · on Sept 7, 2018

Do you know anything about that? I know some ABIs have a vzeroupper at the end of every method - what is this for and do you know how expensive is is?

haberman · on Sept 7, 2018

Looks to be a complicated issue, with very different performance profiles across different microarchitectures:

https://software.intel.com/en-us/forums/intel-isa-extensions...

kazinator · on Sept 7, 2018

> Intel cores can run in one of three modes: license 0 (L0) is the fastest (and is associated with the turbo frequencies “written on the box”), license 1 (L1) is slower and license 2 (L2) is the slowest.

Wow, let's give the word "license" semantics not related to software licensing, and cause confusion between L1 meaning "L1 cache" and "license 1".

the8472 · on Sept 7, 2018

AIUI this is because the power scheduler gives the core the license to actually execute those instructions. Until it is granted the core must emulate them as multiple vector instructions of smaller stride size.

BeeOnRope · on Sept 7, 2018

The terminology isn't great, but I think it's at least better than the marketing terms for the levels which use something like "non-AVX turbo", "AVX2 turbo" and "AVX-512 turbo" for L0,1,2 respectively. This is really confusing because most AVX code will actually run in the faster non-AVX turbo and similarly for AVX-512 as described. So the marketing names are actually too pessimistic.

valarauca1 · on Sept 7, 2018

This post cites sources that actively contradict the points it attempts to make. It provides a wealth of useful aggregation of information against the points it makes.

Let's us begin

    However, there are also deterministic frequency 
    reductions based specifically on which instructions you
    use and on how many cores are active (downclocking).

As you are likely using AVX-512 in a cloud deployment you don't have access to any of the information, and are likely sharing that hardware who may not respect the same engineering rigor.

Also, nobody setting there CPU affinity to ensure the down-clocking stays on a single core. You have to pray the scheduler doesn't shuffle your workload around. This requires a lot of platform specific C and most people are writing Java/Go/Ruby/Python where doing bit twidling NUMA management is impossible, furthermore the information you have access to in a cloud environment (which is where you'll be using advanced AVX-512 unless you work for Amazon, Google, Intel, or Cloudflare) may just lie about core count, and NUMA architecture.

Also this is just false [1]. Running AVX-512 adjusts the BASE clock for the package. There are throttling attempts to ensure every other core on the package throttles back. This is a package wide effect, not a per-core effect.

    Light instructions include integer operations other than
    multiplication, logical operations, data shuffling
    (such as vpermw and vpermd) and so forth.

This is false according cloudflare [2] which you've linked. They test your "light" carry less adding, shifting, and xoring (these are the only operations in ChaCha20 [4]). It cost too much.

    We have chosen to only include two columns.

I'll include the whole thing [3]. Wow yeah the entire package's power curve is changing. The base clock, cores not effected, all their clocks are changing. Its almost like having 1 out of 24 cores still effects all 24 cores....

    For example, the openssl project used heavy AVX-512
    instructions to bring down the cost of a particular
    hashing algorithm (poly1305) from 0.51 cycles per byte
    (when using 256-bit AVX instructions) to 0.35 cycles
    per byte, a 30% gain on a per-cycle basis. They have
    since disabled this optimization.

The literal example to show AVX-512 is good at ends with statement that people using AVX-512 are now actively avoiding it.

This is less then content, do you have an agenda or are you just an idiot?

[1] https://en.wikichip.org/wiki/intel/xeon_silver/4116

[2] https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

[3] https://en.wikichip.org/wiki/intel/xeon_gold/5120

[4] https://en.wikipedia.org/wiki/Salsa20

teraflop · on Sept 7, 2018

> Also this is just false [1]. Running AVX-512 adjusts the BASE clock for the package. There are throttling attempts to ensure every other core on the package throttles back. This is a package wide effect, not a per-core effect.

The explanatory page accompanying that table contains text that contradicts your claim: https://en.wikichip.org/wiki/intel/frequency_behavior

> The frequency of each core is determined independently based on the workload described above. That is, cores running Non-AVX workloads can enjoy the full regular turbo frequency, whereas cores executing AVX-512 or AVX2 will operate at their own designated turbo frequencies. [...]

> In Haswell, an AVX2 workload on one core meant all cores were capped at AVX2 Turbo frequency. This had the undesirable effect of reducing performance for non-AVX workloads on cores that were unrelated to the cores executing AVX2 workloads. This behavior was changed with Broadwell which grouped cores executing AVX2 workloads together and cores executing non-AVX workloads separately, allowing the former cores group to execute at the lower AVX2 turbo frequency while having the later cores group execute at full non-AVX2 turbo.

Also:

> The literal example to show AVX-512 is good at ends with statement that people using AVX-512 are now actively avoiding it.

It's an example to show that cycle-by-cycle speedups due to the use of AVX-512 are not always worthwhile, especially in library code. Which was one of the points the article was trying to make. It's fine if that was an obvious point from your perspective, but it doesn't contradict the article and it doesn't make it "less than content".

dang · on Sept 7, 2018

Please don't break the HN guidelines by becoming uncivil.

https://news.ycombinator.com/newsguidelines.html

gjs278 · on Sept 7, 2018

please stop blinding me by making downvoted comments one shade away from the background color

dang · on Sept 7, 2018

If you want to read one in glorious black, simply click on its timestamp to go to its page.

Rychard · on Sept 7, 2018

This is a reasonable criticism.

I'm not a fan of the delivery, but I absolutely agree with you.

CamperBob2 · on Sept 7, 2018

Agreed as well. It's annoying because they think they're deprecating downvoted comments by fading them out when they're actually calling more attention to them by requiring the reader to invest more effort to read them.

It's reminiscent of the infamous "disemvoweling" strategy used on a few other forums, where the reader is forced to decide whether they want to painstakingly reconstruct offensive and abusive comments or blindly trust someone else to restrict what they see.

Life would be so much easier if they just displayed the comment score like most other moderated forums and let the reader decide the merits of the comments based on visible information.

petermcneeley · on Sept 7, 2018

So what I got from the article was simply be careful of AVX2 & AVX-512 because it can result in frequency reduction (both of the nominal core and potentially other cores). Is this an inaccurate reading of the situation?

BeeOnRope · on Sept 8, 2018

> Also this is just false [1]. Running AVX-512 adjusts the > BASE clock for the package. There are throttling attempts > to ensure every other core on the package throttles back. > This is a package wide effect, not a per-core effect.

This was true of server CPUs before Skylake-SP, but on Skylake-SP this effect is per core. This was widely touted by Intel since it was an improvement over the old behavior.

What the other cores are doing still matters since the count of "active cores" is used to look up the turbo frequency for each other core, depending on its license: but for this purpose it only matters if the others are running or halted, not _what_ they are running.

If you don't believe it, I've shared a benchmark you can try yourself if you have access to Skylake server [1]. Run with --spec avx512_fma_t/1,scalar_iadd/3, for example, to kick off 1 core of heavy FMA ops in parallel with 3 cores of scalar-only ops. You'll see only the FMA core drops down to the L2 license.

    Light instructions include integer operations other than
    multiplication, logical operations, data shuffling
    (such as vpermw and vpermd) and so forth.

> This is false according cloudflare [2] which you've > linked. They test your "light" carry less adding, > shifting, and xoring (these are the only operations > in ChaCha20 [4]). It cost too much.

Again, you can test this yourself with avx-turbo [1], there are a variety of tests there that show everything except FP operations and integer multiplications (which execute on the FP unit) are treated as light. Of course, the tests aren't exhaustive, but they hit the main categories of instructions.

Note that even light AVX-512 instructions cause the chip to transition to a lower frequency (the so-called L1 license, which is usually about half way between the fasted L0 and slowest L2 speed). So even if ChaCha doesn't use any heavy instructions, any AVX-512 at all will slow down your frequency. They only reported a 5% to 7% reduction in performance, which could easily be consistent with a downclock to the L1 frequency.

> The literal example to show AVX-512 is good at ends > with statement that people using AVX-512 are now > actively avoiding it.

I think the example is intended to show that AVX-512 indeed significantly speeds up the parts of the code where it is applied: but if that makes up a small part of your code overall, it might not be worth it, because the rest of your code may suffer a frequency penalty.

Unlike some earlier ISA extensions, there is no simple answer like "use it" or "don't": there are complex tradeoffs. That's what you should get out of this article.

[1] https://github.com/travisdowns/avx-turbo