Hacker News new | past | comments | ask | show | jobs | submit login
AVX-512: when and how to use these new instructions (lemire.me)
188 points by ingve on Sept 7, 2018 | hide | past | favorite | 52 comments



I read a post yesterday on a generalized notion of compositionality [0]. It was neat and extolled the virtues of modularity and compositionality and being able to reason about system by reasoning about its parts.

If I'm understanding OP, this means that to use the AVX-512 instructions well, a compiler that has to think about instruction speed as a function of what other instructions are around it. It might be faster to write operation X with these instructions than without, but only if you don't also write operation Y with them, because then the CPU would get too hot.

That sounds so much harder! Hot damn! I know CPUs are complicated and 1 instruction = 1 cycle is wrong in many ways, but this just sounds especially difficult.

[0] https://news.ycombinator.com/item?id=17923075


The compiler would need to know what other programs are running on the core, or the computer (depending on CPU/instructions). It has no hope!

This is why it is advisable for SIMD libraries or libraries that use SIMD to always offer some levers for this kind of thing.

This makes it a pain as you may have to write the same function 2 or 3 times in different SIMD instruction sets, or use a library to do that for you:

https://github.com/jackmott/simdeez


This situation gets absolutely awful when you consider that the Bronze and Silver Xeons do even more aggressive down throttling. Bronze speed plummets if only one core is doing AVX instructions.

Compilers can't hope to realistically handle this. JITs at least have a chance, but adding in handling for this behaviour surely requires a lot more complexity than I'd imagine most runtime developers would want to add to their code.


Since bronze and silver largely only have one AVX-512 FP unit, running AVX-512 in the L2 license is almost totally pointless: you'd often be better off running twice as many AVX/AVX2 instructions on the two 256-bit units since you run at a higher frequency and the FLOP/cycle is the same.

The exception would be if your kernel can make some good use of other wide instructions such as memory access or shuffles.


For a relevant JIT library, see libxsmm, once submitted here with no take-up.


Essentially a compiler would have to add energy pressure alongside its existing scheduling for memory atency/pressure.

Doable, but someone will have to take the first jump.


Not really.

The power throttling and voltage gating that goes on takes a long time--at least microseconds, up to a few milliseconds. The scheduling concerns that compilers deal with are worried about around tens to hundreds of clock cycles, a factor of well over a thousand.


Sure, but it's not really a scheduling decision. I think the GP is correct in as much compiler now have to make the hard choice of whether to use any AVX at all, and it's a global trade-off: even though using a few 64-byte moves might be locally optimal, you now need a higher license hence slower CPU and you can only evaluate if that trade-off makes sense in the scope of the larger program: how much such speedups do you get and does it compensate for the lower frequency?


Curious, does any compiler implement any kind of general algorithm for "memory pressure"? For register allocation (hence pressure), they do I think - but the memory layout, at least in lower level languages, is mostly fixed by the source so I didn't think there was much flexibility there.


It seems like this should enable JIT compilers (java/.net/js) to realise even more of their theorical gains.


At the cost of some potentially significant complexity in the code. I wonder if the trade off would be worth it.


I wonder if one could generate multiple implementations of a code block and select at runtime which is the best one given the current state of the CPU. Obviously this would require some architecture changes to have a multi-address-select jump or whatever, but this fundamentally seems like a problem only solvable with information known at runtime.

...though, come to think about it, this would be pretty easy with a tracing JIT.


It’s called an indirect jump. If you use C++ virtual functions, C function pointers, or Go interfaces it happens all the time.


That is an entirely different functionality from what I am suggesting. The select would select based on some other internal state to the cpu than a register.


That explains what I observed while optimizing convolution functions in OpenBLAS, adding support for AVX512 while AVX2 was already there. On a 36-core Xeon Platinum there was no measurable gain when moving to AVX512, maybe even a bit slower, when running intensive convolutions full of Fuse Multiply Add - FMLA. I was puzzled and thought it’s due to hitting the memory bandwidth limit (but pre-caching more didn’t help), or due to the pipeline - but again, unrolling the loop and reshuffling AVX512 instructions in between “lighter” instructions didn’t seem to help either. I should have monitored the CPU clock changes. But I had no idea there was this kind of dependency. I now wonder what should be the average ratio of heavy 512-bit was AVX512 instructions vs lighter 256-bit instructions to avoid getting into permanent L2 mode and stay at L1. Maybe paralleling one 512bit loop unroll with several 256bit unrolls may still yield some gains in heavy avx usage... Thank you for the article!


I now wonder what should be the average ratio of heavy 512-bit was AVX512 instructions vs lighter 256-bit instructions to avoid getting into permanent L2 mode and stay at L1.

One of the co-authors talks more about the exact limits here: https://www.realworldtech.com/forum/?threadid=179654&curpost...

The quick answer is that on the W-2104 system he tested, you can sustain one FMA every 2 cycles while still remaining in the medium speed L1 state.

His avx-turbo tool (https://github.com/travisdowns/avx-turbo) can be used to check the situation for your particular processor.


Does the OB work follow BLIS and libxsmm? There's been at least some discussion of avx512 trades-off around them and the authors would probably advise. (Not that there's a single "avx512"...)


These instructions will be around for a long time, but their performance attributes will change in 5 minutes when Intel releases the next wave of processors.

I think given the current state of things it would be irresponsible for compilers to generate heavy instructions unless asked. Forget trying to be smart about it ... we already fail to be smart about things that are much simpler and more visible.

More interestingly, this may be what all CPU behavior looks like in 10 years, because if Intel has to resort to this kind of design now, why would hat change any time soon? Instead of worrying about primarily keeping the execution units full, people trying to write fast code may be primarily concerned with keeping them NOT full so that the chip doesn’t slow down. Which sounds crazy and hard to deal with.


Fwiw, autovectorizers tend to be pretty conservative about it.

So to use cutting edge instructions you generally have to hand code them (either in intrinsics like Lemire did there or in asm)


Especially for FP instructions, where compilers are heavily restricted by the standard and IEEE754 semantics. Vectorization often changes the order of operations and since FP math is generally not associative.

For integer operations, auto-vectorization is more prevalent since everything can be reordered more freely. clang especially auto-vectorizes a ton of stuff even at -O2.


But that's precisely why most language standards don't require strict IEEE754 semantics. And why people mostly compile with -ffast-math or equivalent when they do.


You won't get floating point simd of any kind without the ffastmath flag on gcc, msvc, or llvm or probably anything.


A modern CPU running an instruction set specifically designed for high-performance computing, working under proper operational conditions, and with aqueduct cooling, can still go overheat and trigger its internal throttling. And since Turbo Boost, throttling is explicitly used not only as a powersave or protective measure, but part of the normal operation.

So we have now being at the closest point to the CPU power wall in history ...


I'm no CPU engineer, but this smells like some sort of chip-level macro function that cuts edges even more than the spectre/meltdown issue.

I remember joke opcodes back in the day. One was "Halt and catch fire". Is that seriously what this is doing?


It would be easy for Intel to make the a CPU that never throttled down, they would just have to clock the CPU low enough that it would never run into thermal problems no matter how much load the CPU was experiencing. But then they'd be much worse for everybody's use case except maybe CPU reviewers.

Changing the clock speed based on thermal headroom is really hard to do well but Intel does it well and most other chip-makers are trying to duplicate Intel. This is really the opposite of those old chips that would sometimes burst into flame because a modern CPU which throttles based on temperature will never burst into flame even if you remove the heat sink and put it in a 150 degree over.


AVX-512 does not do to micro-op splitting (on intel server class CPU's, it does on some intel consumer cpu's).

Amusing Zen is emulating AVX512 and AVX2 via micro-op splitting and it performs better under some workloads.

The real issue is path propagation delays of 512bits worth of electricity is extremely non-trivial, and costs a shit load of power. Just `mov`'ing to the AVX-512 instructions (initially when AVX-512 is not warm) can stall the CPU for 10,000+ cycles as it tries to power on all those registers.


Someone analyzed AVX512 performance on the elusive Lenovo laptop? Link!

Also, source on the power up stall? AVX(2) didn’t have that and I’m highly surprised AVX512 would. Agner at least claims the same reduced throughput during warmup, but I think he only has early silicon.


AVX(2) definitely had the power-up stall on many chips, including all client Skylake I think.


No, it had reduced throughput of AVX instructions while the ALUs powered up. Not a stall.


Yeah maybe you are right for Skylake client, I haven't tested carefully there, but I'll probably get around to it. This thread [1] indicates that it may have only been Haswell that had the halted portion.

On to Skylake-SP, however, that chip is reported to have both reduced throughput and fully halted periods in [2].

Some have speculated it has to do whether chips have an integrated IVR: the models with integrated IVR having less capability of handling high dI/dt events. I don't know about that though (Skylake-SP still has external VR, right?).

[1] https://www.agner.org/optimize/blog/read.php?i=378#378 [2] https://software.intel.com/en-us/comment/1926876#comment-192...


> cuts edges even more than the spectre/meltdown issue.

Indeed. If you have some piece of code in a different security context that conditionally executes a heavy instruction based on a decision made over some sensitive data, doesn't this provide a way to obtain information about that data?


> If you have some piece of code in a different security context that conditionally executes a heavy instruction based on a decision made over some sensitive data, doesn't this provide a way to obtain information about that data?

Yes. http://www.numberworld.org/blogs/2018_6_16_avx_spectre/


Yes, this is call NetSpectre basically. Well NetSpectre describes two side channels, but the faster of the two is an AVX-clock/transition related side channel that relies on the CPU downclocking behavior.


If that's the case then you're leaking a lot more information via the power draw and timing then you are via the clock speed.


Is this happening with AVX-512 a realistic scenario?


Oh now I get the name of that TV show! I never realised it was the name of a joke opcode before, thanks :)


Sometimes it's only partly a joke. https://en.wikipedia.org/wiki/Halt_and_Catch_Fire


I'm still upset about the permanent-until-vzeroupper "transition" penalty on nonVEX instructions on skylake after any VEX instruction.


Do you know anything about that? I know some ABIs have a vzeroupper at the end of every method - what is this for and do you know how expensive is is?


Looks to be a complicated issue, with very different performance profiles across different microarchitectures:

https://software.intel.com/en-us/forums/intel-isa-extensions...


> Intel cores can run in one of three modes: license 0 (L0) is the fastest (and is associated with the turbo frequencies “written on the box”), license 1 (L1) is slower and license 2 (L2) is the slowest.

Wow, let's give the word "license" semantics not related to software licensing, and cause confusion between L1 meaning "L1 cache" and "license 1".


AIUI this is because the power scheduler gives the core the license to actually execute those instructions. Until it is granted the core must emulate them as multiple vector instructions of smaller stride size.


The terminology isn't great, but I think it's at least better than the marketing terms for the levels which use something like "non-AVX turbo", "AVX2 turbo" and "AVX-512 turbo" for L0,1,2 respectively. This is really confusing because most AVX code will actually run in the faster non-AVX turbo and similarly for AVX-512 as described. So the marketing names are actually too pessimistic.


This post cites sources that actively contradict the points it attempts to make. It provides a wealth of useful aggregation of information against the points it makes.

Let's us begin

    However, there are also deterministic frequency 
    reductions based specifically on which instructions you
    use and on how many cores are active (downclocking).
As you are likely using AVX-512 in a cloud deployment you don't have access to any of the information, and are likely sharing that hardware who may not respect the same engineering rigor.

Also, nobody setting there CPU affinity to ensure the down-clocking stays on a single core. You have to pray the scheduler doesn't shuffle your workload around. This requires a lot of platform specific C and most people are writing Java/Go/Ruby/Python where doing bit twidling NUMA management is impossible, furthermore the information you have access to in a cloud environment (which is where you'll be using advanced AVX-512 unless you work for Amazon, Google, Intel, or Cloudflare) may just lie about core count, and NUMA architecture.

Also this is just false [1]. Running AVX-512 adjusts the BASE clock for the package. There are throttling attempts to ensure every other core on the package throttles back. This is a package wide effect, not a per-core effect.

    Light instructions include integer operations other than
    multiplication, logical operations, data shuffling
    (such as vpermw and vpermd) and so forth.
This is false according cloudflare [2] which you've linked. They test your "light" carry less adding, shifting, and xoring (these are the only operations in ChaCha20 [4]). It cost too much.

    We have chosen to only include two columns.
I'll include the whole thing [3]. Wow yeah the entire package's power curve is changing. The base clock, cores not effected, all their clocks are changing. Its almost like having 1 out of 24 cores still effects all 24 cores....

    For example, the openssl project used heavy AVX-512
    instructions to bring down the cost of a particular
    hashing algorithm (poly1305) from 0.51 cycles per byte
    (when using 256-bit AVX instructions) to 0.35 cycles
    per byte, a 30% gain on a per-cycle basis. They have
    since disabled this optimization.
The literal example to show AVX-512 is good at ends with statement that people using AVX-512 are now actively avoiding it.

This is less then content, do you have an agenda or are you just an idiot?

[1] https://en.wikichip.org/wiki/intel/xeon_silver/4116

[2] https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

[3] https://en.wikichip.org/wiki/intel/xeon_gold/5120

[4] https://en.wikipedia.org/wiki/Salsa20


> Also this is just false [1]. Running AVX-512 adjusts the BASE clock for the package. There are throttling attempts to ensure every other core on the package throttles back. This is a package wide effect, not a per-core effect.

The explanatory page accompanying that table contains text that contradicts your claim: https://en.wikichip.org/wiki/intel/frequency_behavior

> The frequency of each core is determined independently based on the workload described above. That is, cores running Non-AVX workloads can enjoy the full regular turbo frequency, whereas cores executing AVX-512 or AVX2 will operate at their own designated turbo frequencies. [...]

> In Haswell, an AVX2 workload on one core meant all cores were capped at AVX2 Turbo frequency. This had the undesirable effect of reducing performance for non-AVX workloads on cores that were unrelated to the cores executing AVX2 workloads. This behavior was changed with Broadwell which grouped cores executing AVX2 workloads together and cores executing non-AVX workloads separately, allowing the former cores group to execute at the lower AVX2 turbo frequency while having the later cores group execute at full non-AVX2 turbo.

Also:

> The literal example to show AVX-512 is good at ends with statement that people using AVX-512 are now actively avoiding it.

It's an example to show that cycle-by-cycle speedups due to the use of AVX-512 are not always worthwhile, especially in library code. Which was one of the points the article was trying to make. It's fine if that was an obvious point from your perspective, but it doesn't contradict the article and it doesn't make it "less than content".


Please don't break the HN guidelines by becoming uncivil.

https://news.ycombinator.com/newsguidelines.html


please stop blinding me by making downvoted comments one shade away from the background color


If you want to read one in glorious black, simply click on its timestamp to go to its page.


This is a reasonable criticism.

I'm not a fan of the delivery, but I absolutely agree with you.


Agreed as well. It's annoying because they think they're deprecating downvoted comments by fading them out when they're actually calling more attention to them by requiring the reader to invest more effort to read them.

It's reminiscent of the infamous "disemvoweling" strategy used on a few other forums, where the reader is forced to decide whether they want to painstakingly reconstruct offensive and abusive comments or blindly trust someone else to restrict what they see.

Life would be so much easier if they just displayed the comment score like most other moderated forums and let the reader decide the merits of the comments based on visible information.


So what I got from the article was simply be careful of AVX2 & AVX-512 because it can result in frequency reduction (both of the nominal core and potentially other cores). Is this an inaccurate reading of the situation?


> Also this is just false [1]. Running AVX-512 adjusts the > BASE clock for the package. There are throttling attempts > to ensure every other core on the package throttles back. > This is a package wide effect, not a per-core effect.

This was true of server CPUs before Skylake-SP, but on Skylake-SP this effect is per core. This was widely touted by Intel since it was an improvement over the old behavior.

What the other cores are doing still matters since the count of "active cores" is used to look up the turbo frequency for each other core, depending on its license: but for this purpose it only matters if the others are running or halted, not _what_ they are running.

If you don't believe it, I've shared a benchmark you can try yourself if you have access to Skylake server [1]. Run with --spec avx512_fma_t/1,scalar_iadd/3, for example, to kick off 1 core of heavy FMA ops in parallel with 3 cores of scalar-only ops. You'll see only the FMA core drops down to the L2 license.

    Light instructions include integer operations other than
    multiplication, logical operations, data shuffling
    (such as vpermw and vpermd) and so forth.
> This is false according cloudflare [2] which you've > linked. They test your "light" carry less adding, > shifting, and xoring (these are the only operations > in ChaCha20 [4]). It cost too much.

Again, you can test this yourself with avx-turbo [1], there are a variety of tests there that show everything except FP operations and integer multiplications (which execute on the FP unit) are treated as light. Of course, the tests aren't exhaustive, but they hit the main categories of instructions.

Note that even light AVX-512 instructions cause the chip to transition to a lower frequency (the so-called L1 license, which is usually about half way between the fasted L0 and slowest L2 speed). So even if ChaCha doesn't use any heavy instructions, any AVX-512 at all will slow down your frequency. They only reported a 5% to 7% reduction in performance, which could easily be consistent with a downclock to the L1 frequency.

> The literal example to show AVX-512 is good at ends > with statement that people using AVX-512 are now > actively avoiding it.

I think the example is intended to show that AVX-512 indeed significantly speeds up the parts of the code where it is applied: but if that makes up a small part of your code overall, it might not be worth it, because the rest of your code may suffer a frequency penalty.

Unlike some earlier ISA extensions, there is no simple answer like "use it" or "don't": there are complex tradeoffs. That's what you should get out of this article.

[1] https://github.com/travisdowns/avx-turbo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: