Hacker News new | past | comments | ask | show | jobs | submit login
The least interesting part about AVX-512 is the 512 bits vector width (gamedev.place)
192 points by luu on June 19, 2023 | hide | past | favorite | 107 comments



Summary:

AVX-512 adds many instructions that can replace what used to take 3 less efficient instructions.

This instruction set also double the number of available SIMD (single instruction multiple data) registers.

Those instructions are very useful on 128 bits vectors. And not a lot of people actually need 512 bit vectors.

Because of the number of registers and 512 bits width, it takes a lot of space in silicon. This makes it costly, and so is reserved for more expensive CPUs.

Had it been limited to, or also offered in a 256 bits version, this instruction set would have most likely be included in many more CPUs. Making it much more useful.


> Because of the number of registers and 512 bits width, it takes a lot of space in silicon. This makes it costly, and so is reserved for more expensive CPUs.

> Had it been limited to, or also offered in a 256 bits version, this instruction set would have most likely be included in many more CPUs. Making it much more useful.

Someone it's blatantly ignoring AMD CPUs...


I merely summarized the article, don't shoot the messenger :)


AMD supports the 256 wide subset of the AVX-512 ISA?


When amd launched AVX-512 support, it was present on all cores of that generation. I think it was zen 3, but I might be misremembering. If I recall correctly it was using 256 bit vectors internally, so full 512 bit operations were slower than intel.


It takes a double pumped approach to use 2 cycle with 256bit wide to finish a AVX 512 instruction. So it sort of does.


The physical register file is much larger than the logical register count. A budget option could simply reduce the amount of renaming done to save space.

In Intel's case, Cannon Lake did have AVX 512, but was blocked from being a mainstream part due to 10nm yields. And then their rushed efficiency core strategy effectively disabled AVX-512 just as they were getting back on track.

I don't think there's an intrinsic reason you couldn't have efficiency cores run AVX-512 albeit slowly and expect we'll see just that.


> And then their rushed efficiency core strategy effectively disabled AVX-512 just as they were getting back on track.

I partly blame Linux for that. I remember asking at Kernel Recipes about supporting truly heterogenous multi - processor systems and got shrugged "don't buy broken hardware". Back then, it was for a Broadcom home gateway product, which has an asymmetrical dual core, one with a FPU, the other without. Since then we have seen many examples of such assymetry: most HMP smartphones have asymmetrical instruction set. Mono (and probably all JIT VMs) hit issues of varying cache length so the perfect abstraction is already gone. And now we have Intel E vs P. This is a rather hard problem, I won't pretend otherwise but the amount of dead silicon, and lost power efficiency accumulates significantly.


At least on x86 the CPUID instruction is part of the problem. Userspace can do feature-detection and then rely on that. But if it's inconsistent between cores then thread migration would cause illegal instruction faults.

If the kernel tried to fix that by moving such faulting threads to P-cores that would lead to a memcpy routine with AVX512 instructions cause all threads to be moved off E-cores.

So first intel would have to introduce new CPUID semantics to indicate that e.g. AVX512 is not supported by default and then a separate flag indicating that it's specifically supported on this core and then userspace would have to pin the thread if it wants to use them or stick to the default set if it wants to be migratable.


I don't really know how CPUID works, but I'm guessing it can be trapped by Linux. So I think that a first "stupid" implementation would be for Linux to report in CPUID intersecting section of CPUs on which the process is allowed to run. So if you want to run AVX512, you first need to pin that process to an AVX512 CPU. You would be able to find an AVX512 CPU by checking in /proc/cpuinfo. (even this "simple" variant is far from first because the cpuset can be changed dynamically in various ways, like Android would move a process from foreground CPUs to backgrouns CPUs using cgroups)


Not sure if you can trap on cpuid, bit the kernel does have control to which cpuid bits are exposed to your application So requiring pinning to see all the bits could work, but then the issue is what happen if the affinity is changed. A static list of required capabilities in some ELF header would probably be better.


> A static list of required capabilities in some ELF header would probably be better.

I think I agree, the thing is that it's a kind-of security issue. I suggested pinning, because it requires CAP_SYS_NICE, which is a feature: If you allow apps to freely declare their usage, they will end up being scheduled not fairly, because system will stick them to P cores.

That being said, you could have indeed an ELF header mentioning since, and then ignore it if caller doesn't have CAP_SYS_NICE. I do feel using an ELF header for that is weird, but my knowledge of ELF is way too little to judge.

Another thing that could work is using file-system attributes or mode (like setuid), but I think FS support of attributes is at best spotty, and I doubt modes can be extended.


I don't think sched_setaffinity requires CAP_SYS_NICE unless you want to set it on a process you don't own.


Maybe I'm dumb and for sure I'm not an expert of this subject but wouldn't we need an executable containing both an AVX512 code path and an alternative plain code one, plus a way to switch code paths according to the core the code is running on? The same memory page would run in a P core or in an E core. Inefficient because of the extra checks?


userspace can be preempted at any instruction, so you have a TOCTOU problem.


Sure, but one can first pin the thread to a core, or a "don't move me between core types" flag could be added to OSes.


Right, thanks.


Or maybe a new system call to allow a thread to temporarily enter a “performance mode” where it can only be scheduled on the powerful cores. Pinning sounds a bit too strict.


You can already pin to a set of cores instead of a single one. But anyway, my point is that currently userspace interacts directly with CPU features without intermediation from the kernel. So intel would have to think about how to coordinate with userspace too, not just rely on the kernel to patch things up (or not).


Since Android is Linux, won't manufacturers of such smartphone contribute solutions?

Big companies like Samsung should have more than enough resources and interest in doing so. Unlike the guy who answered you at Kernel Recipes, I guess.


When mainline tells you "your hardware is broken", what kind of contributions do you expect exactly?


Something that proves that it is not?


I don’t believe arm has this problem. They are careful to ensure the same instruction set is available on all cores in the chip. This is a botched launch from intel. Software is not the solution.


Samsung once had this problem with their in-house cores.


IIRC what you’re describing here is indeed what shipped. All the AVX-512 instructions are available for 128-bit, 256-bit and 512-bit registers (xmm, ymm, zmm). If not strictly all, essentially all?


Base avx512, on the first phis, did in fact only work for 512-bit registers (it also used a slightly different encoding). A later extension is avx512vl (included on all 'normal' cpus implementing avx512), which adds support for the instructions on smaller vector sizes. But there is no standard mode which allows the hardware to support the instructions only for 256- or 128-bit registers; they must be supported at least for 512-bit registers or else not at all.


Oh! Yeah! I do forget about “base” avx-512 sometimes. No one but HPC folks really ever had any proximity to that. In practice if you’ve got a computer that supports avx-512, it supports VL.


The Phi’s inherited that from Larrabee: the processor simply didn’t have any 128/256 or even “real” x87 (fpu). It did have a lot more FMA variants, including an instruction that did “single cycle next step rasterization” (faddsets).


You can microcode them on top of high density SRAM if need be.


Right, but the issue is you can't "just" offer AVX-512 with a 256-bit vector length. You have to also offer the 512-bit options too, which has costs that your processor vendor may not be willing to pay. So you end up only getting AVX2 support.


Hmmm, there are only two desktop processor vendors. Is AVX-512 at all available on mobile?

From the two desktop vendors AMD has AVX-512 support on all their AM5 CPUs. Intel has support of AVX-512 on all 11th gen CPUs and on some 12th gen CPUs. The supports is there in the silicon on all P-cores in 12/13th gen CPUs, just disabled in microcode.

So AMD and Intel have already paid the cost.


Previously Intel had AVX-512 support in 3 generations of mobile CPUs, Cannon Lake U, Ice Lake U, Tiger Lake H/U, but only the last generation had widespread availability.

Starting with Alder Lake, Intel has dropped the AVX-512 support in non-server CPUs.

On the other hand, AMD has just launched their Phoenix mobile CPUs (Ryzen x 7x40 HS or U), which have excellent AVX-512 support.


AMD - until AM5 - has not been willing to pay the cost. Intel was not willing to pay the cost for their E-cores. AVX-512 is almost 10 years old at this point and because of adoption issues still can't be relied upon.


Or use double pumping like AMD does, seems like the solution for "efficiency cores" to me


VPTERNLOGD is a mouthful but is fun to think about.

Notice there are 4 functions mapping 1 bit -> 1 bit: const 0, const 1, copy, invert.

There are 16 functions mapping 2 bits -> 1 bit. Think 4 possible inputs, 2 possible output values for each input (0 or 1), so there are 2^4 = 16 such functions.

Likewise 256 functions mapping 3 bits -> 1 bit: 8 possible inputs, 2 possible output values for each input, so 2^8 = 256 such functions.

This means that the set of functions 3 bits -> 1 bit may be indexed by a single byte! With this instruction, you specify the byte as an immediate, it is interpreted as an index into a function table, and so you get any 3-valued boolean function.

I wonder if we'll see a similar instruction for 2 bits -> 2 bits? Could be useful!


> There are 16 functions mapping 2 bits -> 1 bit. Think 4 possible inputs, 2 possible output values for each input (0 or 1), so there are 2^4 = 16 such functions.

When people talk about algebraic data types, they tend to forget about functions.

The tuple `(a, b)` is a product type because it has #a * #b possible values. The union type `a | b` is a sum type because it has #a + #b possible values. the function `a -> b` is an exponential type because it has #b ^ #a possible values.


> I wonder if we'll see a similar instruction for 2 bits -> 2 bits? Could be useful!

No, because the way all modern high-performance CPUs work implies that an instruction with two destinations cannot really be any faster than two instructions with a single destination. So you can implement 3 bits -> 2 bits with two VPTERNLOGDs and can't do any better than that for 2 -> 2.


We've seen (+, and) multiply (that's just standard multiplication) and (xor, and) multiply ('carryless multiply'); it would be interesting to see a carryless multiply where both combining functions are user-controllable. (See also: <https://twitter.com/moon_chilled/status/1639829366304821249>; I forget why this works, though.)


You might find this post interesting: https://bitmath.blogspot.com/2023/05/grevmul.html


> I wonder if we'll see a similar instruction for 2 bits -> 2 bits? Could be useful!

You don't even need to support all 16 functions, as a number are duplicates or could be implemented with others. The following list covers all possibilities:

- AND, OR, XOR [supported basically everywhere]

- AND-NOT (a&~b), OR-NOT (a|~b) [sometimes supported]

- NOT-AND (~(a&b)), NOT-OR (~(a|b)), NOT-XOR (aka XNOR)

So only a few basic instructions need to exist to support all combinations of 2-operand bitwise logic.


That is why the IBM POWER ISA includes only the 8 of the 16 functions of 2 Boolean variables that cannot be obtained from the others by reversing the operand order or by using for both operands the same register (which is enough to obtain the maximum instruction count reduction in comparison with the ISAs that only include AND/OR/XOR).



Fun fact, the third byte of the last argument to the Windows BitBlt function describes a 3 bits -> 1 bit function (inputs being source, destination and brush) in exactly this way.

However, the two bytes below encode the formula for the function, so that the transfer could be JITted more easily. This is not really necessary, since you could just use 512 bytes for the mapping, but memory tradeoffs were different back then...


It was also how you fed the operation to the Amiga blitter hardware. A nice introduction to minterms for many teenage hackers at the time... (https://en.wikipedia.org/wiki/Canonical_normal_form#Minterm)


The masked variants of most operations are a killer AVX-512 feature for me. Vectorised conditional execution was/is the last piece of the puzzle.

It baffles me that clang in particular disregards them. Clang’s intrinsics and builtins generally use the unmasked forms and fake it with subsequent combining operations. This always benchmarks slower in loops, and often demands an extra register. I haven’t delved deeply, but it feels like either the cost model is mispredicting potential k-register bottleneck, or it doesn’t know about masked AVX-512 instructions at all. In comparison, GCC does, but it falls down (on my code at least) in needing more explicit vectorisation than clang.


> This always benchmarks slower in loops

Really? It's a forced read-write dependency on the destination register. Which makes sense for cores with limited superscalar. But for ops with >1 cycle latency or >1/cycle throughput, chained masks are likely to inhibit ILP and be slower...


A big benefit comes from having branchless code. For example, if you have an if/else statement where the consequent acts on some elements of the vector and the antecedent acts on the others, you can perform them all with no branch, by taking the mask resulting from the condition for the consequent instructions, then complementing the mask and applying the antecedent to the same registers. This can also have predictable performance, because all instructions from the consequent and antecedent are executed each time, and there are no branch prediction misses to worry about. It's very useful for timing sensitive code (cryptography), and situations where you want a measurable WCET.


Masked instructions vs subsequent merging are both branchless and have no implicit data-dependent timing relative to each other.


It is possible that even masked vectorised branchless code is susceptible to side-channel attacks based on power consumption, nor would I rule out timing attacks if you can somehow get subnormal or exceptional values loaded. Is it a joy to code in this style? Perhaps. Is it a silver bullet? It is not.


Mask with zeroing would solve that. The EVEX prefix supports both merge and zero masking.


All that solves is changing the merging instruction from a masked merge to a maskless OR.


Zero merging breaks the dependency chain on the destination; all masked out lanes are set to zero. What do you mean a "maskless OR"?


Zero masking doesn't merge. If you're discarding lanes with a separate merge op, it doesn't matter what the discarded value was.


I think the point GP has, is that zero-mask ops prevent/break false dependencies on the destination register, and moreover, that this becomes a useful tool the more conditionally-executed-by-masking vectorized code you have in an algorithm body, and may also (caveat reader: I am now speculating) be a reason why AVX-512 came with so many damn registers, because they're super useful for intermediate/partial results.

Unfortunately the SysV ABI interferes with compilers allocating upper SIMD registers, since they're all call-clobbered. This motivates bigger functions: almost all my intentionally vectorized/vectorizable code is declared inline and very occasionally I've resorted, reluctantly, to inline asm. Whether the ABI design is actually a mistake, and then how/whether it might be remediated, remains a matter of opinion.

Digression:

The consequence of all this is there's often More Than One Way To Do It, which no matter how much mechanical sympathy you might hope to innately possess still means punching lots of variants on your code into uica/iaca et al to paint anything like a decent picture about bottlenecks, as well as doing your damnedest to ensure that any benchmarking of loops/computation you care to perform during development actually corresponds to real execution. The holy grail, viz. writing C or other HLL that auto-vectorizes well on more than one compiler and more than one architecture (because you wanted to support NEON, too, right?), becomes a near-bottomless programmer time sink.

There are real benefits to be had, but given the additional time-investment required to obtain those benefits, it's little wonder that AVX-512 is shortchanged on intentional adoption, and that's even before Intel started crippling Alder Lake. In the long run, only greater strides in compiler auto-vectorization capabilities will fix this for everyday code.


If it's a false dependency, then it doesn't matter what's in the inactive lanes and a simple unpredicated instruction will break the dependency just as well as a zero masking one. Which is exactly what you said the compiler already did.

AVX-512 has 32 registers because the Pentium core Larrabee was developed against was in-order. In a real sense, the P5 core dictated much of AVX-512's design.

There isn't a useful way to define a general ABI with callee saved vector registers without saying something like "only bits [127:0] are saved"


I actually reached for these (_mm*_cmpeq_epi8_mask), this morning in my (Rust) code only to find that they are still `unstable` only and therefore unavailable to me, along with so many other SIMD things in Rust.

Portable SIMD aside (which is sitting forever unstable & unavailable), the actual intrinsics I feel should not be. Quite frustrating, and along with missing allocator_api (still!) makes me feel sometimes like 'reverting' back to C++.

https://doc.rust-lang.org/stable/core/arch/x86_64/fn._mm256_...


Fascinating. I wonder why intel never released “AVX256”. Was it to drive adoption of their new extra wide SIMD hardware? Do the extra instructions add a lot of complexity outside of just the increased register size?

Either way I recently had to write a SIMD implementation in both SSE (Intel) and Neon (Arm). This was my first time writing SIMD. I found the neon instruction set much more intuitive and complete than SSE. There are all these weird limitations in SSE (such as trying to do a reduction sum across a vector or shifting across vectors) that made it feel incomplete. Never had a chance to try out AVX.


That was around the time when Intel was struggling to get 10nm out of the door, when they were designing AVX512 it might have seemed that the initial implementation on 14nm was just a stop-gap and it would be more practical to implement such wide units on the next process that's just around the corner, but little did they know they would end up running in circles rehashing Skylake/14nm for the next six years while they waited for 10nm to finally come online. If they had known things would go that way, perhaps they would have done "AVX256".


>10nm

aka 10nm Enhanced SuperFin aka Intel 7 (12000 and 13000 series)

Which funnily enough don't support AVX512, unlike the previous 10000 (14nm++) and 11000 (14nm+++) series.


That's a whole other mess, the 12th and 13th gen P-core design does technically support AVX512 but the smaller E-cores don't, and rather than try to reconcile that mismatch in software they just disabled AVX512 altogether to make the cores all behave the same. If they hadn't decided to implement E-cores then 12th/13th gen would have had AVX512 support.

Some motherboards allowed you to enable AVX512 on those chips if you disabled the E-cores, but then Intel started permanently fusing off AVX512 in hardware on later batches.


Indeed, and to add on to this, I suspect that they had their own plans about where to take the client space in general, which involved more avx512 (which a couple of generations prior to alderlake supported!), but depended on their having a near-monopoly. The competitiveness of zen forced them to pivot, and avx512 fell to the wayside, largely because of its lack of users (which in turn is in large part because no one really cares about performance on clients).


Intel kinda messed up the ISA by requiring AVX-512F across all AVX-512 subsets. If AVX-512VL didn't depend on F, you could have a 256-bit only variant of "AVX-512".

(and instead of making a new feature that allows VL without F, Intel's "solution" seems to be about piecemeal backporting EVEX instructions to VEX (e.g. VNNI, IFMA))

NEON is generally a more "complete" SIMD ISA than SSE/AVX, though it has less "fancy" stuff. AVX-512 fills in a bunch of gaps that was missing in earlier ISAs, but still has odd omissions (like no 8-bit bitwise shift).


Cool that you started with SIMD :) FYI github.com/google/highway allows you to write your code once and target many instruction sets. It also fills in many of the gaps, including reductions.

Disclosure: I am the main author.


I am a happy owner of a Tigerlake (Intel 11th Gen) Framework laptop. I've considered upgrading to a 12th or 13th Gen motherboard, and while I have no doubt they'd be great for me as a Gentoo developer with the greatly increased core counts, my hesitation is that the new CPUs have AVX-512 disabled.

Maybe this doesn't matter, almost certainly wouldn't for most people, but I'm compiling the whole system myself so the compiler at least has the freedom to use AVX-512 wherever it pleases. Does anyone know if AVX-512 actually makes a difference in workloads that aren't specifically tuned for it?

My guess is that given news like https://www.phoronix.com/news/GCC-AVX-512-Fully-Masked-Vecto... that compilers basically don't do anything interesting with AVX-512 without hand-written code.


The promise of the AVX-512 instruction set really was that it would be much easier to (auto-)vectorize code that wasn’t written with vectorization in mind, with tools like masked execution and gather/scatter that either didn’t exist at all before (SSE) or were very minimal (AVX).

The tools are there in the instruction set, but that still leaves the issues of time and effort to implement in compilers, and enough performance improvement on enough machines in some market (browsers, games, etc) capable of running it all before any of this possibility becomes real.

The skylake-xeon/icelake false start here really can’t have helped. It’s still a much more pragmatic thing to target the haswell feature set that all the intel chips and most amd chips can run (and run well).


Funny that if you want AVX-512 now, it's AMD that's offering it and Intel that isn't.

Sometimes the second comer to a game has the advantage of taking their time to implement something, with fewer compromises and a better overall fit.


The compiler will only choose to use AVX-512 if you give it the right `-m` flags. Most people who are running generic distros that target the basic k8 instructions benefit from AVX-512 only when some library has runtime dispatch that detects the presence of the feature and enables optimized routines. This is common in, for example, cryptography libraries.


Right. Since I'm using Gentoo and compiling my whole system with `-march=tigerlake`, the compiler is free to use AVX-512.

My question is just... does it? (And does it use AVX-512 profitably?)


It will not use AVX-512 if you have CFLAGS="-march=tigerlake -O2". You will, at the very least, need CFLAGS="-march=tigerlake -O3" to get it to actually use AVX2, and tigerlake's AVX512 implementation is so poor (clock throttling etc) that gcc will not use AVX-512 on tigerlake. AVX-512 is used if you have -march=znver4 though, so the support for autovectorizing to AVX-512 is clearly there.

https://godbolt.org/z/1a39Mf3bv


Is it actually that bad on Tiger Lake? Or just for really high-width vectors? On my old Ice Lake laptop, single-core AVX-512 workloads do not decrease frequency at all even with wider registers, and multi-core workloads will result in clock speed degradation of a small amount, maybe 100Mhz or so.

Depends on a couple factors (i.e. Ice Lake client only has 1 FMA unit) but I'd be surprised if Tiger Lake was a major regression relative to Ice Lake. It seems like they had it in an OK spot by then.


In my experience it depends on the compiler. clang seems far more willing to autovectorise than gcc. Also, when writing the code you have to write it in a way that strongly hints to the compiler that it can be autovectorised. So lots of handholding.


I guess a better question is why you rebuild the system without a rational basis to expect benefits.


Are you familiar with source-based distributions?

I'm not rebuilding specifically for this one potential optimization.


Why not use -march=native?


Surprisingly, -march=native doesn’t always expand to the locally optimal build flags we might expect, particularly with gcc on non-Linux platforms.


Oh interesting. Is this one of those things where backwards compatibility eventually got in the way of the intended purpose?


I actually do. I just said -march=tigerlake to make it clear what CPU family the compiler was targeting.


Why not use -march=snark?


> I've considered upgrading to a 12th or 13th Gen motherboard, and while I have no doubt they'd be great for me as a Gentoo developer with the greatly increased core counts, my hesitation is that the new CPUs have AVX-512 disabled.

Unless you have a very specific AVX-512 workload or you need to run AVX-512 code for local testing, you won’t see any net benefit of keeping your older AVX-512 part.

Newer parts will have higher clock speed and better performance that will benefit you everywhere. Skipping that for the possibility of maybe having some workload in the future where AVX-512 might help is a net loss.


Now you may choose a new AMD Phoenix-based laptop, with great AVX-512 support (e.g. with Ryzen 7 7840HS or Ryzen 9 7940HS or Ryzen 7 7840U).

AMD Phoenix is far better than any current Intel mobile CPU anyway, so it is an easy choice (and it compiles code much faster than Intel Raptor Lake, which counts for a Gentoo user or developer).

The only reason to not choose an AMD Phoenix for an upgrade would be to wait for an Intel Meteor Lake a.k.a. Intel Core Ultra. Meteor Lake will be faster in single-thread (the relative performance in multi-thread is unknown) and it will have a bigger GPU (with 1024 FP32 ALUs vs. 768 for AMD).

However, Meteor Lake will not have AVX-512 support.

For compiling code, the AVX-512 support should not matter, but it should matter a lot for the code generated by the compiler, as it enables the efficient auto-vectorization of many loops that cannot be vectorized efficiently with AVX2.

While gcc and clang will never be as smart as hand-written code, their automatic use of AVX-512 can be improved a lot and announcements like that linked by you show progress in this direction.


Does anyone know if AVX-512 actually makes a difference in workloads that aren't specifically tuned for it?

I know game console emulators use it to great effect with significant performance increases.


Incidentally that's another case where the 512bit-ness is the least interesting part, the new instructions are useful for efficiently emulating ARM NEON (Switch) and Cell SPU (Playstation 3) code but those platforms are themselves only 128bits wide so I don't believe the emulators have any use for the 512bit (or even 256bit?) variants of the AVX512 instructions.


I haven't looked into the code for these but are they possibly pipelining multiple ops per clock? If it's not dependency chained they probably calculate a few cycles at once.


Specifically RPCS3 had a huge speedup using AVX-512 [1]

1: https://www.tomshardware.com/news/ps3-emulation-i9-12900k-vs...


RPCS3 is a big fan of esoteric CPU features, it was also one of the very few applications which used Intels TSX before Intel killed it off.


Game console emulators are of course specifically tuned for this.


What other emulators beside rpcs3 use it?


AVX-512 is specifically the first x86 vector extension for which compilers should eventually be able to emit reasonable code. Thanks to gather and masked execution, with AVX-512 vectorizing a simple loop doesn't always mean blowing up code size to 10x.

However, compilers have so far been slow to implement this, with the relevant patches only going into GCC right now.


I never thought that just the registers for AVX-512 were so large compared to L1. (Though some newer chips have 48kb instead of 32). But I think I’m mostly surprised because I hadn’t thought of the size of the register file (which is comparable) when considering renaming.

Some interesting replies too, eg https://mastodon.gamedev.place/@TomF/110572967731705754

A story I would have believed was that the instruction set was designed with some useful seeming instructions and a hope that compilers would improve. But it sounds like it was designed much more closely with actual example programs and a compiler, just not the kind that attempts to vectorise scalar code.


AVX-512F also introduced an embedded rounding and exception control in the instruction itself, which was a great pain for everyone doing accurate maths (e.g. interval arithmetic). It's a great shame that Intel made all good bits of AVX-512 a hostage to the less important 512-bit vector width.


It's a little annoying that valgrind doesn't support AVX-512 still (since patches were first posted in 2017). We like to test our binaries using valgrind, so need to compile everything with -mno-avx512f.

https://bugs.kde.org/show_bug.cgi?id=383010


An often overlooked gem of AVX-512 is its (optional) conflict detection instructions. Using these enables vectorizing gather-modify-scatter operations that may alias at the element/lane level. Algorithms that have irregular data-level parallelism (DLP) as opposed to perfect or embarrassingly parallel DLP can now take advantage of the SIMD registers/FUs.

https://upcommons.upc.edu/bitstream/handle/2117/77204/VSR%20... shows how radix sort can be vectorized much more efficiently using these types of instructions.


Didn't some of the DEC Alpha and IBM Power designs use a cached register-file for lack of a better terminology. Allowing them to support more architectural state than the wide multi-ported register file could fit? Would a small double/quad pumped AVX-512 implementation with a dual ported register file + caching/queuing the recently produced results at the functional units that produced them safe die size and still allow useful throughput?


> * VPTERNLOGD (your swiss army railgun for bitwise logic, can often fuse 2 or even 3 ops into one)

I wonder if this would be useful for implementing cryptographic algorithms.


I doubt it would be much use as cryptographic operations tend to mainly use xor on two inputs.

VPTERNLOGD basically works by constructing a truth table for 3 inputs.

    | A | B | C |  R
    | 0 | 0 | 0 |  x
    | 0 | 0 | 1 |  x
    | 0 | 1 | 0 |  x
    | 0 | 1 | 1 |  x
    | 1 | 0 | 0 |  x
    | 1 | 0 | 1 |  x
    | 1 | 1 | 0 |  x
    | 1 | 1 | 1 |  x
You pick the values you want for R, then pass this 8-bit value as the operand to the instruction along with the 3 values.

For example, A ∧ B ∧ C would be 0b10000000. A ∧ ¬B ∧ ¬C would be 0xb00010000

There are 256 such tables and many of them can be represented by multiple boolean expressions.


Here's a fancy trick from LLVM source: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

  #define A 0xf0
  #define B 0xcc
  #define C 0xaa
And then you can build immediate for VPTERNLOG operation by writing bitwise expression with A/B/C values in source code.

For example, A^B^C=150. A^(~B&C)=210. And so on...

Also mentioned by Fabian here: https://twitter.com/rygorous/status/1187032693944410114


This is basically the same way as the basic primitive of an FPGA, a LUT (look-up table) works. It's a small ROM or RAM of size 2^N x 1, that is looked up by the N-bit "address" of the inputs. Modern FPGAs tend to use N=4 to N=6, with some additional fanciness occasionally present to make the up-to-64-bits of ROM/RAM useful in other ways as well.


That is delightful. Thanks for explaining it so clearly.


Yeah – ARM specifically added EOR3 and BCAX instructions to accelerate SHA-3 hashes, both of which can be handled by VPTERNLOGD.


Very useful. In fact, it speeds up a single instance (i.e. not taking advantage of SIMD) of MD5 by 20%: https://github.com/animetosho/md5-optimisation#x86-avx512-vl...


See also: Why Is AVX 512 Useful for RPCS3?[1] Many of the instructions mentioned by Fabian are used to great effect when emulating the PS3's SPUs.

[1]: https://whatcookie.github.io/posts/why-is-avx-512-useful-for...


> This, combined with the 512b vectors that are the "-512" part, quadruples the amount of architectural FP/SIMD state, which is one of the main reasons we _don't_ get any AVX-512 in the "small" cores.

That’s an interesting point. Does anyone know—the Knights Landing Phi had AVX-512 and was based on Atom cores. Did they bolt on all these extra registers?


Yes. Phi was a massive vector processor bolted onto a tiny scalar processor. This seems to have worked pretty well, for what it was. But it wouldn't be suitable for the applications where intel is using gracemont, because client workloads tend not to use avx512.


Note that quadrupling the architectural state doesn't mean quadrupling the actual state.

In fact, it looks like Haswell and Skylake-X had the same number of physical registers, 168. So that's a straightforward doubling from 256x168 to 512x168.

But further into the thread it looks like the first gen E cores had about 200 128-bit register lines, so trying to fit 512x32 would have been very tight.

To put some of that a different way: The vector design headed for E cores was 128 bits stretching to 256 bits. If it had been 256 bits all the way through, it's likely they would have added AVX-512 support, even if they couldn't increase the size of the register file at all.


All variants of Phi were quite small/simple cores (I think the early ones were derivatives of early Pentium designs(!), just lots of them in more modern processes) with massive vector units strapped to the side. Which was fine for that purpose, since they really only were intended to feed the vector units and not expected to be any good at general-purpose computing tasks.


They actually separated architectural registers from physical ones, due to being SMT4.

There's a chips and cheese article on this.


This is my first time opening a mastodon link and wow. It's not asking me to sign in, to download the app, just straight to the point. I like it!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: