Summary: AVX-512 adds many instructions that can replace what used to take 3 les...

Zardoz84 · on June 20, 2023

> Because of the number of registers and 512 bits width, it takes a lot of space in silicon. This makes it costly, and so is reserved for more expensive CPUs.

> Had it been limited to, or also offered in a 256 bits version, this instruction set would have most likely be included in many more CPUs. Making it much more useful.

Someone it's blatantly ignoring AMD CPUs...

bombela · on June 20, 2023

I merely summarized the article, don't shoot the messenger :)

brrrrrm · on June 20, 2023

AMD supports the 256 wide subset of the AVX-512 ISA?

MobiusHorizons · on June 20, 2023

When amd launched AVX-512 support, it was present on all cores of that generation. I think it was zen 3, but I might be misremembering. If I recall correctly it was using 256 bit vectors internally, so full 512 bit operations were slower than intel.

KolenCh · on June 21, 2023

It takes a double pumped approach to use 2 cycle with 256bit wide to finish a AVX 512 instruction. So it sort of does.

reitzensteinm · on June 20, 2023

The physical register file is much larger than the logical register count. A budget option could simply reduce the amount of renaming done to save space.

In Intel's case, Cannon Lake did have AVX 512, but was blocked from being a mainstream part due to 10nm yields. And then their rushed efficiency core strategy effectively disabled AVX-512 just as they were getting back on track.

I don't think there's an intrinsic reason you couldn't have efficiency cores run AVX-512 albeit slowly and expect we'll see just that.

phh · on June 20, 2023

> And then their rushed efficiency core strategy effectively disabled AVX-512 just as they were getting back on track.

I partly blame Linux for that. I remember asking at Kernel Recipes about supporting truly heterogenous multi - processor systems and got shrugged "don't buy broken hardware". Back then, it was for a Broadcom home gateway product, which has an asymmetrical dual core, one with a FPU, the other without. Since then we have seen many examples of such assymetry: most HMP smartphones have asymmetrical instruction set. Mono (and probably all JIT VMs) hit issues of varying cache length so the perfect abstraction is already gone. And now we have Intel E vs P. This is a rather hard problem, I won't pretend otherwise but the amount of dead silicon, and lost power efficiency accumulates significantly.

the8472 · on June 20, 2023

At least on x86 the CPUID instruction is part of the problem. Userspace can do feature-detection and then rely on that. But if it's inconsistent between cores then thread migration would cause illegal instruction faults.

If the kernel tried to fix that by moving such faulting threads to P-cores that would lead to a memcpy routine with AVX512 instructions cause all threads to be moved off E-cores.

So first intel would have to introduce new CPUID semantics to indicate that e.g. AVX512 is not supported by default and then a separate flag indicating that it's specifically supported on this core and then userspace would have to pin the thread if it wants to use them or stick to the default set if it wants to be migratable.

phh · on June 20, 2023

I don't really know how CPUID works, but I'm guessing it can be trapped by Linux. So I think that a first "stupid" implementation would be for Linux to report in CPUID intersecting section of CPUs on which the process is allowed to run. So if you want to run AVX512, you first need to pin that process to an AVX512 CPU. You would be able to find an AVX512 CPU by checking in /proc/cpuinfo. (even this "simple" variant is far from first because the cpuset can be changed dynamically in various ways, like Android would move a process from foreground CPUs to backgrouns CPUs using cgroups)

gpderetta · on June 20, 2023

Not sure if you can trap on cpuid, bit the kernel does have control to which cpuid bits are exposed to your application So requiring pinning to see all the bits could work, but then the issue is what happen if the affinity is changed. A static list of required capabilities in some ELF header would probably be better.

phh · on June 20, 2023

> A static list of required capabilities in some ELF header would probably be better.

I think I agree, the thing is that it's a kind-of security issue. I suggested pinning, because it requires CAP_SYS_NICE, which is a feature: If you allow apps to freely declare their usage, they will end up being scheduled not fairly, because system will stick them to P cores.

That being said, you could have indeed an ELF header mentioning since, and then ignore it if caller doesn't have CAP_SYS_NICE. I do feel using an ELF header for that is weird, but my knowledge of ELF is way too little to judge.

Another thing that could work is using file-system attributes or mode (like setuid), but I think FS support of attributes is at best spotty, and I doubt modes can be extended.

the8472 · on June 20, 2023

I don't think sched_setaffinity requires CAP_SYS_NICE unless you want to set it on a process you don't own.

pmontra · on June 20, 2023

Maybe I'm dumb and for sure I'm not an expert of this subject but wouldn't we need an executable containing both an AVX512 code path and an alternative plain code one, plus a way to switch code paths according to the core the code is running on? The same memory page would run in a P core or in an E core. Inefficient because of the extra checks?

the8472 · on June 20, 2023

userspace can be preempted at any instruction, so you have a TOCTOU problem.

janwas · on June 20, 2023

Sure, but one can first pin the thread to a core, or a "don't move me between core types" flag could be added to OSes.

pmontra · on June 20, 2023

Right, thanks.

codeflo · on June 20, 2023

Or maybe a new system call to allow a thread to temporarily enter a “performance mode” where it can only be scheduled on the powerful cores. Pinning sounds a bit too strict.

the8472 · on June 20, 2023

You can already pin to a set of cores instead of a single one. But anyway, my point is that currently userspace interacts directly with CPU features without intermediation from the kernel. So intel would have to think about how to coordinate with userspace too, not just rely on the kernel to patch things up (or not).

GuB-42 · on June 20, 2023

Since Android is Linux, won't manufacturers of such smartphone contribute solutions?

Big companies like Samsung should have more than enough resources and interest in doing so. Unlike the guy who answered you at Kernel Recipes, I guess.

phh · on June 20, 2023

When mainline tells you "your hardware is broken", what kind of contributions do you expect exactly?

bmacho · on June 20, 2023

Something that proves that it is not?

MobiusHorizons · on June 20, 2023

I don’t believe arm has this problem. They are careful to ensure the same instruction set is available on all cores in the chip. This is a botched launch from intel. Software is not the solution.

imtringued · on June 20, 2023

Samsung once had this problem with their in-house cores.

mtklein · on June 20, 2023

IIRC what you’re describing here is indeed what shipped. All the AVX-512 instructions are available for 128-bit, 256-bit and 512-bit registers (xmm, ymm, zmm). If not strictly all, essentially all?

moonchild · on June 20, 2023

Base avx512, on the first phis, did in fact only work for 512-bit registers (it also used a slightly different encoding). A later extension is avx512vl (included on all 'normal' cpus implementing avx512), which adds support for the instructions on smaller vector sizes. But there is no standard mode which allows the hardware to support the instructions only for 256- or 128-bit registers; they must be supported at least for 512-bit registers or else not at all.

mtklein · on June 20, 2023

Oh! Yeah! I do forget about “base” avx-512 sometimes. No one but HPC folks really ever had any proximity to that. In practice if you’ve got a computer that supports avx-512, it supports VL.

thechao · on June 20, 2023

The Phi’s inherited that from Larrabee: the processor simply didn’t have any 128/256 or even “real” x87 (fpu). It did have a lot more FMA variants, including an instruction that did “single cycle next step rasterization” (faddsets).

namibj · on June 20, 2023

You can microcode them on top of high density SRAM if need be.

saagarjha · on June 20, 2023

Right, but the issue is you can't "just" offer AVX-512 with a 256-bit vector length. You have to also offer the 512-bit options too, which has costs that your processor vendor may not be willing to pay. So you end up only getting AVX2 support.

borissk · on June 20, 2023

Hmmm, there are only two desktop processor vendors. Is AVX-512 at all available on mobile?

From the two desktop vendors AMD has AVX-512 support on all their AM5 CPUs. Intel has support of AVX-512 on all 11th gen CPUs and on some 12th gen CPUs. The supports is there in the silicon on all P-cores in 12/13th gen CPUs, just disabled in microcode.

So AMD and Intel have already paid the cost.

adrian_b · on June 20, 2023

Previously Intel had AVX-512 support in 3 generations of mobile CPUs, Cannon Lake U, Ice Lake U, Tiger Lake H/U, but only the last generation had widespread availability.

Starting with Alder Lake, Intel has dropped the AVX-512 support in non-server CPUs.

On the other hand, AMD has just launched their Phoenix mobile CPUs (Ryzen x 7x40 HS or U), which have excellent AVX-512 support.

ben-schaaf · on June 20, 2023

AMD - until AM5 - has not been willing to pay the cost. Intel was not willing to pay the cost for their E-cores. AVX-512 is almost 10 years old at this point and because of adoption issues still can't be relied upon.

Aardwolf · on June 20, 2023

Or use double pumping like AMD does, seems like the solution for "efficiency cores" to me