Hacker News new | past | comments | ask | show | jobs | submit | Remnant44's comments login

I agree. I work in a similar field, and the value of AVX512 is clearly there - it just hasn't been worth implementing for the tiny percentage of market penetration. This is directly due to the market segmentation strategy Intel applied. AMD has raised the ante for AVX512 with two excellent implementations in a row, and for the first time ever I'm definitely considering building AVX512 targets.

Just as a small example from current code, the much more powerful AVX512 byte-granular two register source shuffles (vpermt2b) are very tempting for hashing/lookup table code, turning a current perf bottleneck into something that doesn't even show up in the profiler. And according to (http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo...) Zen5 has not one but _TWO_ of them, at a throughput quadrupling Intel's best effort..


Dynamic dispatch adds headaches to the build process; they are surmountable for sure, but in my experience the build wrangling to make it all happen is harder than the original work of rewriting your code with intrinsics!

The other major problem I have with dynamic dispatch, at least for the SIMD code I've written, is that you have to do so at a fairly high level of granularity. Most optimized routines are doing as much fusion & cache tiling as possible and so the dispatch has to happen at the level of the higher-level function rather than the more array-op-like components within it. And mostly, that means you've written your (often quite complicated) procedure several times uniquely instead of simply accelerating components within it.

I have not used Highway - if it dramatically simplifies the above, that's excellent!


:) Yes indeed, no changes to the build required. Example: https://gcc.godbolt.org/z/KM3ben7ET

I agree dispatch should be at a reasonably high level. This hasn't been a problem in my experience, we have been able to inline together most SIMD code and dispatch infrequently.


Setting aside the complexity of "detachable" ALU clusters, the bigger problem is that essentially, we can currently drive all those ALUs _because_ they're SIMD and executing the same instruction.

For example, if you could un-gang even a single Zen5 ALU and do 16 independent scalar FMAs at once, you now need the front-end of the processor to decode and issue 16 instructions/cycle for each 1 instruction that it currently does! That's hopelessly wider than the front-end of the CPU can process. It needs to decode the instructions and schedule them, and both of those are complex operations that can be power hungry / slow (not to mention the instruction fetch bandwidth is now through the roof!).

SIMD bypasses that by doing lots of operations with a single instruction. It would be extremely difficult to achieve with just scalar instructions.


Agreed. My current best-case-scenario hope is that the success of the Zen4/5/etc processors will force Intel to adapt their strategy towards AMDs, and finally move us out of the avx512 mess they've segmented us into.

Assuming Intel is changing direction right now, unfortunately they will face 2-3 years of latency to implement that.

AMD's avx512 implementation is just lovely and they seem to be firing on all cylinders for it. Zen4 was already great, 'double pumped' or no.

It looks like Zen5's support is essentially the dream - all EUs and load/store are expanded to 512 bit, so you can sustain 2 512 FMAs and 2 512 Adds every cycle. There also appears to be essentially no transition penalty to the full-power state which is incredible.

The only thing sad here is that all this work to enable full-width AVX512 is going to be mostly wasted as approximately 0% of all client software will get recompiled to an AVX512 baseline for decades if ever. But if you can compile for your own targets, or JIT for it.. it looks really good.


> The only thing sad here is that all this work to enable full-width AVX512 is going to be mostly wasted as approximately 0% of all client software will get recompiled to an AVX512 baseline for decades if ever.

Well, the other thing is that for workloads where you're cramming 512b vectors through 2xFMAs every cycle -- there's a good chance you can (and have been) just buying GPUs to handle that problem. So, I think that space has been eaten up a bit in recent times.

I don't think it will be decades of waiting though. AVX2 is a practical baseline today IMO and Haswell is what, barely 10 years old? Intel dragged their feet like crazy of course, but "decades" from now is a bit much. And AVX-512's best feature -- it's much more coherent and regular design -- means a lot of vectorization opportunities might be easier to do, even automatically (e.g. universal masking and gather/scatters make loop optimizations more straightforward.) We'll have to see how it shakes out.


The GPUs that can do FP64 operations are priced out of the range acceptable for small businesses or individuals.

The consumer GPUs are suitable only for games, graphics and ML/AI.

There are also other applications, like in engineering, where the only cost-effective way is to use CPUs with good AVX-512 support, like Zen 5.

A 9950X has a similar FP64 throughput like the last GPUs that still had acceptable prices, from 5 years ago (Radeon VII).

Even for FP32, the throughput of a 9950X is similar to that of a very good integrated GPU (512 FP32 FMA per cycle, but at a double clock frequency, so equivalent with a GPU doing 1024 FP32 FMA per cycle), even if it is no match for a discrete GPU.

There are also applications where the latency of transferring the data to the GPU, then doing only short computations, can reduce the performance below what can be achieved on the CPU.

Obviously, there are things better done on a GPU, but there are enough cases where a high throughput CPU like a desktop Zen 5 is better.


I haven't paid attention for the past decade, do modern C/C++ compilers generate any decent AVX512 code if told to do so? Or do you still need to do it by hand via intrinsics or similar?

The short answer is "Yes, sometimes".

Clever hand-written SIMD code is still consistently better, sometimes dramatically better. But generally speaking, I've found Clang to be pretty good at auto-vectorizing code when I've "cleared the path" so to speak, organizing the data access in ways that are SIMD friendly.

On the Windows platform, in my experience MSVC is a disaster in terms of auto-vectorization. I haven't been able to get it to consistently vectorize anything above toy examples.


Those aren't the only two options. You can use libraries that are made to take advantage of SIMD and you can use ISPC which is specifically about SIMD programming.

Those languages are too SIMD-hostile.

Besides C#, what languages do you think are not SIMD-hostile?

Languages where the semantics have considered parallelism and this kind of optimization. ISPC, OpenCL, Chapel, Futhark, etc.

Thanks. With ISPC and OpenCL it's a given...I was thinking more general-purpose programming languages where it is easy to exploit CPU-side SIMD.

Maybe the CPython JIT can make use of it. That might move the needle past 1%...

I honestly didn't realize how performant the decades-old 2013 Haswell architecture is on vector workloads.

250GFLOP/core is no joke - He also cross-compared to an M1 Pro, that when not using the secret matrix coprocessor achieves effectively the same vector throughput, a decade later...


250 is nonsense. 2xFMA per cycle @ ~4.5Ghz = 32*4.5 = ~144 Gflops

Beating cuBlas is unlikely. You probably made a mistake. Last I tested it, it was even better than MKL in efficiency.


Yes, the figure was 250GFLOP for 4 cores instead of per core, I misread. Still impressive but more reasonable


The floating-point FMA throughput per desktop CPU socket and per clock cycle has been doubled every few years in the sequence: Athlon 64 (2003) => Athlon 64 X2 (2005) => Core 2 (2006) => Nehalem (2008) => Sandy Bridge (2011) => Haswell (2013) => Coffee Lake Refresh (2018) => Ryzen 9 3950X (2019) => Ryzen 9 9950X (2024), going from 1 FP64 FMA/socket/clock cycle until 256 FP64 FMA/socket/clock cycle, with double numbers for FP32 FMA (1 FMA/s is counted as 2 Flop/s).

I'd wish memory bandwidth could also be doubled so often on desktops. Instead of 256 (even more due to 2-3 times higher core frequency) only 14 times increase: DDR-400 6.4 GB/s => DDR5-5600 89.6 GB/s. The machine balance keeps falling even further.

While flash memory became so fast in recent years, I haven't heard of any break-through technology prototypes to bring some significant progress into RAM. Let alone the RAM latency, which remained constant (+/- few ns) through all the years.


You are right, which is why in modern CPUs the maximum computational throughput can be reached only when a large part of the operands can be reused, so they can be taken from the L1 or from the L2 cache memories.

Unlike the main memory bandwidth or that of the shared L3 cache memory, the memory bandwidth of the non-shared L1 and L2 caches has been increased exactly in the same ratio as the FMA throughput. Almost all CPUs have always been able to do exactly the same number of FMA operations per clock cycle and loads from the L1 cache per clock cycle (simultaneously with typically only a half of that number, of stores to the L1 cache per clock cycle).

Had this not been true, the computational execution units of the cores would have become useless.

Fortunately, the solution of systems of linear equations and the multiplication of matrices are very frequent operations and these reuse most of their operands, so they can reach the maximum computational throughput.


I agree. It's not even that we over-weight being technically correct. Correctness and kindness are not mutually exclusive at all.

I suspect you're right and it boils down to the fact that a lot of us tech types have substantial pride. Worse, we support a substantial amount of our self-worth from being knowledgeable and correct, and so of course we have to both demonstrate and defend it. I recognize that truth in most of my real-life nerd friends as well as myself even as I try to temper it.

I don't know what the solution is for a community like this that will always be filled with people of so many unique stripes and neurotypes. Other than striving to argue our disagreements from curiosity instead of superiority.


Neat and performant code like the article makes me very curious how the competition will shake out between AMD's AVX512 implementation and Intel's upcoming AVX10. The entire point of AVX10 seems to be to resolve Intel's P vs E core situation, while AMD seems to have taken a better approach of using either full width (Zen5) or double-pumped 256bit (Zen4, Zen5 mobile) as appropriate to the situation, while making the API seamless.

The big gains delivered in the article are all on a double-pumped Zen4 core! AVX512 brings a lot to the table so its quite frustrating that Intel market-segmented support for it so heavily as to completely inhibit its adoption in broad-based client code.


If Intel actually implements AVX10/256 on every CPU they ship going forwards, it will eventually win simply by availability. The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU, so the only SIMD implementation that actually matters is the lowest common denominator. And since AVX10.1/256 and AVX512VL have a shared subset, that will be what people will eventually target once enough time has passed and nearly everyone has a CPU that can support it.

AVX512 will continue to give AMD some easy wins on the few benchmarking apps that were actually updated to support it, but if Intel sticks with the AVX10 plan I expect that AMD will eventually just use the double-pumped SIMD pipes for everything, just because they are the more efficient way to support AVX10/256 while retaining AVX512 compatibility.

Intel did a lot of bad choices in the past decade, but segmenting the market based on instruction set has to be one of the worst. They just chose to kill all the momentum and interest in their newest and best innovations. Hopefully they actually add AVX10/256 support to the whole lineup, because the width is the least interesting part about AVX512, the masked operations especially are a lot more important.


Dispatching is actually heavily used. Image/video codecs, cryptography, and ML libraries routinely use it, because the lowest common denominator is very low indeed.


The things you listed are probably less than 1% of all client loads. Video codecs and ML mostly run on accelerators and cryptography is a tiny proportion of all loads.


Cryptography is every download you ever do in a browser, bitlocker or equivalent encrypted disk access. Add to that gzip or brotli compression (common in HTTP). Hardware decoders for newer video codecs (such as AV1) are not that common (less than 50% of devices) so most youtube watch time on desktops/laptops is software decoding. It's a lot more than 1%.


> Cryptography is every download you ever do in a browser, bitlocker or equivalent encrypted disk access

For IO bandwidth-bound heavy lifting these things typically use AES algorithm. The hardware support for that algorithm is widely available inside CPU cores for more than a decade: https://en.wikipedia.org/wiki/AES_instruction_set#x86_archit... That hardware support is precisely what enabled widespread use of HTTPS or full disk encryption. Before AES-NI it was too slow, or it required specialized accelerator chips found in web servers in 2000-s who needed to encrypt/decrypt HTTPS traffic.

I don’t think people use AVX2 or AVX512 for AES because AES-NI is way faster. The runtime dispatch needs just a few implementations: hardware-based AES to use on 99% of the computers, and couple legacy SSE-only versions.


The original AES-NI (which were SSE instructions) and also their initial correspondent in AVX performed 128-bit operations.

Later, 256-bit AES instructions were introduced, but significantly later than AVX2. Such 256-bit AES instructions, which double the AES throughput, are available in more recent CPUs, like AMD Zen 3 and Intel Alder Lake (the so-called VAES instructions).

Some of the more recent CPUs with AVX-512 support have added 512-bit AES instructions, for an extra doubling of the AES throughput.

Zen 5 (desktop and server) doubles the AES throughput in comparison with Zen 4, similarly with the double throughput for other vector operations.

In conclusion, on x86 CPUs there are many alternatives for AES, which have different throughputs: 128-bit SSE instructions since Westmere, 128-bit AVX instructions since Sandy Bridge, 256-bit VAES AVX instructions since Zen 3 and Alder Lake and 512-bit AVX-512 instructions since Ice Lake, but only in the CPUs with AVX-512 support.


Surely h264 encode and decode is substantial, given the large amount of video consumed?


I don’t believe many people encode or especially decode video with CPU-running codes.

Modern GPUs include hardware accelerators for popular video codecs, these are typically way more power efficient than even AVX-512 software implementations.


> I don’t believe many people encode or especially decode video with CPU-running codes.

Believe what you want but as soon as realtime is not a concern and quality matters you'll be using CPU-based encoders unless you have special hardware for your use case.

> Modern GPUs include hardware accelerators for popular video codecs

... with shit quality encoders because they are designed for speed first.

Decoding is a different matter but even there older hardware can easily end up not supporting codecs (or profiles of them) that you come accross.


> The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU, so the only SIMD implementation that actually matters is the lowest common denominator.

RHEL just moved up to x86_64-v2, equivalent to 2009 level CPUs. And they’re an early mover, Fedora/Ubuntu/Arch have not done the same.

Glibc has used CPU dispatching for str* and mem* functions for over a decade.


FreeBSD added AVX512 dispatch into it C library AFAIK.


> The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU

What do you mean? At least numpy and pytorch (the only numeric libraries I'm familiar with) both use runtime dispatching.


I agree. I still hesitate to ship code that requires even AVX1, even though it was first introduced in 2011(!).

AVX512 really improves the instruction set. Not just from masking but from some really big holes filled in terms of instructions available that AVX2 doesn't have a good solution for.

At the same time I also DO have plenty of code that could definitely use the compute throughput improvement of 512 bit vectors. But it's definitely a more niche usage. You have to at least nominally satisfy that you: 1) Benefit from 2x the ALU throughput 2) Live mostly in the cache 3) Are not worth running on the GPU instead.


It depends on the product and the market.

I’ve been shipping software that requires AVX2 and FMA3 because it is a professional CAM/CAE application. Our users typically don’t use sub-$100 Intel processors like Pentium and Celeron. The key is communication, you need to make sure users are aware of the hardware requirements before they buy.

Another example, in the server market, you can almost always count on having at least AVX2 support because most servers today are cloud-based. For people running these cloud services, power efficiency is crucial due to the high cost of electricity, so they tend to replace old hardware regularly.

On the other hand, for desktop software aimed at a broad audience, there are valid reasons to hesitate before shipping code that requires AVX1.


I don't get saying mask operations are more important than width?

Mask operations can be trivially emulated with vblend, it is one extra instruction..

Width can't be emulated, you just are stuck running half speed.

This take keeps getting repeated, but doesn't appear to be backed up by reality.

Intel hasn't even put AVX10 on their upcoming chips(skymont), so it appears to be going nowhere.


> Mask operations can be trivially emulated with vblend, it is one extra instruction..

For unaligned loads where you can't guarantee that the entire vector is on a mapped page?


The important feature of AVX-512 demonstrated in my blog post is masked loads and stores, which can't be emulated with vblend.


Zen 4 AVX512 implementation is not double-pumped and tech journos need to stop calling it that, because it has specific meaning that does not match what takes place.

It simply decodes operations on ZMM registers into multiple uOPS and schedules them to free 256b units. In addition, Zen 4 has special handling of 512b full-width shuffles, with dedicated hardware to avoid doing very expensive emulation. As a result, Zen 4 with its 4 256b SIMD units still acts like a very strong 2x512b core. There is nothing cheap about this implementation and it is probably the best rendition for consumer hardware so far.


I don't understand why intel doesn't make their E-cores use double pumped AVX512 to solve this issue (or just make P-core only CPUs for desktops like it should). They have had years to fix this by now. It's annoying that despite AMD's support, market share makes the adoption of this not happen anyway, and the AVX10 thing will unfortunately allow them to hold back the world even longer.

What I like to see (for desktop) is: better cores, more cores, well standardized instruction sets that unlock useful things (wide SIMD, float16, gather/scatter, ...). AMD is doing pretty well at this. What Intel is doing instead: put weak cores alongside decent cores, cripple the decent cores to keep up with the weak cores, release CPUs with the same amount of cores as before for many generations in a row, use the weak cores to make it sound like they have more cores than they have, bring out way too many variants of instructions sets to ever have a useful common set, drop support for their own promising sounding instructions

I just really dislike anything Intel has come out with or plans to come out with lately :p

My cycle of preference of manufacturer (for desktop computers) has been: 90s: Intel. Early 2000s: AMD (pentium 4 was so meh). late 2000's+2010s: Intel. Now: AMD again. What will Intel do to gain foothold again (that isn't sabotaging the other)? We need to keep the competition going, or the other side may get too comfortable.


A credible but unconfirmed rumor I've read is that Intel didn't want to do it because of the 512-bit shuffles. The E-cores (including those using the upcoming Skymont microarchitecture) are natively 128-bit and already double-pump AVX2, so to quad-pump AVX-512 would result in many-uops 512-bit shuffle instructions made out of 128-bit uops.

This would render the shuffles unusable, because you'd unpredictably have them costing either 1 uop or cycle to taking 10-30 uops/cycles depending on which core you are on at the moment. A situation similar to PEXT/PDEP, which cost almost nothing on Intel and dozens of cycles on AMD until a couple generations ago.

Why does Zen 4 not have this problem? First, they're only double-pumping instead of quad-pumping. Secondly, while most of their AVX-512 implementation is double-pumped, there seems to be a full-length shuffle unit in there.


Interesting stuff! If a shuffle is just crossed wires without any logic (as far as I know shuffle is a completely predetermined bit permutation), wouldn't it fit on the E-cores easily enough?


AVX-512 allows arbitrary shuffles, e.g., shuffle the 64 bytes in zmm0 with indices from zmm1 into zmm2. Simple shuffles like unpacks etc aren't really an issue.


Worse yet (for wiring complexity or required uops, anyway), AVX-512 also has shuffles with two data inputs, i.e. each of the 64 bytes of result can come from any of 128 different input bytes, selected by another 64-byte register.


Which is also why it's so attractive. :)

Those large shuffles are really powerful for things like lookup tables. Large tables are suddenly way more feasible in-register, letting you replace a costly gather with an in-register permute.


What they did, is actually even worse: they briefly introduced in Alder Lake AVX-512 FP16 = excellent turbo fast instructions for AI, to immediately drop it.


I've run into the same behavior with clang and intrinsics. Well, I appreciate the fact that they're trying to optimize the intrinsics usage, there really does need to be a flag or pragma you can pass that says more along the lines of "no really, give me what I asked for." In some cases I have found that the code it produces is a significant pessimization from what I had selected.


For a fully compute-bound workload, you're certainly correct.

That's rare though. All it takes is a couple stalls waiting on memory where a second thread is able to make progress to make that "ideal speedup" certainly be nonzero.


Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?


Because most code is not running anywhere near saturation of the available resources, and the problem is only getting worse as cores get wider. I mean, look at the Zen5 block diagram - there are 6 ALUs and 4 AGUs on the integer side alone! That's almost two entire Zen1 cores worth of execution resources, which is pretty amazing. Very, very little real world code is going to be able to get anywhere near saturating that every cycle. SMT helps improve the utilization of the execution resources that have already been paid for in the core.

I'll give another example from my own experience. I write a lot of code in the computer graphics domain. Some of the more numeric-oriented routines are able to saturate the execution resources, and get approximately 0% speedup from SMT.

Importantly though, there are other routines that make heavy use of lookup tables. Even though the tables reside completely within L1 cache, there are some really long dependency chains where the 3/4 cycle wait for L1 chains and causes some really low utilization of ALUs. Or at least, that's my theory. :) Regardless in that code running SMT provides about a 30% speedup "for free" which is quite impressive.

I was uncertain of if SMT had a future for a while, but I think for x86 in general it provides some pretty significant gains, for a design complexity that has already been 'paid' for.


With the continuous improvement of out-of-order execution, the SMT gains have been diminishing from Zen 1 to Zen 4.

However you are right that Zen 5, like also the Intel Lion Cove core, has a jump in the number of available execution resources and it is likely that out-of-order execution will not be enough to keep them busy.

This may lead to a higher SMT gain on Zen 5, perhaps on average around 30% (from typically under 20% with Zen 3 or Zen 4), like in the Intel presentation where they compared a Lion Cove without SMT with a Lion Cove with SMT. In the context of a hybrid CPU, where the MT-performance can be better provided by efficient cores than by SMT, Intel has chosen to omit SMT, for better PPA (performance per power and area), but in the context of their future server CPU with big cores, they will use cores with SMT (and with wider SIMD execution units, to increase the energy efficiency).


> Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Because SMT getting faster is a nearly free side-effect. We didn't add extra units to speed up SMT at the cost of single-thread speed. We added extra units to speed up the single thread, and they just happened to speed up SMT even more (at least for the purpose of this theoretical). That's better than speeding up SMT the same percent, or not speeding up SMT at all.

Imagine if I took a CPU and just made SMT slower, no other changes. That would be a bad thing even though it gets the speedup closer to 0%, right? And then if I undo that it's a good thing, right?


this doesn't seem to reflect the reality of the way hardware is actually being added to the cores, where Zen5 features a fourth ALU which is only useful to a single thread in the low-single-digits.

https://old.reddit.com/r/hardware/comments/1ee7o1d/the_amd_r...

this isn't adding more units to speed up single-thread and SMT being a nice "incidental" gain, this is actively targeting wide architectures that have lots of pipeline bubbles for SMT to fill.

and that's fine, but it's also a very different dynamic than, eg, apple silicon, where the focus is running super deep speculation and reordering on a single thread.


What do you think they should add instead of another ALU?

The returns are diminishing, but a single digit bump is still a pretty good when your goal is faster single threads.

Adding nothing to save space is obviously not the answer, because that leads to having more slower cores.

Also the topic of the post, the 2-ahead branch predictor, exists specifically to get more ALUs running in parallel!

> apple silicon, where the focus is running super deep speculation

And to make that speculation profitable, the M1 cores have 6 ALUs among other wideness.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: