Hacker News new | past | comments | ask | show | jobs | submit login
Zen5's AVX512 Teardown and More (numberworld.org)
143 points by todsacerdoti 34 days ago | hide | past | favorite | 87 comments



Intel's handling of SIMD is representative of Intel's value-subtracting principles that have caused Intel to stagnate in the past 15 years or so.

It is an intrinsic problem with SIMD as we know it that you have to recode your application to support new instructions, which is a big hassle. Most people and companies will give up on supporting the latest and greatest and will ship binaries that meet the lowest common denominator. For instance it took forever for Microsoft to rely on instructions that were available almost 15 years ago.

As Charlie Demerjian has pointed out for years consumers are waking up to the fact that a 2024 craptop isn't much better than a 2014 crapbook and there is zero credibility in claims about "Ultrabooks", "AI PCs", etc. What could make a difference is a coordinated effort end-to-end to widely deploy the latest developments as quickly as possible across as much of the product line as possible, to get tooling support for them, and drive developers to adopt them as quickly as possible. As it is Intel will boast about how the national labs are blowing up H-bombs in VR faster than they ever had, and Facebook is profiling users more efficiently than before and not realize that customers don't believe these advances are going to make a difference for them so instead of buying a new PC which might deliver better performance when software (maybe) catches up in 7-8 years they are going to hold on to old machines longer.


Worse still is Intels rollout of AVX512 specifically, which started nearly a decade ago but to this day it's still not available across their whole product stack, so the countdown to it becoming ubiquitous hasn't even started yet. They painted themselves into a corner by making 512bit vectors a mandatory feature, which they then decided isn't feasible to support in their small E-cores, so now they're walking it all back with a new "AVX10" spec which is just a redux of AVX512 except 512bit vectors are optional this time.

Then we'll have to wait a another decade or so for AVX10 to become baseline, so AVX2 will probably be old enough to drink (in the US) before it's fully phased out.


While it does seem that AVX10 was mainly designed for consumer CPUs so they could use modern vector instructions without 512-bit vectors, the upcoming Arrow Lake will not have it.[1]

I guess we will have to wait for at least one more generation.

[1] - According to Intel® Architecture Instruction Set Extensions Programming Reference: https://cdrdv2-public.intel.com/826290/architecture-instruct...


Not only Arrow Lake does not have AVX10, but even Panther Lake, the 2025/2026 Intel CPU does not have it.

Panther Lake will introduce FRED (Flexible Return and Event Delivery) a new manner of handling interrupts, exceptions and system calls.

FRED will bring tremendous changes to the operating system kernels, but it will have little influence on user programs, except that the computer will spend less time running OS kernel code than now.

For now it is expected that Intel will introduce AVX10 in its consumer CPUs only in Nova Lake, the Intel 2026/2027 CPU.

Meanwhile, AMD Zen 4 and Zen 5 are already happily supporting AVX10, except for implementing the CPUID AVX10 flags. AVX10.1 differs from AVX-512 only by adding a simpler method for identifying which instructions are supported. AVX10.2 will add only some instructions that are not needed on the CPUs that support the 512-bit AVX-512 instructions, like Zen 4 and Zen 5. AVX10.3 has not been defined yet and it is far in the future.


Thanks for nerd snipping me into FRED! :)


Intel is making Xeons out of E-cores (up to 288 of them on one chip) so I assume those will also be motivating the rollout of AVX10, not just their consumer parts.


But surely they could just double pump like AMD does on Zen 4(c) and also on (some?) Zen 5c.

It's weird to see an Intel so... Broke? That they are seemingly forced to recycle old architectures endlessly


Another concern besides register file size is shuffle instructions, which can transfer any byte of a 512-bit register to any other (or any byte across two such registers for another instruction variant (vpermt2b), i.e. selecting from 128 bytes, and doing 64 such selections in one instruction).

You can't emulate that via just two regular 256-bit uops, you need four (maybe more for blending the results together). And if you don't have the two-register table 256-bit variant (e.g. Tiger Lake doesn't, though for 512-bit of course; it splits it into three uops), that'd end up at a rather massive 12 uops.


I think Intels E-cores are quite a bit smaller than the Zen 4c/5c cores, maybe at that scale it's prohibitive to even double up the register file? That's required even if the logic is double-pumped. AIUI the small Zen cores are mostly the same design as the big ones, just with less cache, silicon layout retuned for density rather than speed, and the removal of the 3D Cache stacking vias, while Intels small cores are clean-sheet designs with next to nothing in common with their big cores so they have to opportunity to shrink them a lot more.


Yes, while the big Intel cores are much bigger than the big AMD cores (e.g. 5 square mm in Meteor Lake vs. 3.8 square mm for Zen 4) the Intel small cores are much smaller than the AMD compact cores (e.g. 1.5 square mm in Meteor Lake vs. 2.5 square mm for Zen 4c).

The smaller size of the Intel E-cores is not only due to their different microarchitecture, but also because only their L1 cache memories are non-shared, while their L2 cache memories are shared within groups of 4 E-cores.

The shared L2 cache may not matter much for many general-purpose programs, but for other multi-threaded programs, which depend on having a great total throughput for the transfers with the L2 cache, the performance of each group of 4 E-cores becomes similar to that of a single core, instead of being 4 times greater.

The AMD compact cores have the same non-shared cache memories as the big cores. Only the shared L3 cache blocks that service a group of compact cores are smaller than for the same number of big cores.


My non-expert brain immediately jumped to double-pumping + maybe working with their thread director to have tasks using a lot of AVX512 instructions prefer P cores more. It feels like such an obvious solution to a really dumb problem that I assumed there was something simple I was missing.

The register file size makes sense, I didn't think they were that much of the die on those processors but I guess they had to be pretty aggressive to meet power goals?


> The register file size makes sense, I didn't think they were that much of the die on those processors

https://i.imgur.com/WdMPX8S.jpeg

According to this, Zen4s FP register file is almost as big as its FP execution units. It's a pretty sizable chunk of silicon.


I was having trouble finding an E Core die shot, but that helps put it into perspective a bit anyway. Thanks!


If/once they follow through on their x86s architecture maybe they’ll have the transistor budget to support proper AVX512 on their efficiency cores.


Skymont little cores have 4x 128-bit execution. They could quadruple-pump.

But looks more like they're giving up on people writing code for wide vectors, instead settling on trying to make the existing code faster.


Well, they don't support it either. According to the document I linked, neither the just-released Sierra Forest, nor the planned Clearwater Forest support AVX10.


AVX10 is still pretty much in the proposal phase, and has been recently updated based on feedback Intel has received. It takes several years to get from that stage to shipping hardware.


Granite Rapids, to be launched in a few months, is said by Intel to support AVX10.1/512 (which is identical to the ISA supported by Zen 5, except for a few additional flags reported by CPUID; Zen 4 lacks only VP2INTERSECT of AVX10.1).

Only the availability of AVX10/256 in Intel's consumer CPUs and in its server CPUs with E-cores is in the proposal phase (mainly because Intel has yet to design and launch, as the successor of Skymont that is being launched now, an E-core supporting AVX10/256; this is expected only in H2 2026).


I don't think you can phase out AVX2, it's the base of AVX512, because you can't always go 512 wide, and you'd have no backwards compatibility.


I know AVX2 will continue to exist in hardware forever for backwards compatibility, by "fully phased out" I mean the eventual point when software no longer has to maintain a dedicated path for hardware which supports AVX2 but doesn't support AVX10, because all relevent hardware supports AVX10.


EVEX prefix can address XMM/YMM/ZMM registers. So you can apply the AVX512 instruction set to 128bit and 256bit registers too.


I do not understand why ubiquity or baselines should be required in order to use CPU features :)

Since many years, performance-critical libraries have used runtime/dynamic dispatch.

Our github.com/google/highway intrinsics even automate this. You can write your code once and it is compiled for each instruction set, and the best codepath is selected at runtime.


Dynamic dispatch adds headaches to the build process; they are surmountable for sure, but in my experience the build wrangling to make it all happen is harder than the original work of rewriting your code with intrinsics!

The other major problem I have with dynamic dispatch, at least for the SIMD code I've written, is that you have to do so at a fairly high level of granularity. Most optimized routines are doing as much fusion & cache tiling as possible and so the dispatch has to happen at the level of the higher-level function rather than the more array-op-like components within it. And mostly, that means you've written your (often quite complicated) procedure several times uniquely instead of simply accelerating components within it.

I have not used Highway - if it dramatically simplifies the above, that's excellent!


:) Yes indeed, no changes to the build required. Example: https://gcc.godbolt.org/z/KM3ben7ET

I agree dispatch should be at a reasonably high level. This hasn't been a problem in my experience, we have been able to inline together most SIMD code and dispatch infrequently.


I'd say 2018 ~= 2024 instead of 2014 (those craptops all too often had dual-core i5's and i7's, 15" TN 768 screens, and HDDs), but yeah things have slowed down a bit.


I don't think you can make a credible case that a 2024 PC is equal to a 2014 one. In 2014 you could get 4 Haswell cores in a 65W TDP, for 410 2014 US dollars. For the same power and less money in 2024 you get a web browser platform that is around 6x faster, or a code compilation platform that is 5-20x faster.


A lot of the performance gains of the past 10 years aren't obvious from the headline specs.

The desktop PC I built in 2015 for £1082.98

  i7-4790K (4 cores, "4.00 GHz base 
         4.40 GHz Turbo")
  32GB DDR3 RAM
  Samsung 250GB SSD
  nvidia GTX 960 2GB GPU
The desktop PC Dell will sell me, today, for £1,174.80 [1]

  i7-14700 (8 performance cores, 12 efficient
         cores, "2.1 GHz Base, 5.4 GHz Turbo")
  32GB DDR5 RAM
  512 GB SSD 
  Intel Integrated Graphics
Sure, it's better. But on paper those spec changes don't look game-changing, considering it's been an entire decade. Especially if you're mistrustful of efficiency cores.

But what those headline specs don't mention is that the RAM is 4x faster, the SSD is now nvme and faster, the pcie lanes are 4x faster, and the CPU cache has quadrupled.

[1] https://www.dell.com/en-uk/shop/desktop-computers/new-optipl...


The other big difference is if you are comparing DIY vs DIY, you can now get a 2TB SSD for $100 which is pretty great compared to a decade ago.


A good 2TB TLC NVMe 4.0 drive went for 100 USD when the prices bottomed out, but Samsung and friends cut the supply.

You can get 2TB for cheap now, but those are DRAM-less QLC drives. I suggest paying an extra 30-50 USD to get one with the TLC and RAM.


Well, if you accept an asspull reasoning, you could guess a typical task consists of equal parts non-cpu dependent work (disk, network), which hasn't gotten faster, single threaded work, which is allegedly twice as fast, and parallelizable work, which I assume to be infinitely fast. That gives us a timing of 1/3*(1+0.5+0) which is a 2x speedup, with the maximum achievable compared to the baseline being 3x.

So no, you are right, your computer probably isn't that much faster in practice.


Those are the specs you'd expect from a $500 mini PC. Step up to $700 and you will get a better CPU plus 2TB nvme SSD.

https://www.geekom.de/geekom-a5-mini-pc/


On paper those specs definitely look game-changing to me. The newer one has three of the older CPUs on the side, for free. The i9 is even more ridiculous, having what amounts to 4 quad-core Skylake CPUs as coprocessors (efficiency cores, in the parlance).

But people are underestimating the compound effect of 10 years of 15% generational improvements. The CPUs in the article will run your web browser 4x faster than the older CPU you mentioned, about 2x faster than a Ryzen 5 5500 that is only 2 years old.


> On paper those specs definitely look game-changing to me.

The reality is that the newer processor is better in all benchmarks. Like 4-5x on the benchmarks I looked up, yes.

But back in the day, the 'turbo' frequency was only available for a few seconds for thermal and power reasons, and the 'base' clock speed was how it would actually perform on big, compute-intensive tasks.

If you only consider "8 performance cores, 2.1 GHz Base" vs "4 cores, 4.0 GHz base" and you also discount the 'efficiency cores', a person might think performance had barely changed.


> Sure, it's better.

Which 1? The 1 that's 13-14th gen and isn't getting recalled?


The n100 that's out is on par-ish both in performance and only a little bit more power saving than a i5-6500t.

A 2013 core. https://www.intel.com/content/www/us/en/products/sku/88183/i...

The price point is far better at least (iff we look at new inventory only).


> In 2014 you could get 4 Haswell cores in a 65W TDP, for 410 2014 US dollars.

Is there a typo here? I bought a 4-core Haswell in 2014 for around $200.

It definitely didn't cost anywhere near $410, except maybe for the 6-core HEDT cpu?


I was just looking at the launch MSRP of one of the higher-end hyperthreaded ones, but sure there was a range including $182 for the i5-4460S without hyperthreading, $224 for the i5-4690S, etc.


I can still use my casual desktop PC with a dual-core i7 for web browsing and video streaming without much noticeable degradation. I use that PC to play StarCraft 2 on medium graphics. I can do light web, Java/Kotlin dev on it, but I wouldn't try doing any kind of mobile dev.


The only reason I no longer use my craptops from 2010, is that they died.

For the kind of stuff I do at home outside work, even programming, there were perfectly fine, and had they not died, I would still be using them.


Great article. It really drives home what a damn shame Intel's persistent mishandling of AVX512 has been ever since its introduction. I don't even know if it has a future outside of extremely niche libraries given how scattered hardware support for it is on Intel's side.

On a completely different topic: I wasn't expecting the redacted portion of the article due to AMD's embargo to take away too much of the article but the first half (up until the discussion about AVX512) would clearly be much more interesting with the censored out parts. I guess someone will have to resubmit this come August 14th!


Mishandling aside, the issue I've seen is there really isn't consumer demand for this. Prior to AMD having AVX512, most of the comments were around wasting the silicon on SIMD, rather than improving other aspects of the CPU. I'm pretty sure there was good reason to think it was largely a dark area of the chip.

From what I've seen, but haven't heard discussed much, the naive implementation vs AVX512 is a huge gain, but AVX2 vs AVX512 was not very impressive for the application I was looking at. The complexity this code added, and the cases where we needed it to run on AMD (for other reasons), basically made taking advantage of the feature undesirable for a single digit gain.

Things like VNNI or AMX are better wins, but they are only needed in very specific cases. VNNI in particular looked to be a 30% improvement in a BERT workload.


Isn't it a bit weird to expect consumer demand for CPU instruction set extensions?

Obviously there's very little of that, but what should matter is the developer uptake and thus better end-user experience that can be delivered? (I'd also hope for even better autovectorization in compilers.)

It's in my opinion kind-of insane that we're still building so much software for ancient baselines and leaving quite a bit of performance on the table across the entire system. (How much has Apple won in terms of performance by forcing everyone to build for new ARM targets using new toolchains?)


Consumers want faster processing the instructions are just the method to get there. And they aren’t the best since the area dedicated to the instruction could be used for something else.

It is insane especially if you think emulation is performant enough to allow for a switch.


The most interesting bit about this article for me is the "transition time" to get the power needed use AVX-256 or AVX-512 which is present on Intel, but not AMD zen4/zen5. It explains some behavior that I saw years ago when implementing kTLS on FreeBSD, and validates our design of having per-core kTLS crypto worker threads, rather than doing the crypto in the context of sosend() or sendfile's tcp_usr_ready().


Any chance you’d move to using Intel for your content servers in the foreseeable future?

Or this further cements the use of AMD?


It doesn't matter so much anymore, since we use kTLS offload NICs.


Do you only offload aes-gcm or do the NICs handle chacha20-poly1305 too?


AMD's avx512 implementation is just lovely and they seem to be firing on all cylinders for it. Zen4 was already great, 'double pumped' or no.

It looks like Zen5's support is essentially the dream - all EUs and load/store are expanded to 512 bit, so you can sustain 2 512 FMAs and 2 512 Adds every cycle. There also appears to be essentially no transition penalty to the full-power state which is incredible.

The only thing sad here is that all this work to enable full-width AVX512 is going to be mostly wasted as approximately 0% of all client software will get recompiled to an AVX512 baseline for decades if ever. But if you can compile for your own targets, or JIT for it.. it looks really good.


> The only thing sad here is that all this work to enable full-width AVX512 is going to be mostly wasted as approximately 0% of all client software will get recompiled to an AVX512 baseline for decades if ever.

Well, the other thing is that for workloads where you're cramming 512b vectors through 2xFMAs every cycle -- there's a good chance you can (and have been) just buying GPUs to handle that problem. So, I think that space has been eaten up a bit in recent times.

I don't think it will be decades of waiting though. AVX2 is a practical baseline today IMO and Haswell is what, barely 10 years old? Intel dragged their feet like crazy of course, but "decades" from now is a bit much. And AVX-512's best feature -- it's much more coherent and regular design -- means a lot of vectorization opportunities might be easier to do, even automatically (e.g. universal masking and gather/scatters make loop optimizations more straightforward.) We'll have to see how it shakes out.


The GPUs that can do FP64 operations are priced out of the range acceptable for small businesses or individuals.

The consumer GPUs are suitable only for games, graphics and ML/AI.

There are also other applications, like in engineering, where the only cost-effective way is to use CPUs with good AVX-512 support, like Zen 5.

A 9950X has a similar FP64 throughput like the last GPUs that still had acceptable prices, from 5 years ago (Radeon VII).

Even for FP32, the throughput of a 9950X is similar to that of a very good integrated GPU (512 FP32 FMA per cycle, but at a double clock frequency, so equivalent with a GPU doing 1024 FP32 FMA per cycle), even if it is no match for a discrete GPU.

There are also applications where the latency of transferring the data to the GPU, then doing only short computations, can reduce the performance below what can be achieved on the CPU.

Obviously, there are things better done on a GPU, but there are enough cases where a high throughput CPU like a desktop Zen 5 is better.


Some people in HPC groups are working on better toolchain for that. GUIX/Nix may (i can't say how far they got) be able to describe how you want to recompile your whole stack very precisely to leverage your cpu features as much as possible.


I haven't paid attention for the past decade, do modern C/C++ compilers generate any decent AVX512 code if told to do so? Or do you still need to do it by hand via intrinsics or similar?


The short answer is "Yes, sometimes".

Clever hand-written SIMD code is still consistently better, sometimes dramatically better. But generally speaking, I've found Clang to be pretty good at auto-vectorizing code when I've "cleared the path" so to speak, organizing the data access in ways that are SIMD friendly.

On the Windows platform, in my experience MSVC is a disaster in terms of auto-vectorization. I haven't been able to get it to consistently vectorize anything above toy examples.


Those aren't the only two options. You can use libraries that are made to take advantage of SIMD and you can use ISPC which is specifically about SIMD programming.


Those languages are too SIMD-hostile.


Besides C#, what languages do you think are not SIMD-hostile?


Languages where the semantics have considered parallelism and this kind of optimization. ISPC, OpenCL, Chapel, Futhark, etc.


Thanks. With ISPC and OpenCL it's a given...I was thinking more general-purpose programming languages where it is easy to exploit CPU-side SIMD.


Maybe the CPython JIT can make use of it. That might move the needle past 1%...


Seeing 2 consumer CPU generations in a row not only support but improve AVX512 capabilities will hopefully go a long ways towards regaining the confidence of the developers that use AVX512 in the consumer space. I know I personally have been holding back as I watched Intel fumble AVX512 for the last 10 years. With their even more recent fumbles there could be a near future where AMD CPUs have majority market share in both desktop and mobile. Great news for developers that can use AVX512.


Agreed. My current best-case-scenario hope is that the success of the Zen4/5/etc processors will force Intel to adapt their strategy towards AMDs, and finally move us out of the avx512 mess they've segmented us into.


Assuming Intel is changing direction right now, unfortunately they will face 2-3 years of latency to implement that.


AMD fixed vpcompressd in Zen5:

> Hazards Fixed:

> V(P)COMPRESS store to memory is fixed. (3 cycles/store to non-overlapping addresses)

> The super-alignment hazard is fixed.

I initially tried searching for the string, but the () thwarted that.

It used to be 142 cycles/instruction for "vpcompressd [mem]{k}, zmm" in zen4.


Mystical (the author) does such fantastic work for the CS community. I really like that guy. A compilation of his stack overflow answers on SIMD would be better than any available book.

His intelligence and openness, despite no one paying him for it, shines such a bad light on the terrible state of academia. That he was considered a "bad student" is near-proof in and of itself that our system judges people catastrophically poorly.


Looks like he got a master's degree from UIUC and did some research on FFT implementations. Seems to have been successful. What makes you say he was 'considered a "bad student"'?

(This is a genuine question. I've never met Alex in person, but if an applicant to my lab spent their free time diving into SIMD implementations and breaking records for computing mathematical constants, I'd rush to hire them. Not that either of those two things is a requirement, of course.)


It's in his bio.

"However, ever since grade school, I've always sucked in terms of grades and standardized tests. I graduated from Palo Alto High School in the bottom quartile among all the college-bound students. My GPA was barely a 3.0 at graduation, so it was somewhat miraculous that I got accepted into Northwestern University at all."[1]

I know he has a masters and all, but he is spectacular. Hundreds of thousands of people have masters degrees. He is more impressive than 99% of professors, and academia doesn't even acknowledge him as a peer of them.

[1] http://www.numberworld.org/about/ayee/


The fact that he got into Northwestern is proof in itself that the “system” didn’t consider him to be a bad student. It actually seems like the system did a pretty good job of identifying sheer intellectual horsepower and potential despite the self-professed low GPA and standardized test scores.


In the grand hierarchy of college admissions committees ranking people, "getting into northwestern" means roughly "the 20,000th best student in his class year."


and that was a "catastrophic" outcome? When I think of "catastrophic", it would be something like ending up institutionalized, or dead, not ranked in the top 1% of college applicants.


I'll put it this way. He clearly loves teaching. He's better at it than almost any teacher I ever had. He's clearly very intelligent. He loves doing work that interests him, as opposed to lucrative professional tasks.

I'm not in his head, but it sounds like his ideal job would be professor.

Let's presume that second paragraph is true. He doesn't have that job, let alone have it at a top university. He hasn't been given a PhD (a de facto minimum requirement for that job). UIUC is great, but in a tier below other schools, who presumably didn't admit him (or made him pay too much).

But saying he's "ranked in the top 1%" is very disingenuous to this conversation. The 99th percentile basketball doesn't make a college team. And I'm saying he belongs on the NBA All-Star team. And yes, it would be a "catastrophically bad" bad process if Anthony Davis couldn't have made a college team. But the difference between 99th percentile and 99.9999th percentile is enormous, and mistaking one for the other in any context is terrible.


The part of the article I found most amusing:

"Intel added AVX512-VP2INTERSECT to Tiger Lake. But it was really slow. (microcoded ~25 cycles/46 uops) It was so slow that someone found a better way to implement its functionality without using the instruction itself. Intel deprecates the instruction and removes it from all processors after Tiger Lake. (ignoring the fact that early Alder Lake unofficially also had it) AMD adds it to Zen5. So just as Intel kills off VP2INTERSECT, AMD shows up with it. Needless to say, Zen5 had probably already taped out by the time Intel deprecated the instruction. So VP2INTERSECT made it into Zen5's design and wasn't going to be removed.

But how good is AMD's implementation? Let's look at AIDA64's dumps for Granite Ridge:

AVX512_VP2INTERSECT :VP2INTERSECTQ k1+1, zmm, zmm L: [diff. reg. set] T: 0.23ns= 1.00c

Yes, that's right. 1 cycle throughput. ONE cycle. I can't... I just can't...

Intel was so bad at this that they dropped the instruction. And now AMD finally appears and shows them how it's done - 2 years too late."


It's in fact very common to microcode instructions at early iterations of an arch. https://uops.info/table.html is a nice place if certain instructions being slow brings joy to your life.


I want to add a fact about which the author was not aware.

While as he says, after Tiger Lake Intel has deprecated VP2INTERSECT, then they have changed their mind and they have added it again in the server CPU Granite Rapids, which will be launched in a few months.

Moreover, the ISA of Granite Rapids is considered to be AVX10.1 and all its instructions, including VP2INTERSECT, will be a mandatory part of the ISA of all future Intel CPUs from 2026 on.

Therefore it is good that AMD has achieved an excellent implementation of VP2INTERSECT, which they will be able to carry into their future designs.

It remains to be seen whether the Intel Granite Rapids implementation of VP2INTERSECT is also good.


Something I've wondered about is why CPUs have separate scalar and vector ALUs. Would it be possible to simply have 16x scalar ALUs that can be used either individually or ganged together in groups of 4, 8, or 16 to execute the various vector instructions?


Setting aside the complexity of "detachable" ALU clusters, the bigger problem is that essentially, we can currently drive all those ALUs _because_ they're SIMD and executing the same instruction.

For example, if you could un-gang even a single Zen5 ALU and do 16 independent scalar FMAs at once, you now need the front-end of the processor to decode and issue 16 instructions/cycle for each 1 instruction that it currently does! That's hopelessly wider than the front-end of the CPU can process. It needs to decode the instructions and schedule them, and both of those are complex operations that can be power hungry / slow (not to mention the instruction fetch bandwidth is now through the roof!).

SIMD bypasses that by doing lots of operations with a single instruction. It would be extremely difficult to achieve with just scalar instructions.


Intel has shared ports across scalar and SIMD for a long time (only recently they've split them apart). I don't know whether they share silicon though, quite likely that they don't.

The complexity of using parts of vector ALUs might not be worth the saved silicon (even for something as expensive as float division, which has both SIMD and scalar versions (both operate in the same register file, just differing in the number of elements processed), both Intel and AMD have the same throughput for the scalar version and 128-bit SIMD version[0]).

It additionally allows separating bypass networks, e.g. the output of a SIMD add will never be directly used as a memory address (or, in the case of a gather, an extra cycle or two of latency is acceptable considering the amount of loads that will follow and in general the throughput-centric nature of SIMD), whereas for a scalar op that's an extremely important path. Or, in general, it allows making different tradeoffs between scalar & SIMD.

[0]: https://uops.info/table.html?search=div%20xmm)&cb_lat=on&cb_...


Good question. I imagine that for instruction encoding density one could for example encode instructions so that "add4 ra0, rb4" would do a 4-wide addition ra0+rb4 up to ra3+rb7.

Possibly with restrictions like only "aligned" register access, ie an 8-way op can only use regs 0-7 or 8-15.

Swizzle instructions can then handle reordering ala SSE.

Don't forget about 2-way ops as well.


I was just thinking of the back end ALUs but you make a good point: it might also make sense to unify scalar and vector instructions. If the right encoding is used it might allow scalar code to retain its efficiencies.


For unified scalar and vector instructions, MRISC32 might be interesting: https://mrisc32.bitsnbites.eu/. Though it still has separate vector & scalar register files, and I don't think it's intended for the type of thing discussed above.


For "just" the back-end I was thinking the instruction count overhead would outweigh the gain. That is you'd get less effective instruction cache and require far more decoding units from all the instructions required to fill all those ALUs.

So yea, that's why I was imagining an ISA where you'd have instructions that were 1, 2, 4 etc wide, encoded in the instruction. You'd only need two bits for up to 2^4=16 wide instructions.

Of course could be I am terribly mistaken, I'm by no means an expert in this.


When a team at IBM has coined the word "superscalar" in 1987, they have written a widely cited research paper in which they have argued that a "superscalar" CPU is better than a "vector CPU" ("vector" CPUs were well known at that time and had been used for more than a decade, especially in supercomputers), therefore all vector CPUs should be replaced with superior "superscalar" CPUs.

Their theory was in essence that when you have 16x scalar ALUs it is always better to be able to use them individually, instead of having them used by a single instruction.

If you have the hardware that allows using the 16 scalar ALUs individually, by 16 independent instructions, that is obviously much more powerful. It can handle all the cases that can be handled by SIMD, but also many other cases.

The IBM research paper has been extremely influential. Before it, the target for most CPU design teams was to make a pipelined CPU able to execute 1 instruction per clock cycle, after it everybody has switched to attempting to design CPUs with an IPC as high as possible. Even Intel, who prioritized the CPU production cost over the CPU performance, has evolved through 80486 (1989), Pentium (1993) and Pentium Pro (1995) eventually reaching the stage of having a high-performance superscalar CPU.

Nevertheless, around 1995 the initial hype about superscalar CPUs has begun to dissipate. It became understood that even if a superscalar CPU would always be faster than a vector CPU with the same number of ALUs, the cost of the superscalar CPU increases very quickly and super-linearly with the number of ALUs. For high enough values, doubling the number of ALUs in a superscalar CPU increases both the area and the power consumption by a factor much greater than 2.

The result of understanding the limitations of the superscalar CPUs was a smaller step back, resurrecting the vector CPUs, but in combination with the superscalar CPUs. Now all modern CPUs use this combination, instead of using only one of the two design variants.

For instance, Zen 5 contains 32 FP64 adders. However, it can do only 4 independent FP64 operations simultaneously.

Therefore it can do 4 scalar FP64 additions per cycle. Or it can gang pairs of ALUs and it can do 4 additions of length-2 FP64 vectors per cycle. Or it can gang groups of 4 ALUs and it can do 4 additions of length-4 FP64 vectors per cycle. Or it can gang groups of 8 ALUs and it can do 4 additions of length-8 FP64 vectors per cycle.

Only in the last case all the 32 existing FP64 adders are used.

The same is in all modern CPUs. All have 2 limits. One limit is the number of ALUs and the second limit is the number of simultaneous independent operations a.k.a. the number of execution ports.

The number of execution ports limits the number of independent scalar operations. The total number of ALUs limits the number of equivalent scalar operations when they are not independent, but the ALUs are ganged for vector operations.


What an excellent writeup, thx for sharing


Does Zen 5 have cldemote or senduipi ?


Probably not.

The gcc Zen 5 patch adds as new instructions only AVX-VNNI, MOVDIRI, MOVDIR64B, AVX-512-VP2INTERSECT and PREFETCHI.

Zen 4 already had AVX-512 VNNI (for ML/AI), AVX VNNI is only an alternate encoding for the programs that use only AVX (because they have been compiled for Intel Alder Lake/Raptor Lake).

MOVDIRI and MOVDIR64B can be very useful in some device drivers, less often in general-purpose programs.


out of curiosity, what applications might I see this used for?


Ugh, half of this article is

> This section has been redacted until August 14.

Could you repost it then?


I think you need to wait one year before being able to post an exact link again. You can ask deng to reset it if you want, but maybe take some initiative for yourself rather than giving other people attitude, then asking them to do you favors?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: