For a fully compute-bound workload, you're certainly correct. That's rare though...

paulmd · 2024-07-27T01:57:03.000000Z

Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Remnant44 · 2024-07-27T04:09:58.000000Z

Because most code is not running anywhere near saturation of the available resources, and the problem is only getting worse as cores get wider. I mean, look at the Zen5 block diagram - there are 6 ALUs and 4 AGUs on the integer side alone! That's almost two entire Zen1 cores worth of execution resources, which is pretty amazing. Very, very little real world code is going to be able to get anywhere near saturating that every cycle. SMT helps improve the utilization of the execution resources that have already been paid for in the core.

I'll give another example from my own experience. I write a lot of code in the computer graphics domain. Some of the more numeric-oriented routines are able to saturate the execution resources, and get approximately 0% speedup from SMT.

Importantly though, there are other routines that make heavy use of lookup tables. Even though the tables reside completely within L1 cache, there are some really long dependency chains where the 3/4 cycle wait for L1 chains and causes some really low utilization of ALUs. Or at least, that's my theory. :) Regardless in that code running SMT provides about a 30% speedup "for free" which is quite impressive.

I was uncertain of if SMT had a future for a while, but I think for x86 in general it provides some pretty significant gains, for a design complexity that has already been 'paid' for.

adrian_b · 2024-07-27T07:32:02.000000Z

With the continuous improvement of out-of-order execution, the SMT gains have been diminishing from Zen 1 to Zen 4.

However you are right that Zen 5, like also the Intel Lion Cove core, has a jump in the number of available execution resources and it is likely that out-of-order execution will not be enough to keep them busy.

This may lead to a higher SMT gain on Zen 5, perhaps on average around 30% (from typically under 20% with Zen 3 or Zen 4), like in the Intel presentation where they compared a Lion Cove without SMT with a Lion Cove with SMT. In the context of a hybrid CPU, where the MT-performance can be better provided by efficient cores than by SMT, Intel has chosen to omit SMT, for better PPA (performance per power and area), but in the context of their future server CPU with big cores, they will use cores with SMT (and with wider SIMD execution units, to increase the energy efficiency).

Dylan16807 · 2024-07-27T08:29:06.000000Z

> Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Because SMT getting faster is a nearly free side-effect. We didn't add extra units to speed up SMT at the cost of single-thread speed. We added extra units to speed up the single thread, and they just happened to speed up SMT even more (at least for the purpose of this theoretical). That's better than speeding up SMT the same percent, or not speeding up SMT at all.

Imagine if I took a CPU and just made SMT slower, no other changes. That would be a bad thing even though it gets the speedup closer to 0%, right? And then if I undo that it's a good thing, right?

paulmd · 2024-07-28T20:17:05.000000Z

this doesn't seem to reflect the reality of the way hardware is actually being added to the cores, where Zen5 features a fourth ALU which is only useful to a single thread in the low-single-digits.

https://old.reddit.com/r/hardware/comments/1ee7o1d/the_amd_r...

this isn't adding more units to speed up single-thread and SMT being a nice "incidental" gain, this is actively targeting wide architectures that have lots of pipeline bubbles for SMT to fill.

and that's fine, but it's also a very different dynamic than, eg, apple silicon, where the focus is running super deep speculation and reordering on a single thread.

Dylan16807 · 2024-07-29T07:15:58.000000Z

What do you think they should add instead of another ALU?

The returns are diminishing, but a single digit bump is still a pretty good when your goal is faster single threads.

Adding nothing to save space is obviously not the answer, because that leads to having more slower cores.

Also the topic of the post, the 2-ahead branch predictor, exists specifically to get more ALUs running in parallel!

> apple silicon, where the focus is running super deep speculation

And to make that speculation profitable, the M1 cores have 6 ALUs among other wideness.