It's interesting to see that modern processor optimization still revolves around balancing hardware for specific tasks. In this case, the vector scheduler has been separated from the integer scheduler, and the integer pipeline has been made much wider. I'm sure it made sense for this revision, but I wonder if things will change in a few generations in the pendulum will swing back to simplifying and integrating more parts of the arithmrtic scheduler(s) and ALUs.
It's also interesting to see that FPGA integration hasn't gone far, and good vector performance is still important (if less important than integer). I wonder what percentage of consumer and professional workloads make significant use of vector operations, and how much GPU and FPGA offload would alleviate the need for good vector performance. I only know of vector operations in the context of multimedia processing, which is also suited for GPU acceleration.
> good vector performance is still important (if less important than integer)
This is in part (major part IMHO) because few languages support vector operations as first class operators. We are still trapped in the tyranny that assumes a C abstract machine.
And so because so few languages support vectors, the instruction mix doesn’t emphasize it, therefore there’s less incentive to work on new language paradigms, and we remained tapped in a suboptimal loop.
I’m not claiming there are any villains here, we’re just stuck in a hill-climbing failure.
It’s not obvious that that’s what’s happened here. Eg vector scheduling is separated but there are more units for actually doing certain vector operations. It may be that lots of vector workloads are more limited by memory bandwidth than ILP so adding another port to the scheduler mightn’t add much. Being able to run other parts of the cpu faster when vectorised instructions aren’t being used could be worth a lot.
I’m not saying that you cannot write vector code, but that it’s typically a special case. CUDA APIs and annotations are bolted onto existing languages rather than reflecting languages with vector operations as natural first class operations.
C or Java have no concept of `a + b` being a vector operation the way a language like, say, APL does. You can come closer in C++, but in the end the memory model of C and C++ hobbles you. FORTRAN is better in this regard.
It is always possible to inline assembler in C, and present vector operators as functions in a library.
Otherwise, R does perceive vectors, so another language that performs well might be a better choice. Julia comes to mind, but I have little familiarity with it.
With Java, linking the JRE via JNI would be an (ugly) option.
When the data is generated on CPU shoveling it to the GPU to do possibly a single or few vector operations and then shoveling it back to the CPU to continue is most likely going to be more expensive than the time saved.
No - a CUDA program consists of parts that run on the CPU as well as on the GPU, but the CPU (aka host) code is just orchestrating the process - allocating memory, copying data to/from the GPU, and queuing CUDA kernels to run on the GPU. All the work (i.e. running kernels) is done on the GPU.
There are other libraries (e.g. OpenMP, Intel's oneAPI) and languages (e.g. SYCL) that do let the same code be run on either CPU or GPU.
When you use a GPU, you are using a different processor with a different ISA, running its own barebones OS, with which you communicate mostly by pushing large blocks of memory through the PCIe bus. It’s a very different feel from, say, adding AVX512 instructions to your program flow.
The CPU vector performance is important for throughput-oriented processing of data e.g. databases. A powerful vector implementation gives you most of the benefits of an FPGA for a tiny fraction of the effort but has fewer limitations than a GPU. This hits a price-performance sweet spot for a lot of workloads and the CPU companies have been increasingly making this a first-class "every day" feature of their processors.
AMD tried that with HSA in the past it doesn’t really work. Unless your CPU can magically offload vector processing to the GPU or another sub-processor you are still reliant on new code to get this working which means you break backward compatibility with previously compiled code.
The best case scenario here is if you can have the compiler do all the heavy lifting but more realistically you’ll end up having to make developers switch to a whole new programming paradigm.
I understand that you can't convince developers to rewrite/recompile their applications for a processor that breaks compatibility. I'm wondering how many existing applications would be negatively impacted by cutting down vector throughput. With some searching, I see that some applications make mild use of it like Firefox. However there are applications that would negatively affected, such as noise suppression in Microsoft Teams, and crypto acceleration in libssl and the Linux kernel. Acceleration of crypto functions seems essential enough to warrant not touching vector throughput, so it seems vector operations are here to stay in CPUs.
Sure; but it’s hard to do and very few programs get optimised to this point. Before reaching for vector instructions, I’ll:
- Benchmark, and verify that the code is hot.
- Rewrite from Python, Ruby, JS into a systems language (if necessary). Honorary mention for C# / Go / Java, which are often fast enough.
- Change to better data structures. Bad data structure choices are still so common.
- Reduce heap allocations. They’re more expensive than you think, especially when you take into account the effect on the cpu cache
Do those things well, and you can often get 3 or more orders of magnitude improved performance. At that point, is it worth reaching for SIMD intrinsics? Maybe. But I just haven’t written many programs where fast code written in a fast language (c, rust, etc) still wasn’t fast enough.
I think it would be different if languages like rust had a high level wrapper around simd that gave you similar performance to hand written simd. But right now, simd is horrible to use and debug. And you usually need to write it per-architecture. Even Intel and amd need different code paths because Intel has dumped avx2.
Outside generic tools like Unicode validation, json parsing and video decoding, I doubt modern simd gets much use. Llvm does what it can but ….
Indeed, people really fixate on “slow languages” but for all but the most demanding of applications, the right algorithm and data structures makes the lions share of the difference.
Reaching for SIMD intrinsics or an abstraction has been historically quite painful in C and C++. But cross-platform SIMD abstractions in C#, Swift and Mojo are changing the picture. You can write a vectorized algorithm in C# and practically not lose performance versus hand-intrinsified C, and CoreLib heavily relies on that.
Newer SoCs come with co-processors such as NPUs so it’s just a question of how long it would take for those workloads to move there.
And this would highly depend on how ubiquitous they’ll become and how standardized the APIs will be so you won’t have to target IHV specific hardware through their own libraries all the time.
Basically we need a DirectX equivalent for general purpose accelerated compute.
It’s a lot more work to push data to a GPU or NPU than to just to a couple vector ops. Crypto is important enough many architectures have hardware accelerators just for that.
For servers no, but we’re talking about endpoints here. Also this isn’t only about reducing the existing vector bandwidth but also about not increasing it outside of dedicated co-processors.
I think the answer here is dedicated cores of different types on the same die.
Some cores will be high-performance, OoO CPU cores.
Now you make another core with the same ISA, but built for a different workload. It should be in-order. It should have a narrow ALU with fairly basic branch prediction. Most of the core will be occupied with two 1024-bit SIMD units and a 8-16x SMT implementation to hide the latency of the threads.
If your CPU and/or OS detects that a thread is packed with SIMD instructions, it will move the thread over to the wide, slow core with latency hiding. Normal threads with low SIMD instruction counts will be put through the high-performance CPU core.
I actually think the CPU and GPU meeting at the idea of SIMT would be very apropos. AVX-512/AVX10 has mask registers which work just like CUDA lanes in the sense of allowing lockstep iteration while masking off lanes where it “doesn’t happen” to preserve the illusion of thread individuality. With a mask register, an AVX lane is now a CUDA thread.
Obviously there are compromises in terms of bandwidth but it’s also a lot easier to mix into a broader program if you don’t have to send data across the bus, which also gives it other potential use-cases.
But, if you take the CUDA lane idea one step further and add Independent Thread Scheduling, you can also generalize the idea of these lanes having their own “independent” instruction pointer and flow, which means you’re free to reorder and speculate across the whole 1024b window, independently of your warp/execution width.
The optimization problem you solve is now to move all instruction pointers until they hit a threadfence, with the optimized/lowest-total-cost execution. And technically you may not know where that fence is specifically going to be! Things like self-modifying code etc are another headache not allowed gpgpu too - there certainly will be some idioms that don’t translate well, but I think that stuff is at least thankfully rare in AVX code.
This is what happening now with NPUs and other co-processors. Just not fully OS managed / directed yet but Microsoft is most likely working on that part at least.
The key part is that now there are far more use cases than there were in the early dozer days and that the current main CPU design does not compromise on vector performance like the original AMD design did (outside of extreme cases of very wide vector instructions).
And they are also targeting new use cases such as edge compute AI rather than trying to push the industry to move traditional applications towards GPU compute with HSA.
I've had thoughts along the same lines, but this would require big changes in kernel schedulers, ELF to provide the information, and probably other things.
+1 : Heterogeneous/Non uniform core configuration always require a lot of very complex adjustment to the kernel schedulers and core binding policies. Even now after almost a decade of big-little (from arm) configuration and/or chiplet design(from amd) the (linux) kernel scheduling still requires a lot tuning for things like games etc...
Adding cores with very different performance characteristics would probably require the thread scheduling to be delegated to the CPU it self with only hint from the kernel scheduler
Static analysis would probably work in this case because the in-order core would be very GPU-like while the other core would not.
In cases where performance characteristics are closer, the OS could switch cores, monitor the runtimes, and add metadata about which core worked best (potentially even about which core worked best at which times).
Persuading people to write their C++ as a graph for heterogeneous execution hasn't gone well. The machinery works though, and it's the right thing for heterogeneous compute, so should see adoption from XLA / pytorch etc.
As CPU cores get larger and larger it makes sense to always keep looking for opportunities to decouple things. AMD went with separate schedulers in the Athalon three architectural overhauls ago and hasn't reversed their decision.
Removing a core's SMT aka "hyperthreading" has some modest hardware savings but but biggest cost is that it makes testing and validation much more complicated. Given the existence of Intel's p-cores I'm not surprised they're getting rid of it.
From Intel's perspective, I doubt that's true, when taking into consideration the constant stream of side channel vulnerabilities they were needing to deal with.
You can always do like Sun and IBM and dilute the side channel in too many other threads to make it reliable. IIRC, both POWER10 and SPARC do 8 threads per core.
Lion Cove doesn't remove hyper-threading in general. Some variants will have it so Intel still has to validate it (eventually).
But the first part shipping, Lunar Lake, removes hyper-threading. It may have saved them validation time for this particular part but they'll still have to do it.
Intel has said that they have designed two different versions of the Lion Cove core.
One with SMT, for server CPUs, and one in which all circuits related to SMT are completely removed, for smaller area and power consumption, to be used in hybrid CPUs for laptops and desktops, together with Skymont cores.
I think you’re saying that testing smt makes it expensive, which sounds mostly right to me, though I can imagine some arguments that it isn’t much worse. When I first read your comment, I thought it said the opposite – that removing smt requires expensive testing and validation work.
The dropping hyperthreading is interesting. I'm hoping x64 will trend in the other direction - I'd like four or eight threads scheduled across a pool of execution units as a way of hiding memory latency. Sort of a step towards the GPU model.
This has not been successful historically because software developers struggle to design code that has mechanical sympathy with these architectural models. These types of processors will run idiomatic CPU code with mediocre efficiency, which provides an element of backward compatibility, but achieving their theoretical efficiency requires barrel-processing style data structures and algorithms, which isn't something widely taught or known to software developers.
I like these architectures but don't expect it to happen. Every attempt to produce a CPU that works this way has done poorly because software developers don't know how to use them properly. The hardware companies have done the "build it and they will come" thing multiple times and it has never panned out for them. You would need a killer app that would strongly incentivize software developers to learn how design efficient code for these architectures.
I think another huge issue is the combination of general-purpose/multi-tenant computing, especially with virtualization in the mix, and cache. Improvements gained from masking memory latency by issuing instructions from more thread execution contexts are suddenly lost when each of those thread contexts is accessing a totally different part of memory and evicting the cache line that the next thread wanted anyway.
In many ways barrel processing and hyperthreading are remarkably similar to VLIW - they're two very different sides of the same coin. At the end of the day, both work well for situations where multiple simultaneous instruction streams need to access relatively small amounts of adjacent data. Realtime signal processing, dedicated databases, and gaming are obvious applications here, which I think is why VLIW has done so well in DSP and hyperthreading did well in gaming and databases. Once the simultaneous instruction streams are accessing completely disparate parts of main memory, it all blows up.
Plus, in a multi-tenant environment, the security issues inherent to sharing even more execution state context (and therefore side channels) between domains also become untenable fairly quickly.
Hyperthreading is also useful for simply making a little progress on more threads. Even if you're not pinning a core by making full use of two hyperthreads, you can handle double the threads at once. Now, I don't know how important it is, but I assume that for desktop applications, this could lead to a snappier experience. Most code on the desktop is just waiting for memory loads and context switching.
Of course, the big elephant in the room is security - timing attacks on shared cores is a big problem. Sharing anything is a big problem for security conscious customers.
Maybe it's the case of the server leading the client here.
Just this morning I benchmarked my 7960X on RandomX (i.e. Monero mining). That's a 24-core CPU. With 48 cores, it gets about 10kH/s and uses about 100 watts extra. With 48 threads, about 15kH/s and 150 watts. It does make sense with the nature of RandomX.
Another benchmark I've done is .onion vanity address mining. Here it's about a 20% improvement in total throughput when using hyperthreading. It's definitely not useless.
However, I didn't compare to a scenario with hyperthreading disabled in the BIOS. Are you telling me the threads get 20-50% faster, each, with it disabled?
I've been disabling HT for security reasons on all my machines ever since the first vulnerabilities appeared, with no discernible negative impact on daily usage.
I'm running a quite old CPU currently, a 6-core Haswell-E i7-5930K. Disabling HT gave me a huge boost in gaming workloads. I'm basically never doing anything that pegs the entire CPU so getting that extra 10-15% IPC for single threaded tasks is huge.
And like you said, the vulnerability consideration makes HT a hard sell for workloads that would benefit (hypervisors).
That i7 is a huge downgrade from what I was running before (long story) so I'm looking forward to Arrow Lake and I like everything I've read. In addition to removing HT, they're supporting Native JEDEC DDR5-6400 Memory so XMP won't be necessary. I've never liked XMP/Expo...
To summarize Intel's reasoning, the extra hardware (and I guess associated firmware or whatever it'd be called) required to manage hyperthreading in their P-cores takes up package space and power, meaning the core can't boost as high.
And since hyperthreading only increases IPC by 30% (according to Intel), they're planning on making up the loss of threads with more E-cores.
But we'll have to see how that turns out, especially since Intel's first chiplet design (the Core Ultra series 1) had performance degradations compared to their 13th Gen mobile counterparts
I'm very interested to see independent testing of cores without SMT/hyperthreading. Of course it's one less function for the hardware and thread scheduler to worry about. But hyperthreading was a useful way to share resources between multiple threads that had light-to-intermediate workloads. Synthetic benchmarks might show an improvement but I'm interested to see what everyday workloads, like web browsing while streaming a video, will react.
I was surprised that disabling SMT has improved by a few percents the Geekbench 6 multi-threaded results on a Zen 3 (5900X) CPU.
While there are also other tasks where SMT does not bring advantages, for the compilation of a big software project SMT does bring an obvious performance improvement, of about 20% for the same Zen 3 CPU.
In any case, Intel has said that they have designed 2 versions of the Lion Cove core, one without SMT for laptop/desktop hybrid CPUs and one with SMT for server CPUs with P cores (i.e. for the successor of Granite Rapids, which will be launched later this year, using P-cores similar to those of Meteor Lake).
> there is hardly a reason to keep hyperthreading around.
Performance is still a reason. Anecdote: I have a pet project that involves searching for chess puzzles, and hyperthreading improves throughput 22%. Not massive, but definitely not nothing.
There are definitely workloads where turning off SMT improves performance.
SMT is a crutch. If your frontend is advanced enough to take advantage of the majority of your execution ports, SMT adds no value. SMT only adds value when your frontend can't use your execution ports, but at that point, maybe you're better off with two more simple cores anyway.
With Intel having small e-cores, it starts to become cheaper to add a couple e-cores that guarantee improvement than to make the p-core larger.
As always, the answer is “it depends”. If you are getting too many cache misses, and are memory bound, adding more threads will not help you much. If you have idling processor backends, with FP integer or memory units sitting there doing nothing, adding more threads might extract more performance from the part.
Generally HT/SMT has never been favored for high utilization needs or low wattage needs.
On the high utilization end, stuff like offline rendering or even some realtime games, would have significant performance degradation when HT/SMT are enabled. It was incredibly noticeable when I worked in film.
And on the low wattage end, it ends up causing more overhead versus just dumping the jobs on an E core.
It doesn’t because a lot of low wattage silicon doesn’t support HT/SMT anyway.
The difference is that now low wattage doesn’t have to mean low performance, and getting back that performance is better suited to E cores than introducing HT.
Saying "no" doesn't magically remove your contradiction. E cores didn't exist in laptop/PC/server CPUs before 2022 and using HT was a decent way to increase capacity to handle many (e.g. IO) threads without expensive context switches. I'm not saying E cores are a bad solution, but somehow you're trying to erase historical context of HT (or more likely just sloppy writing which you don't want to admit).
No, you haven't explained the contradiction, you just talk over it. Before E cores were a thing, HT was a decent approach to cheaply support more low utilization threads.
Backend-bound workloads that amount to hours of endless multiplication are not that common. For workloads that are just grab-bags of hundreds of unrelated tasks on a machine, which describes the entire "cloud" thing and most internal crud at every company, HT significantly increases the apparent capacity of the machine.
The need for hyperthreading has diminished with increasing core counts and shrinking power headroom. You can just run those tasks on E cores now and save energy.
I don't think the L0 is the "new" cache. It's still 48K like the old L1. The 9 cycle 192K cache about halfway between the old L1 and L2 is really the new cache level in size and latency. And that's now called L1.
Yes, Apple makes expensive CPUs and has much more cache, one of the reasons why they have been faster and why Intel adds much more caching each generation now.
> The removal of Hyperthreading makes a lot of sense for Lunar Lake to both reduce the die size of the version of Lion Cove found in Lunar Lake along with simplifying the job of Thread Director.
And, you know, stop the security vulnerability bleeding.
I don't think hyperthreading was the bulk of the attack surface. It definitely presented opportunities for processes to get out of bounds, but I think preemptive scheduling is the bulk of the issue. That genie not going back in the bottle another way to significantly improve processor performance for the same amount of instructions.
I think the real problem is cache sharing and hyperthreading kind of depends on it, so it was only ever secure to run two threads from the same security domain in the same core
Newbie question, if the cores share an L3 cache, does that factor in the branch prediction vulnerabilities? Or is the data affected by the vulnerability stay in caches closer to the individual core? I assume so otherwise all cores would be impacted but I’m unclear where it does sit
Cheese is talking about different chips (e.g. laptop, desktop, and server) that contain Lion Cove cores. Intel doesn't really reuse chips between different segments.
I'm just wondering, how much of this analysis is real? I mean, how much of this analysis is a true weighing up of design decisions and performance, and how much is copy/pasted Intel marketing bumf?
Nothing real. They cherry-pick for what they're trying to sell you this time - advent of 2nd core, +HT, E-cores, -HT, you-name-it-next-time. Without independent benchmark + additional interpretation for your particular workload you can safely ignore all their claims.
That’s typically the question with all marketing (probably cherry-picked) benchmarks. Have to wait until independent reviewers get their hands on them.
Now that the floating point vector units have gotten large enough to engage in mitosis, I wonder if there will be room in the future for little SIMD units stuck on the integer side. Bring back SSE, haha.
And the main question: Is it overall better (faster and less power hungry) than a reduced instruction set, powerful laptop ARM based CPU which is around the corner (Qualcomm)? Guessing not...
There's still an insurmountable amount of apps that are still X86 exclusive both new and legacy. So the chip not beating the best of ARM in benchmarks is largely irelevant.
A Ferrari will beat a tractor on every test bench numbers and every track, but I can't plow the field with a Ferrari, so any improvements in tractor technology is still welcome despite they'll never beat Ferraris.
I hear this argument, but I don't really believe it.
If you're talking apps before say 2015 (10 years ago), they can be emulated on ARM faster than they ran natively. That rules out 95% of the backward compatibility argument.
Most more recent apps are very portable. They were written in a managed language running on a cross-platform runtime. The source code is likely stored in git so it can be tracked down and recompiled.
Over 15 years of modern smartphones has ensured that most low-level libraries have support for ARM and other ISAs too as being ISA-agnostic has once again become important. Apple's 4 years of transition aren't to be underestimated either. Lots of devs/creatives use ARM machines and have ensured that pretty much all of the biggest pro software runs very well on non-x86 platforms.
Yes, some stuff remains, but I don't think the remaining stuff is as big a deal as some people claim.
> Some CPUs have been specifically designed to have a very small set of instructions—but these designs are very different from classic RISC designs, so they have been given other names such as minimal instruction set computer (MISC) or transport triggered architecture (TTA).
It seems that "RISC" has just become a synonym for "load-store architecture"
Non-embedded POWER implementations are around 1000 opcodes, depending on the features supported, and even MIPS eventually got a square-root instruction.
What are you guessing from? Historically, generation for generation, x86 is good at performance and awful at power consumption. Even when Apple (not aarch64 in general, just Apple) briefly pulled ahead on both, subsequent x86 chips kept winning on raw performance, even as they got destroyed on perf per Watt.
13900K lost couple of % in single tread performance, which lead to 14900K being so overclocked/overvolted that it lead to it being useless for what it's made for - crunching numbers. See https://www.radgametools.com/oodleintel.htm.
It's also interesting to see that FPGA integration hasn't gone far, and good vector performance is still important (if less important than integer). I wonder what percentage of consumer and professional workloads make significant use of vector operations, and how much GPU and FPGA offload would alleviate the need for good vector performance. I only know of vector operations in the context of multimedia processing, which is also suited for GPU acceleration.