Intel's Lion Cove Architecture Preview

andrewia · 2024-06-04T04:49:19 1717476559

It's interesting to see that modern processor optimization still revolves around balancing hardware for specific tasks. In this case, the vector scheduler has been separated from the integer scheduler, and the integer pipeline has been made much wider. I'm sure it made sense for this revision, but I wonder if things will change in a few generations in the pendulum will swing back to simplifying and integrating more parts of the arithmrtic scheduler(s) and ALUs.

It's also interesting to see that FPGA integration hasn't gone far, and good vector performance is still important (if less important than integer). I wonder what percentage of consumer and professional workloads make significant use of vector operations, and how much GPU and FPGA offload would alleviate the need for good vector performance. I only know of vector operations in the context of multimedia processing, which is also suited for GPU acceleration.

gumby · 2024-06-04T10:31:20 1717497080

> good vector performance is still important (if less important than integer)

This is in part (major part IMHO) because few languages support vector operations as first class operators. We are still trapped in the tyranny that assumes a C abstract machine.

And so because so few languages support vectors, the instruction mix doesn’t emphasize it, therefore there’s less incentive to work on new language paradigms, and we remained tapped in a suboptimal loop.

I’m not claiming there are any villains here, we’re just stuck in a hill-climbing failure.

dan-robertson · 2024-06-04T14:12:15 1717510335

It’s not obvious that that’s what’s happened here. Eg vector scheduling is separated but there are more units for actually doing certain vector operations. It may be that lots of vector workloads are more limited by memory bandwidth than ILP so adding another port to the scheduler mightn’t add much. Being able to run other parts of the cpu faster when vectorised instructions aren’t being used could be worth a lot.

packetlost · 2024-06-04T14:38:29 1717511909

That matches with recent material I've read on vectorized workloads: memory bandwidth can become the limiting factor.

semi-extrinsic · 2024-06-04T18:06:49 1717524409

Always nice to see people rediscovering the roofline model.

davedx · 2024-06-04T10:55:27 1717498527

But isn’t that why we have things like CUDA? Who exactly is “we” here, people who only have access to CPU’s? :)

gumby · 2024-06-04T11:21:22 1717500082

I’m not saying that you cannot write vector code, but that it’s typically a special case. CUDA APIs and annotations are bolted onto existing languages rather than reflecting languages with vector operations as natural first class operations.

C or Java have no concept of `a + b` being a vector operation the way a language like, say, APL does. You can come closer in C++, but in the end the memory model of C and C++ hobbles you. FORTRAN is better in this regard.

chasil · 2024-06-04T14:42:34 1717512154

I see two options from this perspective.

It is always possible to inline assembler in C, and present vector operators as functions in a library.

Otherwise, R does perceive vectors, so another language that performs well might be a better choice. Julia comes to mind, but I have little familiarity with it.

With Java, linking the JRE via JNI would be an (ugly) option.

davedx · 2024-06-04T12:13:02 1717503182

Makes sense. I guess that’s why some python libs use it under the hood

ladyanita22 · 2024-06-05T21:52:55 1717624375

What about Rust?

NekkoDroid · 2024-06-04T11:57:40 1717502260

When the data is generated on CPU shoveling it to the GPU to do possibly a single or few vector operations and then shoveling it back to the CPU to continue is most likely going to be more expensive than the time saved.

And CUDA is Nvidia specific.

Filligree · 2024-06-04T17:34:11 1717522451

Doesn’t CUDA also let you execute on the CPU? I wonder how efficiently.

HarHarVeryFunny · 2024-06-04T18:24:00 1717525440

No - a CUDA program consists of parts that run on the CPU as well as on the GPU, but the CPU (aka host) code is just orchestrating the process - allocating memory, copying data to/from the GPU, and queuing CUDA kernels to run on the GPU. All the work (i.e. running kernels) is done on the GPU.

There are other libraries (e.g. OpenMP, Intel's oneAPI) and languages (e.g. SYCL) that do let the same code be run on either CPU or GPU.

rbanffy · 2024-06-04T14:44:53 1717512293

When you use a GPU, you are using a different processor with a different ISA, running its own barebones OS, with which you communicate mostly by pushing large blocks of memory through the PCIe bus. It’s a very different feel from, say, adding AVX512 instructions to your program flow.

jandrewrogers · 2024-06-04T15:37:25 1717515445

The CPU vector performance is important for throughput-oriented processing of data e.g. databases. A powerful vector implementation gives you most of the benefits of an FPGA for a tiny fraction of the effort but has fewer limitations than a GPU. This hits a price-performance sweet spot for a lot of workloads and the CPU companies have been increasingly making this a first-class "every day" feature of their processors.

dogma1138 · 2024-06-04T07:04:06 1717484646

AMD tried that with HSA in the past it doesn’t really work. Unless your CPU can magically offload vector processing to the GPU or another sub-processor you are still reliant on new code to get this working which means you break backward compatibility with previously compiled code.

The best case scenario here is if you can have the compiler do all the heavy lifting but more realistically you’ll end up having to make developers switch to a whole new programming paradigm.

andrewia · 2024-06-04T08:31:53 1717489913

I understand that you can't convince developers to rewrite/recompile their applications for a processor that breaks compatibility. I'm wondering how many existing applications would be negatively impacted by cutting down vector throughput. With some searching, I see that some applications make mild use of it like Firefox. However there are applications that would negatively affected, such as noise suppression in Microsoft Teams, and crypto acceleration in libssl and the Linux kernel. Acceleration of crypto functions seems essential enough to warrant not touching vector throughput, so it seems vector operations are here to stay in CPUs.

alexhutcheson · 2024-06-04T10:51:52 1717498312

Modern hash table implementations use vector instructions for lookups:

- Folly: https://github.com/facebook/folly/blob/main/folly/container/...

- Abseil: https://abseil.io/about/design/swisstables

josephg · 2024-06-04T14:47:49 1717512469

Sure; but it’s hard to do and very few programs get optimised to this point. Before reaching for vector instructions, I’ll:

- Benchmark, and verify that the code is hot.

- Rewrite from Python, Ruby, JS into a systems language (if necessary). Honorary mention for C# / Go / Java, which are often fast enough.

- Change to better data structures. Bad data structure choices are still so common.

- Reduce heap allocations. They’re more expensive than you think, especially when you take into account the effect on the cpu cache

Do those things well, and you can often get 3 or more orders of magnitude improved performance. At that point, is it worth reaching for SIMD intrinsics? Maybe. But I just haven’t written many programs where fast code written in a fast language (c, rust, etc) still wasn’t fast enough.

I think it would be different if languages like rust had a high level wrapper around simd that gave you similar performance to hand written simd. But right now, simd is horrible to use and debug. And you usually need to write it per-architecture. Even Intel and amd need different code paths because Intel has dumped avx2.

Outside generic tools like Unicode validation, json parsing and video decoding, I doubt modern simd gets much use. Llvm does what it can but ….

nickpeterson · 2024-06-04T15:03:12 1717513392

Indeed, people really fixate on “slow languages” but for all but the most demanding of applications, the right algorithm and data structures makes the lions share of the difference.

neonsunset · 2024-06-05T00:07:38 1717546058

Reaching for SIMD intrinsics or an abstraction has been historically quite painful in C and C++. But cross-platform SIMD abstractions in C#, Swift and Mojo are changing the picture. You can write a vectorized algorithm in C# and practically not lose performance versus hand-intrinsified C, and CoreLib heavily relies on that.

dogma1138 · 2024-06-04T12:07:50 1717502870

Newer SoCs come with co-processors such as NPUs so it’s just a question of how long it would take for those workloads to move there.

And this would highly depend on how ubiquitous they’ll become and how standardized the APIs will be so you won’t have to target IHV specific hardware through their own libraries all the time.

Basically we need a DirectX equivalent for general purpose accelerated compute.

rbanffy · 2024-06-04T14:46:51 1717512411

It’s a lot more work to push data to a GPU or NPU than to just to a couple vector ops. Crypto is important enough many architectures have hardware accelerators just for that.

dogma1138 · 2024-06-04T16:07:37 1717517257

For servers no, but we’re talking about endpoints here. Also this isn’t only about reducing the existing vector bandwidth but also about not increasing it outside of dedicated co-processors.

hajile · 2024-06-04T16:48:35 1717519715

I think the answer here is dedicated cores of different types on the same die.

Some cores will be high-performance, OoO CPU cores.

Now you make another core with the same ISA, but built for a different workload. It should be in-order. It should have a narrow ALU with fairly basic branch prediction. Most of the core will be occupied with two 1024-bit SIMD units and a 8-16x SMT implementation to hide the latency of the threads.

If your CPU and/or OS detects that a thread is packed with SIMD instructions, it will move the thread over to the wide, slow core with latency hiding. Normal threads with low SIMD instruction counts will be put through the high-performance CPU core.

celrod · 2024-06-04T17:08:00 1717520880

Different vector widths for different cores isn't currently feasible, even with SVE. So all cores would need to support 1024-bit SIMD.

I think it's reasonable for the non-SIMD focused cores to do so via splitting into multiple micro-ops or double/quadruple/whatever pumping.

I do think that would be an interesting design to experiment with.

paulmd · 2024-06-05T04:11:29 1717560689

I actually think the CPU and GPU meeting at the idea of SIMT would be very apropos. AVX-512/AVX10 has mask registers which work just like CUDA lanes in the sense of allowing lockstep iteration while masking off lanes where it “doesn’t happen” to preserve the illusion of thread individuality. With a mask register, an AVX lane is now a CUDA thread.

Obviously there are compromises in terms of bandwidth but it’s also a lot easier to mix into a broader program if you don’t have to send data across the bus, which also gives it other potential use-cases.

But, if you take the CUDA lane idea one step further and add Independent Thread Scheduling, you can also generalize the idea of these lanes having their own “independent” instruction pointer and flow, which means you’re free to reorder and speculate across the whole 1024b window, independently of your warp/execution width.

The optimization problem you solve is now to move all instruction pointers until they hit a threadfence, with the optimized/lowest-total-cost execution. And technically you may not know where that fence is specifically going to be! Things like self-modifying code etc are another headache not allowed gpgpu too - there certainly will be some idioms that don’t translate well, but I think that stuff is at least thankfully rare in AVX code.

dogma1138 · 2024-06-04T22:23:03 1717539783

This is what happening now with NPUs and other co-processors. Just not fully OS managed / directed yet but Microsoft is most likely working on that part at least.

The key part is that now there are far more use cases than there were in the early dozer days and that the current main CPU design does not compromise on vector performance like the original AMD design did (outside of extreme cases of very wide vector instructions).

And they are also targeting new use cases such as edge compute AI rather than trying to push the industry to move traditional applications towards GPU compute with HSA.

Symmetry · 2024-06-04T19:06:24 1717527984

I've had thoughts along the same lines, but this would require big changes in kernel schedulers, ELF to provide the information, and probably other things.

soulbadguy · 2024-06-04T19:25:04 1717529104

+1 : Heterogeneous/Non uniform core configuration always require a lot of very complex adjustment to the kernel schedulers and core binding policies. Even now after almost a decade of big-little (from arm) configuration and/or chiplet design(from amd) the (linux) kernel scheduling still requires a lot tuning for things like games etc... Adding cores with very different performance characteristics would probably require the thread scheduling to be delegated to the CPU it self with only hint from the kernel scheduler

hajile · 2024-06-04T22:34:53 1717540493

There are a couple methods that could be used.

Static analysis would probably work in this case because the in-order core would be very GPU-like while the other core would not.

In cases where performance characteristics are closer, the OS could switch cores, monitor the runtimes, and add metadata about which core worked best (potentially even about which core worked best at which times).

JonChesterfield · 2024-06-04T14:48:22 1717512502

Persuading people to write their C++ as a graph for heterogeneous execution hasn't gone well. The machinery works though, and it's the right thing for heterogeneous compute, so should see adoption from XLA / pytorch etc.

Symmetry · 2024-06-04T10:35:29 1717497329

As CPU cores get larger and larger it makes sense to always keep looking for opportunities to decouple things. AMD went with separate schedulers in the Athalon three architectural overhauls ago and hasn't reversed their decision.

frutiger · 2024-06-04T14:04:16 1717509856

> It's interesting to see that modern processor optimization still revolves around balancing hardware for specific tasks

Asking sincerely: what’s specifically so interesting about that? That is what I would naively expect.

xw390112 · 2024-06-04T19:51:47 1717530707

It's also important to note that in modern hardware the processor core proper is just one piece in a very large system.

Hardware designers are adding a lot of speciality hardware, they're just not putting it into the core, which also makes a lot of sense.

https://www.researchgate.net/figure/Architectural-specializa...

Symmetry · 2024-06-04T10:33:16 1717497196

Removing a core's SMT aka "hyperthreading" has some modest hardware savings but but biggest cost is that it makes testing and validation much more complicated. Given the existence of Intel's p-cores I'm not surprised they're getting rid of it.

dralley · 2024-06-04T12:09:26 1717502966

From Intel's perspective, I doubt that's true, when taking into consideration the constant stream of side channel vulnerabilities they were needing to deal with.

Symmetry · 2024-06-04T12:25:47 1717503947

It's exactly the potential for side channel vulnerabilities that makes SMT so hard to get right.

rbanffy · 2024-06-04T14:48:52 1717512532

You can always do like Sun and IBM and dilute the side channel in too many other threads to make it reliable. IIRC, both POWER10 and SPARC do 8 threads per core.

Symmetry · 2024-06-04T15:42:40 1717515760

It's also a matter of workload. For a database where threads are often waiting on trips to RAM then SMT can provide a very large boost to performance.

trynumber9 · 2024-06-04T16:10:21 1717517421

Lion Cove doesn't remove hyper-threading in general. Some variants will have it so Intel still has to validate it (eventually). But the first part shipping, Lunar Lake, removes hyper-threading. It may have saved them validation time for this particular part but they'll still have to do it.

adrian_b · 2024-06-04T18:36:13 1717526173

Intel has said that they have designed two different versions of the Lion Cove core.

One with SMT, for server CPUs, and one in which all circuits related to SMT are completely removed, for smaller area and power consumption, to be used in hybrid CPUs for laptops and desktops, together with Skymont cores.

dan-robertson · 2024-06-04T14:07:05 1717510025

I think you’re saying that testing smt makes it expensive, which sounds mostly right to me, though I can imagine some arguments that it isn’t much worse. When I first read your comment, I thought it said the opposite – that removing smt requires expensive testing and validation work.

JonChesterfield · 2024-06-04T14:54:51 1717512891

The dropping hyperthreading is interesting. I'm hoping x64 will trend in the other direction - I'd like four or eight threads scheduled across a pool of execution units as a way of hiding memory latency. Sort of a step towards the GPU model.

jandrewrogers · 2024-06-04T15:20:53 1717514453

This has not been successful historically because software developers struggle to design code that has mechanical sympathy with these architectural models. These types of processors will run idiomatic CPU code with mediocre efficiency, which provides an element of backward compatibility, but achieving their theoretical efficiency requires barrel-processing style data structures and algorithms, which isn't something widely taught or known to software developers.

I like these architectures but don't expect it to happen. Every attempt to produce a CPU that works this way has done poorly because software developers don't know how to use them properly. The hardware companies have done the "build it and they will come" thing multiple times and it has never panned out for them. You would need a killer app that would strongly incentivize software developers to learn how design efficient code for these architectures.

bri3d · 2024-06-04T17:02:04 1717520524

I think another huge issue is the combination of general-purpose/multi-tenant computing, especially with virtualization in the mix, and cache. Improvements gained from masking memory latency by issuing instructions from more thread execution contexts are suddenly lost when each of those thread contexts is accessing a totally different part of memory and evicting the cache line that the next thread wanted anyway.

In many ways barrel processing and hyperthreading are remarkably similar to VLIW - they're two very different sides of the same coin. At the end of the day, both work well for situations where multiple simultaneous instruction streams need to access relatively small amounts of adjacent data. Realtime signal processing, dedicated databases, and gaming are obvious applications here, which I think is why VLIW has done so well in DSP and hyperthreading did well in gaming and databases. Once the simultaneous instruction streams are accessing completely disparate parts of main memory, it all blows up.

Plus, in a multi-tenant environment, the security issues inherent to sharing even more execution state context (and therefore side channels) between domains also become untenable fairly quickly.

pradn · 2024-06-04T17:25:21 1717521921

Hyperthreading is also useful for simply making a little progress on more threads. Even if you're not pinning a core by making full use of two hyperthreads, you can handle double the threads at once. Now, I don't know how important it is, but I assume that for desktop applications, this could lead to a snappier experience. Most code on the desktop is just waiting for memory loads and context switching.

Of course, the big elephant in the room is security - timing attacks on shared cores is a big problem. Sharing anything is a big problem for security conscious customers.

Maybe it's the case of the server leading the client here.

immibis · 2024-06-04T21:21:13 1717536073

Just this morning I benchmarked my 7960X on RandomX (i.e. Monero mining). That's a 24-core CPU. With 48 cores, it gets about 10kH/s and uses about 100 watts extra. With 48 threads, about 15kH/s and 150 watts. It does make sense with the nature of RandomX.

Another benchmark I've done is .onion vanity address mining. Here it's about a 20% improvement in total throughput when using hyperthreading. It's definitely not useless.

However, I didn't compare to a scenario with hyperthreading disabled in the BIOS. Are you telling me the threads get 20-50% faster, each, with it disabled?

Covzire · 2024-06-04T16:43:41 1717519421

I've been disabling HT for security reasons on all my machines ever since the first vulnerabilities appeared, with no discernible negative impact on daily usage.

duffyjp · 2024-06-04T19:27:51 1717529271

I'm running a quite old CPU currently, a 6-core Haswell-E i7-5930K. Disabling HT gave me a huge boost in gaming workloads. I'm basically never doing anything that pegs the entire CPU so getting that extra 10-15% IPC for single threaded tasks is huge.

And like you said, the vulnerability consideration makes HT a hard sell for workloads that would benefit (hypervisors).

That i7 is a huge downgrade from what I was running before (long story) so I'm looking forward to Arrow Lake and I like everything I've read. In addition to removing HT, they're supporting Native JEDEC DDR5-6400 Memory so XMP won't be necessary. I've never liked XMP/Expo...

groos · 2024-06-04T17:49:23 1717523363

The gain from HT for doing large builds has been 20% at most. Daily usage is indistinguishable, as you say.

MisterTea · 2024-06-04T20:49:27 1717534167

HT really only benefits you if you spend a lot of time waiting on IO. E.g. file serving where you're waiting on slow spinning disks.

coolspot · 2024-06-05T01:07:19 1717549639

IO wait happens on OS kernel level, not on CPU level.

RAM fetch latency is what happens on CPU level.

binkHN · 2024-06-04T18:06:49 1717524409

For security reasons, OpenBSD disables hyperthreading by default.

AzzyHN · 2024-06-04T18:58:03 1717527483

To summarize Intel's reasoning, the extra hardware (and I guess associated firmware or whatever it'd be called) required to manage hyperthreading in their P-cores takes up package space and power, meaning the core can't boost as high.

And since hyperthreading only increases IPC by 30% (according to Intel), they're planning on making up the loss of threads with more E-cores.

But we'll have to see how that turns out, especially since Intel's first chiplet design (the Core Ultra series 1) had performance degradations compared to their 13th Gen mobile counterparts

killerstorm · 2024-06-04T18:59:32 1717527572

That's what SPARC and Power architectures do now, with 8 threads per core.

The only issues with these architectures is that they are priced as "high-end server stuff" while x64 is priced like a commodity.

andrewia · 2024-06-04T04:38:48 1717475928

I'm very interested to see independent testing of cores without SMT/hyperthreading. Of course it's one less function for the hardware and thread scheduler to worry about. But hyperthreading was a useful way to share resources between multiple threads that had light-to-intermediate workloads. Synthetic benchmarks might show an improvement but I'm interested to see what everyday workloads, like web browsing while streaming a video, will react.

adrian_b · 2024-06-04T06:17:32 1717481852

I was surprised that disabling SMT has improved by a few percents the Geekbench 6 multi-threaded results on a Zen 3 (5900X) CPU.

While there are also other tasks where SMT does not bring advantages, for the compilation of a big software project SMT does bring an obvious performance improvement, of about 20% for the same Zen 3 CPU.

In any case, Intel has said that they have designed 2 versions of the Lion Cove core, one without SMT for laptop/desktop hybrid CPUs and one with SMT for server CPUs with P cores (i.e. for the successor of Granite Rapids, which will be launched later this year, using P-cores similar to those of Meteor Lake).

papichulo2023 · 2024-06-04T12:12:35 1717503155

Probably because the benchmark is not using all cores so the cores hit the cache more often.

pjmlp · 2024-06-04T07:51:18 1717487478

Since side-channel attacks became a common thing, there is hardly a reason to keep hyperthreading around.

It was a product of its time, a way to get cheap multi-cores when getting real cores was too expensive for regular consumer products.

Besides the security issues, for high performance workloads they have always been an issue, stealing resources across shared CPU units.

sapiogram · 2024-06-04T11:36:57 1717501017

> there is hardly a reason to keep hyperthreading around.

Performance is still a reason. Anecdote: I have a pet project that involves searching for chess puzzles, and hyperthreading improves throughput 22%. Not massive, but definitely not nothing.

magnio · 2024-06-04T13:03:14 1717506194

You mean 4 cores 8 threads give 22% more throughput than 8 cores 8 threads or 4 cores 4 threads?

rbanffy · 2024-06-04T14:53:03 1717512783

Remember core to core coordination takes longer than between threads of the same core.

sapiogram · 2024-06-04T21:39:56 1717537196

4c/8t gives more throughput than 4c/4t.

hajile · 2024-06-05T12:47:46 1717591666

There are definitely workloads where turning off SMT improves performance.

SMT is a crutch. If your frontend is advanced enough to take advantage of the majority of your execution ports, SMT adds no value. SMT only adds value when your frontend can't use your execution ports, but at that point, maybe you're better off with two more simple cores anyway.

With Intel having small e-cores, it starts to become cheaper to add a couple e-cores that guarantee improvement than to make the p-core larger.

pjmlp · 2024-06-04T13:53:39 1717509219

My experience with high performance computing is that the shared execution units and smaller caches are worse than dedicated cores.

rbanffy · 2024-06-04T14:52:18 1717512738

As always, the answer is “it depends”. If you are getting too many cache misses, and are memory bound, adding more threads will not help you much. If you have idling processor backends, with FP integer or memory units sitting there doing nothing, adding more threads might extract more performance from the part.

binkHN · 2024-06-04T15:35:35 1717515335

For what it's worth, for security reasons, OpenBSD disables hyperthreading by default.

dagmx · 2024-06-04T07:11:46 1717485106

Generally HT/SMT has never been favored for high utilization needs or low wattage needs.

On the high utilization end, stuff like offline rendering or even some realtime games, would have significant performance degradation when HT/SMT are enabled. It was incredibly noticeable when I worked in film.

And on the low wattage end, it ends up causing more overhead versus just dumping the jobs on an E core.

The_Colonel · 2024-06-04T08:57:23 1717491443

> And on the low wattage end, it ends up causing more overhead versus just dumping the jobs on an E core.

For most of the HT's existence there weren't any E cores which conflicts with your "never" in the first sentence.

dagmx · 2024-06-04T14:01:01 1717509661

It doesn’t because a lot of low wattage silicon doesn’t support HT/SMT anyway.

The difference is that now low wattage doesn’t have to mean low performance, and getting back that performance is better suited to E cores than introducing HT.

The_Colonel · 2024-06-04T14:26:37 1717511197

> It doesn’t

Saying "no" doesn't magically remove your contradiction. E cores didn't exist in laptop/PC/server CPUs before 2022 and using HT was a decent way to increase capacity to handle many (e.g. IO) threads without expensive context switches. I'm not saying E cores are a bad solution, but somehow you're trying to erase historical context of HT (or more likely just sloppy writing which you don't want to admit).

dagmx · 2024-06-04T16:46:47 1717519607

I’ve explained what I meant. You’ve interjected your own interpretation of my comment and then gotten huffy about it.

We could politely discuss it or you can continue being rude by making accusations of sloppy writing and denials.

The_Colonel · 2024-06-05T04:11:59 1717560719

No, you haven't explained the contradiction, you just talk over it. Before E cores were a thing, HT was a decent approach to cheaply support more low utilization threads.

jeffbee · 2024-06-04T13:22:31 1717507351

Backend-bound workloads that amount to hours of endless multiplication are not that common. For workloads that are just grab-bags of hundreds of unrelated tasks on a machine, which describes the entire "cloud" thing and most internal crud at every company, HT significantly increases the apparent capacity of the machine.

mmaniac · 2024-06-04T07:43:39 1717487019

The need for hyperthreading has diminished with increasing core counts and shrinking power headroom. You can just run those tasks on E cores now and save energy.

KingOfCoders · 2024-06-04T06:19:27 1717481967

Main thing for me, they take the Apple hint and again increase caches (and add another cache layer L0)

trynumber9 · 2024-06-04T16:16:39 1717517799

I don't think the L0 is the "new" cache. It's still 48K like the old L1. The 9 cycle 192K cache about halfway between the old L1 and L2 is really the new cache level in size and latency. And that's now called L1.

hajile · 2024-06-04T16:51:44 1717519904

L0 latency is 4 cycles while old L1 was 5 cycles, so it's definitely a new cache even though the size is the same.

Apple does a lot better. I'm not sure about newer chips, but M1 has the same 192kb of L1, but with the 4 cycle latency of Intel's tiny 48kb cache.

KingOfCoders · 2024-06-05T03:53:01 1717559581

Yes, Apple makes expensive CPUs and has much more cache, one of the reasons why they have been faster and why Intel adds much more caching each generation now.

formerly_proven · 2024-06-04T06:29:29 1717482569

End of the unified scheduler is big news, Intel had this since Core. Both AMD and Apple have split the scheduling up for a long time.

klooney · 2024-06-04T04:40:44 1717476044

> The removal of Hyperthreading makes a lot of sense for Lunar Lake to both reduce the die size of the version of Lion Cove found in Lunar Lake along with simplifying the job of Thread Director.

And, you know, stop the security vulnerability bleeding.

andrewia · 2024-06-04T04:53:04 1717476784

I don't think hyperthreading was the bulk of the attack surface. It definitely presented opportunities for processes to get out of bounds, but I think preemptive scheduling is the bulk of the issue. That genie not going back in the bottle another way to significantly improve processor performance for the same amount of instructions.

nextaccountic · 2024-06-04T08:57:47 1717491467

I think the real problem is cache sharing and hyperthreading kind of depends on it, so it was only ever secure to run two threads from the same security domain in the same core

antoniojtorres · 2024-06-04T16:50:45 1717519845

Newbie question, if the cores share an L3 cache, does that factor in the branch prediction vulnerabilities? Or is the data affected by the vulnerability stay in caches closer to the individual core? I assume so otherwise all cores would be impacted but I’m unclear where it does sit

tambourine_man · 2024-06-04T16:09:26 1717517366

> Moving to a more customizable design will allow Intel to better optimize their P-cores for specific designs moving forward

Amateur question: is that due to the recent advances in interconnect or are they designing multiple versions of this chip?

wmf · 2024-06-04T16:34:04 1717518844

Cheese is talking about different chips (e.g. laptop, desktop, and server) that contain Lion Cove cores. Intel doesn't really reuse chips between different segments.

SilverBirch · 2024-06-04T14:49:26 1717512566

I'm just wondering, how much of this analysis is real? I mean, how much of this analysis is a true weighing up of design decisions and performance, and how much is copy/pasted Intel marketing bumf?

dur-randir · 2024-06-04T21:13:42 1717535622

Nothing real. They cherry-pick for what they're trying to sell you this time - advent of 2nd core, +HT, E-cores, -HT, you-name-it-next-time. Without independent benchmark + additional interpretation for your particular workload you can safely ignore all their claims.

antoniojtorres · 2024-06-04T16:48:13 1717519693

That’s typically the question with all marketing (probably cherry-picked) benchmarks. Have to wait until independent reviewers get their hands on them.

ein0p · 2024-06-04T16:33:39 1717518819

Once (and if) Intel manages to get out of the woods with lithography, they will have an incredibly strong product lineup.

bee_rider · 2024-06-04T09:59:24 1717495164

Now that the floating point vector units have gotten large enough to engage in mitosis, I wonder if there will be room in the future for little SIMD units stuck on the integer side. Bring back SSE, haha.

therealmarv · 2024-06-04T12:56:05 1717505765

And the main question: Is it overall better (faster and less power hungry) than a reduced instruction set, powerful laptop ARM based CPU which is around the corner (Qualcomm)? Guessing not...

Rinzler89 · 2024-06-04T13:12:06 1717506726

There's still an insurmountable amount of apps that are still X86 exclusive both new and legacy. So the chip not beating the best of ARM in benchmarks is largely irelevant.

A Ferrari will beat a tractor on every test bench numbers and every track, but I can't plow the field with a Ferrari, so any improvements in tractor technology is still welcome despite they'll never beat Ferraris.

hajile · 2024-06-05T13:06:55 1717592815

I hear this argument, but I don't really believe it.

If you're talking apps before say 2015 (10 years ago), they can be emulated on ARM faster than they ran natively. That rules out 95% of the backward compatibility argument.

Most more recent apps are very portable. They were written in a managed language running on a cross-platform runtime. The source code is likely stored in git so it can be tracked down and recompiled.

Over 15 years of modern smartphones has ensured that most low-level libraries have support for ARM and other ISAs too as being ISA-agnostic has once again become important. Apple's 4 years of transition aren't to be underestimated either. Lots of devs/creatives use ARM machines and have ensured that pretty much all of the biggest pro software runs very well on non-x86 platforms.

Yes, some stuff remains, but I don't think the remaining stuff is as big a deal as some people claim.

jeffbee · 2024-06-04T13:24:10 1717507450

Amazing that you can look at an ISA like ARM and say "reduced instruction set". It has 1300+ opcodes.

yjftsjthsd-h · 2024-06-04T19:36:45 1717529805

I used to think this too, but apparently RISC isn't about the number of instructions, but the complexity or execution time of each; as https://en.wikipedia.org/wiki/Reduced_instruction_set_comput... puts it,

> The key operational concept of the RISC computer is that each instruction performs only one function (e.g. copy a value from memory to a register).

and in fact that page even mentions at https://en.wikipedia.org/wiki/Reduced_instruction_set_comput... that

> Some CPUs have been specifically designed to have a very small set of instructions—but these designs are very different from classic RISC designs, so they have been given other names such as minimal instruction set computer (MISC) or transport triggered architecture (TTA).

aidenn0 · 2024-06-04T15:41:38 1717515698

It seems that "RISC" has just become a synonym for "load-store architecture"

Non-embedded POWER implementations are around 1000 opcodes, depending on the features supported, and even MIPS eventually got a square-root instruction.

JonChesterfield · 2024-06-04T14:51:05 1717512665

Aarch64 looks a lot like x86-64 to me. Deep pipelines, loads of silicon spent on branch prediction, vector units.

mhh__ · 2024-06-04T14:31:11 1717511471

At best ARM is regular rather than reduced

yjftsjthsd-h · 2024-06-04T19:39:21 1717529961

What are you guessing from? Historically, generation for generation, x86 is good at performance and awful at power consumption. Even when Apple (not aarch64 in general, just Apple) briefly pulled ahead on both, subsequent x86 chips kept winning on raw performance, even as they got destroyed on perf per Watt.

dur-randir · 2024-06-04T21:18:34 1717535914

>kept winning on raw performance

13900K lost couple of % in single tread performance, which lead to 14900K being so overclocked/overvolted that it lead to it being useless for what it's made for - crunching numbers. See https://www.radgametools.com/oodleintel.htm.

ffgjgf1 · 2024-06-04T13:25:17 1717507517

IIRC they claimed that it supposedly is

therealmarv · 2024-06-04T17:47:37 1717523257

Would be great if true! Competition always good for us consumers.

osigurdson · 2024-06-04T15:05:17 1717513517

From an investor perspective, I'd rather hear: "we pay more for talent than our competition". That is the first step toward greatness.