AMD’s 7950X3D: Zen 4 Gets VCache

ChuckMcM · on April 23, 2023

From the article -- "Unfortunately, there’s been a trend of uneven performance across cores as manufacturers try to make their chips best at everything, and use different core setups to cover more bases."

I don't find this unfortunate. Engineering is compromise and being able to make things that do a particular thing well can get you more performance per $ and per watt than you might otherwise see. The whole GPU thing that kneecapped Intel[1] is an example I use of how a compute element optimized for one thing can boost overall system performance.

I have worked on a number of bits of code for "big/little" ARM chips and while it does make scheduling more complex, overall we've been able to deliver more capability per $ and per watt. That is perhaps more important in portable systems but it works in data centers too.

I had an interesting discussion inside Google about storage and whether or not using a full up motherboard for GFS nodes was ideal. The prevailing argument at the time was uniformity of "nodes" meant everything was software, nodes could be swapped at will and you could write software to do what ever you needed. But when it comes to efficiency, which is to say what percentage of available resource is used to deliver the services, you found a lost of wasted watts/cpu cycles. All depends on what part of the engineering design space you are trying to optimize.

[1] The "kneecapped" situation is that since the PC/AT (80286) on, Intel generally made the most margin of all the chips that went into a complete system. Now that it is often NVIDIA.

usrusr · on April 23, 2023

Unfortunate in how it demonstrates that there don't seem to be any lower-hanging fruit left. Actually it doesn't so much feel like picking higher hanging fruit, it feels like stretching hard for the last withered leaves, long after the last fruit have disappeared.

ChuckMcM · on April 23, 2023

Yeah pretty much. There was never any real expectation that Moore's observation would result in infinitely improving compute performance, and Intel famously demonstrated that reality was much closer to Jim Gray's "smoking hairy golf ball" future.

[1] "Jim: Here it is. The concept is that the speed of light is finite and a nanosecond is a foot. So if you buy a gigahertz processor, it’s doing something every nanosecond. That’s the event horizon. But that’s in a vacuum, the processor is not a vacuum, and signals don’t go in a straight line and the processor is running at 3 gigahertz. So you don’t have a foot. You’ve got four inches. And the speed of light in a solid is less than that. So this is the event horizon. If something happens on one side of this thing, the clock is going to tick before to the signal gets to the other side. That’s why processors of the future have to be small and in fact golf ball– size. Why are they smoking? Well, because they have to run on a lot of electricity. The way you get things to go fast is you put a lot of power into them. So heat dissipation is a big problem. Now it’s astonishing to me that Intel has decided that this is a big problem only recently, because people knew that we were headed towards this heat cliff a long time ago. And why is it hairy? Because you’ve got to get signals in and out of it, so this thing is going to be wrapped in pins." -- https://amturing.acm.org/pdf/GrayTuringTranscript.pdf

ksec · on April 24, 2023

I often use the speed of light and nanoseconds as an example of we cant have infinite performance. But that was back to at best 2013. But this goes back to 1998. I just wish I had this quote known earlier.

We are now definitely near the end of the S curve. May be Another 10 years of node advance and small incremental IPC progress. I often wonder, what's next. Or are we... nearly done and its all software?

Nekhrimah · on April 24, 2023

> nearly done

With silicon based physical chips, I imagine this is true. I would guess, like some speculative science fiction, that the next leap in performance would be an entirely new concept. Who knows what is possible?

chmod775 · on April 24, 2023

We still got one unused dimension. We're not done until our chips are cubes or even spheres.

dgacmu · on April 24, 2023

It's a lot harder to get the heat out. It's why we stack things like dram (HBM) or sram (AMD v-cache), and intensively used 3d in flash, but not yet heavy duty processing.

The more we go 3d on compute we have to cut ghz even more to reduce power, and that puts more burden on the programmer. Which we'll probably do some day.

andrewstuart · on April 23, 2023

It sounds like this would be great for single user contexts, but really unpredictable for servers running multiple duplicate tasks.

Can someone who knows better than me please comment on the Linux server scheduling issues with a CPU like this.

At this stage I’m assuming I’d be better with a CPU with all cores the same.

rektide · on April 23, 2023

I'd rather have more cache than less. There are some corner cases where things will go wrong that we can dig up I'm sure, but generally, most tasks will execute at least as fast as they would have without the lopsided extra cache, and some will operate faster.

There's definitely a lot of iteration & growing in here we could do to get better here. Yet... it feels like searching for problems to worry about some tasks running faster than they would have before (but not all tasks). In most deployment models, it should be fine. I think, for now, this slightly favors two use cases: one big monolith running on all cores, which will let work get consumed as it's ready. Or divide your workers onto the different chiplets, and use a good load balancing strategy that is somehow utilization aware.

Personally I'm lacking in Fear, Uncertainty or Doubt over this making things worse.

AnonCoward42 · on April 23, 2023

Just a small reminder that the CCD with V-Cache has way lower boost clocks and therefore performs worse in almost everything beside gaming. For casual use this is irrelevant I think, but it is still significant in many workloads. The only workload I know where V-Cache benefits more than it is hurt by lower boost clocks is gaming.

The advantage of the dual CCD design is that this only/mostly hurts in highly threaded workloads at least.

jackmott42 · on April 23, 2023

The benchmarks show it making plenty of things worse in the non gaming domain on this particular CPU. On a server cpu that is running lower clocks anyway, probably near impossible for it to hurt.

toast0 · on April 23, 2023

> Can someone who knows better than me please comment on the Linux server scheduling issues with a CPU like this.

The scheduler was more or less designed around symmetric multi processing. BIG/little asymmetric systems will still have obviously preferred cores; if you're optimizing for throughput, add one task to each fast core first, then to each slow core, then maybe move tasks around to fit policies, etc.

With the 7900X3D and the 7950X3D, it's trickier, because one chiplet has a lower clock speed but more cache. Tasks that fit into the smaller cache will do better on the less cache chiplet, and tasks that fit into the larger cache but not the smaller cache will do better on the larger cache chiplet, and tasks that don't fit into either task will probably do better on the faster chiplet, but it kind of depends. In order to make good decisions, the scheduler would need more information about the task's memory access patterns, and I don't think that's something schedulers tend to keep track of; but if this type of chip is common in the future, it will need to happen.

For Epyc, AMD's server processor line, I believe the plan for X3D is to add cache to all the chiplets, keeping them roughly symmetric; there's still the modern situation that some cores will boost higher than others, and moving tasks to different cores can be very expensive if the process has memory still resident in cache on the old core, and the new core doesn't share that cache, etc.

sudosysgen · on April 23, 2023

This platform exposes each CCX ( group of 4 cores which share the L3 between them) as a NUMA domain if you want. This means that if your workload really takes a huge performance penalty from the ~10% lower (still very high) clock speed, or only part of it really enjoys the cache, you can manually tell the OS to stay where you want.

Scheduling for this kind of chip is not super, but it might improve. Meanwhile, you can enjoy almost all of the performance for the specific workloads where it matters by doing this.

ilyt · on April 24, 2023

Kinda depends on whatever is running on it. I can imagine just... putting different workload on cores with spicy cache vs normal ones.

... but overall it would be far easier to just buy 5 faster machines and 10 slower ones instead of having 15 "half fast half slow" and fuck around with scheduling

The aforementioned 7950X3D is a very weird proposition. 7800X3D (8 cores all with spicy cache) have game performance that's near the same in most games so I guess value proposition is "you can just run other workload" (streaming) on other ones but that's not many people's use case...

MichaelZuo · on April 23, 2023

Linux multi-threading code isn't the most elegant or robust out there, but I doubt it would catastrophically fail, in the sense of performing worse on the newer AMD design, outside of a few exceptional cases.

andrewstuart · on April 23, 2023

>> Linux multi-threading code isn't the most elegant or robust out there

Evidence?

MichaelZuo · on April 23, 2023

You want evidence of my personal estimation of Linux?

secondcoming · on April 24, 2023

How about Linux will move your process to another core for no good reason?

ilyt · on April 24, 2023

It usually moves it for good reason, and most bad reasons were eventually weeded out of the scheduler.

For example if task previously running on core is waiting for something, the choice is to allow something else to run there (and hope it finished before that other thread comes back from waiting), or not use it at all and waste CPU capacity.

Obviously scheduler will choose to use that now-idle core, but that might cause the previous process (which scheduler have no idea when it will come back) to be resheduled somewhere else, usually on its sibling (as in closest in memory hierarchy) core.

Or power saving. If say core was idle (thread was waiting on something), it might actually be quicker to run it on other core (especially one close in cache hierarchy) than to wake up the core and then ramp up the clock speed of it.

Here is some interesting if a bit outdated info about the nitty gritty: https://www.usenix.org/system/files/login/articles/login_win...

chupasaurus · on April 24, 2023

How about running your process in a cgroup attached to a cpuset with the cores/NUMA nodes you want? How about cgroups v2 which has the threaded mode so a process can put it's threads into different subgroups?

wmf · on April 23, 2023

Recent discussion on that topic: https://news.ycombinator.com/item?id=35656374

mastax · on April 23, 2023

It's a pretty similar shape of problem to NUMA, which servers have managed for quite some time. (Perhaps more similar to big.LITTLE which is not so common in servers but Linux should have decent support for from ARM SoCs)

Its a bit of an unnecessary hassle and annoyance, though. I'm pretty sure all the V-Cache EPYCs will have cache on all their dies so there won't be this issue.

brnt · on April 23, 2023

It sounds great when paired with an APU. No such part was announced though.

wmf · on April 23, 2023

The GPU cannot access the L3 though so V-cache would not help APUs. Maybe in the future AMD will have V-Infinity Cache.

brnt · on April 24, 2023

What a pity!

andrewstuart · on April 23, 2023

APUs are very slow compared to discrete GPUs.

wtallis · on April 23, 2023

A lot of that is due to the low memory bandwidth on mainstream CPU sockets: 128-bit bus, going up to around 6GT/s currently. That's half the memory bandwidth of NVIDIA's current entry-level laptop discrete GPU. And more L3 cache on a CPU directly addresses that memory bandwidth limitation (provided it's accessible to the iGPU, which would certainly be the case if AMD put 3D cache on an APU).

AnthonyMouse · on April 23, 2023

All of the Ryzen 7000 series are APUs.

Dalewyn · on April 23, 2023

With AMD finally taking a page from Intel with regards to providing an iGPU to everyone, Intel expected to integrate ARC into their CPUs with Meteor Lake (14th gen), and discrete GPUs from Nvidia and AMD being as bloody expensive as they are, I wonder if we're on the verge of a turning point in desktop computing as discrete GPUs go down the path once trodden by sound cards?

AnthonyMouse · on April 23, 2023

Fairly common for even pedestrian discrete GPUs to use over 200W. The ludicrous ones can break 1000W. In a given generation, GPU performance is not far from linear in power consumption.

The problem is the existing iGPUs are unnecessarily anemic. Okay, the Radeon RX 6950 XT has 80 CU at 335W. You can't fit that in Socket AM5. But the Radeon RX 6400 has 12 CU at 53W. You could add that to a Ryzen 7 7700X (105W) and still have a lower TDP than a Ryzen 9 7900X (170W). You could double that and add it to a Ryzen 7 7700 (65W) and still be under 170W. You could even add that to the 7950X with the caveat that it has the base clock of the 7945HX when the iGPU is under load.

But the Ryzen 7000 series iGPU doesn't have 24 CU, or 12, it has 2. Intel's have been even slower.

The question is, why bother? 2 CU is plenty to run a monitor if you're not doing anything that benefits from a fast GPU. If you are, 12 CU is still on the low side.

But AI could change things, because then even people who don't play games will have a use for that kind of hardware. Professionals will still pay thousands of dollars for a big hot discrete GPU but there is now a case for having ten times as many CU in every business PC.

It's also starting to make more sense to put the CPU and GPU together because of unified memory. The full version of LLaMA requires ~122GB of RAM. That's petty cash in PC memory but a capital expenditure for GPUs. If AMD wanted to make a big splash, they'd drop a new Threadripper with a high TDP socket, a ton of memory channels, support for RDIMMs and a fast iGPU.

This is the thing that may finally unseat Nvidia on AI, because if every CPU from Intel or AMD starts to include an iGPU that can do AI as fast as a $300+ Nvidia discrete GPU, people are going to make the code run on them.

sliken · on April 24, 2023

Keep in mind that PC sockets have very little memory bandwidth. Even the fancy new AM5 socket with DDR5 and the "recommended" overclocking to DDR5-6000 leaves you with 96GB/sec. Even the lowly RX 6400 has a dedicated 128GB/sec, which isn't trying to keep any CPU cores happy. Doubling the CUs requires double the bandwidth that brings you up to 256GB/sec, almost as much as the new Intel Xeons (307GB/sec) with 8 channels of DDR5-4800 memory.

So basically this is why Intel/AMD iGPUs stink, and why Apple's do not. Apple give you a choice of 100, 200, and 400GB/sec in a laptop and 400 or 800GB/sec in a desktop.

AnthonyMouse · on April 26, 2023

DDR5-8000 is already available. DDR5-6000 is 75% as fast, which is not that different. But if you want it you can get the same 128GB/sec from socket AM5 as the RX 6400 right now. Which is the class of GPU they could be integrating, instead of the existing ones which are <17% as fast. It's relatively uncommon for the CPU and GPU to be memory bound on the same workload. Generally it's either one or the other.

And there are GPUs, like the RX 6900 XT, that have more CUs per GB/s than the RX 6400. At the same ratio you could have 20 CU with the bandwidth of dual channel DDR5-8000, which will no doubt become more common and less expensive as socket AM5 matures. So the problem isn't the memory bandwidth, it's that the iGPU could be bigger.

And nothing stops them from making arbitrarily much memory bandwidth available even in existing sockets by including a few GB of HBM on the CPU package. That's essentially what Apple is doing to the exclusion of even having separate memory, but that's not optimal either. Now you can get a Macbook Pro with 96GB of very fast memory (if you want to pay $4000+), but then it can't run the full ~122GB LLaMA model which anyone can do with ~$200 worth of DDR4 in an ordinary desktop PC.

The right answer is to provide a few GB of HBM on the CPU package which can be used as an L4 cache and to provide the bandwidth needed for a fast iGPU, then have memory slots to add an arbitrary amount of ordinary memory so that anything that won't fit there can run at lower speed, instead of not at all. That isn't on offer yet, but it's better, and it's possible, so give it a minute.

sliken · on April 26, 2023

There have been numerous attempts at GPU friendly memory systems for CPUs. Iris Pro had an off chip, but inside package help. I forget if it was cache, ram, or edram. The current apple solution looks like the best I've seen (100,200,400, or 800GB/sec).

Frustratingly the PS5 and XboxX both had pretty nice memory systems for CPU+IGPU, but during the entire GPU shortage AMD never brought it to market for PCs/ laptops. Although there is hope. The AMD Strix Halo, due in 2nd half 2024 does claim to have a 256 bit wide memory bus.

Sure a fancy hybrid that uses a few fast stacks of HBM and some slower dimm slots is possible, but it's far from sure if such a design would hit the size, power, performance, and cost constraints of the market. Clearly the Apple 400GB/sec apple solution fits in a thin/light laptop with pretty good battery life.

AnthonyMouse · on April 27, 2023

What Apple is doing is inflexible and expensive. If you need 128GB of RAM, it isn't available. If you need 32GB of RAM, their least expensive offering appears to be a $1700 Mini. You can get a PC laptop with 32GB for under $500 and a PC desktop with 128GB for the same. You could add a $1000 discrete GPU to that and have money left over.

> Sure a fancy hybrid that uses a few fast stacks of HBM and some slower dimm slots is possible, but it's far from sure if such a design would hit the size, power, performance, and cost constraints of the market.

Slower memory costs less and uses less power than higher bandwidth configurations. Moreover, if you offered various CPUs with e.g. 8-32GB of HBM, systems that needed no more memory than that could leave their DIMM slots empty. But then systems with e.g. 16GB of HBM and 64GB of ordinary RAM would be possible, resulting in 80GB of usable memory the most active 16GB of which has quadruple the bandwidth or more. For a much lower price than 80GB of HBM and similar performance for anything with a <16GB working set.

sliken · on April 27, 2023

> What Apple is doing is inflexible and expensive.

High bandwidth, High capacity, and cheap ... pick 2.

If you need memory bandwidth for more then 64GB Apple is unmatched at anywhere close to the price. Apple has a unique combination of GPU like memory bandwidth, without the GPU like memory limits.

> You could add a $1000 discrete GPU to that and have money left over.

Yes, but you'd also have 1/8th the CPU memory bandwidth and a max GPU ram of 16-20GB.

> [ proposed hybrid HBM + dram]

Possible, but it's a large complex solution that will significantly increase hardware + development costs for a niche market, even if you leave the dimms empty. Trick is will the HBM actually improve performance? Say you have a 128GB LLM, how much will adding 16GB of HBM improve performance? If the fast memory (HBM) is 16GB it's going to be very tempting to just use a GPU.

> 128GB of RAM, it isn't available

M1 studio ultra has a 128GB config for $4,800. It hasn't been refreshed for the M2 yet, but the gen2 is 24GB (m2), 48GB (pro), and 96gb (max). Stands to reason the m2 ultra will have twice the ram of the m2 max, just like the m1 ultra allows double the m1 max. So likely 192GB real soon. Rumors claim a release at WWDC in June, recent OS updates mention 3 new desktop models.

800GB/sec to 192GB ram is quite unique in today's market. There are some similar solutions like the Arm base A64FX which has multiple stacks of HBM memory, but a max ram of 32GB. As you might imagine they aren't cheap either.

dragontamer · on April 23, 2023

DDR5 is something like 10x slower (or 10% of the bandwidth) of GDDR6 or HBM that's used on GPUs.

There's no point going much wider on iGPUs, because they're all throttled by DDR4/DDR5 anyway. The exception: PS5 and XBox both have GDDR (graphics RAM) which is 10x more bandwidth than your typical DDR4/DDR5. Since they are no longer memory-bandwidth constrained, they can go much bigger and remain practical.

AnthonyMouse · on April 24, 2023

> DDR5 is something like 10x slower (or 10% of the bandwidth) of GDDR6 or HBM that's used on GPUs.

DDR5 is something like 50% of the bandwidth of GDDR6. DDR5 up to 8 GT/s, GDDR6 in the range of 14-21 GT/s. For example, the Radeon RX 6400 has GDDR6 at 16 GT/s. But it also has a 64-bit memory bus, compared to 128-bit for Socket AM5. So they have the same memory bandwidth, given the fastest available DDR5.

Higher end GPUs have a wider memory bus than that, but those don't fit in the CPU socket's TDP anyway. It wouldn't prevent the iGPU from having 12+ CU instead of 2.

Moreover, higher end CPUs have a wider memory bus than that too. Socket SP5 has 12 memory channels. And a max TDP of 360W. That's as much memory bandwidth as many midrange discrete GPUs, especially anything that would fit in its TDP with a reasonable power budget left to the CPU.

AnthonyMouse · on April 24, 2023

They could also add some HBM as a chiplet to the CPU package. That would provide enough bandwidth to feed a fast iGPU and give the CPU a very nice L4 cache.

sliken · on April 24, 2023

More like 2x. Apple for instance uses lpddr5 for 100, 200, 400, or 800GB/sec.

It's mostly bus width, not the memory technology that makes a big difference.

ilyt · on April 24, 2023

> With AMD finally taking a page from Intel with regards to providing an iGPU to everyone

Eh, it's wasted silicon for most that will be buying those CPUs.

I'd rather see separate chipset that integrates GPU for those that need it and drop few bucks off the CPU price.

Dalewyn · on April 24, 2023

I've had my bacon saved more than enough times by integrated graphics that I consider it downright heresy to call them a waste.

sampa · on April 23, 2023

look at the size of 4090 and think again

noizejoy · on April 23, 2023

The soundcard analogy may not be that faulty. General purpose soundcards have pretty much been replaced by motherboard components, while high end soundcards (audio interfaces) for DAW (digital audio workstation) use cases are an entirely separate market.

Side note: Most of those high end audio interfaces are now external devices connected via USB (formerly also via Firewire).

And since a DAW typically doesn’t need high end video, I didn’t bother with a separate GPU on my latest DAW build. I will only add a GPU card, if I ever want to do higher end gaming, video production or machine learning on that computer.

p1necone · on April 23, 2023

Dedicated GPUs are a chip about the same size as a CPU + a big board with power delivery, memory modules, cooling etc. CPUs would be the same size if they had all that stuff on their own discrete board too.

Pairing a 4090 + a CPU on the same chip/side by side would be a stretch, but it has nothing to do with physical size - more the motherboard needing to handle power delivery, and the separate memory not being fast enough. Something like a 3060 is probably more reasonable.

This already exists too - the PS5 and Xbox Series X/S both use APUs with much more powerful GPUs than are available as standalone purchases.

ilyt · on April 24, 2023

> Dedicated GPUs are a chip about the same size as a CPU + a big board with power delivery, memory modules, cooling etc.

4090 have die area of ~600mm2

Ryzen 7 have 71mm2 CPU die and 122mm2 IO die

It's not even fucking comparable

> CPUs would be the same size if they had all that stuff on their own discrete board too

Oh, oh, I know, we can call the board that has all the power and IO CPU needs a "motherboard" cos it cradles the CPU in its hold (socket). Oh, wait, that's exactly how it is.

AnthonyMouse · on April 24, 2023

That's because you're comparing one of the biggest GPUs to a midrange CPU. Epyc 9654 has 12xCCD, which is bigger than the 4090. Radeon RX 6400 has a 107mm2 die, which is smaller than the Ryzen IO die.

p1necone · on April 25, 2023

Ryzen 7 is the high end of the gaming market in AMDs offerings. Buying an EPYC would give you worse performance for more cost and higher power budget (games still generally benefit more from a modest core count + high single core speed rather than the high core count + modest single core speed found in most high end server/workstation chips).

AnthonyMouse · on April 26, 2023

Ryzen 7 is the midrange processor. Ryzen 9 has a higher boost clock but also has two CCD for not much more money. The high end is Threadripper (though there isn't a Zen4 one (yet?)), and that has up to 8 CCD which is still bigger than the 4090.

You're comparing a ~$350 CPU to a ~$1600 GPU.

p1necone · on April 26, 2023

No, Ryzen 7 is the high end of their regular desktop chips. ThreadRipper is a different market segment altogether, and with all the ones they've released so far have been worse for gaming outside of a few titles that really benefit from the high thread count.

Intel and Ryzens consumer chips both follow the same marketing formula:

> Celeron/Athlon: ultra low end

> Ryzen 3/i3: low end

> Ryzen 5/i5: mid range

> Ryzen 7/i7: high end

> Ryzen 9/i9: I have money coming out of my ears and want a bigger number than everyone else.

CPUs in general are cheaper than GPUs - spending the same amount on the CPU as the GPU has almost never been the right move.

The fact that you have to move out of the consumer product range and into the workstation stuff to even find CPUs that cost as much as high end consumer GPUs should make that clear - the GPU equivalent in that space would be stuff like Quadros or RTX AXXX, which just like threadrippers and epycs are worse for gaming because they focus on memory and stability rather than raw power.

AnthonyMouse · on April 27, 2023

The market segment for Threadripper is literally called High-End Desktop (HEDT).

The pricing for Ryzen 7 and Ryzen 9 overlap. The Ryzen 9 7900 costs less than the Ryzen 7 7800X3D. And Ryzen 9 has better performance per dollar than Ryzen 7 on threaded workloads. The demarcation for "I have money coming out of my ears and want a bigger number than everyone else" is where dollars start going up faster than performance, which is unambiguously Threadripper/HEDT.

But Threadripper actually is significantly faster for the kind AI workload that would otherwise be done on a GPU, or for which a fast iGPU would make sense, because it has traditinally had twice as many memory channels. If Zen 4 Threadripper ends up on Socket SP6, it will have three times as many memory channels as Socket AM5.

Zen 4-based Ryzen 3 doesn't exist and there hasn't been a new Zen-based Athlon for desktop since Zen+ three years ago.

ilyt · on April 24, 2023

That's server CPU. 7950X3D isn't that much bigger,only one extra CCD

> Epyc 9654 has 12xCCD

That's $7k CPU you're comparing to $1600 GPU

AnthonyMouse · on April 26, 2023

I named a big one. It's expensive.

The 9354P is $2,730 and still has 8 CCD. That's >60% more than the 4090, but it's also >60% bigger, because Epyc has a ~400mm2 I/O die on top of the CCDs.

p1necone · on April 25, 2023

4090 was just a specific example, you don't have to drop much before die size is much more reasonable - e.g. 4070 is 295mm2

I'm not really sure what you're arguing so aggressively about - are you saying it's impossible to pair a CPU + a high end GPU on a single board? Because as I mentioned this already exists in the PS5 and XBSX, they could sell this in the PC space if they wanted to.

sliken · on April 24, 2023

Grace + Hopper is very similar to a 4090 + CPU. They are each on their own die/chiplet, but share a package. The CPU will of course be ARM not x86-64, but otherwise very similar to what you describe.

metadat · on April 23, 2023

I wish they'd also test xz andaune also RAR compression for vanilla vs vCores.

It'd be interesting to learn if / how the results differ depending on compression implementation.

shmerl · on April 23, 2023

Somewhat moot benefit if scheduler doesn't decide what core to use depending on what the thread is doing.

loeg · on April 23, 2023

The smaller Vcache devices (8 cores and fewer) uniformly have access to the large L3 (though not the 16-core 7950 review unit in the article).

shmerl · on April 23, 2023

Yeah, the point is about 16 core one.

7e · on April 23, 2023

The scheduler might be able to do it, but apps like games can always pin their threads by hand.

syntheticcdo · on April 23, 2023

No need - the chipset drivers automatically park the non-VCache CCD cores when a game is running (effectively turning the 7950X3D into a 7800X3D).

shmerl · on April 23, 2023

How would they know it's a game or not a game? I'm playing my games on Linux anyway. Haven't heard of schedulers using such logic.

I'd expect some kind of predictive AI that analyzes thread behavior to be able to help. But not sure if anyone tried making a scheduler like that.

sudosysgen · on April 23, 2023

Windows already has a system for detecting what is and isn't a game for purposes of switchable graphics laptops, so I imagine they reuse that.

You can get pretty good heuristics by looking at graphics API usage.

On Linux you could just put the appropriate taskset or numactl command in your game shortcut, it's pretty easy.

ece · on April 24, 2023

And this works pretty well, there's flickering on screen, but it is switching graphics. Maybe there's also some mouse stuttering or something as threads are moved to the active chiplet. This basically seems like AMD's version of big.LITTLE, except there's only a cache and frequency difference between cores, not altogether different cores.

wmf · on April 23, 2023

Under Windows the driver has a whitelist of process names that it recognizes and pins to V-cache. Of course you don't have these problems if you buy the 7800X3D...

ilyt · on April 24, 2023

Well, the simplest one would be "prioritize whatever talks with input devices and video card"

I guess the hard part would be distinguishing between "a video game getting inputs" and "discord app getting push to talk key", but at the moment you moved all of the noninteractive system stuff to slow cores and only "things using keyboard" on fast cores it's already pretty good approximation.

shmerl · on April 24, 2023

Not every game benefits from it as well (vs higher core clocks), so it's not really a comprehensive method.

fredoralive · on April 23, 2023

AIUI they currently use a rather basic system on Windows that asks "is the Xbox game bar active?", and using that to switch off the low cache cores. I suspect if these sorts of chips become common we might get something a bit more nuanced.

sharpneli · on April 23, 2023

It’s the same mechanism that triggers the game mode in Windows. You can tag a program in the Xbox game bar as a game if it hasn’t recognized it by default.

shmerl · on April 23, 2023

Pinning anything by hand defeats the purpose. You'd need to benchmark things first to figure out if it helps or not. Too manual.

tpxl · on April 23, 2023

Some games do run benchmarks to figure out recommended settings for your setup.

secondcoming · on April 24, 2023

Defeats the purpose of what? Core isolation and pinning is very common.

cronin101 · on April 23, 2023

Seems like an obvious question but is there conventional wisdom on whether compilation/transpilation heavy workloads are more suited to cache size or to higher clocks? Is this a “it depends” situation? Wondering what to pick for my next workhorse.

mastax · on April 23, 2023

I thought I remembered benchmarks from when the 5800X3D came out showing it was good at code compilation but that is, at the very least, not always true.

https://www.phoronix.com/review/amd-ryzen9-7950x3d-linux/13

the8472 · on April 23, 2023

Assuming the power is reported correctly it looks like an efficiency win:

> While the build times were similar between the 7950X and 7950X3D, with the latter the CPU power consumption was significantly lower.

mastax · on April 23, 2023

I'm assuming the vast majority of that benefit is just from the X3D chips having a lower stock TDP. You could achieve the same efficiency by just reducing the 7950X TDP to 120W in BIOS or Ryzen Master.

re-thc · on April 24, 2023

Yes, but this would reduce performance. The cache makes up for it whilst being able to run at a lower TDP.

ilyt · on April 24, 2023

Maybe but at that point you paid more for less

boyter · on April 24, 2023

Not really. Chips are mostly factory overclocked these days. They dump in all that power to get the coveted #1 spot in the review charts. Turn on eco mode and you get 99% of the performance at a much nicer power limit.

Just pretend it’s 2013 and you got a chip with some nice overclocking potential you don’t really want to use because you don’t like heat or the noise it brings.

sosodev · on April 23, 2023

It’s kinda random. Every workload is going to have different requirements depending on how they’re implemented and the context of what the program is actually doing.

naiv · on April 23, 2023

I have absolutely no idea regarding cpus but would this also speed up database caching?

loeg · on April 23, 2023

If your working set is between 32 and 96 MB, it can make a big difference. This is probably not what most people mean when they talk about database caching, though.

stingraycharles · on April 23, 2023

Generally memory has different access speeds. In order for a CPU to actually perform operations on the data, it needs to be loaded into the memory closest to / within the CPU.

L1 cache is the fastest (and closest to the CPU), but there’s very few of it. There’s also L2, L3 and L4, each being slightly slower but larger than the previous one.

VCache is a new kind of mechanism where the caching layer is placed vertically “on top” of the CPU. This unlocks new possibilities, and possibly more cache available at lower access times, but there are some technical challenges to make it work.

It will enable your database to perform computations faster, as more data can be stored closer to the CPU.

loeg · on April 23, 2023

VCache is more or less just a big L3 cache, with some quirks due to the way it is implemented. It's slightly slower than on-die L3 (~10%); on the bigger chips, it can only be accessed from a single CCD, etc.

justsomehnguy · on April 23, 2023

Not if you modus operandi is to

  SELECT * FROM t1 WHERE (SELECT * FROM t1 WHERE (SELECT * FROM t1....

EDIT: for whose who downvoted: https://news.ycombinator.com/item?id=32720856

trollied · on April 23, 2023

... then stream the entire resultset to the server and do the filtering in software. Drives me mad!

ilyt · on April 24, 2023

It will generally speed up things that do a bunch of operation of smallish set of data

Databases usually do a few (say a compare, maybe bit spicy compare) on big set of data