Apple Patent Shows GPU Dynamic Caching Has Been in Development for Years

Mrirazak1 · 2023-11-03T06:18:29 1698992309

I don’t know why we’re surprised about the fact they are working on things for years. It’s how innovation usually works.

pjmlp · 2023-11-03T07:26:24 1698996384

It is the usual "overnight success, 15 years in the making" surprise.

kridsdale3 · 2023-11-06T19:37:17 1699299437

Because a lot of us work in organizations led by Program Managers and directors who expect results within the same financial quarter as the original planning session.

userbinator · on Nov 2, 2023

Apple's GPU Dynamic Caching ensures that cache and memory spaces are dynamically assigned based on the actual needs of different tasks and workloads

Isn't this not too different from what Intel DVMT was doing twenty years ago?

https://en.wikipedia.org/wiki/Dynamic_video_memory_technolog...

Then again, this is the same Apple that took "integrated graphics" / "UMA" as a marketing point.

Jasper_ · on Nov 2, 2023

No, it's about replacing the static shader register allocation with more of a cache, allowing for more dynamic occupancy usage of the GPU. Memory is still memory.

jayd16 · 2023-11-03T06:38:22 1698993502

My read of the patent is that it discusses a virtual memory system that allows the GPU to page memory into cache according to various scopes, (based on tile or threadgroup or maybe even shader stage?).

But aren't caches already a dynamic copy of main memory? Don't GPUs already prefetch buffers as needed? What is actually new here?

tedunangst · on Nov 2, 2023

The patent mentions MMU? Is Apple mapping the GPU register file into the CPU address space?

reroute22 · on Nov 3, 2023

See paragraphs 31 & 32. If I'm reading it right, they answer that question, and positively.

foota · on Nov 2, 2023

I don't think this is correct. My understanding is that this is about more efficiently utilizing the GPUs available resources (e.g., cache) at the core-ish level. Like imagine if hyperthreads could share register space based on what they're doing at the time, I think.

acchow · on Nov 2, 2023

The OP isn't really about UMA, but about caching.

In any case, in Intel's DVMT you have a dedicated memory space for graphics. It is a dynamic amount, but it is still dedicated for graphics. You need to copy data from system memory over to graphics memory for it to be handled.

In UMA, no data copying is needed. There's 1 memory space

kcb · on Nov 2, 2023

Zero copy shared memory absolutely is available on all modern IGPUs. Do people really think such a low hanging fruit was ignored for the past decades?

https://www.intel.com/content/dam/develop/external/us/en/doc...

https://en.wikipedia.org/wiki/Heterogeneous_System_Architect...

kllrnohj · 2023-11-03T04:42:55 1698986575

> Do people really think such a low hanging fruit was ignored for the past decades?

Yes, they do, it's mentioned as some major M1 revolutionary technology all the time. It's utterly absurd, but that's the power of Apple's marketing.

buildbot · 2023-11-03T06:40:03 1698993603

I can’t run llama on an intel iGPU, because it OOMs - the memory is unified because it’s in the same space and shared on M1. Not just marketing.

menaerus · 2023-11-03T09:53:10 1699005190

So, how much memory is available to your Intel iGPU? What was the machine?

If it OOMs, then one valid assumption is that it OOMs because it has less memory available than there is for a GPU on M1 but we can't know that if you fail to mention the specs.

NikkiA · 2023-11-03T05:25:35 1698989135

Hell, the UMA was considered a downside on the original XBOX because everyone was used to each processor having it's own memory bus, so even though the bus was (for the time) a sprightly 6.4GB/s, it was considered lesser.

userbinator · on Nov 2, 2023

You need to copy data from system memory over to graphics memory for it to be handled.

No you don't. You can map any page of physical RAM into the GPU address space.

This has been the case since at least the Intel i810. Here's a relevant part from the i815 documentation:

https://i.imgur.com/p53Dxsn.png

lwkl · on Nov 2, 2023

These GPUs reserved a variable amount of your system memory for the GPU. So if you had 4 GB of RAM the GPU would maybe use 500 MB leaving you with 3.5 GB of RAM for your CPU. There was often a option in the BIOS that allowed you to change the amount of memory reserved for the GPU.

kcb · on Nov 2, 2023

That option is not a hard limit on the RAM IGPUs can use.

lwkl · on Nov 2, 2023

There is a hard limit with DVMT that is described in the white papers by Intel. Let's say you allow up to 256 MB of RAM to be allocated for graphics. The memory could be partitioned in a way that 128 MB are fixed memory that are always only available for the graphics driver and 128 MB are DVMT memory that can be used by the OS if the graphics driver doesn't need it.

ginko · on Nov 2, 2023

>Apple patented GPU Dynamic Caching back in 2020.

I know this might seem like a lifetime for SW people but <3 years from patent to customer for a GPU HW innovation isn't really that long? It's pretty typical, really.

faitswulff · on Nov 2, 2023

The patent is here: https://patentimages.storage.googleapis.com/2d/72/df/117a401...

herniatedeel · 2023-11-06T01:53:30 1699235610

This patent hasn't issued - it's just a publication.

baggy_trough · on Nov 2, 2023

Would love to read a good explanation of what this is.

slaymaker1907 · on Nov 2, 2023

Given what I know about CUDA and assuming it's different from being able to call malloc within a kernel and paging memory between the CPU and GPU, I assume it's adding in some smartness for offloading stuff from the GPU when running multiple kernels. You want to keep stuff in GPU memory as much as possible since transfers can increase latency dramatically, but if you have multiple applications using the GPU then there might not be enough memory for all of them.

However, I'm always suspicious of how innovative patents actually are so it wouldn't surprise me to find out it's just barely different than what CUDA or unified memory on game consoles can do.

bangonkeyboard · on Nov 2, 2023

All Apple Silicon is unified memory.

stefanfisk · on Nov 2, 2023

What about https://www.digitaltrends.com/computing/apple-dynamic-cachin...?

kimixa · on Nov 2, 2023

Yeah, vague terms like "local memory" aren't helpful - there's lots of different memories that are local to the shader unit that are a shared resource so can affect occupancy (the LDS, register file, TBDR stuff like the tile buffer, and probably any number of other things). Which of these is it referring to? They are separate things in some architectures, but maybe not in all?

There's the linked patent itself [0] - but that specifically seems to refer to using the MMU as the part that's doing this dynamic allocation and translation - but the majority of those resources above are "before" address translation, with the MMU often being more at the L2 cache level rather than embedded within the shader clusters themselves. Maybe this isn't the "normal" MMU and instead a simpler address translator specifically for those resources that has been added of for this? Or is it something like the parameter buffer (The block of memory used to store the intermediate data between the tiling & rasterization state, and the pixel shaders). That has been "dynamically allocated" in a similar way since Apple were just taking PowerVR cores directly.

And then what is the cost of this re-allocation? If there's a cost to performance to allocating new pages, or (like worse) running out of spare pages, it might mean the (graphics API) user still has to be aware of their resource usage in scheduling shaders, so less of the promise of "Just throw things at the hardware and it'll do things optimally" than people might hope.

[0] https://patents.google.com/patent/US20210271606A1/en

tedunangst · on Nov 2, 2023

Even after reading that, I'm still uncertain how it lets more blocks/waves/whatev execute. Could really use an example with some fictitious numbers.

reroute22 · 2023-11-06T02:46:36 1699238796

A pretty detailed hypothesis showed up recently: https://forums.macrumors.com/threads/3d-rendering-on-apple-s...

jb1991 · on Nov 2, 2023

[flagged]

skavi · on Nov 2, 2023

Personally wouldn’t consider that a good explanation. Would love if more details were released.

yunohn · on Nov 2, 2023

I’m in the same boat, what exactly enables this dynamism that couldn’t be achieved before? I’m missing the reason why this was static up until now.

nxobject · on Nov 2, 2023

Real World Tech's mailing list will have a good conversation about it at some point, for sure. (Although I know the community isn't primarily about GPU.)

AaronFriel · on Nov 2, 2023

Is macOS Dynamic Caching similar to Windows hardware-accelerated GPU Scheduling[1]? That feature purports improve latency and GPU scheduling efficiency. Is the feature here that the OS is delegating more scheduling to the M3's ASC coprocessor[2] - assuming it's similar to the M1?

It seems to me that every explanation of dynamic caching in terms of memory is "wrong" - as seen here and in several articles written by folks more familiar with PC hardware.

I think where some folks might get it wrong is thinking of Apple silicon as being like PC hardware, where VRAM and RAM are distinct pools of memory and using the CPU to move data between them (or other devices) is very inefficient. PCI bus attached devices having separate pools of memory has given rise to a plethora of technologies to allow directly read from RAM (DMA), or to enable a GPU to read from NVMe (DirectStorage on Windows), and so on.

The patent from the article seems to describe unified memory, part of the M1's architecture, not whatever "Dynamic Caching" is; but I'll admit Apple makes it a bit hard to understand what exactly is the case.

There's only one person I'd trust to describe what this feature actually is - I'll wait for Asahi Lina to break down what dynamic caching is and whether this is a hardware or OS feature.

[1] https://devblogs.microsoft.com/directx/hardware-accelerated-...

[2] https://asahilinux.org/2022/11/tales-of-the-m1-gpu/

userbinator · on Nov 2, 2023

I think where some folks might get it wrong is thinking of Apple silicon as being like PC hardware, where VRAM and RAM are distinct pools of memory and using the CPU to move data between them

The patent basically describes using a page table to dynamically map pages of RAM between the GPU and CPU, something that Intel's integrated GPUs have been doing for 20 years.

bdowling · 2023-11-03T06:18:56 1698992336

> The patent basically describes ...

Almost EVERY patent is basically something that has been done for many years, plus some improvement. Improvements are usually inconsequential, but sometimes they are very valuable.

photonerd · on Nov 2, 2023

It’s more than that. That’s just shared memory. As you say, if it was just that they it wouldn’t be noteworthy at all

withinboredom · on Nov 2, 2023

If it quacks like a duck, looks like a duck, it's probably a duck.

userbinator · on Nov 2, 2023

Unless it's from Apple, in which case they call it a peacock and everyone gets all hyped about it.

jayd16 · on Nov 3, 2023

Do we know its noteworthy? Its not like everything apple names is really that interesting. Liquid Retina Display, Dynamic Island, etc.

clemlesne · on Nov 2, 2023

Hope Asahi team will integrate that soon.

iAMkenough · 2023-11-03T13:07:23 1699016843

They have to focus on basic M3 support first. Curious to see if their existing OpenGL driver for M1/M2 will be compatible.

tibbydudeza · 2023-11-03T11:23:23 1699010603

So do they have dedicated raytracing units sitting in the pipeline that is not used when the game does not utilize raytracing ???.

reroute22 · 2023-11-04T09:04:06 1699088646

We don't know anything about Apple's implementation, but for perspective, in Nvidia's first implementation of RTX RT cores occupied about 6-7% of the area of a TPC, where TPCs constitute the majority, but not the entire die, so it's an even smaller (but not by much) portion of the entire die. Maybe 5% or so.

It's possible that the latest gen of RT cores are a slightly larger portion of a TPC (but we don't know), but on the other hand, TPCs are a smaller portion of a die now as TPC arrays do not include L2, and L2 got many times bigger. So chances are, RT over die area is still somewhere on the order of 5%.

And in Nvidia designs as far as I understand RT cores sit idle in non-RT workloads, yes.

If anything, Tensor cores are almost twice as large as RT cores, and tensor cores sit idle whenever you don't use DLSS (or professional ML compute).

See https://www.reddit.com/r/hardware/comments/baajes/rtx_adds_1...

skavi · on Nov 2, 2023

Anything shipping in a chip has to have been developed for years. And this particular GPU arch was meant to ship last year.