Because a lot of us work in organizations led by Program Managers and directors who expect results within the same financial quarter as the original planning session.
No, it's about replacing the static shader register allocation with more of a cache, allowing for more dynamic occupancy usage of the GPU. Memory is still memory.
My read of the patent is that it discusses a virtual memory system that allows the GPU to page memory into cache according to various scopes, (based on tile or threadgroup or maybe even shader stage?).
But aren't caches already a dynamic copy of main memory? Don't GPUs already prefetch buffers as needed? What is actually new here?
I don't think this is correct. My understanding is that this is about more efficiently utilizing the GPUs available resources (e.g., cache) at the core-ish level. Like imagine if hyperthreads could share register space based on what they're doing at the time, I think.
In any case, in Intel's DVMT you have a dedicated memory space for graphics. It is a dynamic amount, but it is still dedicated for graphics. You need to copy data from system memory over to graphics memory for it to be handled.
In UMA, no data copying is needed. There's 1 memory space
So, how much memory is available to your Intel iGPU? What was the machine?
If it OOMs, then one valid assumption is that it OOMs because it has less memory available than there is for a GPU on M1 but we can't know that if you fail to mention the specs.
Hell, the UMA was considered a downside on the original XBOX because everyone was used to each processor having it's own memory bus, so even though the bus was (for the time) a sprightly 6.4GB/s, it was considered lesser.
These GPUs reserved a variable amount of your system memory for the GPU. So if you had 4 GB of RAM the GPU would maybe use 500 MB leaving you with 3.5 GB of RAM for your CPU. There was often a option in the BIOS that allowed you to change the amount of memory reserved for the GPU.
There is a hard limit with DVMT that is described in the white papers by Intel. Let's say you allow up to 256 MB of RAM to be allocated for graphics. The memory could be partitioned in a way that 128 MB are fixed memory that are always only available for the graphics driver and 128 MB are DVMT memory that can be used by the OS if the graphics driver doesn't need it.
I know this might seem like a lifetime for SW people but <3 years from patent to customer for a GPU HW innovation isn't really that long? It's pretty typical, really.
Given what I know about CUDA and assuming it's different from being able to call malloc within a kernel and paging memory between the CPU and GPU, I assume it's adding in some smartness for offloading stuff from the GPU when running multiple kernels. You want to keep stuff in GPU memory as much as possible since transfers can increase latency dramatically, but if you have multiple applications using the GPU then there might not be enough memory for all of them.
However, I'm always suspicious of how innovative patents actually are so it wouldn't surprise me to find out it's just barely different than what CUDA or unified memory on game consoles can do.
Yeah, vague terms like "local memory" aren't helpful - there's lots of different memories that are local to the shader unit that are a shared resource so can affect occupancy (the LDS, register file, TBDR stuff like the tile buffer, and probably any number of other things). Which of these is it referring to? They are separate things in some architectures, but maybe not in all?
There's the linked patent itself [0] - but that specifically seems to refer to using the MMU as the part that's doing this dynamic allocation and translation - but the majority of those resources above are "before" address translation, with the MMU often being more at the L2 cache level rather than embedded within the shader clusters themselves. Maybe this isn't the "normal" MMU and instead a simpler address translator specifically for those resources that has been added of for this? Or is it something like the parameter buffer (The block of memory used to store the intermediate data between the tiling & rasterization state, and the pixel shaders). That has been "dynamically allocated" in a similar way since Apple were just taking PowerVR cores directly.
And then what is the cost of this re-allocation? If there's a cost to performance to allocating new pages, or (like worse) running out of spare pages, it might mean the (graphics API) user still has to be aware of their resource usage in scheduling shaders, so less of the promise of "Just throw things at the hardware and it'll do things optimally" than people might hope.
Real World Tech's mailing list will have a good conversation about it at some point, for sure. (Although I know the community isn't primarily about GPU.)
Is macOS Dynamic Caching similar to Windows hardware-accelerated GPU Scheduling[1]? That feature purports improve latency and GPU scheduling efficiency. Is the feature here that the OS is delegating more scheduling to the M3's ASC coprocessor[2] - assuming it's similar to the M1?
It seems to me that every explanation of dynamic caching in terms of memory is "wrong" - as seen here and in several articles written by folks more familiar with PC hardware.
I think where some folks might get it wrong is thinking of Apple silicon as being like PC hardware, where VRAM and RAM are distinct pools of memory and using the CPU to move data between them (or other devices) is very inefficient. PCI bus attached devices having separate pools of memory has given rise to a plethora of technologies to allow directly read from RAM (DMA), or to enable a GPU to read from NVMe (DirectStorage on Windows), and so on.
The patent from the article seems to describe unified memory, part of the M1's architecture, not whatever "Dynamic Caching" is; but I'll admit Apple makes it a bit hard to understand what exactly is the case.
There's only one person I'd trust to describe what this feature actually is - I'll wait for Asahi Lina to break down what dynamic caching is and whether this is a hardware or OS feature.
I think where some folks might get it wrong is thinking of Apple silicon as being like PC hardware, where VRAM and RAM are distinct pools of memory and using the CPU to move data between them
The patent basically describes using a page table to dynamically map pages of RAM between the GPU and CPU, something that Intel's integrated GPUs have been doing for 20 years.
Almost EVERY patent is basically something that has been done for many years, plus some improvement. Improvements are usually inconsequential, but sometimes they are very valuable.
We don't know anything about Apple's implementation, but for perspective, in Nvidia's first implementation of RTX RT cores occupied about 6-7% of the area of a TPC, where TPCs constitute the majority, but not the entire die, so it's an even smaller (but not by much) portion of the entire die. Maybe 5% or so.
It's possible that the latest gen of RT cores are a slightly larger portion of a TPC (but we don't know), but on the other hand, TPCs are a smaller portion of a die now as TPC arrays do not include L2, and L2 got many times bigger. So chances are, RT over die area is still somewhere on the order of 5%.
And in Nvidia designs as far as I understand RT cores sit idle in non-RT workloads, yes.
If anything, Tensor cores are almost twice as large as RT cores, and tensor cores sit idle whenever you don't use DLSS (or professional ML compute).