By the way, let me ask a couple of stupid questions about GPUs:
Do RISCV-style free GPU projects exist or would they be unviable because of some specific properties of the GPU nature?
Why even bother implementing actual rendering in hardware when you can just implement fast general-purpose calculations and use them to accelerate software rendering? CPU-based software rendering did pretty well for the first versions of Unreal and Half-Life, I imagine it could also make sense today if accelerated with something like CUDA.
> Do RISCV-style free GPU projects exist or would they be unviable because of some specific properties of the GPU nature?
Not that I know of, I don't think they are specifically unviable, but even RiscV is relatively new and in it's infancy, and CPU's in a sense are a subset of GPUs (modern GPUs have pretty comprehensive computation ISAs) and additionally, there exists a much more viable market for low-end CPUs that doesn't exist for GPUs in microcontrollers.
> Why even bother implementing actual rendering in hardware when you can just implement fast general-purpose calculations and use them to accelerate software rendering?
The answer is sort of a mixture of A. That's what we have and B. because custom hardware is very hard to beat. Do read up on Larrabee[1]
So, GPU's have been converging on being more and more programmable, there are a few parts that are still much faster done in silicon, and very widely used, specifically rasterization (Though, software can be a win with e.g. micropolygons, See UE4's Nanite), Texture Sampling, and we're starting to see BVH traversal in the case of ray-tracing. And rasterization specifically is the one that pretty much dictates the structure of the render pipeline, with vertex/fragment shaders, etc and is broadly what makes it's hard to fit it into a "general-purpose" pipeline with specialized instructions.
And just keep in mind, Cuda is basically just a compute shader, and a compute shader itself is basically just a fragment shader without rasterization stuck on the front and without render target blending stuck on the end, and that's also specifically the history of how we got to general-purpose GPU computation ala Cuda.
Intel had a project called Larrabee where they put a lot of small but general purpose x86 cores on a GPU card. I've heard that it was a flop because certain GPU operations weren't fast enough on those little cores - you still needed some hardware acceleration. It wasn't a total failure though, Intel managed to pivot it from a GPU project into a moderately successful datacenter product, the Xeon Phi. Of course, time has passed and maybe it's worth revisiting the idea again.
The biggest barrier for anyone making a purchasable GPU is the current patent minefield dominated by Nvidia, Imagination, and others.
> when you can just implement fast general-purpose calculations and use them to accelerate software rendering?
They are optimizing for different things.
CPUs are optimized for single-threaded performance, i.e. lower latency. They only have a few cores, but these cores are spending billions of transistors implementing cache hierarchy, reordering instructions, predicting branches with neural networks, predicting indirect branches, prefetching data from RAM, executing instructions speculatively with the ability to rollback, etc. All these things are very complicated, but they do help with latency.
GPUs don't care about latency. They only care about throughput on massively parallel workloads. For all these cases when a CPU needs to do something smart to minimize latency, GPU cores simply switch to another thread to do something else in the meantime. That's how they have count of cores measured in hundreds if not thousands.
One more thing, high-end CPUs run at ~4GHz to minimize latency, GPUs at 1-2GHz to maximize power efficiency.
The result is about an order of magnitude difference in throughput. Ryzen 9 5950X peaks at 1.8 TFlops FP32, yet similarly priced Radeon RX 6800 XT peaks at 17 TFlops.
I didn't mean using a CPU for rendering. I meant building a GPU without graphics-specific functions. Imagine a 3D graphics driver/engine which takes a normal GPU of today and uses its CUDA/OpenCL functions to calculate everything it needs to render graphics without using the GPU's OpenGL/Direct3D functions. I wonder if such an engine could be viable and if we could design a GPU which would have no actual graphics-specific functions (only implementing e.g OpenCL) and still empower the graphics engine in a reasonable degree.
IANA GPU expert, but AFAIK modern GPUs are pretty much are that already: massively parallel general-purpose CPUs.
It's the GPU driver layer that takes either the compute or graphics instructions and converts them into the particular instruction set/microcode used by the "CPUs" in the GPU. In a sense that code is "recompiled" (but may be cached) every time a "program" is run.
[0] is a post that talks about reverse-engineering the interface to the GPU in the Apple M1, a lot of which consists of what I've just described, so reading that series might help understand.
The reason I feel interested in this is it seems we wouldn't need a vendor/model-specific 3D graphics driver in such a case. The vendors could just provide libraries implementing a standardized set of general purpose parallel calculation acceleration functions (e.g. OpenCL) and an app (or a vendor-agnostic 3D driver) would just use the vendor-agnostic interface to accelerate rendering-related calculations and then output to something like a VESA driver.
We still seem to be pretty far from there though.
It is also worth mentioning that actual 3D drivers have always been buggy. Seeing a mess of visual artifacts above or in place of the actual picture still is not and has never been anything unusual when using a driver which utilizes hardware graphics rendering functions. This is what got the idea into my mind in the first place. Too tired of swapping driver versions and hoping for the best. It's been more than 2 decades already and this still happens.
Another motivation might be just reducing the chip complexity which might occasionally help in improving power efficiency in some usage scenarios.
What you described is fairly close to how it works on Windows.
Vendor-specific GPU drivers don’t implement the complete Direct3D. They only do low-level hardware dependent pieces: VRAM allocation, command queues, and the JIT compiler from DXBC into their proprietary hardware-dependent byte codes (this last one is in the user-mode half of these drivers).
Direct3D runtimes (all 3 of them, 9, 11 and 12) are largely vendor-agnostic, and are implemented by Microsoft as a part of the OS.
I see. Thank you. This is curious to know. However, I have tried the Visual Studio 2019 WPF designer on a new laptop recently and it shown black artifacts on the form and complete garbage in place of the components palette. I've installed the graphics driver from the GPU vendor website and that fixed the problem. The driver used before that was a supposedly stable version developed by the same vendor and shipped by Microsoft. This led me to the conclusion there still is a lot of weird hardware-dependent stuff happening under the hood.
WPF is relatively old tech, built in 2006 on top of DirectX 9.0c and unfortunately stayed that way ever since. Probably that’s why the bug in the GPU drivers went unnoticed.
In my experience, in modern world the newer ones like D3D11 and 12 are generally more reliable.
If I would be managing the relevant division at Microsoft, I would consider deprecating D3D9 support in drivers and kernel, instead rewrite d3d9.dll to emulate the API in user mode on top of D3D12.
Microsoft already did similar trick long ago for OpenGL. Starting with Windows Vista they were emulating GL (albeit very old version of GL) on top of DirectX kernel infrastructure, unless the GPU vendor shipped their own implementation of a newer version of GL.
But it seems you have to actively enable it in the development time, the OS won't just silently do this on itself and I doubt Microsoft or anybody else is going to update their software to do so.
As a nit, they run at 1-2GHz to maximize thermal efficiency. high-end GPU's aren't really power efficient at all, they are built to maximize throughput while minimizing melting.
Unviable, GPU shines at what they do because they hide memory latency. They do that by having N threads in flight for M computes unit with N > M by a sufficiently large factor (for instance N = 5M or more like 10M ...). In flight means all states including all register values are save inside the GPU usually in its cache.
So at any point in time among the N threads they are M that are ready to execute (ie not waiting on memory) and those do execute. Anytime a thread need to lookup memory, like accessing a texture for instance, the memory lookup is schedule and the thread is put to sleep but its states stays in the GPU. But overall they are always M threads not waiting. This is why if you do performance analysis on GPU in some basic cases the memory lookup operation looks like they are free (ie takes no time).
On the CPU when you context switch between thread the kernel save the CPU states (all register values and ancillary states) to main memory. Which means that switching thread on a CPU core means writing current thread context to memory and reading next thread context from memory. So it even worsen the memory latency issues.
If you design a CPU capable of holding many threads context within silicon (on die memory) you might get closer to a GPU. But you do not get much from doing so for CPU workload. Also at which point does your design is more a GPU then a CPU ?
> If you design a CPU capable of holding many threads context within silicon (on die memory) you might get closer to a GPU.
From a pure programming model POV, this is just SMT which RISC-V supports quite handily - the native "core"-like abstraction is specifically pointed out as a 'hardware thread', "hart" for short. Now, clearly GPGPU adds some features that are not encompassed by this model, such as vartious sorts of "scratchpads"/"memories" often with restricted addressing. But the general feature is accounted for.
-- risc-v was created to support data parallel / vector codes, even though the open thing is what took off, and all sorts of cool stuff being built (not an expert but def a fan!)
-- ... but the reason nvidia >> amd, intel, google tpu in practice is the compounding ecosystem investment over the years: compilers, libs, and now, even rest of the chip/network . Ex: Intel & AMD & IBM tried to push OpenCL for example soon after CUDA, yet ~no one runs AI stuff that way, the ecosystem just isn't there in practice.
This seems like a misnomer. This seems more like rendering API architectures more so than GPU-architecture.
Which is still important: Immediate mode vs Tile-based is a big shift in overall style. And GPU-hardware is designed for particular software architectures (because the CPU will be inevitably invoking calls in a certain pattern).
But it'd probably be more accurate to call this blogpost "Rendering Architecture Types Explained" moreso than "GPU Architecture". A modern GPU running DirectX 9.0 or OpenGL 2.0 would still be immediate mode for example.
No, this is about HW architectures. While they are likely evolving towards one a other there are tile based (like Imagination and ARM Mali) And immediate mode (Nvidia AMD) that both implement the same APIs (OpenGL, Vulkan etc). All these HW architectures are modern and in use.
Basically all modern GPU architectures implement tiled rasterization. NVIDIA has been doing it since Maxwell (2014) and AMD has been doing it since Vega (2017). Even Intel has been doing it for a few years now starting with their Gen 11 (2019) GPUs.
> Specifically, Maxwell and Pascal use tile-based immediate-mode rasterizers that buffer pixel output, instead of conventional full-screen immediate-mode rasterizers.
The parent article already discusses that article, saying those GPUs don't use TBR in areas where the primitive count is too high or something:
> Another class of hybrid architecture is one that is often referred to as tile-based immediate-mode rendering. As dissected in this article[1], this hybrid architecture is used since NVIDIA’s Maxwell GPUs. Does that mean that this architecture is like a TBR one, or that it shares all benefits of both worlds? Well, not really…
What the article and the video fails to show is what happens when you increase the primitive count. Guillemot’s test application doesn’t support large primitive counts, but the effect is already visible if we crank up both the primitive and attribute count. After a certain threshold it can be noted that not all primitives are rasterized within a tile before the GPU starts rasterizing the next tile, thus we’re clearly not talking about a traditional TBR architecture.
Classic TBDRs typically require multiple passes on tiles with large primitive counts as well. Each tile's buffer containing binned geometry generally has a max size, with multiple passes required if that buffer size is exceeded.
Having watched the video, I'm fairly certain what is being observed is not really tiled.
I'm not however sure what a "tile-based immediate-mode rasterizers that buffer pixel output", but I think that's enough qualifications to make it somewhat meaningless. All modern gpu's dispatch thread groups that could look like "tiles" and have plenty of buffers, likely including buffers between fragment output, and render target output/color blending, But that doesn't make it a tiled/deferred renderer.
It's been often quoted that Nvidia has switched to tile based for their Desktop renderers, but I haven't seen a source that confirms this. I suspect this is speculation due to changes in raster order that produce side-effects that look tiled even though they aren't.
This has been empirically tested on multiple occasions. There is an article on realwordtechnologies discussing this, and the results have been related for newer AMD GPUs as well. I have a little tool for macOS that tests these things out, and the Navi GPU on my MacBook is definitely a tiler (the Gen10 Intel GPU is not).
It's brought up in multiple other comments, so I won't bother going into detail, but the empirical testing, is flawed and is actually measuring changes in other details about thread launch behavior.
Agreed. For non-movies/games people -- think ML, neural networks, simulations, ETL -- this is far from how we think about them. Instead, focus is much more on thread divergence, NUMA memory models, consistency models, hw/sw schedulers, latency hiding, growing variety of DMA modes, funny ISA stacks, etc. The rendering pipeline is a tiny bit relevant for GPGPU people, e.g., if you're trying to do 1990s style shoehorning of it into antiquated webgl 1/2 rendering primitives because google/apple won't let you do the real thing.
I think that the article focuses too much on the academic distinction between immediate renders and tilers but fails short to discuss how these techniques relate to real-world GPUs. For example, the fact that all contemporary AMD and Nvidia gaming GPUs are tilers with large tiles (that's one of the key reasons why Maxwell and Navi got a big boost in performance). Or that many mainstream mobile GPUs employ various hacks (e.g. vertex shader splitting) in order to simplify the architecture, but which ultimately blocks their ability to scale to more advanced applications. Notably missing any mention of TBDR which currently powers the fastest low-power mobile and desktop GPUs on the market.
>For example, the fact that all contemporary AMD and Nvidia gaming GPUs are tilers with large tiles (that's one of the key reasons why Maxwell and Navi got a big boost in performance)
There's a whole section on it near the end:
"Another class of hybrid architecture is one that is often referred to as tile-based immediate-mode rendering. As dissected in this article, this hybrid architecture is used since NVIDIA’s Maxwell GPUs."
>Notably missing any mention of TBDR which currently powers the fastest low-power mobile and desktop GPUs on the market.
Another section mentions:
"There’s a long-standing myth (that luckily slowly disappears) that deferred rendering techniques are not suitable for TBR GPUs. "
> "Another class of hybrid architecture is one that is often
referred to as tile-based immediate-mode rendering. As
dissected in this article, this hybrid architecture is used since NVIDIA’s Maxwell GPUs."
Yes, they do mention it, but I feel that the relevance of this technique would warrant a more in-depth discussion.
> "There’s a long-standing myth (that luckily slowly
disappears) that deferred rendering techniques are not
suitable for TBR GPUs. "
This has nothing to do with TBDR. "Deferred rendering" is a "software" rendering technique used to optimize lighting computation, "tile based deferred rendering" describes a hardware GPU architecture that delays fragment shading to the latest possible moment to optimize shader unit efficiency.
> Yes, they do mention it, but I feel that the relevance of this technique would warrant a more in-depth discussion.
They do have a long section on it at the end discussing hybrid stuff.
> This has nothing to do with TBDR
You are right about TBDR vs deferred rendering on a tile based architecture, but it still seems the article did discuss TBDR too just without naming it (probably due to confusion with explicit deferred rendering by the application):
The main technique in there is described in the original article I believe:
[original article] "Some TBR GPUs go even further than that and perform perfect per-pixel hidden surface removal and thus can guarantee that every single pixel gets shaded exactly once throughout the whole subpass. "
vs
[imgtech description of TBDR] "Deferred rendering means that the architecture will defer all texturing and shading operations until all objects have been tested for visibility. The efficiency of PowerVR Hidden Surface Removal (HSR) is high enough to allow overdraw to be removed entirely for completely opaque renders."
Regarding Maxwell and Navi: Actually, that's not true.
The micro-benchmark that suggested Maxwell (and later) was a tiled deferred gpu was actually measuring something else. Each GPC gets assigned different sceenspace areas, and concurrency rules between the areas is relaxed (unless explicitly required by shader atomics).
The result looks somewhat like tiled deferred rendering in that micro-benchmark. But it's still very much immediate mode.
A similar thing happened with Navi.
However, there are mobile GPUs (Qualcomm's Adreno) that dynamically switch between tiled deferred mode and immediate mode on a per renderpass basis, depending on what driver heuristics suggest will be faster.
I don't think anyone has ever suggested that Maxwell or Navi were doing tiled deferred rendering, but all evidence (as well as technical documentation) point to them doing tiled immediate rendering. The tests in question rely on an atomic variable to draw only a limit amount of fragments, and it is very clear that rasterization proceeds tile by tile, with obvious geometry binning. Which is the very definition of tiled rendering.
The big difference between a modern desktop GPU and a run-of-the-mill mobile GPU is that the former uses much larger tiles and smaller binning buffers, so you start spilling tiles after just several dozen primitives. Desktop GPUs implement tiling primarily to improve memory access locality. Mobile GPUs on the other hand heavily invest into tiling because it is the basis of their existence, they wouldn't be able to do anything with that low power budget otherwise.
> However, there are mobile GPUs (Qualcomm's Adreno) that
> dynamically switch between tiled deferred mode and immediate > mode on a per renderpass basis
I've seen multiple mentions that Adreno supports TBDR, but no actual evidence for it. When asked, folks usually give me some link or documentation that describes immediate mode tiled rendering. AFAIK, "real" TBDR has only ever been achieved by Imagination and Apple inherited it from it and deployed it at a large scale.
Imgtec and Apple use the term "Tile-Based Deferred Rendering" to mean a combination of tiling and deferred shading. Because that's what their GPUs do.
Other vendors, like qualcomm [1] still use the term "Deferred" in regards to their Tile-Based Rendering, simply because the draw calls are deferred. It doesn't mean deferred shading.
Every company appears to make the the terminology as they go. I found an early presentation from ARTX [2] and they are using database terminology to describe what we now call vertex buffers.
Gah, so when Qualcomm uses the word "deferred", they just mean "tiled", and each tile still rasterizes each tri in order. Agreed on terrible terminology, this is a disaster.