That sounds like a useful mental model. But it can't be quite right, can it? The...

modeless · on Oct 3, 2021

Pixels are grouped into independent 2x2 blocks called quads and the derivatives are calculated only with respect to the other pixels in the quad.

At the edges of a triangle, when it doesn't cover all four pixels in a quad the pixel shader is still evaluated on all four pixels to calculate derivatives and at the end the results for pixels outside the triangle are thrown away. Your pixel shader better not blow up outside the triangle or you'll get bad derivatives for the pixels that are inside the triangle.

As triangles get smaller, the percentage of pixel shader evaluations that are happening outside of triangles goes up. In the extreme case if the triangle covers only one pixel, the pixel shader is executed 4 times and 3 results are thrown out! This has started to become a problem for games with high geometric detail and it's one of the reasons why Epic started using software rasterization for their Nanite virtualized geometry system in Unreal 5.

You might say "why don't you just stop doing this approximate derivative thing so you don't have to execute pixels that are thrown away? Are derivatives that important?" The answer is that derivatives are used by the hardware during texture sampling to select the correct mip level, which is pretty much essential for performance and eliminating aliasing. So you'd have to replace that with manual calculation of derivatives by the application which would be a big change.

bla3 · on Oct 4, 2021

That's a great explanation, thanks.

ww520 · on Oct 2, 2021

That's correct. There aren't enough physical cores to do all cells in parallel. A 1000x1000 grid would have 1M cells. These are virtual computing units. Conceptually there are 1M virtual computing units, one per cell. The thousands of physical cores take on the cells one batch at a time until all of the cells are run. In fact multiple cores can run the code of one cell in parallel. E.g. for-loop is unrolled and each core takes on a different branch of the for-loop. The scheduling of the cores to virtual computing units is like the OS scheduling CPUs to different processes.

Most instructions have no inter-dependency between cells. These instructions can be executed by a core as fast as it can on one computing unit until an inter-dependent instruction like dFdx is encountered that requires syncing. The computing unit is put in a wait state while the core moves on to another one. When all the computing units are sync'ed up at the dFdx instruction, they're then executed by the cores batch by batch.

Jasper_ · on Oct 2, 2021

No, multiple cores can't run the code of the same cell in parallel. The key here is that these cores run in lockstep, so the program counter is the is the exact same for all cells being run at once. In practice, each core supports 32 (or sometimes 64) cells at a time. There is also no syncing required for dFdX! Since the quads ran in lockstep, the results are guaranteed to be computed at the same time. In practice, these are all computed in vector SIMD registers, and a dFdX simply pulls a registers from another SIMD lane.

If you want to execute more tasks than this, throw more cores at it.

ww520 · on Oct 3, 2021

All the computing units running in lockstep is what appears to the program. They don't have to run in lockstep for every single instruction in reality. SIMT works in a warp with up to 32 threads [1]. Different thread warps working on different cell regions can run with different program counters.

Also the newer Volta architecture (section 3.2 in [1]) allows independent thread scheduling of threads across warps such that each thread can have its own program counter and stack.

There're certainly sync. See how barrier is used (9.7.12.1 in [1]) to synchronize threads.

[1] https://docs.nvidia.com/cuda/parallel-thread-execution/index...

Jasper_ · on Oct 3, 2021

This is not how Volta's "Independent Thread Scheduling" works under the hood, though I'm not at liberty to say what's actually going on (unfortunately, one of the upsetting things about doing GPU programming is how much "behind NDA" tribal knowledge there tends to be. If you have an NVIDIA devrel contact, ask for their "GPU Programming Guide", that should hopefully clear things up). So for now, you'll just have to trust me.

Suffice to say, the Volta "Independent Thread Scheduling" is only used for compute shaders, not for vertex and pixel shaders.

You're right to note that different warps can run with different program counters, and when communicating across warps, those need synchronization. That's what those barriers are necessary for (see GroupMemoryBarrier and friends in the high-level languages). However, the definition of dFdX chosen by these languages basically requires that all threads participating in the derivative be scheduled within the same warp. dFdX will never synchronize or wait on another warp.