Yeah, that's true. Someone at some point started using the term wrong and it stuck. Kind of like the words occlude and albedo which are also typically used incorrectly in graphics.
It is jargon, i think pretty much any field has those - though graphics and gamedev have a lot of their own (sometimes conflicting).
My favorite is "brush" to mean convex polyhedron. I remember when i joined a gamedev company years ago and was looking around the editor and noticed some buttons with a paintbrush - clicking it didn't do anything and after asking around it turned out this was for making in-editor 3d geometry but the artist who made the icon (we wanted to release the editor which meant someone went and remade all the programmer-made icons) didn't knew about it or about the terminology (i think the original programmer-made icon had the word "brush"), so he just drew what the button was supposedly about: a brush :-P.
I just went through this a couple of weeks ago. I joined a game dev company recently and I’ve been experiencing a lot of it. Another is calling everything a particle for physics simulations. Please let me know if you can think of any other.
in permutations the ordering of the individual elements is important (a different order of exactly the same elements is a different permutation), but it's irrelevant in uber shaders. they should just say combinations (different elements are toggled on and off).
With permutations without repetitions you get factorials (n!) while with permutations with repetitions you get n^n.
More precisely, when taking N objects out of M, the number of permutations is always computed by multiplying N factors, which are either all equal to M when repetitions are allowed (i.e. the power M^N) or they are decreasing by one at each factor when the extracted objects must be unique (i.e. M*(M-1)*(M-2) ...), which gives the factorial in the case of N taken out of N.
With combinations either with or without repetitions you also get a product of N factors, but each factor is much smaller, being a ratio of two integers, instead of the integer that is the numerator. This is usually written in a form that is useless for actual computation, as the ratio between a factorial and the product of other two factorials (which differ between the two kinds of combinations).
The sum of all combinations without repetitions is 2^N (when repetitions are allowed, the sum is infinite).
Nope. Combinations are subsets: you either include or exclude each element. So 2^n. With permutations, you have n options for the first element, (n-1) options for the second, and so on. Thus n! possibilities total.
I will say though, as someone who, along with a large portion of my friends - plays quite a few video games: everyone prefers when a game just take the time to precompile the shaders it needs. I don’t care if it needs to spin for 30 mins on first run/on-update. That’s preferable to jitter, random hangs, and lag from JIT-ing shaders in-game.
Precompiling all combinations can take many hours up to days though, so at some point you need to know what exactly is going to be used which is not always easy.
In a project a while back we had a problem with too many potential shader variations which needed to be precompiled because there was no way of determining which variants were actually used in advance. I think it was well past the 20k mark. Through some clever sharing of feature bits and using flags in the uniform buffer we managed to get it down to 4096, though I suspect we could have gone further and just used a big uber shader since the shader logic wasn't really that complicated compared to a current gen PBR-heavy game.
And on top of that you need to add in all the other fixed function states like blending, pixel formats, etc. Unless you can make use of dynamic states.
It's definitely a mind boggling problem for sure... one which needs collaboration between the technical artists and whoever is working on the renderer (unless you are using unreal/unity/godot, in which case its all on the artist).
Great article. I struggle to understand when it is fine to create another `if(feature_enabled)` with feature_enabled being a bool uniform I pass into the shader and when to split said shader into two.
Even after writing a bunch of them, I still don't have formed an intuition as to when to where that threshold lies...
...and going further whether it makes sense to always execute that feature and make feature_enabled a float multiplier of that effect's intensity instead.
I wonder if there is a better answer than the blanket "benchmark and iterate"
If the inputs of a shader change, or the outputs: permutation is needed. And this is often: think a variant with vertex colors for wind impact on vertices. Or a variant with normal map vs without. Your uniform and vertex attributes should also be as small as possible (when aligned) for the use.
You never want to be sampling a zero-texture in order to have generic shader code as sampling is very expensive.
I hope this helps a bit, I am no expert in this field.
I fail to understand why the inlined pieces of code can't be compiled into intermediary lines. Seems like the intermediary language is where this sort of compiler jungle should be solved.
If your feature actually has to perform texture lookups and stuff having effects outside the SM core (even if it just means flushing some cache lines), it will have some cost to run it in a float-blend kind of situation compared to simply "shutting it off" with a boolean and if().
The feature on/off case of conditionals is a bit special in that it shouldn't cause a big hit on perf as partial warps won't have to execute serially for each different if-condition within the warp - but this changed somewhere along Nvidias development and I'm not sure anymore how exactly diverging if's within a warp get hit (or not):
"Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures. In particular, any warp-synchronous code (such as synchronization-free, intra-warp reductions) should be revisited to ensure compatibility with Volta and beyond."
On the other hand, if you have the "infrastructure" to do it, there is nothing wrong with having 10,000 of shaders I guess, they are small and the cost of switching between them should be minimal compared to the rendered triangles unless you have a bad engine doing bad sorting where you need a different shader for every tri you're rendering but that shouldn't happen...
As you note, benchmarking is the best, Nvidia Nsight is a really amazing free tool that is well worth the time to learn. I've done a lot of very low-level optimizations of CUDA code using it and learned a lot about how the SMs and caches work due to that, and managed to answer a lot of questions that pop up all the time in the process, like when to put stuff in the constant cache, when to use local shared memory etc. With Nsight you test the different versions and observe how it leads to cache misses or pure perf hits in many ways.
It's a bit target dependent but say for a particular hardware and a shader use case, having a larger shader could mean that it couldn't have as many copies to quickly change context due to an event like a texture read. Context memory is a fixed resource and sometimes smaller shaders means you can fit more.
So as the other answer states, profiling. It's also good to have an accurate mental model of the real hardware. Of course this is made difficult unless you are doing specific console development.
And it's also usually not as useful to do synthetic profiling but profile in actual use.
You just have to get an understanding of how an if(...) is different from a compile-time #if and where they end up costing you time or memory:
- if(...) has a small runtime cost for literally executing the comparison instruction. very small but there nonetheless. if you have a shader that would otherwise be a handful of instructions, but it's blasted to a couple of dozen by runtime-toggling of various features, you will notice a difference!
- if(...) needs more registers, also to perform the instruction, and potentially to store the result for longer if you reuse it (e.g. to check a feature activation with if(a && b) where a and b are also individual features). but this is only relevant if the extra if is added on the already 'most register hungry' path, since registers are allocated statically so only the most complex path in your shader determines how many registers your gpu program needs in its entirety. after the result is used it is discareded and the register can be reused. also on archs that have scalar and vector registers (most of them by now), this is only relevant for the scalar part since the instructions will likely work on uniform values.
- if(...) has a small instruction cache penalty, obviously, because your code has more instructions. not often talked about but can be relevant if you are running a really large program where different subgroups might be operating in completely different places in the code regularly. adding even more instructions can make it worse. I have not actually seen this in practice but I've seen people claim it exists as an issue :-P
- if(...) implies that you have more uniform data to pass to the program (you need some way to decide which features are enabled...). besides the extra cpu side API overhead for setting those values, making your constant/uniform buffers larger and for example crossing the threshold where it's larger than 1 cache line size can have an impact on small shaders, as this data has to be fetched per thread group, at least once. I've observed ~5% overall performance differences before due to immediate constant cache misses on nvidia on very small programs that needed to operate on a lot of elements but didn't do much per element. there are ways around that obviously (like operating on more data per thread) but as a ground truth you are fetching more data than you otherwise would. it's unlikely to have an effect on large programs, they are bottlenecked by other stuff.
- if(...) can potentially create some optimization weirdness by your drivers. drivers reorder things like data accesses statically in order to optimize for the latency in accessing that data. you would want to fetch a value from some buffer as early as possible so it's already available when you actually want to use it, and do some other stuff in between. this means your driver can pull memory fetches out of an if, put it as close as possible to the start of your program, and do them anyways so they are ready by the time you need the data. however it might also decide to leave it in there and skip the read entirely if the branch isn't taken. now think about what this means when you have like 32 feature enable if checks each of which the compiler technically can't know where the branch is gonna go unless it does some static recompilation of code depending on uniform values. such an issue doesn't exist when the code is statically enabled or not.
those are the things that come to my mind. obviously whether it will make a difference always means some manner of profiling. you should never blindly use one version or another based on intuition, though this intuition can tell you whether it might be worth to spend the hours to try it out. I'd argue because of how much complexity runtime if saves on the CPU side it's always worth at least investigating. with some minimal abstraction it's also not a huge pain to implement...
On the other hand sometimes you can improve speed by replacing a variant with a branch. I've done this when I know all the threads in a warp will usually take the same path and removing variants allows me to reduce the number of draw calls.