If your feature actually has to perform texture lookups and stuff having effects outside the SM core (even if it just means flushing some cache lines), it will have some cost to run it in a float-blend kind of situation compared to simply "shutting it off" with a boolean and if().
The feature on/off case of conditionals is a bit special in that it shouldn't cause a big hit on perf as partial warps won't have to execute serially for each different if-condition within the warp - but this changed somewhere along Nvidias development and I'm not sure anymore how exactly diverging if's within a warp get hit (or not):
"Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures. In particular, any warp-synchronous code (such as synchronization-free, intra-warp reductions) should be revisited to ensure compatibility with Volta and beyond."
On the other hand, if you have the "infrastructure" to do it, there is nothing wrong with having 10,000 of shaders I guess, they are small and the cost of switching between them should be minimal compared to the rendered triangles unless you have a bad engine doing bad sorting where you need a different shader for every tri you're rendering but that shouldn't happen...
As you note, benchmarking is the best, Nvidia Nsight is a really amazing free tool that is well worth the time to learn. I've done a lot of very low-level optimizations of CUDA code using it and learned a lot about how the SMs and caches work due to that, and managed to answer a lot of questions that pop up all the time in the process, like when to put stuff in the constant cache, when to use local shared memory etc. With Nsight you test the different versions and observe how it leads to cache misses or pure perf hits in many ways.
The feature on/off case of conditionals is a bit special in that it shouldn't cause a big hit on perf as partial warps won't have to execute serially for each different if-condition within the warp - but this changed somewhere along Nvidias development and I'm not sure anymore how exactly diverging if's within a warp get hit (or not):
"Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures. In particular, any warp-synchronous code (such as synchronization-free, intra-warp reductions) should be revisited to ensure compatibility with Volta and beyond."
On the other hand, if you have the "infrastructure" to do it, there is nothing wrong with having 10,000 of shaders I guess, they are small and the cost of switching between them should be minimal compared to the rendered triangles unless you have a bad engine doing bad sorting where you need a different shader for every tri you're rendering but that shouldn't happen...
As you note, benchmarking is the best, Nvidia Nsight is a really amazing free tool that is well worth the time to learn. I've done a lot of very low-level optimizations of CUDA code using it and learned a lot about how the SMs and caches work due to that, and managed to answer a lot of questions that pop up all the time in the process, like when to put stuff in the constant cache, when to use local shared memory etc. With Nsight you test the different versions and observe how it leads to cache misses or pure perf hits in many ways.