Hacker News new | past | comments | ask | show | jobs | submit login

You just have to get an understanding of how an if(...) is different from a compile-time #if and where they end up costing you time or memory:

- if(...) has a small runtime cost for literally executing the comparison instruction. very small but there nonetheless. if you have a shader that would otherwise be a handful of instructions, but it's blasted to a couple of dozen by runtime-toggling of various features, you will notice a difference!

- if(...) needs more registers, also to perform the instruction, and potentially to store the result for longer if you reuse it (e.g. to check a feature activation with if(a && b) where a and b are also individual features). but this is only relevant if the extra if is added on the already 'most register hungry' path, since registers are allocated statically so only the most complex path in your shader determines how many registers your gpu program needs in its entirety. after the result is used it is discareded and the register can be reused. also on archs that have scalar and vector registers (most of them by now), this is only relevant for the scalar part since the instructions will likely work on uniform values.

- if(...) has a small instruction cache penalty, obviously, because your code has more instructions. not often talked about but can be relevant if you are running a really large program where different subgroups might be operating in completely different places in the code regularly. adding even more instructions can make it worse. I have not actually seen this in practice but I've seen people claim it exists as an issue :-P

- if(...) implies that you have more uniform data to pass to the program (you need some way to decide which features are enabled...). besides the extra cpu side API overhead for setting those values, making your constant/uniform buffers larger and for example crossing the threshold where it's larger than 1 cache line size can have an impact on small shaders, as this data has to be fetched per thread group, at least once. I've observed ~5% overall performance differences before due to immediate constant cache misses on nvidia on very small programs that needed to operate on a lot of elements but didn't do much per element. there are ways around that obviously (like operating on more data per thread) but as a ground truth you are fetching more data than you otherwise would. it's unlikely to have an effect on large programs, they are bottlenecked by other stuff.

- if(...) can potentially create some optimization weirdness by your drivers. drivers reorder things like data accesses statically in order to optimize for the latency in accessing that data. you would want to fetch a value from some buffer as early as possible so it's already available when you actually want to use it, and do some other stuff in between. this means your driver can pull memory fetches out of an if, put it as close as possible to the start of your program, and do them anyways so they are ready by the time you need the data. however it might also decide to leave it in there and skip the read entirely if the branch isn't taken. now think about what this means when you have like 32 feature enable if checks each of which the compiler technically can't know where the branch is gonna go unless it does some static recompilation of code depending on uniform values. such an issue doesn't exist when the code is statically enabled or not.

those are the things that come to my mind. obviously whether it will make a difference always means some manner of profiling. you should never blindly use one version or another based on intuition, though this intuition can tell you whether it might be worth to spend the hours to try it out. I'd argue because of how much complexity runtime if saves on the CPU side it's always worth at least investigating. with some minimal abstraction it's also not a huge pain to implement...




On the other hand sometimes you can improve speed by replacing a variant with a branch. I've done this when I know all the threads in a warp will usually take the same path and removing variants allows me to reduce the number of draw calls.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: