When you look at how mind-bendingly precise surrogate NNs can be, sometimes you ...

When you look at how mind-bendingly precise surrogate NNs can be, sometimes you build your pipeline with 'slow' FP16 NNs and it's still faster than the 'direct' operation.

I also like how using tensor cores at the same time as cuda cores is how you get 'more' performance. It's a bit like CPU execution ports, if you /don't/ use them you're leaving performance on the floor. Only here nvidia (cuda, tensor, rt) 'cores' are a bit higher level than exec ports. Also it's kind of hard (i.e. fun) to find a use for rt cores. Digging up all those 90's papers about exotic applications of Raytracing (I mean weed must have been great then :-)

Interesting times.