Right, I should have mentioned GPUs have a ton of hardware threads. Then again, ...

Right, I should have mentioned GPUs have a ton of hardware threads. Then again, they have to, GDDR5 memory access can take a microsecond. Try latency like that on a generic CPU, hyperthreading or not...

So: GPUs are wide SIMD machines with a lot of hardware threads, massive branch and glacial memory latencies. When there's a branch or memory latency, HW simply switch thread. GPUs don't care about serial execution performance.