The simplified SIMD cores in early GPUs had to fake branching to some extent for their virtual threads: every branch in the shader code would be tested for each virtual flag and that thread (really just a vector component) would be masked out for the instructions of the branch that didn't apply. The GPU would run both branches, relying on the mask. It was workable, but very slow.
What architecture would do that, and how?