It appears to me here that it is trivial for a compiler to use a conditional instruction (instead of a branch) here. As a result, I'm very surprised that it didn't. Any idea why this is the case?
Correctly predicted conditional branches are faster than conditional instructions, because they add fewer dependenies. The programmer usually has better knowledge of whether the branch is going to be 50/50 or somewhat predictable, and is thus better suited to make that choice.
While that post is technically correct (conditional moves have high latency because you can't dispatch the instruction until the conditional variable has been evaluated), they're still good in many cases because there's also no immediate dependency on the output of the conditional move. You might end up with 10 conditional moves in your reservation stations, but that's fine if all you're doing is summing up the results. You don't actually act on that sum until the end of your loop, so it's OK if it takes a few cycles after the loop to flush all those pending conditional moves out of the reservation stations.
The issue isn't the `cmov` itself, which is extremely forgiving nowadays (4 per cycle!), but the fact the `cmov` introduces a dependency on both inputs and the condition. A predicted branch introduces a dependency on only the predicted input: the other input is never calculated and the condition is checked after-the-fact.