The instructions generated make a big difference. Modern processor specification...

The instructions generated make a big difference. Modern processor specifications commonly quote how many instructions of a type can be "retired" in a cycle. They can retire lots of conditional branches at once, or branches and other ops, when the branches are not taken.

So it matters whether the code generator produces dead branches that can be retired cheaply. Probably, optimizers take this into account for built-in operations, but they know less about the happy path in libraries.

This is a motivation for the "likely" annotations compilers support. The likely path can then be made the one where the branch is not taken. Code on the unhappy path can be stuck off in some other cache line, or even another MMU page, never fetched in normal operation.

The cost seen here is likely from something else, though. Keeping array size in a register costs register pressure, or comparing to a stack word uses up cache bandwidth. Doing the comparison burns an ALU unit, and propagating the result to a branch instruction via the status register constrains instruction order.

Even those might not be at fault, because they might not add any extra cycles. Modern processors spend most of their time waiting for words from memory: just a few cycles for L1 cache, many more for L2 or L3, an eternity for actual RAM. They can get a fair bit done when everything fits in registers and L1 cache, and loops fit in the micro-op cache. Blow any of those, and performance goes to hell. So depending how close your code is to such an edge, extra operations might have zero effect, or might tank you.

Results of measurements don't generalize. Change something that looks like it ought to make no difference, and your performance goes up or down by 25%. In that sense, the 10% seen here is noise just because it is hard to know what might earn or cost you 10%.