The author seems to be freaking out a wee bit much over branching, which the last time I checked is quite fast on the shallow-pipeline ARM cores in the market. Branches get more expensive the deeper the pipeline gets, which means they're hugely important for desktop CPUs (although a little less now than in the days of the "netburst" cores from Intel). But for ARM? Meh.
What's a much bigger deal on ARM are cache issues. The L1 caches are very small, and there is no L2. Keeping working set sizes down for instructions and data is hugely important, which means that tricks like these aren't always a win if you're using them instead of (e.g.) calling a "min()" function.
And (someone correct me if I'm misremembering) ARM doesn't have a physically tagged cache, which means the caches can't survive a change in memory domain like a system call. I know for a fact that syscalls on my Motorola A780 (XScale CPU, Linux 2.4 kernel) are 20k cycles or more.
The bottom line is that I think the author is missing the point. These are elegant assembly hacks, but aren't really where performance-conscious programmers need to be focusing their efforts.
What's a much bigger deal on ARM are cache issues. The L1 caches are very small, and there is no L2. Keeping working set sizes down for instructions and data is hugely important, which means that tricks like these aren't always a win if you're using them instead of (e.g.) calling a "min()" function.
And (someone correct me if I'm misremembering) ARM doesn't have a physically tagged cache, which means the caches can't survive a change in memory domain like a system call. I know for a fact that syscalls on my Motorola A780 (XScale CPU, Linux 2.4 kernel) are 20k cycles or more.
The bottom line is that I think the author is missing the point. These are elegant assembly hacks, but aren't really where performance-conscious programmers need to be focusing their efforts.