Got me curious regarding ARM latency, wonder if that was related to particular instructions which added more latency or transfers between the registers/memory subsystem internals. Also on the off-chance that you remember, did you inline intrinsics or let the compiler auto-optimize?
Interesting to test out on the ARM Mac, and see if different dependency chains show significant latency penalties / in with reorder buffer.
This is for Cortex A8, which was the chip in the Nexus One. I wrote the original version of sound synthesis directly in ARM assembler[1]. It was very highly optimized, I remember using a cycle counting app that flagged any dependency chain that would cause the processor to stall, and ultimately utilization was in the 90%+ range. Back in those days, processors were simple enough you could do this kind of optimization by hand. By the time of Cortex A15 (Nexus 10 etc), instruction issue was out-of-order and much harder to reason about.
The best current info I could find for the latency advice is [2]. Quoting, "Moving data from NEON to ARM registers is Cortex-A8 is expensive." Looking at [3] partially reveals the reason why: the NEON pipeline is entirely after the integer pipeline, so moves from integer to NEON are cheap, but the reverse direction is potentially a large pipeline stall. This is an unusual design decision that as far as I know is not true for any other CPUs. Edit: I found [4], which is a more authoritative source.
Awesome reply, and thank you for the well put together answer linking to resources and for sharing your experience.
For Cortex-A8 from [4] and the others you have linked, It makes sense to me now regarding the instruction passing data between registers, filling out the pipeline and then stalling.
Will have a peek at ARMv8/ARMv9 arch's and see what they did there regarding SVE/SVE2.
Interesting to test out on the ARM Mac, and see if different dependency chains show significant latency penalties / in with reorder buffer.