I am very familiar with such low level workings. On a modern, fast machine, the ...

RossBencina · on Jan 13, 2024

Historically, float to integer conversions in C on x86 were very slow due to the compiler reconfiguring the FPU rounding mode for each conversion.

raphlinus · on Jan 13, 2024

Historically yes. This is most definitely not true since processors had SSE3 (~2004), using the FISTTP instruction, or I think you can also use the packed float to integer instructions like CVTTPS2PI as far back as SSE (1999).

jim-jim-jim · on Jan 12, 2024

Cheers for the explanation. I suppose "modern" should be asterisked, since I was parroting posts that date back to the mid-00s. Things have probably changed considerably from then.

I'm specifically targeting an Intel Atom, which is obviously powerful enough for the task, but may not fit all definitions of "modern" now?

o11c · on Jan 12, 2024

> For example, multiply-and-add is usually one cycle in float, but would always be two separate instructions in integers.

Unless the multiply is by a constant 2, 4, or 8, in which case you can use `lea`.

dvas · on Jan 12, 2024

Got me curious regarding ARM latency, wonder if that was related to particular instructions which added more latency or transfers between the registers/memory subsystem internals. Also on the off-chance that you remember, did you inline intrinsics or let the compiler auto-optimize?

Interesting to test out on the ARM Mac, and see if different dependency chains show significant latency penalties / in with reorder buffer.

raphlinus · on Jan 13, 2024

This is for Cortex A8, which was the chip in the Nexus One. I wrote the original version of sound synthesis directly in ARM assembler[1]. It was very highly optimized, I remember using a cycle counting app that flagged any dependency chain that would cause the processor to stall, and ultimately utilization was in the 90%+ range. Back in those days, processors were simple enough you could do this kind of optimization by hand. By the time of Cortex A15 (Nexus 10 etc), instruction issue was out-of-order and much harder to reason about.

The best current info I could find for the latency advice is [2]. Quoting, "Moving data from NEON to ARM registers is Cortex-A8 is expensive." Looking at [3] partially reveals the reason why: the NEON pipeline is entirely after the integer pipeline, so moves from integer to NEON are cheap, but the reverse direction is potentially a large pipeline stall. This is an unusual design decision that as far as I know is not true for any other CPUs. Edit: I found [4], which is a more authoritative source.

[1]: https://github.com/google/music-synthesizer-for-android/blob...

[2]: https://community.arm.com/support-forums/f/armds-forum/757/n...

[3]: https://www.design-reuse.com/articles/11580/architecture-and...

[4]: https://developer.arm.com/documentation/den0018/a/Optimizing...

dvas · on Jan 13, 2024

Awesome reply, and thank you for the well put together answer linking to resources and for sharing your experience.

For Cortex-A8 from [4] and the others you have linked, It makes sense to me now regarding the instruction passing data between registers, filling out the pipeline and then stalling.

Will have a peek at ARMv8/ARMv9 arch's and see what they did there regarding SVE/SVE2.