Hacker News new | past | comments | ask | show | jobs | submit login
Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more (arm.com)
96 points by danlark on Aug 29, 2022 | hide | past | favorite | 8 comments



This is a really interesting article. I was expecting some obviously biased and/or marketing horror by virtue of it being on arm.com

It’s actually an interesting breakdown of ways NEON differs from SSE, and how a “direct” translation may well be sub optimal. Their first example is really illustrative of this. SSE has an instruction that pulls the top(I think?) but of each register and creates an 8bit mask from those. You can do similar in NEON but the perf is apparently terrible. But NEON has an instruction that packs some bits from each register into a 64bit value, and you can go from that to the masking behaviour you were presumably trying for originally, but much faster.

The other examples and case studies are similarly interesting.


The article matches my own experience porting some media processing code from SSE to NEON for the Apple Silicon transition. I had a library with C and SSE implementations and I wanted to write a NEON implementation with the goal of outperforming the C version (on any arch), as well as the SSE versions running native (on an Intel CPU), ported (via a intrinsic compatibility library), and runtime translated (via Rosetta 2).

I started studying the SSE code. This ended up not being useful and even counter productive. I only began to make good progress when I let myself forget what I knew about the SSE implementation and instead used the C code as a starting point. By letting myself back up and think about what the code was actually doing at a high level, and then thinking about how best to write that in NEON, I was able to come up with quite different approaches compared to the SSE code, and in the end the NEON version was much faster.


It improves string comparison and sorting in ClickHouse by 15%: https://github.com/ClickHouse/ClickHouse/pull/38093


Really interesting, thanks for sharing

From the article also, 10-20% improvement (I guess in Instructions Per Cycle) on some str methods in glibc https://sourceware.org/git/?p=glibc.git;a=commit;h=3c9980698...


Great article, applied it to my parser where I was emulating movemask and it did indeed speed it up a few percent.


Awesome work! We were very happy to receive the patches to zstd to optimize ARM performance!


Can anyone recommend a good practical learning resource on adding vector optimisations to C code?

We could use some further optimisation in the emulated screen rendering code in VICE, particularly on ARM.


Thanks for this informative blog ..

https://www.mywakehealth.website/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: