The clang documentation on vectorisation has a few examples https://llvm.org/doc...

The clang documentation on vectorisation has a few examples

https://llvm.org/docs/Vectorizers.html#slp-vectorizer

Cache precedence and cache line optimisations are black magic, either you know specifically the cpu that you are targeting, or rely on hopium techniques like cache oblivious algorithms that try to reap some benefits.

The baseline is to measure, always, before and after optimisation(s). These "Data oriented design" approaches are very hard to measure and change rapidly because they have a profound impact on a codebase, rarely ever change "just one thing" and they err to the less intuitive and less readable side.