Yup, something that often is omitted about O notation is absolute per-item and setup/activation cost.
SIMD is also not easy to reason about as the formula changes quite a bit once you become bottlenecked by bandwidth - Zen 5's AVX512 implementation lets you blaze through data with 128-256B/cycle (think >128B/0.25ns) which is absolutely bonkers. But that quickly comes to a halt once you exceed data in L1 and L2 - streaming data from/to RAM is just about 32B/c.
So the consideration of memory traffic becomes important, where "better by Big O classification" algorithms with more granular data access might start to outperform.
SIMD is also not easy to reason about as the formula changes quite a bit once you become bottlenecked by bandwidth - Zen 5's AVX512 implementation lets you blaze through data with 128-256B/cycle (think >128B/0.25ns) which is absolutely bonkers. But that quickly comes to a halt once you exceed data in L1 and L2 - streaming data from/to RAM is just about 32B/c.
So the consideration of memory traffic becomes important, where "better by Big O classification" algorithms with more granular data access might start to outperform.