Hacker News new | past | comments | ask | show | jobs | submit login

Unfortunately those SWAR optimizations are only useful for strings that are aligned on 8 bytes address.

If your SWAR algorithm is applied on a non-aligned string, it is often slower than the original algorithm.

And splitting the algorith in 3 parts (handling the beginning up to an aligned address, then the aligned part, and then the less-than-8-bytes tail) takes even more instructions.

Here is a similar case on a false claim of a faster utf8.IsValid in Go, with benchmarks: https://github.com/sugawarayuuta/charcoal/pull/1




Masked SIMD operations, which are in AVX-512 and ARM SVE were intended to solve that problem. Then memory operations could still be aligned and of full vectors all the time, but masked to only those elements that are valid.

Even if a masked vector-memory operation is unaligned and crosses into an unmapped or protected page, that will not cause a fault if those lanes are masked off. There are even special load instructions that will reduce the vector length to end at the first element that would have caused a fault, for operations such as strlen() where the length is not known beforehand.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: