Great work. But what's their goal? Are they trying to make that GeLU approximation go faster? Things would probably go a lot faster going back to the erff().
Thank you for such detailed analysis! Just curious to know which version of x86-simd-sort did you benchmark: release v1.0 or the top of main current branch? (I'm the author of x86-simd-sort).
Of course! Appreciate all the time you put in. I added a few more optimizations to qsort after that (see https://github.com/intel/x86-simd-sort/pull/33), just wanted to know if your analysis took that into account.
No matter how sophisticated the pivot selection is, you can always risk having some degenerate worst case. I recommend having something like a heapsort fallback after a certain recursion limit is reached, as do pdqsort, ipnsort and vqsort(I'm a little fuzzy what their fallback is, but they have one).
Yes, vqsort does indeed resort to Heapsort after too many recursions. I'd be surprised if that happens on real data, though, because we apply a lot more effort to the pivot selection.
Would be curious to see any input distribution that triggers a fallback.