Hacker News new | past | comments | ask | show | jobs | submit login
Do low-level optimizations matter? Faster quicksort with cmov (2020) (cantrip.org)
174 points by fanf2 51 days ago | hide | past | favorite | 75 comments



A correction to the history mentioned in the article: CMOV has not been added by AMD around 2000.

CMOV has been added by Intel in 1995, to Pentium Pro. It was the most important addition to the x86 ISA added by Pentium Pro. So CMOV was supported by Pentium Pro and its successors, like Pentium 2 or Pentium III, but it was not supported by Pentium, Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with Athlon, in 1999.

Pentium Pro was the first Intel CPU with out-of-order execution and it did much more speculative execution than its predecessors. Therefore it had the need for CMOV to limit the performance loss caused by branch mispredictions.


The most important bit of this is in the conclusion:

  Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life.
Linus famously ranted about CMOV in https://yarchive.net/comp/linux/cmov.html (2007, so potentially more modern architectures are better at some of this) and he says:

  if you KNOW the branch is totally unpredictable, cmov is often good for
  performance. But a compiler almost never knows that.
As usual with optimizations like this you have to benchmark and even then if your sample isn't representative it might not mean much.


You can see how dramatically the actual data changes sort performance in e.g. this (summary of the current unstable sort in Rust, ipnsort)

https://github.com/Voultapher/sort-research-rs/blob/main/wri...

Notice how random_s95 is worse (not by much, but it's there) than fully random. random_s95 is 95% sorted data, but 5% unsorted, simulating a common "sort, do stuff, append, repeat" pattern we see in a lot of software.

In contrast the sorted cases are almost instant, and random_d20 (only 20 distinct values, chosen at random, but as a result the sorted output needs much fewer comparions) is very fast.


When Linus made that comment cmov was like a 6 cycle latency. For the last decade it has been 1 cycle, and I don't think there is any scenario where cmov is now slower than a branch.


The problem is not the 1 cycle latency, but the data dependency on both values. A correctly-predicted branch cuts the dependency on the value that is not used.

I've definitely measured scenarios where cmov/branchless is slower than a branch for a given algorithm. Especially if the branchless version is doing a bit more work to avoid the branch.


It is both though. At 6+ cycles there are only a few places where CMOV is a win. At 1 cycle you can be more liberal with its use and the lack of dependency breaking is a tradeoff.


cmov turns a control dependency into a data dependency. It is still there just in a different way. (its like the Fourier Transform of instructions - lolol).

I haven't seen cmov lose to a conditional in a few years (on very recent Intel hardware). Maybe there might be some circumstances where in creates a loop carried dependency that wrecks the optimizer, but I haven't seen one in a minute.

To be fair - cmov versions often involve some extra compution to get things in a form where cmov can be used, and I often just opt for the well predicted conditional since those couple extra instructions to set up the value to move can be avoided.


Good point. It makes me wonder if modern out-of-order processors can skip performing unused computations altogether, if their result registers are overwritten by other data later (in program order).


CPUs could in theory speculate CMOV, reintroducing prediction when predictable. But after Spectre, IIRC Intel now guarantees that CMOV is never speculated.


No, and it feels unlikely that they will either.


a cmov-based approach is necessarily slower than a branch-based approach that was correctly predicted, since cmov requires computing both branches then selecting the result at the end.


Are you sure? I recall cmov always having single cycle latency.


I know that it was decoded into 2 micro-ops and thus had to be decoded by the wide decoder, so perhaps 2 cycles?


That could be. Agner's tables seem to confirm that.


Per Fog's tables, Ice Lake shows a reg-to-reg cmov decoding to 1 uop with a latency of 1 and throughput of 0.5. *mont all have latency 2. P4 had a latency of 6. AMD's K10 latency 4.


Linus was specifically ranting about compilers inserting CMOV. These days it is actually a pain to get GCC or clang to generate a CMOV when you specifically want it.

Also, CMOV did indeed get significantly better.


> Back in 2000, AMD included cmov in its 64-bit x86 ISA extensions. Then, Intel had to adopt them when Itanium flopped.

Wasn't "cmov" one of the things added for the pentium pro? So it wasn't instruction compatible - hence the "i686" prefix to a lot of compiler triples?


Intel was so embarrassed by the failure of itanium they invented a Time Machine and went back and added the instruction to a 1995 CPU. Deceptive and anti-competitive!


Intel failed to predict the timeline branch in which i686 arch had cmov so they had to roll back and replay it.


Also known as Speculative CPU manufacturing


Yes, that is correct.


Rather than std::swap_if which is rather niche, I would prefer to see std::select(cond, if_true, if_false), with the guarantee that unlike the ternary operator it eagerly evaluates both arguments and selects between them branchlessly if possible.

Something similar is coming to Rust as well: https://doc.rust-lang.org/nightly/core/intrinsics/fn.select_...


Wouldn't the sensible lower level primitive be Clang's __builtin_unpredictable? If you standardize that then a librry can easily build your select function on top of it. I guess GCC's __builtin_expect_with_probability with probabilily 0.5 should have the same meaning.


> I guess GCC's __builtin_expect_with_probability with probabilily 0.5 should have the same meaning.

But it doesn't! the sequence 1010101010... has probability 50% that it assumes one or the other value, yet is completely predictable.


Also the compiler vendors have seen this movie before and know how it ends. Too often a "performance hint" is garbage and ignoring it will improve performance.

Which is unfortunate because these are genuinely sometimes useful, but it's a "We can't have nice things" situation - for every expert who is carefully applying a hint to a hot path after careful reading of the manual for the specific CPU model they use, there are dozens of idiots who will copy-paste that hint believing that it's magic speed-up juice, never having measured and with no idea what it's really doing or why.

The intrinsics have the advantage that at least you're expressing what you meant, and it is obvious whose fault it is when that's slower because you screwed up - hints make it too easy for programmers to justify a tantrum when their "hint" isn't magic speed-up juice after all.


For what it's worth, I recommended against mentioning `unpredictable` anywhere in the name when we stabilize the intrinsic.

IMO the constant-timeness is much more important, e.g. in audio DSP code or for cryptography you really care that it's a `cmov`, regardless of the predictability of the predicate.


A constant time guarantee would make this unimplementable since CPU vendors do not make that guarantee.



Makes sense.


Compiler vendors are the ones who add both the intrinsics and attribute hints. I also don't see a reason why "programmers" who engage in wishful thinking would limit that to hints and not apply the same to intrinsics.


Communication is the reason. "Hints" are not good communication.

There's a real difference between the wishful thinking which leads to programmers writing incorrect hints and them writing code that outright says what it's doing and then being surprised it did what they asked.


I would expect eliminating branches in a busy inner loop to matter.

The interesting part is how that was done:

> A New Primitive, swap_if

> How can we use this method in our sort? First, let us make a swap_if:

  inline bool swap_if(bool c, int& a, int& b) {
    int ta = a, mask = -c;  // false -> 0, true -> 111..111
    a = (b & mask) | (ta & ~mask);
    b = (ta & mask) | (b & ~mask);
    return c;
  }
> In our partition function, then, we can transform

    if (*right <= pivot) {
      int tmp = *left; *left = *right, *right = tmp;
      ++left;
    }
> into just

    left += swap_if(*right <= pivot, *left, *right);


This sort of use of turning conditionals into bitmasks and then shuffling masked values around was fairly common back in the day when writing code for SSE intrinsics.

You wouldn't want to have to check each lane individually since that would spoil the point of using SSE. So you'd write masking ops like this and keep things parallel, tThen judiciously use any() and all() type intrinsics to check whether the lanes could agree to skip certain blocks conditionally.

(This was before ISPC and before autovectorization to handle it for you.)


Does ISPC + auto vectorization truly remove the need to attend to these details in the most optimized code?


They can, yes.

For example, here's a simple Compiler Explorer link with a simple loop conditionally copying from one of two values:

https://godbolt.org/z/TbeK99sx1.

If you look at the assembly code produced, you'll see in lines 9-11, it's doing a similar thing as the swap_if() function: ANDing one value with the mask, ANDing the other with the negation of the mask, and then ORing them together. And it produced that automatically from the ternary in the loop.

They can also do things like automatically generate epilogues with a serial loop to handle the leftover items if the loop count isn't an even multiple of the SIMD width. That's something else that I remember writing by hand before autovectorizers.

That said, coaxing an autovectorizer to actually vectorize a loop can be an art in itself. They're notoriously finicky. My favorite used to be the one in the old ("classic") ICC compiler where you use compiler flags to get a nice optimization report which would explain exactly why it chose what it did for each loop. You could also set #pragma's which would noisily warn or error out if the associated loop failed to vectorize for some reason.

(And of course, a lot of this stuff you might just use GPU compute for now.)


normally there is an instruction for this called blend (since SSE4.1).

There shouldn't be a need to do any bitwise magic.


You can get gcc to generate the blend instruction in the example given - you just have to 'force' it to use SIMD more aggressively by, for example, using -march=native.


That's not forcing anything, if you're not getting blend you're just targeting an old ISA without that instruction.

It's up to you to choose your target adequately based on your deployment requirements.


There’s a well-known in-place implementation of swap of the form:

  a ^= b
  b ^= a
  a ^= b
(Here ^ denotes bitwise XOR)

Allowing for the mask, one could do an in-place version of swap_if via

  a ^= (b & mask)
  b ^= (a & mask)
  a ^= (b & mask)
The in-place version of swap is generally discouraged because compilers are smart (https://stackoverflow.com/questions/36906/what-is-the-fastes...) but I do wonder whether the masking in swap_if obscures intent to the compiler enough to close the gap.

Assuming the mask is passed in, Godbolt puts OP’s swap_if at 26 instructions, versus the above swap_if at 17 instructions: https://godbolt.org/z/Wedco5hPv


You have the optimizer disabled and the functions are no-ops. Additionally, masking that way makes the XOR no longer a no-op, so only one masking operation is necessary -- but it seems that GCC already knows this trick:

https://godbolt.org/z/qxMvcbrc8


Oof, thank you for the correction.


> The in-place version of swap is generally discouraged because compilers are smart

Isn’t that more because CPUs slow down when there are dependencies between instructions?

Compilers could (and may even do) fairly easily detect that pattern (like they do with some versions of popcnt. See for example https://langdev.stackexchange.com/questions/3942/what-are-th..., discussed here in https://news.ycombinator.com/item?id=40987123) and compile it to whatever is fastest on the target CPU/in the given context (in some contexts, it can be compiled to nothing, just changing the mapping between local variables and the registers they’re stored in)


> Isn’t that more because CPUs slow down when there are dependencies between instructions?

Indeed, while reg-reg moves can be resolved at the rename stage and are effectively free latency-wise.


CPUs have shortcuts in the pipeline to provide the results to the dependent operation in time


Will we see x86 or similar CPUs replace hot instruction blocks with current-processor-specific optimized code during runtime, similar to how certain JIT VMs do?

What I mean is, say you have a similar simple "x = (cond) ? a : b;" which the compiler has not translated to a CMOV.

If this is in a hot loop then the CPU could, in theory, notice that "it's just doing a conditional move, I can do the CMOV faster" and then translate those code bytes at that memory location to a CMOV instead (ie during decoding or something like that).

Not worth the complexity? I imagine it would slow down the decoder or wherever they insert this replacement logic. Or am I hopelessly out of date and they're already doing this?


I think it could only do that in spans of code where interrupts are disabled and purely between values already in registers. CPU state feels like it's in your control, but it's not at all in your control, and even if it was, multiple processes might exist and memory is never in your control.


Look for "idiom detection". Macro uop fusion is a limited form of this.

As a relevant example, POWER CPUs can convert some non diverging branches into conditional execution if not well predictable.

Many RISC CPUs can convert well known LL/SC code patterns into locked atomic operations to guarantee forward progress.


Thanks, keyword helped. Yeah I was thinking about macro uop fusion but taken to another level.


Whether to use a branch or a conditional move isn’t really dependent on what the source is doing but how likely the branch is and its dependencies. Simplifying a bit, an explicit cmov is basically telling the computer that the condition is not easy to predict and to not bother. Modern processors will typically do the same analysis themselves and perform similarly on a branch with a slight overhead.


> Will we see x86 or similar CPUs replace hot instruction blocks with current-processor-specific optimized code during runtime, similar to how certain JIT VMs do?

Wasn't that one of Transmeta's things?


Kinda, though Transmeta was built to translate everything. I must admit though I had forgotten some details, and you're right in that my idea is not unlike bolting the Code Morphing stuff to a regular x86 core.


> I call this bog-standard, but the quicksorts most easily found online are always oddly pessimized, both more complicated and slower. This one uses the “Lomuto partition”, which is simpler than Hoare’s.

The Lomuto partitioning is a naive vandalism of Hoare's original, which causes quicksort to have quadratic behavior on sorted input.


Fascinating! How does one learn to write C code that will make use of specific asm instructions?

SIMDjson is another example that comes to mind. The conceit of C is that you do t have control over the underlying machine instructions without inlining it yourself. So how do people write C code with cpu optimizations like this?


There are three approaches:

1) Use intrinsics, if your platform provides intrinsics for the instructions you want to use.

2) Use inline assembly (which I suppose doesn't technically count as writing C code, but is very much part of the story).

3) Write C code carefully against a chosen compiler with chosen optimization flags, and inspect the assembly. Especially for small chunks of code, a tool like godbolt.org is indispensable. Basically, you're writing the assembly you want the compiler to generate (either explicitly, or just in your head), then writing C code that seems likely to generate that sequence of instructions, then tweaking the code until the output you want is generated. If this is a super important optimization, it's also reasonable to add a build step that inspects the generated assembly and fails the build if it doesn't match the desired pattern; but in that case, writing inline assembly is usually easier.


Option 4 is sometimes compiler annotations like __builtin_expect_with_probability which you could use to assign a branch a value of 50% which should coerce it to pick cmov (same risk as 3 though in that you need to inspect compiler output).


Yep, there's an entire book to be written on techniques that you can use to express your understanding of code to the compiler so that the compiler's chosen implementation of it matches yours. Plenty of __builtins and __attributes__, and even just the ordering of if/else statements, grouping statements into local scopes with {}, using or not using a temporary variable...


I would say it is a moving target if the goal is to write C code that compiles to a seemingly magical cpu instruction. As other have pointed out, learning assembly and compiler explorer are useful things.

As a way to show the wonderful complexities of compilers check the TOC of https://shop.elsevier.com/books/optimizing-compilers-for-mod...


Use inline assembly or intrinsics. Read the code of stuff that does these things.

Resources sufficient for implementing simdjson or similar vectorized parsers: https://www.intel.com/content/www/us/en/docs/intrinsics-guid... https://lemire.me/blog/ http://0x80.pl/articles/


Heh, nice! This article reminded me of a great talk I recently noticed:

Rethinking Binary Search: Improving on a Classic with AI Assistance: https://www.youtube.com/watch?v=FAGf5Xr8HZU

The gist is rather simple: Assuming that a substiantial amount of your searches results in no result, you can bias the binary search to jump farther then just half, improving runtime noticeably.


Matter to whom?

If your primary audience is other developers then it absolutely does not matter. In fact it’s a way to expose deep anger. All that matters is convenience and measurements just piss people off, because measuring anything is just too hard.

If your primary audience is business, especially transactional finance, then absolutely yes. Squeeze absolutely every millisecond out, measure absolutely everything in terms of transaction quantity, focus on concurrent operations.


Convenience is the kind of thing a man values when he isn't pushing against any fundamental limits imposed by the universe. You have to work on computing frontiers like LLMs where people are feeling a performance crunch. Measuring things is what separates software engineering from computer programming. Computer science degenerates into social science when your computer is 100x faster than the thing you're building, since then you have no need to measure things. Without extreme self-imposed discipline, your work in that kind of environment will spiral into bloated decadence, unless you move on to the next computing frontier.


Note that random data is not a common case for sorting algorithms. It'd be interesting to see how the numbers change on partially-, mostly-, and fully-sorted data.


the article does mention this:

> What can we conclude from this discovery? Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life.

it's true that it's not a common case, but it's probably the simplest objectively justifiable case; any particular case of the others requires more elaborate justification for privileging it. why 90% sorted instead of 80%, say? or why do we give 20% weighting to the randomized-data case and 80% weighting to the 90%-sorted case, rather than some other weighting?

and it does at least guarantee exploration of a good part of the algorithm's behavior space. i mean, bogosort is optimal on fully-sorted data, right? as long as you do the check for sortedness before the random shuffle instead of after it

it's important that your sort not explode on mostly-sorted or mostly-reverse-sorted data (myers's 'bog-standard quicksort' uses the last element as the pivot and consequently goes quadratic), but beyond that, the particular weighting of relative importance to assign to random input vs. 99% sorted vs. 90% sorted vs. 90% reverse sorted—that requires some application-specific justification

aside from the ones you mentioned, another interesting case is a large array of records that all have equal keys. median-of-three quicksort or random-pivot quicksort does fine on mostly- or fully-sorted data, but still sucks when everything is equal!


It’s weird to admit that this is an arbitrary dataset that doesn’t mimic real-world use cases, then admit that radix sort is far faster for this situation, then conclude that low-level optimizations are important and a cmov-wrapper should be added to the standard library to make programs faster. It’s a great post I just disagree with the conclusion.


True. You would think it's maybe not that hard to find a real dataset to sort somewhere.


Here's the article linked as https; I'll email the mods.

https://cantrip.org/sortfast.html


Belatedly changed from http://cantrip.org/sortfast.html. Thanks!


The radix sort implementation is not optimal. Instead of sorting 1 digit at a time, it should be sorting 11 digits at a time to saturate the cache lines.

So the baseline radix sort can be even faster.


From the article - "This radix sort would be easy to make even faster. Instead, let us make one that is slower"

The radix sort was used to provide an estimate for how much better the quicksort could improve.


but why make a slower one? when a faster one is around? none of the x-times analysis make any sense now.


(2020)


Which of course makes it ironic that it says "We ought to be able to sort pretty fast, by now, using a Standard Library sort."

The C++ standard promised (since 1998) that the provided sorts are O(n log n).

But in fact popular C++ implementations ignored this and shipped quicksort - after all it's "pretty fast" but unfortunately this means if bad guys can choose your input (or you get unlucky) you get O(n squared)

Gradually they fixed this, libc++ updated to Introsort (last century's best attempt to do this) in 2021. In response to a ticket opened by Orson Peters.

These days you're also starting to see a trend towards the Rust-style "fast sort which implements a wide contract" design even in C++ implementations. Rust's sort doesn't actually promise this wide contract arrangement, but it does promise safety and the wide contract turns out to be the simplest way to do that.

What I mean there by "wide contract" is that these sorts don't blow up if your ordering rule is nonsense. They can't "sort" successfully because your ordering rule is nonsense, if A > B && B > A then too bad, we can't sort A and B. But they can and do gracefully give up, while many older C++ stdlibs just cause uncontrolled mayhem in this case.


I'm pretty sure that introsort was already implemented in the original STL last century.

In fact the standard didn't require O(log n) untill C++11 as all implementations were already using introsort, so allowing the previous more permissive O(n^2) bound was no longer necessary.


I don't know which is the "original STL" in this context. Alexander Stepanov's proposal of a standard template library for C++ pre-dates Musser's invention of introspective sorting (introsort). Stepanov is a smart guy but he's also somewhat famous so I'd guess if he invented it first we'd know that.

As I said, Orson filed a bug in LLVM bugzilla for libc++ going quadratic for adversarial input in 2014, and it was eventually fixed (by using introsort) in 2021.

There's a fun sideshow in that ticket, remember Orson actually wrote what is probably the best general purpose unstable sort at the time - but all he's done is open a ticket saying libc++ doesn't meet the minimum requirements. After a while, with the team having dragged their feet and still not even landed the patch to just use introsort somebody pipes up to report that they have benchmarked sorts and actually libc++ is faster than other popular C++ sorts (in their benchmark). So, Orson politely asks them to try PDQsort. That's the first time he explicitly mentions his Pattern Defecting Quicksort in the ticket, and maybe a harried LLVM bug handler wouldn't notice who filed the bug until then. Just another name.

But it's OK, the LLVM devs are unperturbed and ignore him, focusing instead on the minutiae of the performance numbers for their known defective unstable sort. Year later somebody eventually adds introsort to libc++ and closes the defect ticket. Nobody answers Orson's question, the answer of course is "Oh, yours is much faster".


Musser and Stepanov were working together on the STL. I think they were both at SGI at that time, and introsort was integrated into the SGI STL (around the same time the STL was standardized, I guess introsort was too new to require it in C++98); from there introsort made it into most other standard library implementations which are directly derived from SGI STL.

libc++ is notable for being one of the few from-scratch implementations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: