Questions about Superoptimization

orlp · on Dec 31, 2016

My comments here are not to the posted article, but the one referenced at the top: http://www.oilshell.org/blog/2016/12/23.html

    def LookupKind(id):
         return KIND_TABLE[id]
    def LookupKind(id):
        return 175 & id & ((id ^ 173) + 11)

> The second implementation is faster because it doesn't require any memory accesses.

This is not necessarily true. If this function is in a hot loop, and the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.

> Jeff Dean also posed this question in his talk: How long it does it take to quicksort 1 billion 4-byte integers? It surprised me that he estimated it by simply counting the number of branch mispredictions, which is described in this post.

This is no longer true. In state of the art implementations quicksort is implemented in a branchless fashion. See https://github.com/orlp/pdqsort for details (I'm the author of pdqsort, old username on HN). The trick is to first replace directly swapping misplaced elements in a loop by first finding a buffer of elements on the left that should be on the right, and a buffer of elements that are on the right but should be on the left. This can be done in a branchless manner:

    buffer_num = 0; buffer_max_size = 64;
    for (int i = 0; i < buffer_max_size; ++i) {
        // With branch:
        if (elements[i] < pivot) { buffer[buffer_num] = i; buffer_num++; }
        // Without:
        buffer[buffer_num] = i; buffer_num += (elements[i] < pivot);
    }

These buffers can then also be swapped branchlessly.

Coding_Cat · on Dec 31, 2016

>the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.

You can do so, but on modern processors it's ~4 cycles to access the L1 cache and

        return 175 & id & ((id ^ 173) + 11)

takes 3 cycles (cylce 1, do the first & and the ^; cycle two, do the +; cycle 3, do the second &).

pmalynin · on Dec 31, 2016

Actually on Intel CPUs ALU operations take 1/3 of a cycle and the processor can schedule them back to back.

Coding_Cat · on Dec 31, 2016

AFAIK this is not quite right. They take only 1/4th of a cycle on average, but that is because of pipelining. if you have a dependency on the result of an ALU you will still have to wait the full latency (1 cycle) before you can continue.

This makes sense as the clock is also somewhat of the 'driving force' for pushing signals through the chip from one part to the other. (some architectures have 'zero cost' operations I believe, but these are usually baked into the pipeline and have to be turned on-or-off depending on need).

gravypod · on Dec 31, 2016

This is unrelated but I wonder what GCC's new platform specific-method generator would do to this section of code [0]. There are a few other bits in here that look like all vector-ops and I'd love to see what would happen to this after the compiler get's a crack at it.

Also if someone could tell me if this is a stupid idea: what would be the result of writing a version of this sort as a const-expr for fixed size arrays. For instacne pdqsort512 to sort a 512 length set. Just a thought. I'd love to see what G++ is able to hob-cobble together.

[0] - https://github.com/orlp/pdqsort/blob/master/pdqsort.h#L227-L...

orlp · on Dec 31, 2016

It's not a vector operation, even though it's written in a fashion similar to it. However, it can be implemented in a very interesting fashion.

Consider when c0, c3 and c6 are true, and the rest are false. Then the end result is this:

    offsets_l[num_l + 0] = i + 0;
    offsets_l[num_l + 1] = i + 3;
    offsets_l[num_l + 2] = i + 6;
    num_l += 3;

On a little-endian machine we could write this as following (assuming there is space after offsets_l):

    *(uint64_t*) (offsets_l + num_l) = (i + 0) + ((i + 3) << 8) + ((i + 6) << 16);

Which is equivalent to:

    *(uint64_t*) (offsets_l + num_l) = (i + 0) + (i + 3)*256 + (i + 6)*65536;

And if we regroup the parenthesis and add common factors:

    *(uint64_t*) (offsets_l + num_l) = 65793 * i + 393984;

If we precalculate these magic constants for each combination of c0..7 we get a 256-element lookup table, and code that looks like this:

    int idx = c0 | c1 << 1 | c2 << 2 | c3 << 3 | c4 << 4 | c5 << 5 | c6 << 6 | c7 << 7;
    *(uint64_t*) (offsets_l + num_l) = muls[idx] * i + adders[idx];
    num_l += popcnt[idx];

Sadly, it turns out that this approach isn't faster. It's cool nonetheless.

duckingtest · on Dec 31, 2016

>If this function is in a hot loop, and the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.

In isolation, yes, but is that ever actually the case? It's close to certain there's something else that can use L1 cache.

chubot · on Dec 31, 2016

Thanks for the link on sorting -- that looks really interesting!

I think it's fair to say that the expression is no slower than the lookup table. Even if missing the cache is rare, the penalty is large, so the average case will be significantly worse.

I freely admit the problem is contrived, but I think most people would agree it is actually an optimization! If you really want to nitpick, you might want to count the "opportunity cost" of some other 221 bytes not being in L1 cache.

orlp · on Dec 31, 2016

I haven't read about the rest of your problem much, but I usually think in terms of _cache pressure_. A lookup table can be very fast, but adds to the cache pressure. A large implementation with many special smart cases can also be very fast, but also adds to the cache pressure (instructions also need to be loaded from memory).

Another thing I don't believe you mention is that not every n-instruction solution is equal (even if we only use single cycle instructions as possibilities). This is because of instruction-level parallelism.

(a _ b) _ (c _ d) only takes two cycles to evaluate on a modern CPU.

a _ (b _ (c _ d)) takes three.

smcl · on Dec 31, 2016

What are the underscore operators here? I'm curious, this may be an established convention but I have not seen it before

stygiansonic · on Dec 31, 2016

I presume they are placeholders to represent any binary operator, i.e. +, -, etc. The operator should be the same in all the placeholders or have the same precedence for the two expressions to be equivalent.

orlp · on Dec 31, 2016

They are just arbitrary placeholders for any operator.

codeflo · on Dec 31, 2016

    buffer_num += (elements[i] < pivot);

Is this really branchless though? I'm curious what the generated assembler code looks like.

looki · on Dec 31, 2016

It is. You can use the `setl` instruction to conditionally set a register depending on whether the previous `cmp` was less. Then, you simply do an `add` with the produced value, which is either 0 or 1.

EDIT: You can see this here https://godbolt.org/g/ssy4am

orlp · on Dec 31, 2016

Your example is incorrect. But even if it were correct, a much better way of visualizing this is by unrolling the loop (as is done in my pdqsort implementation): https://godbolt.org/g/qWkmG4

The odd writing style of first doing all comparisons is intentional - it increases the parallelism and reduces the data dependency in the generated code. Note how all the colors are mixed through eachother: this is a good sign.

Also note that there are no branches at all in the generated code. It's 'straight' code that the CPU can just march through at full speed. This is what makes it so stupidly fast.

looki · on Dec 31, 2016

I think you misunderstood the OP (or maybe I have?) - it wasn't about parallelism, just about that one use of a comparison inside the addition. I showed that CPUs don't need to generate conditional jumps for that, that's all.

EDIT: To clarify, my example was completely contrived, I just reused your variable names for some sense of familiarity.

codeflo · on Dec 31, 2016

Exactly, that's what I asked about. I didn't know about setl and thus couldn't figure out how to get around the conditional branch. Thanks for the explanation!

looki · on Dec 31, 2016

I'm glad I understood correctly! No problem.

lgeek · on Dec 31, 2016

That's applicable if you compile for 386 or a newer x86 and your compiler generates the code you expect. On AArch32 (ARM) you could use conditional MOV and on AArch64 you could use CSET. I'm sure that other architectures also have similar instructions, however it's not a guarantee that you'll end up with branchless machine code.

awirth · on Dec 31, 2016

The nop and xchg in the GCC output are padding to make the next function aligned. You can ignore everything after the ret.

The execution of movzx should be negligible. The big difference between the code generated by GCC and the code generated by clang is that the clang code does everything using 8-bit registers, and zero-extends at the end, whereas the GCC code does everything with full 32-bit registers.

I actually had never seen the dil register before this (it's the low 8-bits of the [er]?di register). It's pretty cool that clang uses this when it knows the value in the first argument is byte-sized. I think it's also neat seeing how the compilers use the commutativity of binary-and differently, and end up anding in a different order. Overall I like the clang code better, as it seems closer to the asm someone might write by hand.

dumael · on Dec 31, 2016

> It's pretty cool that clang uses this when it knows the value in the first argument is byte-sized.

Clang is using the 8 bit subregister due to how it legalizes types.

When LLVM-IR is compiled for a target, it undergoes a process called "legalization" where invalid operations for the target are Expanded (replace the invalid operation with a semantically equivalent but legal series of operations), or Promoted (e.g. promote operations on boolean types to character types), Libcall (call out to the likes of libgcc.a) or Legal where the target directly supports the operation.

Since X86(_64) supports 8, 16, 32 (and 64) bit register accesses and operations, operations on variables of those sizes will be matched to the corresponding operation and register sizes.

If you were to compile that code for the likes of MIPS, ARM or PowerPC you'd see fully 32 bit code.

chubot · on Dec 31, 2016

OK thanks, do you know why it's 16-byte aligned? The Clang code is exactly 16 bytes but the GCC code is 20 bytes and it pads it out to 32. For i-cache maybe? (I asked this on reddit too)

awirth · on Dec 31, 2016

It's definitely an optimization thing, but I'm not actually sure off the top of my head why. According to https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Opt... -falign-functions is enabled at -O2 and higher, unless you use -Os. Presumably something about aligned instruction fetch/decoding can be faster.

powera · on Dec 31, 2016

If you're considering doing this, it's important to consider what you will do when you inevitably add additional values to the enum. If you have persistence or partial rollouts (and networked services always have partial rollouts) this is likely to break horribly. If the app never changes or it's used entirely in memory on a single process, you're fine.