How a language can be faster than C

riobard · on March 25, 2012

TL;DR:

1. Better memory aliasing to use SIMD instructions. But you have to trade in pointer arithmetic for those cases. Only useful for number-crunching and GPU-like workload AFAIK.

2. Pre-compute time-consuming constants during compile time. Though I see no reason why you cannot do this in C.

3. JIT and runtime optimization. I doubt this though, given various overhead of JIT and runtime optimization (GC, memory, etc), that JIT can beat carefully crafted C. Of course it will make it easier to write many types of programs.

orblivion · on March 25, 2012

Couldn't carefully crafted C pretty much fix every potential case where another language could beat it? For instance, any reason you can't write a memcopy function that copied 16 bytes at a time?

I assumed that a big part of a language being faster than C is really about the compiler generating faster code than a human programmer in either language could do without thinking too hard about optimization, and not so much the theoretical top speed. As I understand this is the case with assembly vis a vis C.

bunderbunder · on March 26, 2012

Couldn't carefully crafted C pretty much fix every potential case where another language could beat it?

Yes, but one runs into the same problems as for carefully crafted assembly:

- People who have the skills necessary to produce "carefully crafted" code don't come cheap.

- All careful crafting often ties the code to a specific target platform. For example, the gyrations one might through to make the code more SIMD-friendly might be detrimental to performance on a platform that lacks SIMD.

scott_w · on March 25, 2012

The keywords are "carefully crafted".

In C, you would have to code the assumptions into the runtime yourself, as opposed to letting the compiler or JIT do it. This isn't viable if you have time constraints or a lack of knowledge in the required subject area.

There's also the chance that you'll introduce bugs that the compiler/JIT developers have already encountered and fixed.

alexchamberlain · on March 25, 2012

Well summarised... except 2, where the original author was talking a lot about meta-programming, which can't be done (easily) in C.

pmjordan · on March 25, 2012

That kind of metaprogramming is just a manual substitute for deep constant propagation. Your optimiser needs to work across functions and compilations, however. It's also tricky for the optimiser to decide how deep to pursue this propagation (think of it as marking all functions as inline). This is why, for example, memcpy() is typically also a built-in function rather than just an exported library symbol. Knowing the context (alignment guarantees, fixed size) can enable some very significant optimisations.

Still, compilers are increasingly starting to do this.

riobard · on March 25, 2012

Applying meta-programming techniques to address those issues is just way too overkill. Pre-computing difficult-to-compute constants should happen during coding time, not compile time, unless you are writing an academic paper about meta-programming stuff. In real-world practices, any serious C programmers who write `fib(20)` as a function call repeatedly should just go hang themselves immediately.

There is a reason why we have tons of `#define` macros in many C programs.

scott_w · on March 25, 2012

The issue isn't programmers writing fib(20) repeatedly, it's that fib(20) may be called repeatedly by the system during the natural course of operation.

A language with the constraints of Haskell allows the JIT to recognise that fib(20) is being called a lot, or will be called again, and simply inline that function call itself. That is completely impossible for a C compiler to achieve unless the developer writes: if (arg == 20) { return X; }

That has two problems: a) The developer has to profile and look for specific behaviour in the system b) The optimisation may change depending on the state of the world e.g. the system may pick a random constant every day to compute the Fibonacci sequence for.

riobard · on March 26, 2012

I think you are confusing the “meta-programming” and “run-time optimization” as described by the original article.

The JIT optimization you mentioned is actually his point 3, while I was addressing his point 2, “meta-programing” during compile-time. True, he mentioned in languages like Lisp where compile-time and runtime can be interleaved it would be a feasible solution, but he was focusing on statically-compiled features more in line with C++ templates, which I doubt would bring anything closer to what you were talking about.

C doesn't provide runtime profiling and optimization, but if performance is really critical, offline profiling to find hotspots and replacing them with pre-computed constants is a must. On average, the complexity and additional overhead brought by the JIT techniques probably won't offset the performance gain it buys compared to carefully profiled and optimized C.

PS. Why downvoting the grandparent comment?

ubernostrum · on March 25, 2012

You are free to doubt item 3 as much as you like, but reality has already reached the point where runtime optimiziations with profiling and JIT are starting to beat even "carefully crafted" statically-compiled programs. Yes, there's overhead, of course, but there's also overhead in C. Much of the gain from JIT and profiling lies in the fact that you can do heuristic optimiziations such as live inlining that simply can't be done by a static ahead-of-time compiler (which can't prove that those optimiziations are safe). Over the life of a sufficiently long-running process (like a web application, for example), those gains easily outweigh the overhead of the system, and can quite easily beat statically compiled code.

nivertech · on March 25, 2012

C can be fast (icc) and slow (CINT) depending on compiler/interpreter and OS/platform.

MSVC on Windows can be 40% faster than gcc on the same platform. I think icc (Intel C Compiler) can be several times faster than gcc for some workloads.

I can write OpenCL code, that will be up to 100 times faster than plain C.

EDIT:

The reason for this, is that we still use procedural languages rather than declarative.

In declarative language I would say: "The content of this block of memory should be identical to the content of that block of memory." I would also specify that blocks are non-overlapping, if it's not possible to infer from the block definitions themselves.

Instead of this, memcopy C code describes procedure of iterative byte-by-byte copying. This is only one way out of many to do it and maybe it even was the fastest way to do it on PDP-11, but not on modern or future machines.

Similarly if you trying to micromanage your employees, you will not get most optimized result.

To have the best performance, you do not need to be Close-to-The-Metal(tm). You need to be able to express what your desired result is in the highest possible abstracted, but still correct, way.

ben0x539 · on March 25, 2012

This is basically an argument about language semantics constraining the compiler in possible optimisations, isn't it?

nivertech · on March 25, 2012

yes, I think we "micromanaging" computers.

ontologiae · on March 26, 2012

Several possibilities :

- A very clever and complete type analysis. The compiler can thus optimize at best the code without runtime type-checking code.

- If possible, a flow analysis : By analysing the bounds of each function, compiler is able to remove a lot of unused code. For instance, if a variable is defined by t = 2*n and long afterwards, there's a test which check primality of t, the compiler can remove the test.

- Compile code by rewriting algorithms that "takes in mind" locality of memory, ie. all things described by U. Drepper in "What every programmers should know about computer memory", ie2. rewriting algorithms to avoid cache miss, etc...

- Automatic template-like mecanism : the compiler is able to make partial execution of the code, even in array, so all the code which can be calculated because you have the datas should be directly compiled. Imagine you have a regular expression library and you write your regexp in your code (just the regexp) : the compiler should be able to produce an automata.

- It is also said : be able to use SIMD instruction.

Because of this, it is better to compile a high-level language with a very clean semantics : you can much easily reason about your semantics and more easily rewrite some algorithms.

ers35 · on March 25, 2012

The author says "In C99 the restrict keyword was added, which we could use here to encode that src and dst are different from all other references. This mechanism helps in some cases, but not in our example."

Doesn't the example below show otherwise? Note the presence of movdqu in both memcopy_normal and memcopy_restrict and how GCC emits five LEA instructions without restrict, but only two when restrict is present.

This does not support the author's assertion that "Aliasing information is the only one, where I am certain about speed improvements, because it is impossible to reach Fortran-speed in C."

  #include "stddef.h"

  void* memcopy_normal(void* dst, const void* src, size_t count) {
     while (count--) *(char*)dst++ = *(char*)src++;
     return dst;
  }

  void* memcopy_restrict(void* restrict dst, const void* restrict src, size_t count) {
     while (count--) *(char* restrict)dst++ = *(char* restrict)src++;
     return dst;
  }

  > gcc -S -masm=intel -std=c99 -m64 -march=corei7 -O3 restrict2.c

    .file	"restrict2.c"
    .intel_syntax noprefix
    .text
    .p2align 4,,15
    .globl	memcopy_normal
    .type	memcopy_normal, @function
  memcopy_normal:
  .LFB0:
    .cfi_startproc
    test	rdx, rdx
    mov	rax, rdi
    je	.L2
    lea	r10, [rdx-1]
    mov	r8, rdx
    shr	r8, 4
    mov	r9, r8
    sal	r9, 4
    test	r9, r9
    je	.L7
    lea	rcx, [rsi+16]
    cmp	rdx, 15
    lea	r11, [rax+16]
    seta	dil
    cmp	rax, rcx
    seta	cl
    cmp	rsi, r11
    seta	r11b
    or	ecx, r11d
    test	dil, cl
    je	.L7
    xor	ecx, ecx
    xor	edi, edi
    .p2align 4,,10
    .p2align 3
  .L4:
    movdqu	xmm0, XMMWORD PTR [rsi+rcx]
    add	rdi, 1
    movdqu	XMMWORD PTR [rax+rcx], xmm0
    add	rcx, 16
    cmp	r8, rdi
    ja	.L4
    lea	r8, [rax+r9]
    add	rsi, r9
    sub	r10, r9
    cmp	rdx, r9
    je	.L5
  .L3:
    lea	r9, [r10+1]
    xor	ecx, ecx
    .p2align 4,,10
    .p2align 3
  .L6:
    movzx	edi, BYTE PTR [rsi+rcx]
    mov	BYTE PTR [r8+rcx], dil
    add	rcx, 1
    cmp	r9, rcx
    jne	.L6
  .L5:
    add	rax, rdx
  .L2:
    rep
    ret
  .L7:
    mov	r8, rax
    jmp	.L3
    .cfi_endproc
  .LFE0:
    .size	memcopy_normal, .-memcopy_normal
    .p2align 4,,15
    .globl	memcopy_restrict
    .type	memcopy_restrict, @function
  memcopy_restrict:
  .LFB1:
    .cfi_startproc
    push	r12
    .cfi_def_cfa_offset 16
    .cfi_offset 12, -16
    test	rdx, rdx
    mov	rax, rdi
    push	rbp
    .cfi_def_cfa_offset 24
    .cfi_offset 6, -24
    push	rbx
    .cfi_def_cfa_offset 32
    .cfi_offset 3, -32
    je	.L12
    lea	r10, [rdx-1]
    mov	r11, rsi
    mov	r8, rsi
    neg	r11
    and	r11d, 15
    cmp	r11, rdx
    cmova	r11, rdx
    test	r11, r11
    je	.L13
    xor	ecx, ecx
    .p2align 4,,10
    .p2align 3
  .L14:
    movzx	r9d, BYTE PTR [r8]
    add	rcx, 1
    add	r8, 1
    sub	r10, 1
    mov	BYTE PTR [rdi], r9b
    add	rdi, 1
    cmp	r11, rcx
    ja	.L14
    cmp	rdx, r11
    je	.L15
  .L13:
    mov	r12, rdx
    sub	r12, r11
    mov	rbx, r12
    shr	rbx, 4
    mov	rbp, rbx
    sal	rbp, 4
    test	rbp, rbp
    je	.L16
    add	rsi, r11
    xor	ecx, ecx
    add	r11, rax
    xor	r9d, r9d
    .p2align 4,,10
    .p2align 3
  .L17:
    movdqa	xmm0, XMMWORD PTR [rsi+rcx]
    add	r9, 1
    movdqu	XMMWORD PTR [r11+rcx], xmm0
    add	rcx, 16
    cmp	rbx, r9
    ja	.L17
    add	rdi, rbp
    add	r8, rbp
    sub	r10, rbp
    cmp	r12, rbp
    je	.L15
  .L16:
    lea	r9, [r10+1]
    xor	ecx, ecx
    .p2align 4,,10
    .p2align 3
  .L18:
    movzx	esi, BYTE PTR [r8+rcx]
    mov	BYTE PTR [rdi+rcx], sil
    add	rcx, 1
    cmp	r9, rcx
    jne	.L18
  .L15:
    add	rax, rdx
  .L12:
    pop	rbx
    .cfi_def_cfa_offset 24
    pop	rbp
    .cfi_def_cfa_offset 16
    pop	r12
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
  .LFE1:
    .size	memcopy_restrict, .-memcopy_restrict
    .ident	"GCC: (GNU) 4.6.3"
    .section	.note.GNU-stack,"",@progbits

cliffbean · on March 25, 2012

Yes, the author is wrong there; restrict is valid and useful in that memcpy example.

It is true though that restrict doesn't solve all of C's aliasing problems though. For example:

    #include <stdlib.h>

    struct array_3d {
      float * restrict data;
      size_t xmin, ymin, zmin;
      size_t xmax, ymax, zmax;
      size_t allocated;
    };

    static inline float *
    array_3d_elem_addr(struct array_3d *a3d, size_t x, size_t y, size_t z) {
      return a3d->data + z +
                         y * (a3d->zmax - a3d->zmin) +
                         x * (a3d->zmax - a3d->zmin) * (a3d->ymax - a3d->ymin);
    }

    void array_3d_add(struct array_3d *restrict dst, struct array_3d *restrict src) {
      for (size_t x = dst->xmin; x != dst->xmax; ++x)
        for (size_t y = dst->ymin; y != dst->ymax; ++y)
          for (size_t z = dst->zmin; z != dst->zmax; ++z)
            *array_3d_elem_addr(dst, x, y, z) += *array_3d_elem_addr(src, x, y, z);
    }

No amount of restrict keywords can tell the compiler that src->data and dst->data don't alias here (putting restrict on the 'data' member declaration inside array_3d doesn't mean anything here, because of the way restrict is defined).

GCC manages to vectorize this loop, but only conditionally, with runtime alias checks. That's ok for this trivial example, but it isn't always viable in more complex examples. It's also worth noting that there are ways to rewrite this code so that it does fully vectorize, though again it's helped by the example being so trivial. The main point is that in Fortran, the alias rules are strong by default, without the programmer having to do acrobatics.

Edit: formatting fixes

cliffbean · on March 25, 2012

Also note that this restriction is in C itself. This is how restrict is defined in the C standard. It's not specific to GCC or any other compiler.

DarkShikari · on March 25, 2012

Sadly, from what I recall, gcc has some internal limitations which prevent it from making significant use of "restrict".

mans or astrange can probably elaborate.

beza1e1 · on March 25, 2012

movdq is only used, if appropriate. There is a check, whether the pointer are at least 16 bytes apart.

kaiwetzel · on March 25, 2012

The instruction set page linked in the article mentions: To move a double quadword to or from memory locations that are known to be aligned on 16-byte boundaries, use the MOVDQA instruction - would it not be beneficial to use this instruction for most of the memory to be copied and only use the slower variants for the leading and trailing bytes?

Does the C semantics prevent the compiler from issuing a run-time check for the (rare) aliasing case and proceed with the fast version in the common case? (probably unrolled, too?)

Out of curiosity: The code uses a special counter variable count which has to be decremented separately - is this faster than testing for dst != behind_last_dst_prt ?

(aside: my gut-feeling tells me that if the combination of C/8086 can't pull his example of at the maximum memory to processor transfer speed for sufficiently large input vectors, there is something seriously rotten ...)

mmphosis · on March 25, 2012

> How a language can be faster than C?

Turtles. Turtles all the way down.

Translation: If your language is "Turtles" then you need to use "Turtles" all the way down the entire stack, and that includes your "Turtles" CPUs.

prewett · on March 26, 2012

Personally, I think the most interesting part was the link to a student who used Scheme to generate a really fast C implementation of a matrix-multiply algorithm, which beat all C code except for one that included hand-coded assembly (and with additional improvements could have beat that, too).

http://www.cs.indiana.edu/~jsobel/c455-c511.updated.txt

cpr · on March 25, 2012

His third point is addressable by using LLVM as your target. Optimization can occur at compile time, at link time, and at runtime, if you decide to do the final LLVM-to-machine-language translation at the latter point.

Inter-module optimizations (e.g., removing dead code) can occur at link time as well.

So there are better tools now that compiler writers can target to start making progress in at least one of these areas.

furyofantares · on March 25, 2012

The memcopy example isn't an optimization issue, it's a correctness issue. If the described behavior of memcopy is what was intended then of course the compiler can't use that particular optimization, and it couldn't in any other language, either. However, if the author was intending to write a function with behavior like memmove, then the issue is that the code is incorrect.

mynegation · on March 25, 2012

A nitpick: memcpy example is flawed: memcpy's behavior is undefined if source and destination overlap. One should use memmove for that.

pascal_cuoq · on March 25, 2012

This was my initial reaction, but he is talking about a function that happens to be named memcpy() and what the compiler is allowed to do with the provided source code. So you should actually blame him for defining a function with the same name as a standard function (an undefined behavior). What would make his example correct is to use any name other than a reserved one, that is, not "memmove".

udp · on March 25, 2012

If you look more closely, he actually called it "memcopy", not "memcpy".

faithful_droog · on March 25, 2012

If you look more closely, you will also see that his memcopy dereferences a void * (and performs some pointer arithmetic as well). Apologies for my inner pedant getting the better of me.

gaius · on March 26, 2012

Right, but the point is that there's no way for the compiler to "know" that.