1. Better memory aliasing to use SIMD instructions. But you have to trade in pointer arithmetic for those cases. Only useful for number-crunching and GPU-like workload AFAIK.
2. Pre-compute time-consuming constants during compile time. Though I see no reason why you cannot do this in C.
3. JIT and runtime optimization. I doubt this though, given various overhead of JIT and runtime optimization (GC, memory, etc), that JIT can beat carefully crafted C. Of course it will make it easier to write many types of programs.
Couldn't carefully crafted C pretty much fix every potential case where another language could beat it? For instance, any reason you can't write a memcopy function that copied 16 bytes at a time?
I assumed that a big part of a language being faster than C is really about the compiler generating faster code than a human programmer in either language could do without thinking too hard about optimization, and not so much the theoretical top speed. As I understand this is the case with assembly vis a vis C.
Couldn't carefully crafted C pretty much fix every potential case where another language could beat it?
Yes, but one runs into the same problems as for carefully crafted assembly:
- People who have the skills necessary to produce "carefully crafted" code don't come cheap.
- All careful crafting often ties the code to a specific target platform. For example, the gyrations one might through to make the code more SIMD-friendly might be detrimental to performance on a platform that lacks SIMD.
In C, you would have to code the assumptions into the runtime yourself, as opposed to letting the compiler or JIT do it. This isn't viable if you have time constraints or a lack of knowledge in the required subject area.
There's also the chance that you'll introduce bugs that the compiler/JIT developers have already encountered and fixed.
That kind of metaprogramming is just a manual substitute for deep constant propagation. Your optimiser needs to work across functions and compilations, however. It's also tricky for the optimiser to decide how deep to pursue this propagation (think of it as marking all functions as inline). This is why, for example, memcpy() is typically also a built-in function rather than just an exported library symbol. Knowing the context (alignment guarantees, fixed size) can enable some very significant optimisations.
Still, compilers are increasingly starting to do this.
Applying meta-programming techniques to address those issues is just way too overkill. Pre-computing difficult-to-compute constants should happen during coding time, not compile time, unless you are writing an academic paper about meta-programming stuff. In real-world practices, any serious C programmers who write `fib(20)` as a function call repeatedly should just go hang themselves immediately.
There is a reason why we have tons of `#define` macros in many C programs.
The issue isn't programmers writing fib(20) repeatedly, it's that fib(20) may be called repeatedly by the system during the natural course of operation.
A language with the constraints of Haskell allows the JIT to recognise that fib(20) is being called a lot, or will be called again, and simply inline that function call itself. That is completely impossible for a C compiler to achieve unless the developer writes: if (arg == 20) { return X; }
That has two problems:
a) The developer has to profile and look for specific behaviour in the system
b) The optimisation may change depending on the state of the world e.g. the system may pick a random constant every day to compute the Fibonacci sequence for.
I think you are confusing the “meta-programming” and “run-time optimization” as described by the original article.
The JIT optimization you mentioned is actually his point 3, while I was addressing his point 2, “meta-programing” during compile-time. True, he mentioned in languages like Lisp where compile-time and runtime can be interleaved it would be a feasible solution, but he was focusing on statically-compiled features more in line with C++ templates, which I doubt would bring anything closer to what you were talking about.
C doesn't provide runtime profiling and optimization, but if performance is really critical, offline profiling to find hotspots and replacing them with pre-computed constants is a must. On average, the complexity and additional overhead brought by the JIT techniques probably won't offset the performance gain it buys compared to carefully profiled and optimized C.
You are free to doubt item 3 as much as you like, but reality has already reached the point where runtime optimiziations with profiling and JIT are starting to beat even "carefully crafted" statically-compiled programs. Yes, there's overhead, of course, but there's also overhead in C. Much of the gain from JIT and profiling lies in the fact that you can do heuristic optimiziations such as live inlining that simply can't be done by a static ahead-of-time compiler (which can't prove that those optimiziations are safe). Over the life of a sufficiently long-running process (like a web application, for example), those gains easily outweigh the overhead of the system, and can quite easily beat statically compiled code.
C can be fast (icc) and slow (CINT) depending on compiler/interpreter and OS/platform.
MSVC on Windows can be 40% faster than gcc on the same platform.
I think icc (Intel C Compiler) can be several times faster than gcc for some workloads.
I can write OpenCL code, that will be up to 100 times faster than plain C.
EDIT:
The reason for this, is that we still use procedural languages rather than declarative.
In declarative language I would say: "The content of this block of memory should be identical to the content of that block of memory." I would also specify that blocks are non-overlapping, if it's not possible to infer from the block definitions themselves.
Instead of this, memcopy C code describes procedure of iterative byte-by-byte copying. This is only one way out of many to do it and maybe it even was the fastest way to do it on PDP-11, but not on modern or future machines.
Similarly if you trying to micromanage your employees, you will not get most optimized result.
To have the best performance, you do not need to be Close-to-The-Metal(tm). You need to be able to express what your desired result is in the highest possible abstracted, but still correct, way.
- A very clever and complete type analysis. The compiler can thus optimize at best the code without runtime type-checking code.
- If possible, a flow analysis : By analysing the bounds of each function, compiler is able to remove a lot of unused code. For instance, if a variable is defined by t = 2*n and long afterwards, there's a test which check primality of t, the compiler can remove the test.
- Compile code by rewriting algorithms that "takes in mind" locality of memory, ie. all things described by U. Drepper in "What every programmers should know about computer memory", ie2. rewriting algorithms to avoid cache miss, etc...
- Automatic template-like mecanism : the compiler is able to make partial execution of the code, even in array, so all the code which can be calculated because you have the datas should be directly compiled. Imagine you have a regular expression library and you write your regexp in your code (just the regexp) : the compiler should be able to produce an automata.
- It is also said : be able to use SIMD instruction.
Because of this, it is better to compile a high-level language with a very clean semantics : you can much easily reason about your semantics and more easily rewrite some algorithms.
The author says "In C99 the restrict keyword was added, which we could use
here to encode that src and dst are different from all other references.
This mechanism helps in some cases, but not in our example."
Doesn't the example below show otherwise? Note the presence of movdqu in both
memcopy_normal and memcopy_restrict and how GCC emits five LEA instructions
without restrict, but only two when restrict is present.
This does not support the author's assertion that "Aliasing information is
the only one, where I am certain about speed improvements, because it is
impossible to reach Fortran-speed in C."
Yes, the author is wrong there; restrict is valid and useful in that memcpy example.
It is true though that restrict doesn't solve all of C's aliasing problems though. For example:
#include <stdlib.h>
struct array_3d {
float * restrict data;
size_t xmin, ymin, zmin;
size_t xmax, ymax, zmax;
size_t allocated;
};
static inline float *
array_3d_elem_addr(struct array_3d *a3d, size_t x, size_t y, size_t z) {
return a3d->data + z +
y * (a3d->zmax - a3d->zmin) +
x * (a3d->zmax - a3d->zmin) * (a3d->ymax - a3d->ymin);
}
void array_3d_add(struct array_3d *restrict dst, struct array_3d *restrict src) {
for (size_t x = dst->xmin; x != dst->xmax; ++x)
for (size_t y = dst->ymin; y != dst->ymax; ++y)
for (size_t z = dst->zmin; z != dst->zmax; ++z)
*array_3d_elem_addr(dst, x, y, z) += *array_3d_elem_addr(src, x, y, z);
}
No amount of restrict keywords can tell the compiler that src->data and dst->data don't alias here (putting restrict on the 'data' member declaration inside array_3d doesn't mean anything here, because of the way restrict is defined).
GCC manages to vectorize this loop, but only conditionally, with runtime alias checks. That's ok for this trivial example, but it isn't always viable in more complex examples. It's also worth noting that there are ways to rewrite this code so that it does fully vectorize, though again it's helped by the example being so trivial. The main point is that in Fortran, the alias rules are strong by default, without the programmer having to do acrobatics.
The instruction set page linked in the article mentions: To move a double quadword to or from memory locations that are known to be aligned on 16-byte boundaries, use the MOVDQA instruction - would it not be beneficial to use this instruction for most of the memory to be copied and only use the slower variants for the leading and trailing bytes?
Does the C semantics prevent the compiler from issuing a run-time check for the (rare) aliasing case and proceed with the fast version in the common case? (probably unrolled, too?)
Out of curiosity: The code uses a special counter variable count which has to be decremented separately - is this faster than testing for dst != behind_last_dst_prt ?
(aside: my gut-feeling tells me that if the combination of C/8086 can't pull his example of at the maximum memory to processor transfer speed for sufficiently large input vectors, there is something seriously rotten ...)
Personally, I think the most interesting part was the link to a student who used Scheme to generate a really fast C implementation of a matrix-multiply algorithm, which beat all C code except for one that included hand-coded assembly (and with additional improvements could have beat that, too).
His third point is addressable by using LLVM as your target. Optimization can occur at compile time, at link time, and at runtime, if you decide to do the final LLVM-to-machine-language translation at the latter point.
Inter-module optimizations (e.g., removing dead code) can occur at link time as well.
So there are better tools now that compiler writers can target to start making progress in at least one of these areas.
The memcopy example isn't an optimization issue, it's a correctness issue. If the described behavior of memcopy is what was intended then of course the compiler can't use that particular optimization, and it couldn't in any other language, either. However, if the author was intending to write a function with behavior like memmove, then the issue is that the code is incorrect.
This was my initial reaction, but he is talking about a function that happens to be named memcpy() and what the compiler is allowed to do with the provided source code. So you should actually blame him for defining a function with the same name as a standard function (an undefined behavior). What would make his example correct is to use any name other than a reserved one, that is, not "memmove".
If you look more closely, you will also see that his memcopy dereferences a void * (and performs some pointer arithmetic as well). Apologies for my inner pedant getting the better of me.
1. Better memory aliasing to use SIMD instructions. But you have to trade in pointer arithmetic for those cases. Only useful for number-crunching and GPU-like workload AFAIK.
2. Pre-compute time-consuming constants during compile time. Though I see no reason why you cannot do this in C.
3. JIT and runtime optimization. I doubt this though, given various overhead of JIT and runtime optimization (GC, memory, etc), that JIT can beat carefully crafted C. Of course it will make it easier to write many types of programs.