Since we're comparing apples to oranges anyway, how fast could it be if you really wanted to format "%d %d" into the stack 10 million times without a function call... Just for fun:
int main() {
static const char* digits = "0123456789";
int i;
for (i = 0; i < 10000000; i++) {
char x[44], *p = x, tmp[20];
/* sign */
int j;
if (i < 0) { *p++ = '-'; j = -i; } else { j = i; }
/* number */
int pos = 0, spos;
do {
tmp[pos++] = digits[j % 10];
j /= 10;
} while (j != 0 && pos <= 20);
spos = pos;
do { *p++ = tmp[--pos]; } while (pos > 0);
/* space, sign, number again */
*p++ = ' ';
if (i < 0) *p++ = '-';
do { *p++ = tmp[--spos]; } while (spos > 0);
*p++ = '\0';
}
}
$ gcc -O4 -o s s.c
$ time ./s
real 0m0.140s
user 0m0.138s
sys 0m0.001s
In a practical sense I don't really care if benchmarks are imploring the same algorithms or if ever little detail is the same. I do care what the speed of the idiomatic approach to solving some problem is. That means if the C++ STL's standard sort is the bubble sort and Python's is the quicksort, to 9 out of 10 programmers that will end up using the standard library, Python is faster when it comes to sorting even though C++ would be much faster using the same algorithm.
That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).
Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.
Interesting.. I guess I should have mentioned gcc 4.5.2 on a Xeon X5670, 2.93GHz, which is 12M cache. Changing it to ('0' + j % 10) has no change in overall speed for me.
OK, I just tested with gcc 4.6.0, and unless I've screwed something up, it looks like (at -O4) it actually optimizes these into the exact same code. As in the generated ELFs are byte-for-byte identical. Impressive.
Well, if you really want to go apples-to-oranges and specialize as much as possible for the sequentially increasing case we're dealing with here, you could just do your arithmetic directly on the ASCII string itself, complete with a little ripple-carry loop...I bet that'd be fast. If I'm remembering correctly, I think x86 actually has dedicated instructions for ASCII arithmetic (though they're probably so unoptimized in a modern microarchitecture it'd probably be faster to avoid them).