Since we're comparing apples to oranges anyway, how fast *could* it be if you re...

megaman821 · on Aug 2, 2011

In a practical sense I don't really care if benchmarks are imploring the same algorithms or if ever little detail is the same. I do care what the speed of the idiomatic approach to solving some problem is. That means if the C++ STL's standard sort is the bubble sort and Python's is the quicksort, to 9 out of 10 programmers that will end up using the standard library, Python is faster when it comes to sorting even though C++ would be much faster using the same algorithm.

1amzave · on Aug 2, 2011

That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.

apaprocki · on Aug 2, 2011

Interesting.. I guess I should have mentioned gcc 4.5.2 on a Xeon X5670, 2.93GHz, which is 12M cache. Changing it to ('0' + j % 10) has no change in overall speed for me.

1amzave · on Aug 2, 2011

OK, I just tested with gcc 4.6.0, and unless I've screwed something up, it looks like (at -O4) it actually optimizes these into the exact same code. As in the generated ELFs are byte-for-byte identical. Impressive.

apaprocki · on Aug 2, 2011

Since you showed interest, using this as the for loop tweaks it even a bit more :)

  /* temp number */
  char tmp[20];
  int pos = 0;
  int j = i > 0 ? i : -i;
  do {
    tmp[pos++] = '0' + j % 10;
    j /= 10;
  } while (j != 0 && pos <= 20);

  /* output both numbers simultaneously */
  char x[44], *p1 = x, *p2 = x + 1 + pos;
  if (i < 0) { *p1++ = '-'; *p2++ = '-'; }
  do {
    int tpos = --pos;
    *p1++ = tmp[tpos]; *p2++ = tmp[tpos];
  } while (pos > 0);
  *p1 = ' ';
  *p2 = '\0';

1amzave · on Aug 2, 2011

Well, if you really want to go apples-to-oranges and specialize as much as possible for the sequentially increasing case we're dealing with here, you could just do your arithmetic directly on the ASCII string itself, complete with a little ripple-carry loop...I bet that'd be fast. If I'm remembering correctly, I think x86 actually has dedicated instructions for ASCII arithmetic (though they're probably so unoptimized in a modern microarchitecture it'd probably be faster to avoid them).

dalke · on Aug 3, 2011

Your assignment j=-i doesn't work if i=-2^31 since the range of 32 bit signed int is -2^31 to 2^31-1.

(Obviously this code never has i<0 so that branch is never even taken.)