Hacker News new | past | comments | ask | show | jobs | submit login

Since we're comparing apples to oranges anyway, how fast could it be if you really wanted to format "%d %d" into the stack 10 million times without a function call... Just for fun:

  int main() {
      static const char* digits = "0123456789";
      int i;
      for (i = 0; i < 10000000; i++) {
          char x[44], *p = x, tmp[20];
  
          /* sign */
          int j;
          if (i < 0) { *p++ = '-'; j = -i; } else { j = i; }
  
          /* number */
          int pos = 0, spos;
          do {
              tmp[pos++] = digits[j % 10];
              j /= 10;
          } while (j != 0 && pos <= 20);
          spos = pos;
          do { *p++ = tmp[--pos]; } while (pos > 0);
  
          /* space, sign, number again */
          *p++ = ' ';
          if (i < 0) *p++ = '-';
          do { *p++ = tmp[--spos]; } while (spos > 0);
          *p++ = '\0';
      }
  }
  
  $ gcc -O4 -o s s.c
  $ time ./s
  real    0m0.140s
  user    0m0.138s
  sys     0m0.001s



In a practical sense I don't really care if benchmarks are imploring the same algorithms or if ever little detail is the same. I do care what the speed of the idiomatic approach to solving some problem is. That means if the C++ STL's standard sort is the bubble sort and Python's is the quicksort, to 9 out of 10 programmers that will end up using the standard library, Python is faster when it comes to sorting even though C++ would be much faster using the same algorithm.


That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.


Interesting.. I guess I should have mentioned gcc 4.5.2 on a Xeon X5670, 2.93GHz, which is 12M cache. Changing it to ('0' + j % 10) has no change in overall speed for me.


OK, I just tested with gcc 4.6.0, and unless I've screwed something up, it looks like (at -O4) it actually optimizes these into the exact same code. As in the generated ELFs are byte-for-byte identical. Impressive.


Since you showed interest, using this as the for loop tweaks it even a bit more :)

  /* temp number */
  char tmp[20];
  int pos = 0;
  int j = i > 0 ? i : -i;
  do {
    tmp[pos++] = '0' + j % 10;
    j /= 10;
  } while (j != 0 && pos <= 20);

  /* output both numbers simultaneously */
  char x[44], *p1 = x, *p2 = x + 1 + pos;
  if (i < 0) { *p1++ = '-'; *p2++ = '-'; }
  do {
    int tpos = --pos;
    *p1++ = tmp[tpos]; *p2++ = tmp[tpos];
  } while (pos > 0);
  *p1 = ' ';
  *p2 = '\0';


Well, if you really want to go apples-to-oranges and specialize as much as possible for the sequentially increasing case we're dealing with here, you could just do your arithmetic directly on the ASCII string itself, complete with a little ripple-carry loop...I bet that'd be fast. If I'm remembering correctly, I think x86 actually has dedicated instructions for ASCII arithmetic (though they're probably so unoptimized in a modern microarchitecture it'd probably be faster to avoid them).


Your assignment j=-i doesn't work if i=-2^31 since the range of 32 bit signed int is -2^31 to 2^31-1.

(Obviously this code never has i<0 so that branch is never even taken.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: