Hacker News new | past | comments | ask | show | jobs | submit login

That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.




Interesting.. I guess I should have mentioned gcc 4.5.2 on a Xeon X5670, 2.93GHz, which is 12M cache. Changing it to ('0' + j % 10) has no change in overall speed for me.


OK, I just tested with gcc 4.6.0, and unless I've screwed something up, it looks like (at -O4) it actually optimizes these into the exact same code. As in the generated ELFs are byte-for-byte identical. Impressive.


Since you showed interest, using this as the for loop tweaks it even a bit more :)

  /* temp number */
  char tmp[20];
  int pos = 0;
  int j = i > 0 ? i : -i;
  do {
    tmp[pos++] = '0' + j % 10;
    j /= 10;
  } while (j != 0 && pos <= 20);

  /* output both numbers simultaneously */
  char x[44], *p1 = x, *p2 = x + 1 + pos;
  if (i < 0) { *p1++ = '-'; *p2++ = '-'; }
  do {
    int tpos = --pos;
    *p1++ = tmp[tpos]; *p2++ = tmp[tpos];
  } while (pos > 0);
  *p1 = ' ';
  *p2 = '\0';


Well, if you really want to go apples-to-oranges and specialize as much as possible for the sequentially increasing case we're dealing with here, you could just do your arithmetic directly on the ASCII string itself, complete with a little ripple-carry loop...I bet that'd be fast. If I'm remembering correctly, I think x86 actually has dedicated instructions for ASCII arithmetic (though they're probably so unoptimized in a modern microarchitecture it'd probably be faster to avoid them).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: