Hacker News new | past | comments | ask | show | jobs | submit login
PyPy is faster than C, again: string formatting (morepypy.blogspot.com)
233 points by Scriptor on Aug 2, 2011 | hide | past | favorite | 76 comments



I love PyPy, those guys are amazing. However, every time I see "faster than C" almost always a bit of trickery/wordplay is involved.

The examples aren't comparable. The equivalent would be to have Python code invoke an external function which sits in a pre-compiled .so

Bulk of the work is happening inside of sprintf(), why handicap C by not letting it to compile the code?

The fair comparison would be to place the source of sprinf() nearby and see if C compiler inlines that call or/and unrolls the loop, otherwise it's just about packaging/linking, not really about code generation.

Edit: I see this became #1 on HN front page today. I want to take advantage of this and say that http://mailgun.net, the programmable email platform, is looking for an engineer who'd find this discussion interesting. See my profile. And we're users of PyPy too! :)


I highly doubt that the performance difference is a result of not inlining/unrolling sprintf(). The difference is most likely due to the fact that sprintf() is parsing the format string over and over, whereas PyPy is only parsing it once.

The printf() family of functions is one of the rare parts of the C standard library that is essentially implementing an interpreter. The format string is a program and the library call runs that program with data that is supplied in the other arguments. So in essence this is comparing an interpreter (sprintf) with a JIT compiler (PyPy generating code for a specific format string at runtime).

I agree it's not a very fair comparison: if this C code was performance critical you would call a specialized function for converting an integer to a string, and this would almost certainly beat PyPy.


>I highly doubt that the performance difference is a result of not inlining/unrolling sprintf(). The difference is most likely due to the fact that sprintf() is parsing the format string over and over, whereas PyPy is only parsing it once.

And the fact that it only runs twice as long is pretty impressive, actually. Sprintf is doing the parsing 999,999,999 times more than the PyPy interpreter is.


But the whole point of the post was to point out that PyPy can optimize in places that the traditional C model of shared libraries cannot - or, at least, have great difficulty doing so. This is an inherent advantage to optimizing the instructions at runtime.


Shared libraries is an operating system feature and has nothing to do with C. C just happens to be the most popular systems programming language. "Traditional C model" is just a bunch of object files, and it's up to you and your linker to package it any way you want it: a shared library, a static library or an executable.

I'm not a GCC expert, but Microsoft C compilers could inline cross-module code since the beginning of time. His example may actually produce different results under msvc with /GL flag and with static linking enabled.

"...Inline a function in a module even when the function is defined in another module..." http://msdn.microsoft.com/en-us/library/0zza0de8%28v=vs.80%2...


GCC got that option recently with '-flto' (or whatever it was to enable link time optimization)


Is anyone working on a JIT for object code? I'm thinking we could keep all our SO's how they are now since they are great for updating versions of libraries with bugs, but have a JIT that takes care of profiling and inlining functions from other modules.


LLVM is designed to be able to do this.


See my reply below. I'm fully aware of the relationship C has to systems programming.


Depends on their GCC options. If it's the default option, it should be using the shared library. If they changed it to statically link libc, that should produce a(possibly considerable) performance boost.


C the language doesn't have shared libraries. That's something some systems add themselves.

Edit: Is this not the case? I am under the impression that it is, and is relatively new at that.


Which is why I said "the traditional C model of shared libraries." Shared libraries are not a part of the C language itself, but most real systems make heavy use of shared libraries. Even statically linked C programs have this problem, because optimizing across object files is hard - most compilers don't even try.

Not being a part of the C language itself is irrelevant. The traditional linker has been a part of the C ecosystem for decades.


Statically linked programs certainly still have this problem, I'm not contending that at all. I would not however call shared libraries a "traditional model" at all. Particularly in the fields that C seems to be getting relegated to more and more these days.


If you're running a Unix derived system, most of the programs are using shared libraries. That fits "traditional" in my mind.


A sufficiently modern Unix derived system, sure ;) I have quite commonly seen shared libraries not done on embedded systems however, regardless of heritage.


And I don't consider embedded systems to be traditional. Otherwise, we wouldn't have to qualify it with "embedded."


They are certainly a traditional stronghold of straight up C.


The embedded systems I work on use shared libraries, C++, shell scripts, and even Adobe Flash.

Dynamic linking has been around since the 1960s, and is the model that has been used by almost all currently active developers for their entire careers. Arguments based around specialized microcontrollers, hard real-time systems, life-or-death control systems, etc. are off in the woods. Few developers use such systems in the real world, and their code is statistically irrelevant to comparisons between e.g. Python and C.


"The embedded systems I work on use shared libraries, C++, shell scripts, and even Adobe Flash."

Consider yourself lucky...


PyPy tries to move as much code as possible out of C and into (R)Python for the exact reason that this kind of optimization becomes possible/feasible/easier with JIT. So it's a packaging win sure but it's not an accidental one.


Just including vprintf is not enough. Here are some timings (GCC-4.2.1, system compiler on OpenBSD-5.0-current/amd64 on a Thinkpad SL510; all timings are median-of-three):

Provided program, -O4: ~5.06s

Provided program, -O4 -march=native: ~5.07s (i.e. no difference)

Provided program, includes sprintf and the vprintf family, -O4 -march=native: ~4.84s

Of course, this is just a silly example (realistic programs don't make things that easy on the JIT). Still, quite impressive; congratulations to the PyPy guys!


> However, every time I see "faster than C" almost always a bit of trickery/wordplay is involved.

yeah, it's almost like those "$something in only $few lines of $programming_language" posts where you get highly obfuscated code to read.


Since we're comparing apples to oranges anyway, how fast could it be if you really wanted to format "%d %d" into the stack 10 million times without a function call... Just for fun:

  int main() {
      static const char* digits = "0123456789";
      int i;
      for (i = 0; i < 10000000; i++) {
          char x[44], *p = x, tmp[20];
  
          /* sign */
          int j;
          if (i < 0) { *p++ = '-'; j = -i; } else { j = i; }
  
          /* number */
          int pos = 0, spos;
          do {
              tmp[pos++] = digits[j % 10];
              j /= 10;
          } while (j != 0 && pos <= 20);
          spos = pos;
          do { *p++ = tmp[--pos]; } while (pos > 0);
  
          /* space, sign, number again */
          *p++ = ' ';
          if (i < 0) *p++ = '-';
          do { *p++ = tmp[--spos]; } while (spos > 0);
          *p++ = '\0';
      }
  }
  
  $ gcc -O4 -o s s.c
  $ time ./s
  real    0m0.140s
  user    0m0.138s
  sys     0m0.001s


In a practical sense I don't really care if benchmarks are imploring the same algorithms or if ever little detail is the same. I do care what the speed of the idiomatic approach to solving some problem is. That means if the C++ STL's standard sort is the bubble sort and Python's is the quicksort, to 9 out of 10 programmers that will end up using the standard library, Python is faster when it comes to sorting even though C++ would be much faster using the same algorithm.


That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.


Interesting.. I guess I should have mentioned gcc 4.5.2 on a Xeon X5670, 2.93GHz, which is 12M cache. Changing it to ('0' + j % 10) has no change in overall speed for me.


OK, I just tested with gcc 4.6.0, and unless I've screwed something up, it looks like (at -O4) it actually optimizes these into the exact same code. As in the generated ELFs are byte-for-byte identical. Impressive.


Since you showed interest, using this as the for loop tweaks it even a bit more :)

  /* temp number */
  char tmp[20];
  int pos = 0;
  int j = i > 0 ? i : -i;
  do {
    tmp[pos++] = '0' + j % 10;
    j /= 10;
  } while (j != 0 && pos <= 20);

  /* output both numbers simultaneously */
  char x[44], *p1 = x, *p2 = x + 1 + pos;
  if (i < 0) { *p1++ = '-'; *p2++ = '-'; }
  do {
    int tpos = --pos;
    *p1++ = tmp[tpos]; *p2++ = tmp[tpos];
  } while (pos > 0);
  *p1 = ' ';
  *p2 = '\0';


Well, if you really want to go apples-to-oranges and specialize as much as possible for the sequentially increasing case we're dealing with here, you could just do your arithmetic directly on the ASCII string itself, complete with a little ripple-carry loop...I bet that'd be fast. If I'm remembering correctly, I think x86 actually has dedicated instructions for ASCII arithmetic (though they're probably so unoptimized in a modern microarchitecture it'd probably be faster to avoid them).


Your assignment j=-i doesn't work if i=-2^31 since the range of 32 bit signed int is -2^31 to 2^31-1.

(Obviously this code never has i<0 so that branch is never even taken.)


    char x[44];
    sprintf(x, "%d %d", i, i);
This is fine, except you can't even return x from this function, a more fair comparison might be:

     char * x = malloc(44 * sizeof(char));
     sprintf(x, "%d %d", i, i);*
There is a standard (C99) way to do this: asprintf(3).

    char *x;
    asprintf(&x, "%d %d", i, i);
    return x;


There is no asprintf function in my copy of the C99 standard.


Sorry, you are correct; I mis-read the man page. It is however in at least glibc and Darwin/FreeBSD's libc.


asprintf is a GNU extension, I believe.


The pypy guys continue to do amazing work. This project is one of the reasons why I believe the python community is one of the best open source communities out there: Let's work on something incredibly difficult, challenging and something that not long ago was considered to be nearly impossible, and produce incredible results.

If you're not running pypy in production already, then you probably should be[1].

[1]: Yes, there are some obvious exceptions.

edit: formatting.


Re:

> GCC is unable to inline or unroll the sprintf call, because it sits inside of libc.

If I'm understanding http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Other-Builtins.h... correctly, sprintf should be handled as a built-in function, rather than linking the libc version, unless you explicitly specify -fno-builtin. In theory that should allow gcc to perform various optimizations; I've seen that happen with printf at least, where e.g. printf-ing a constant string just gets compiled to puts.


Perhaps it is recognizing it, but it's not doing anything interesting with that knowledge, on my machine GCC emits a call to "__sprintf_chk"


There's nothing really stopping gcc from doing as well here, since it ought to be able to spot that the format string never changes & roll out a custom sprintf() that doesn't need to parse it every time. However, right now gcc doesn't do that, so it loses to a language implementation that does.

The really sad thing is that using an ostringstream in C++ is even worse, despite the fact that C++ has all the types available to it & doesn't need to parse any format strings at all: Not enough template metaprogramming clearly!


I'm no python developer but this:

def main():

    for i in xrange(10000000):

        "%d %d" % (i, i)
main()

Does't seem to be copying the result anywhere. Where as the C example is copying the result to memory .. which would explain why it is slower.


Yes, it does seem a little strange that the result is unused, however changing that to be ``x = <stuff>`` would have no effect on the runtime, the compiler isn't quite good enough yet to realize the result is unused.


The more fundamental mismatch is that they're testing an interpreter (libc's printf implementation) against a compiler (pypy's format string jit). A C++0x or D library that parsed format strings at compile time would likely still beat pypy.


But Pypy can optimize dynamic strings, while a static compiler can't.

Replace the constant string with argv[1], and run the code: for Pypy, there is no real difference, but the C/C++/D compilers are unable to optimize.


I think this is still a library problem, not a language problem. "Compiling" a string at runtime for `sprintf` isn't any more difficult than doing it at compile-time, and plenty of regex libraries do very similar things.

Compiling of printf strings isn't done, though, because nobody cares about string performance unless they're writing UNIX command-line utils. If you're writing C you're probably only dealing with strings for (infrequent) IO, spending the vast majority of your time crunching away on pointers and integer types (floats in niche cases). In the end, you're probably only printing something out because someone needs to read it, and how fast can humans read, anyway?

This goes just as much for the printing of floating point numbers. (http://www.serpentine.com/blog/2011/06/29/here-be-dragons-ad...)


Oh and does anyone else think the malloc example in comparison to actual garbage collection is incredibly unfair?


I've long since lost track of what it means for a comparison between languages to be fair. To put that a different way, how should we have compared them? Should we have downloaded some GC library for C to implement it?


I guess the real problem is that the example serves no real world purpose and can very easily be biased. See my response to scott_s below.


"Fairness" is not really relevant. As long as they're comparing typical C idioms to typical Python idioms, it's valid.


So why not simply do:

#include <stdio.h>

#include <stdlib.h>

int main() {

    int i = 0;

    char *x = malloc(44 * sizeof(char));

    for (i = 0; i < 10000000; i++) {
        sprintf(x, "%d %d", i, i);
    }

    free(x);
}

Their version was intentionally biased, which is a shame.


Probably because they wanted to simulate calling a function that parsed a string, and such functions will have to allocate and deallocate memory as needed.

It's also worth noting that calling malloc and free in a tight loop where you're always requesting the same amount of memory will be pretty fast. Good implementations of malloc - of which glibc certainly is - will consistently return the exact same chunk of memory to you, and you will be on the fast-path of the allocation algorithm.


Why would you assume the need to deallocate in the loop? Reuse of an already allocated memory location is C optimisation 101.


You're looking at this from the wrong angle. You're looking at the C code and thinking, "How can I optimize this?"

What you need to do in this case is look at it and say, "How can I optimize this and still retain the essence of what I want to test?" Your optimizations remove that essence - if you're calling a function that is a part of an API, it will have to allocate and free its own memory. That the code is in a loop is an artifact of the experiment.


I'm looking at it from the angle of someone that has spent over a decade in and out of C. The code in the loop is in NO way an "artifact", it's absolutely critical to the comparison. My example in no way removes the essence, especially if we are using "idioms".


I said the loop is an artifact, not the code in the loop. Their experimental design is "Do x N times, then divide the overall runtime of the program by N to estimate the cost of x." If timing a program that did x once lead to good results, they would do that - but it doesn't, so they don't. You're trying to optimize around the loop, but by doing that, you're changing the x that is being measured.


Two things potentially different between the two implementations:

1. The C code is stack allocating a pointer in every loop iteration.

2. Cleaning up the memory without(potentially) triggering Python's GC.

Is the Python GC getting triggered in this scenario? If not, then this is what's actually happening, with the cleanup happening automatically when the process exits:

  char x[10000000];
  int i;

  for (i = 0; i < 10000000; i++) {
        x[i] = malloc(44 * sizeof(char));
        sprintf(x, "%d %d", i, i);
    }
If GC is thrown into the mix, the C code is really:

  char x[1000];
  int i;
  int j=0;
  for (i = 0; i < 10000000; i++) {
        x[j] = malloc(44 * sizeof(char));
        sprintf(x, "%d %d", i, i);
        j++;
        if(j == 1000) {
            for(j = 0; j < 1000; ++j)
                free(x[j]);
        }
    }


If by x you mean performing a string operation and copying the result to memory .. I still fail to see how allocating and deallocating memory on each iteration is a fair comparison, it's not something I'd expect a competent C developer to do ergo how is that a relevant comparison?

I know the aim is to show off the string operation in and of itself so why not leave it there, why the need to bring garbage collection into the mix?

Edit: it's the way the garbage collection has been thrown into the mix I'm having difficulty understanding the need for.


If you didn't notice, their version was based off of static allocation for the character array, which is the kind of thing a sharp C programmer will put in as an optimization. Note that the Python version is already pretty optimized, according to general Python idioms.

You are arguing that they should have compared against a less efficient C example, which honestly boggles my mind.


Instantiation for side effects happens in Python. Unless PyPy can prove that the string isn't used anywhere (and it can't, really), it has to execute the statement and push the object onto the stack before discarding it.


Why can't it?


PyPy simply doesn't have such a thing yet, that's all. I don't think there's any technical reason that it couldn't be developed at some point in the future.


It's a matter of purity inference. They already trivially know the value isn't used based on the AST, so they need to show that the function being called is pure to elide it. Naturally, in the general case this is undecidable, but there are fast algorithms for very weak purity inference that might do the trick, at least in this case.

There may be other Python-specific concerns I'm missing - my work in this area is in Ruby static analysis - but one other thing is that allocating memory is a side-effect itself. The main side-effect visible to Python is that it might raise an exception for being out of memory - this would have to be special-cased as an acceptably ignored side-effect by any purity analyzer.


Sounds like this would be the equivalent of some theoretical

    exp = CompileSprintfFormat("%d %d");
    for (i = 0; i < 10000000; i++) {
        RunCompiledSprintf(exp, i, i);
    }
All I'm really getting out of this is that PyPy now compiles sprintf formats for you and saves the results, and that there's no equivalent API in libc.


That's not really it. This, at least in my understanding, is a fully generic optimization, of operations on constant strings.


From the article:

> In the case of PyPy, we specialize the assembler if we detect the left hand string of the modulo operator to be constant.

So it's very much a targeted optimization at the modulo operator (which is Python's equivalent of sprintf).


I wouldn't say it's particularly targeted, let me show you the code that makes this happen: https://bitbucket.org/pypy/pypy/src/unroll-if-alt/pypy/objsp...


Yeah, the point is that by expressing the algorithm in Python, the JIT gets to go hog wild optimising tight loops like this one.

Which is great: why do all that work writing some kind of custom sprintf generating function when you can just let the JIT do it for you on the fly?


In that case, wouldn't it be more fair to compare it to a C++ re-implementation of sprintf using the new `constexpr` keyword?


You shouldn't have to use such an API though and this case could be optimized by any C compiler.


Why don't you take an example from Mike Pall's LuaJIT relying on something more computationally expensive like scimark - he has lua version, and also few other benchmarks (from the alioth site).

If you really want fast sprintf( %s, "%d %d" ) - then you might aswell craft something specifically for converting text to decimal numbers.

sprintf( ) is convenience function, not performance.


I'm sure that's one of the larger goals of PyPy, to make idiomatic code that is convenient to write also perform well. They don't have to be mutually exclusive.


What the example shows is that specializing string operations for known inputs can be faster than not doing so. Surprise!

But going from this to "PyPy is faster than C" seems quite a stretch, no?


> compiled with GCC 4.5.2 at -O4 (other optimization levels were tested, this produced the best performance).

I was under the impression that any number greater than 3 had no effect?


Why be so shy? Faster than Assembly language! Faster than Assembly language with cache size alignment and padding of data structures! ^_^


If PyPy were really smart, it would notice the unused result and take 0.00s. :)


How did they measure this? How would it compare to C#. I'm assuming they ignore the startup time of course.


This title is misleading; instead of `C` it should use `CPython`.


Why would it use that title? CPython is a part of some of the numbers, but the comparison is C vs. PyPy.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: