PyPy is faster than C, again: string formatting

old-gregg · on Aug 2, 2011

I love PyPy, those guys are amazing. However, every time I see "faster than C" almost always a bit of trickery/wordplay is involved.

The examples aren't comparable. The equivalent would be to have Python code invoke an external function which sits in a pre-compiled .so

Bulk of the work is happening inside of sprintf(), why handicap C by not letting it to compile the code?

The fair comparison would be to place the source of sprinf() nearby and see if C compiler inlines that call or/and unrolls the loop, otherwise it's just about packaging/linking, not really about code generation.

Edit: I see this became #1 on HN front page today. I want to take advantage of this and say that http://mailgun.net, the programmable email platform, is looking for an engineer who'd find this discussion interesting. See my profile. And we're users of PyPy too! :)

haberman · on Aug 2, 2011

I highly doubt that the performance difference is a result of not inlining/unrolling sprintf(). The difference is most likely due to the fact that sprintf() is parsing the format string over and over, whereas PyPy is only parsing it once.

The printf() family of functions is one of the rare parts of the C standard library that is essentially implementing an interpreter. The format string is a program and the library call runs that program with data that is supplied in the other arguments. So in essence this is comparing an interpreter (sprintf) with a JIT compiler (PyPy generating code for a specific format string at runtime).

I agree it's not a very fair comparison: if this C code was performance critical you would call a specialized function for converting an integer to a string, and this would almost certainly beat PyPy.

cube13 · on Aug 2, 2011

>I highly doubt that the performance difference is a result of not inlining/unrolling sprintf(). The difference is most likely due to the fact that sprintf() is parsing the format string over and over, whereas PyPy is only parsing it once.

And the fact that it only runs twice as long is pretty impressive, actually. Sprintf is doing the parsing 999,999,999 times more than the PyPy interpreter is.

scott_s · on Aug 2, 2011

But the whole point of the post was to point out that PyPy can optimize in places that the traditional C model of shared libraries cannot - or, at least, have great difficulty doing so. This is an inherent advantage to optimizing the instructions at runtime.

old-gregg · on Aug 2, 2011

Shared libraries is an operating system feature and has nothing to do with C. C just happens to be the most popular systems programming language. "Traditional C model" is just a bunch of object files, and it's up to you and your linker to package it any way you want it: a shared library, a static library or an executable.

I'm not a GCC expert, but Microsoft C compilers could inline cross-module code since the beginning of time. His example may actually produce different results under msvc with /GL flag and with static linking enabled.

"...Inline a function in a module even when the function is defined in another module..." http://msdn.microsoft.com/en-us/library/0zza0de8%28v=vs.80%2...

ori_b · on Aug 2, 2011

GCC got that option recently with '-flto' (or whatever it was to enable link time optimization)

sausagefeet · on Aug 2, 2011

Is anyone working on a JIT for object code? I'm thinking we could keep all our SO's how they are now since they are great for updating versions of libraries with bugs, but have a JIT that takes care of profiling and inlining functions from other modules.

fanf2 · on Aug 3, 2011

LLVM is designed to be able to do this.

scott_s · on Aug 2, 2011

See my reply below. I'm fully aware of the relationship C has to systems programming.

cube13 · on Aug 2, 2011

Depends on their GCC options. If it's the default option, it should be using the shared library. If they changed it to statically link libc, that should produce a(possibly considerable) performance boost.

burgerbrain · on Aug 2, 2011

C the language doesn't have shared libraries. That's something some systems add themselves.

Edit: Is this not the case? I am under the impression that it is, and is relatively new at that.

scott_s · on Aug 2, 2011

Which is why I said "the traditional C model of shared libraries." Shared libraries are not a part of the C language itself, but most real systems make heavy use of shared libraries. Even statically linked C programs have this problem, because optimizing across object files is hard - most compilers don't even try.

Not being a part of the C language itself is irrelevant. The traditional linker has been a part of the C ecosystem for decades.

burgerbrain · on Aug 2, 2011

Statically linked programs certainly still have this problem, I'm not contending that at all. I would not however call shared libraries a "traditional model" at all. Particularly in the fields that C seems to be getting relegated to more and more these days.

scott_s · on Aug 2, 2011

If you're running a Unix derived system, most of the programs are using shared libraries. That fits "traditional" in my mind.

burgerbrain · on Aug 2, 2011

A sufficiently modern Unix derived system, sure ;) I have quite commonly seen shared libraries not done on embedded systems however, regardless of heritage.

scott_s · on Aug 2, 2011

And I don't consider embedded systems to be traditional. Otherwise, we wouldn't have to qualify it with "embedded."

burgerbrain · on Aug 2, 2011

They are certainly a traditional stronghold of straight up C.

nknight · on Aug 2, 2011

The embedded systems I work on use shared libraries, C++, shell scripts, and even Adobe Flash.

Dynamic linking has been around since the 1960s, and is the model that has been used by almost all currently active developers for their entire careers. Arguments based around specialized microcontrollers, hard real-time systems, life-or-death control systems, etc. are off in the woods. Few developers use such systems in the real world, and their code is statistically irrelevant to comparisons between e.g. Python and C.

burgerbrain · on Aug 2, 2011

"The embedded systems I work on use shared libraries, C++, shell scripts, and even Adobe Flash."

Consider yourself lucky...

skorgu · on Aug 2, 2011

PyPy tries to move as much code as possible out of C and into (R)Python for the exact reason that this kind of optimization becomes possible/feasible/easier with JIT. So it's a packaging win sure but it's not an accidental one.

JoachimSchipper · on Aug 2, 2011

Just including vprintf is not enough. Here are some timings (GCC-4.2.1, system compiler on OpenBSD-5.0-current/amd64 on a Thinkpad SL510; all timings are median-of-three):

Provided program, -O4: ~5.06s

Provided program, -O4 -march=native: ~5.07s (i.e. no difference)

Provided program, includes sprintf and the vprintf family, -O4 -march=native: ~4.84s

Of course, this is just a silly example (realistic programs don't make things that easy on the JIT). Still, quite impressive; congratulations to the PyPy guys!

leon_ · on Aug 2, 2011

> However, every time I see "faster than C" almost always a bit of trickery/wordplay is involved.

yeah, it's almost like those "$something in only $few lines of $programming_language" posts where you get highly obfuscated code to read.

apaprocki · on Aug 2, 2011

Since we're comparing apples to oranges anyway, how fast could it be if you really wanted to format "%d %d" into the stack 10 million times without a function call... Just for fun:

  int main() {
      static const char* digits = "0123456789";
      int i;
      for (i = 0; i < 10000000; i++) {
          char x[44], *p = x, tmp[20];
  
          /* sign */
          int j;
          if (i < 0) { *p++ = '-'; j = -i; } else { j = i; }
  
          /* number */
          int pos = 0, spos;
          do {
              tmp[pos++] = digits[j % 10];
              j /= 10;
          } while (j != 0 && pos <= 20);
          spos = pos;
          do { *p++ = tmp[--pos]; } while (pos > 0);
  
          /* space, sign, number again */
          *p++ = ' ';
          if (i < 0) *p++ = '-';
          do { *p++ = tmp[--spos]; } while (spos > 0);
          *p++ = '\0';
      }
  }
  
  $ gcc -O4 -o s s.c
  $ time ./s
  real    0m0.140s
  user    0m0.138s
  sys     0m0.001s

megaman821 · on Aug 2, 2011

In a practical sense I don't really care if benchmarks are imploring the same algorithms or if ever little detail is the same. I do care what the speed of the idiomatic approach to solving some problem is. That means if the C++ STL's standard sort is the bubble sort and Python's is the quicksort, to 9 out of 10 programmers that will end up using the standard library, Python is faster when it comes to sorting even though C++ would be much faster using the same algorithm.

1amzave · on Aug 2, 2011

That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.

apaprocki · on Aug 2, 2011

Interesting.. I guess I should have mentioned gcc 4.5.2 on a Xeon X5670, 2.93GHz, which is 12M cache. Changing it to ('0' + j % 10) has no change in overall speed for me.

1amzave · on Aug 2, 2011

OK, I just tested with gcc 4.6.0, and unless I've screwed something up, it looks like (at -O4) it actually optimizes these into the exact same code. As in the generated ELFs are byte-for-byte identical. Impressive.

apaprocki · on Aug 2, 2011

Since you showed interest, using this as the for loop tweaks it even a bit more :)

  /* temp number */
  char tmp[20];
  int pos = 0;
  int j = i > 0 ? i : -i;
  do {
    tmp[pos++] = '0' + j % 10;
    j /= 10;
  } while (j != 0 && pos <= 20);

  /* output both numbers simultaneously */
  char x[44], *p1 = x, *p2 = x + 1 + pos;
  if (i < 0) { *p1++ = '-'; *p2++ = '-'; }
  do {
    int tpos = --pos;
    *p1++ = tmp[tpos]; *p2++ = tmp[tpos];
  } while (pos > 0);
  *p1 = ' ';
  *p2 = '\0';

1amzave · on Aug 2, 2011

Well, if you really want to go apples-to-oranges and specialize as much as possible for the sequentially increasing case we're dealing with here, you could just do your arithmetic directly on the ASCII string itself, complete with a little ripple-carry loop...I bet that'd be fast. If I'm remembering correctly, I think x86 actually has dedicated instructions for ASCII arithmetic (though they're probably so unoptimized in a modern microarchitecture it'd probably be faster to avoid them).

dalke · on Aug 3, 2011

Your assignment j=-i doesn't work if i=-2^31 since the range of 32 bit signed int is -2^31 to 2^31-1.

(Obviously this code never has i<0 so that branch is never even taken.)

onedognight · on Aug 2, 2011

    char x[44];
    sprintf(x, "%d %d", i, i);

This is fine, except you can't even return x from this function, a more fair comparison might be:

     char * x = malloc(44 * sizeof(char));
     sprintf(x, "%d %d", i, i);*

There is a standard (C99) way to do this: asprintf(3).

    char *x;
    asprintf(&x, "%d %d", i, i);
    return x;

tedunangst · on Aug 2, 2011

There is no asprintf function in my copy of the C99 standard.

onedognight · on Aug 2, 2011

Sorry, you are correct; I mis-read the man page. It is however in at least glibc and Darwin/FreeBSD's libc.

supersillyus · on Aug 2, 2011

asprintf is a GNU extension, I believe.

jperras · on Aug 2, 2011

The pypy guys continue to do amazing work. This project is one of the reasons why I believe the python community is one of the best open source communities out there: Let's work on something incredibly difficult, challenging and something that not long ago was considered to be nearly impossible, and produce incredible results.

If you're not running pypy in production already, then you probably should be[1].

[1]: Yes, there are some obvious exceptions.

edit: formatting.

_delirium · on Aug 2, 2011

Re:

> GCC is unable to inline or unroll the sprintf call, because it sits inside of libc.

If I'm understanding http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Other-Builtins.h... correctly, sprintf should be handled as a built-in function, rather than linking the libc version, unless you explicitly specify -fno-builtin. In theory that should allow gcc to perform various optimizations; I've seen that happen with printf at least, where e.g. printf-ing a constant string just gets compiled to puts.

kingkilr · on Aug 2, 2011

Perhaps it is recognizing it, but it's not doing anything interesting with that knowledge, on my machine GCC emits a call to "__sprintf_chk"

pja · on Aug 2, 2011

There's nothing really stopping gcc from doing as well here, since it ought to be able to spot that the format string never changes & roll out a custom sprintf() that doesn't need to parse it every time. However, right now gcc doesn't do that, so it loses to a language implementation that does.

The really sad thing is that using an ostringstream in C++ is even worse, despite the fact that C++ has all the types available to it & doesn't need to parse any format strings at all: Not enough template metaprogramming clearly!

xd · on Aug 2, 2011

I'm no python developer but this:

def main():

    for i in xrange(10000000):

        "%d %d" % (i, i)

main()

Does't seem to be copying the result anywhere. Where as the C example is copying the result to memory .. which would explain why it is slower.

kingkilr · on Aug 2, 2011

Yes, it does seem a little strange that the result is unused, however changing that to be ``x = <stuff>`` would have no effect on the runtime, the compiler isn't quite good enough yet to realize the result is unused.

jckarter · on Aug 2, 2011

The more fundamental mismatch is that they're testing an interpreter (libc's printf implementation) against a compiler (pypy's format string jit). A C++0x or D library that parsed format strings at compile time would likely still beat pypy.

icebraining · on Aug 3, 2011

But Pypy can optimize dynamic strings, while a static compiler can't.

Replace the constant string with argv[1], and run the code: for Pypy, there is no real difference, but the C/C++/D compilers are unable to optimize.

repsilat · on Aug 3, 2011

I think this is still a library problem, not a language problem. "Compiling" a string at runtime for `sprintf` isn't any more difficult than doing it at compile-time, and plenty of regex libraries do very similar things.

Compiling of printf strings isn't done, though, because nobody cares about string performance unless they're writing UNIX command-line utils. If you're writing C you're probably only dealing with strings for (infrequent) IO, spending the vast majority of your time crunching away on pointers and integer types (floats in niche cases). In the end, you're probably only printing something out because someone needs to read it, and how fast can humans read, anyway?

This goes just as much for the printing of floating point numbers. (http://www.serpentine.com/blog/2011/06/29/here-be-dragons-ad...)

xd · on Aug 2, 2011

Oh and does anyone else think the malloc example in comparison to actual garbage collection is incredibly unfair?

kingkilr · on Aug 2, 2011

I've long since lost track of what it means for a comparison between languages to be fair. To put that a different way, how should we have compared them? Should we have downloaded some GC library for C to implement it?

xd · on Aug 2, 2011

I guess the real problem is that the example serves no real world purpose and can very easily be biased. See my response to scott_s below.

scott_s · on Aug 2, 2011

"Fairness" is not really relevant. As long as they're comparing typical C idioms to typical Python idioms, it's valid.

xd · on Aug 2, 2011

So why not simply do:

#include <stdio.h>

#include <stdlib.h>

int main() {

    int i = 0;

    char *x = malloc(44 * sizeof(char));

    for (i = 0; i < 10000000; i++) {
        sprintf(x, "%d %d", i, i);
    }

    free(x);

}

Their version was intentionally biased, which is a shame.

scott_s · on Aug 2, 2011

Probably because they wanted to simulate calling a function that parsed a string, and such functions will have to allocate and deallocate memory as needed.

It's also worth noting that calling malloc and free in a tight loop where you're always requesting the same amount of memory will be pretty fast. Good implementations of malloc - of which glibc certainly is - will consistently return the exact same chunk of memory to you, and you will be on the fast-path of the allocation algorithm.

xd · on Aug 2, 2011

Why would you assume the need to deallocate in the loop? Reuse of an already allocated memory location is C optimisation 101.

scott_s · on Aug 2, 2011

You're looking at this from the wrong angle. You're looking at the C code and thinking, "How can I optimize this?"

What you need to do in this case is look at it and say, "How can I optimize this and still retain the essence of what I want to test?" Your optimizations remove that essence - if you're calling a function that is a part of an API, it will have to allocate and free its own memory. That the code is in a loop is an artifact of the experiment.

xd · on Aug 2, 2011

I'm looking at it from the angle of someone that has spent over a decade in and out of C. The code in the loop is in NO way an "artifact", it's absolutely critical to the comparison. My example in no way removes the essence, especially if we are using "idioms".

scott_s · on Aug 2, 2011

I said the loop is an artifact, not the code in the loop. Their experimental design is "Do x N times, then divide the overall runtime of the program by N to estimate the cost of x." If timing a program that did x once lead to good results, they would do that - but it doesn't, so they don't. You're trying to optimize around the loop, but by doing that, you're changing the x that is being measured.

cube13 · on Aug 2, 2011

Two things potentially different between the two implementations:

1. The C code is stack allocating a pointer in every loop iteration.

2. Cleaning up the memory without(potentially) triggering Python's GC.

Is the Python GC getting triggered in this scenario? If not, then this is what's actually happening, with the cleanup happening automatically when the process exits:

  char x[10000000];
  int i;

  for (i = 0; i < 10000000; i++) {
        x[i] = malloc(44 * sizeof(char));
        sprintf(x, "%d %d", i, i);
    }

If GC is thrown into the mix, the C code is really:

  char x[1000];
  int i;
  int j=0;
  for (i = 0; i < 10000000; i++) {
        x[j] = malloc(44 * sizeof(char));
        sprintf(x, "%d %d", i, i);
        j++;
        if(j == 1000) {
            for(j = 0; j < 1000; ++j)
                free(x[j]);
        }
    }

xd · on Aug 2, 2011

If by x you mean performing a string operation and copying the result to memory .. I still fail to see how allocating and deallocating memory on each iteration is a fair comparison, it's not something I'd expect a competent C developer to do ergo how is that a relevant comparison?

I know the aim is to show off the string operation in and of itself so why not leave it there, why the need to bring garbage collection into the mix?

Edit: it's the way the garbage collection has been thrown into the mix I'm having difficulty understanding the need for.

MostAwesomeDude · on Aug 2, 2011

If you didn't notice, their version was based off of static allocation for the character array, which is the kind of thing a sharp C programmer will put in as an optimization. Note that the Python version is already pretty optimized, according to general Python idioms.

You are arguing that they should have compared against a less efficient C example, which honestly boggles my mind.

MostAwesomeDude · on Aug 2, 2011

Instantiation for side effects happens in Python. Unless PyPy can prove that the string isn't used anywhere (and it can't, really), it has to execute the statement and push the object onto the stack before discarding it.

wingo · on Aug 2, 2011

Why can't it?

MostAwesomeDude · on Aug 2, 2011

PyPy simply doesn't have such a thing yet, that's all. I don't think there's any technical reason that it couldn't be developed at some point in the future.

carbonica · on Aug 3, 2011

It's a matter of purity inference. They already trivially know the value isn't used based on the AST, so they need to show that the function being called is pure to elide it. Naturally, in the general case this is undecidable, but there are fast algorithms for very weak purity inference that might do the trick, at least in this case.

There may be other Python-specific concerns I'm missing - my work in this area is in Ruby static analysis - but one other thing is that allocating memory is a side-effect itself. The main side-effect visible to Python is that it might raise an exception for being out of memory - this would have to be special-cased as an acceptably ignored side-effect by any purity analyzer.

lilyball · on Aug 2, 2011

Sounds like this would be the equivalent of some theoretical

    exp = CompileSprintfFormat("%d %d");
    for (i = 0; i < 10000000; i++) {
        RunCompiledSprintf(exp, i, i);
    }

All I'm really getting out of this is that PyPy now compiles sprintf formats for you and saves the results, and that there's no equivalent API in libc.

TylerE · on Aug 2, 2011

That's not really it. This, at least in my understanding, is a fully generic optimization, of operations on constant strings.

lilyball · on Aug 2, 2011

From the article:

> In the case of PyPy, we specialize the assembler if we detect the left hand string of the modulo operator to be constant.

So it's very much a targeted optimization at the modulo operator (which is Python's equivalent of sprintf).

kingkilr · on Aug 2, 2011

I wouldn't say it's particularly targeted, let me show you the code that makes this happen: https://bitbucket.org/pypy/pypy/src/unroll-if-alt/pypy/objsp...

pja · on Aug 3, 2011

Yeah, the point is that by expressing the algorithm in Python, the JIT gets to go hog wild optimising tight loops like this one.

Which is great: why do all that work writing some kind of custom sprintf generating function when you can just let the JIT do it for you on the fly?

greyfade · on Aug 2, 2011

In that case, wouldn't it be more fair to compare it to a C++ re-implementation of sprintf using the new `constexpr` keyword?

DasIch · on Aug 2, 2011

You shouldn't have to use such an API though and this case could be optimized by any C compiler.

malkia · on Aug 3, 2011

Why don't you take an example from Mike Pall's LuaJIT relying on something more computationally expensive like scimark - he has lua version, and also few other benchmarks (from the alioth site).

If you really want fast sprintf( %s, "%d %d" ) - then you might aswell craft something specifically for converting text to decimal numbers.

sprintf( ) is convenience function, not performance.

sjs · on Aug 3, 2011

I'm sure that's one of the larger goals of PyPy, to make idiomatic code that is convenient to write also perform well. They don't have to be mutually exclusive.

pwpwp · on Aug 2, 2011

What the example shows is that specializing string operations for known inputs can be faster than not doing so. Surprise!

But going from this to "PyPy is faster than C" seems quite a stretch, no?

ZoFreX · on Aug 2, 2011

> compiled with GCC 4.5.2 at -O4 (other optimization levels were tested, this produced the best performance).

I was under the impression that any number greater than 3 had no effect?

schiptsov · on Aug 3, 2011

Why be so shy? Faster than Assembly language! Faster than Assembly language with cache size alignment and padding of data structures! ^_^

comex · on Aug 2, 2011

If PyPy were really smart, it would notice the unused result and take 0.00s. :)

aninteger · on Aug 2, 2011

How did they measure this? How would it compare to C#. I'm assuming they ignore the startup time of course.

afhof · on Aug 2, 2011

This title is misleading; instead of `C` it should use `CPython`.

briancurtin · on Aug 2, 2011

Why would it use that title? CPython is a part of some of the numbers, but the comparison is C vs. PyPy.