Python Patterns - An Optimization Anecdote

nene · on Nov 2, 2010

I decided to try writing the same thing in JavaScript and discovered something really strange.

My first idea was:

    numbers.map(function(x){return String.fromCharCode(x);}).join("");

This was pretty fast already, but why not eliminate the anonymous function completely and pass String.fromCharCode directly to map():

    numbers.map(String.fromCharCode).join("");

I timed it and... ...this was ~100 times slower than the previous version. WTF!

Somehow passing this native function directly to Array.map() is way slower than wrapping it inside another function and passing that to Array.map().

I have tested it so far in Chrome, Firefox and Opera - the results are the same. I also tried forEach(), which behaves similarly.

Does anybody have an idea why this is so?

Update: I tried the same with Math.round and Math.sin, and with these the results were as one would expect: passing the function to Array.map() directly was a little bit faster than using intermediate anonymous function. So it seems the problem is with String.fromCharCode specifically.

nene · on Nov 2, 2010

Found an answer.

The problem is that String.fromCharCode takes multiple arguments and Array.map also passes multiple arguments to the callback, therefore the equivalent of numbers.map(String.fromCharCode) is actually:

    numbers.map(function(x, y, z){return String.fromCharCode(x, y, z);})

Which of course is slower as the end result will actually be array with longer strings in it.

lysium · on Nov 2, 2010

Just a guess: your anonymous function cannot be redefined (because there is no name), but String.fromCharCode could potentially be. Thus, a similar reason as mentioned in the article for global vs. local variables.

One would think that String.fromCharCode is looked up only once, though.

pornel · on Nov 2, 2010

Actually it's the opposite. When you pass String.fromCharCode directly, then you're passing reference (not name) to that particular implementation, and it can't change.

When you pass anonymous function, then every execution of that function needs to look up `String.fromCharCode` (anonymous functions save scope, not references).

I'm surprised by the benchmark as well. I suspect it may be because calls to native functions are handled differently from calls to JS functions, and JS engine is able to optimize call inside anonymous function (create trace/JIT and inline it), but not when calling by reference (and perhaps keeps calling it by some expensive proxy object).

nene · on Nov 2, 2010

That's exactly what I was thinking:

* In the first version each time the anonymous function is executed interpreter has to lookup String and from it the fromCharCode method => 2n lookups.

* In the second version the String and fromCharCode have to be lookud up only once => 2 lookups.

Therefore according to slow-lookups-theory the first version should be slower. Except when I measure it, the opposite turns out to be true.

dish · on Nov 2, 2010

That kind of should be the first train of thought in python too, why using "+" to add strings!? it maybe fixed in py3k but its still not worth it in py2.X

<pre> return ''.join(map(chr, list)) </pre>

Also the liberal use of the words list and string bothers me.

dish · on Nov 2, 2010

Since Guido invented the words "list" and "string" in python at least, I guess him using them doesn't bother me.

edanm · on Nov 2, 2010

I once found myself writing a python program to calculate Poker hands. It was pretty fun to write - it generated, for any given hand you hold, every possible way the game can go forward, and generated statistics about how many times you win (against 1 opponent only, who could be holding anything.)

The program initially ran for 20 minutes on each hand. I worked a day on optimizing it, and got it down to around 1-2 minutes per hand, so I learned a few tricks to make things quicker. Lessons learned:

1. Local lookups are much faster.

2. Anything that you can somehow offload to a c-loop, makes things much faster.

3. Surprisingly, the fastest construct for building a new list was always list comprehension (e.g. [x for x in calc(l) or whatever.) This was far faster than a regular ol' for-loop, of course, but was also faster than using map's and reduce's.

devinj · on Nov 2, 2010

Point 3 is weakened by the fact that map can indeed be faster. http://codepad.org/SzXcubXy -- in fact, point 2 and point 3 are in conflict, because map() does use a C-loop, whereas list comprehensions do not. As it turns out, point 2 was the correct one.

Also, and this is just on a more minor, pedantic note, list comprehensions basically are regular for loops, except with the use of the list-append opcode instead of a getattr-call pair. This produces a minor speedup, but it's nothing to write home about, honest. (using a different pastebin because it has a feature I needed): http://ideone.com/dT1LG . The story is a bit different on 3.x, where list comps are implemented very differently: http://ideone.com/58FGZ

edanm · on Nov 2, 2010

First off, I'm really no expert in Python or optimizations - I'm only reporting on observations.

Having said that, I was pretty surprised that the list comprehensions ran faster than map, exactly because of point 2 - I had assumed that map used a c-loop, whereas list comprehensions did not. Unfortunately, I don't have that code sitting around anywhere so I can't redo the tests - I just remember being very surprised by those results.

deathflute · on Nov 2, 2010

In my experience edanm is correct. map/reduce/filter are faster with built-in functions, but list comprehensions are faster with user-defined ones

For eg. using python 2.5.2

In [9]: l = ['10'] * 5000000

In [10]: timeit -n5 map(int, l)

5 loops, best of 3: 1.29 s per loop

In [11]: timeit -n5 [int(i) for i in l]

5 loops, best of 3: 1.64 s per loop

devinj · on Nov 2, 2010

You mean "incorrect". Also, I did post detailed timing results above.

rix0r · on Nov 2, 2010

The first thing that stuck out to me like a sore thumb was the string-concatenation-in-a-loop. Though the author was aware of its consequences he only addressed it late in the article, optimizing relatively trivial stuff like variable lookups first. And then he did so in a rather strange way, instead of adding string fragments to a list and then join()ing them.

Not sure how fast that loop would be in Python compared to the implied loop&join but I sure would have liked to see them compared.

RiderOfGiraffes · on Nov 2, 2010

  > Not sure how fast that loop would be in Python
  > compared to the implied loop&join but I sure
  > would have liked to see them compared.

Why don't you try it and tell us the results?

stavros · on Nov 2, 2010

Yeah, seriously. "".join((chr(x) for x in chars)) would be the first thing I did (I'm not sure if comprehensions are faster than generators, they probably are, so I'd test for that as well.

Nitramp · on Nov 2, 2010

Indeed seems strange to micro optimize all that stuff without first fixing the obvious quadratic runtime part.

By the way, is there no mutable string object (à la Java's StringBuilder/Buffer) in Python?

danielh · on Nov 2, 2010

It's a whole different story if the input gets longer and the quadratic complexity kicks in. Just change the testdata like this:

  testdata = range(256)*100

Result:

codefisher · on Nov 2, 2010

For those interested, the author of this was actually Guido himself.

I would have used "".join(map(chr, list)) myself. But I am not sure that was support in the version of python available at the time. It is equivalent to f6.

To some of those talking about the dynamic nature of python means you have to look up chr every time, there is a way to stop that, see this project http://github.com/rfk/promise/

RyanMcGreal · on Nov 2, 2010

Why not just:

    def f(list): 
        return ''.join([chr(l) for l in list])

zephyrfalcon · on Nov 2, 2010

Because it's an old article. Admittedly, there's no easy way to know this, because it has no date, but believe me, it's been around for ages. At the very least, it predates the silly str.join method (and new-style classes and string methods altogether).

However, before that, there was the string module, which had the join() function. This article probably relates to it in some way; I think that either it predates string.join (I'm not sure when that was added to Python, but it's been around since at least 1.4), or it clarifies how it was implemented.

The author is Guido van Rossum, by the way (another thing that isn't mentioned in the article).

dish · on Nov 2, 2010

WTF, Guido wrote that!? I will jump out of a window now.

teoruiz · on Nov 2, 2010

It is actually slower. Adding an f8 function using the ''.join() method:

  sikanda-13:30:02:/tmp$ python f.py 
  [...]
  f1 0.07
  f2 0.12
  f3 0.07
  f4 0.06
  f5 0.09
  f6 0.04
  f7 0.02
  f8 0.07

Original code linked in the article: http://www.python.org/doc/essays/f.py

scott_s · on Nov 2, 2010

My results agree with yours, in terms of who wins. Python 2.6 (2.4 is similar, and 3.0 is as well, for the techniques that were still valid) on OSX 10.5, Intel Core Duo at 2 GHz:

  [scotts@silver]$ /opt/local/bin/python2.6 f.py
  f1 0.107
  f2 0.19
  f3 0.106
  f4 0.097
  f5 0.139
  f6 0.073
  f7 0.025
  f8 0.103

edit: I added a ninth function,

  def f9(list):
    return ''.join(map(chr, list))

To test if the difference between f8 and f6 was the list comprehension or the joinfields method. A result of 0.072 tells me it's the list comprehension. The only difference between f8 and f9 is list comprehension versus map. That leads me to conclude that map is indeed a faster way to construct a list than a list comprehension. (Although this may change for a user-defined function. I could test this, but I've spent enough time already.)

Goladus · on Nov 2, 2010

This would be my default approach as well. I'd be interested to see how it compares-- my guess is that it would be equivalent to or slightly faster than f6(), and slower than the version that uses the array library.

petdog · on Nov 2, 2010

Actually, map is still faster:

    In [24]: timeit ''.join(map(chr, ll))
    100000 loops, best of 3: 3.91 us per loop

    In [25]: timeit ''.join([chr(i) for i in ll])
    100000 loops, best of 3: 5.85 us per loop

    In [26]: timeit ''.join(chr(i) for i in ll)
    100000 loops, best of 3: 7.79 us per loop

RyanMcGreal · on Nov 2, 2010

From the article, I'm guessing the author was using a version <2.x.

Incidentally, I ran my function and f6 from the article on a list of 9600 integers using Python 2.6.3. f6 ran in 0.00303250832159 seconds and my function ran in 0.00421450212248 seconds.

rmc · on Nov 2, 2010

Incidentally, are there any programmes that can scan your python functions and suggest improvements that are in this article? e.g. any programme that'll suggest referencing global names to the local namespace, using map etc.?

Devilboy · on Nov 2, 2010

I hope that one day soon our compilers will be smart enough to optimize simple things like this automatically, so we can focus more on writing for clarity. This seems like exactly the kind of thing a next-gen compiler should be able to optimize.

edanm · on Nov 2, 2010

I'm not sure it's so possible, considering the dynamic nature of Python.

Consider one of the biggest problems - dynamic lookups. Calling a function like "chr", which is looked up dynamically and found in the global scope each time, takes a lot of time. Just adding a local variable which has the same value (i.e. lchr = chr) already gives a speedup.

How can an optimizer fix this? Technically, a new lookup must be performed each time, since you never know if "chr" has been mapped to something new during the iterations.

riffraff · on Nov 2, 2010

you can lookup once and inline the method, and only keep a flag to check if you need to invalidate that. Pypy & other jits al do this kind of things all the time :)

But you don't need a jit: you can partially execute the code and determine that no changes were happening in the referenced globals, thus the inlining becomes possible.

There is quite a bit of literature on the issue of using partial evaluation for optimizing dynamic/reflective languages.

edanm · on Nov 2, 2010

Interesting. What are the best sources for learning what kind of optimizations are actually implemented in practice in Python?

riffraff · on Nov 2, 2010

well neither of the above, in standard python, but the pypy project has _a lot_ of interesting stuff http://codespeak.net/pypy/dist/pypy/doc/extradoc.html

derefr · on Nov 2, 2010

> you never know if "chr" has been mapped to something new during the iterations.

Why must this be true? Scan the loop body, find nothing that redefines chr, create a local variable holding it at the beginning of the loop. The same goes for all identifiers mentioned in the loop.

lysium · on Nov 2, 2010

Generally, you cannot scan the loop body to find redefinitions of `chr`. This is known as the halting problem.

Consider, for example, a redefinition of `chr` in an `if`-clause: you would have to actually run the program to know if the `if`-clause is taken or not.

Further, in Python, the redefinition of `chr` does not even have to be in the form of `chr = ...`, but could be an assignment into a global hash, maybe even using a variable as an index.

derefr · on Nov 2, 2010

I'm not suggesting that you have to know specifically that chr is being redefined; to cancel the optimization, you simply have to know that "this loop body is not limited to the subset of operations which are known to not redefine chr" (i.e. basically the same sort of analysis Google NaCl does.) This leaves a large number of optimizations on the table (i.e. any loop body that modifies a global hash will cancel the optimization), but is better than no optimization at all.

lysium · on Nov 2, 2010

I wonder how large (or small) that restricted subset is and how easily it is 'violated', turning off the optimization although it would have been possible (For example, the global hash could be referred to via a local variable which might carry the result of a function call). Do you have any insights on that?

Until then, using a 'dirty flag' on the `chr` definition that turns off the optimization at run time as mentioned elsewhere in this thread, seems to be the appropriate approach for me.

edanm · on Nov 2, 2010

Good points.

What about multi-threading? Isn't it possible that another thread entirely reassigns "chr" to something different, while the for loop is running?

Obviously this would only happen in a pathological program written purely for perpetrating evil on unsuspecting programmers :) But seriously, it's possible, and it's the kind of thing optimizing compilers have to take into consideration.

vilya · on Nov 2, 2010

This is the sort of thing that trace-based optimisation can help with. It's able to observe that the value bound to chr doesn't change between invocations and inline the lookup accordingly. I believe Mark Shannon's HotPy runtime (http://www.dcs.gla.ac.uk/~marks/) attempts to do this.

jpr · on Nov 2, 2010

> Calling a function like "chr", which is looked up dynamically

Shouldn't that be just one or two pointer-dereferences? Or is Python putting globals in hash-table and looking them up each and every time?

zephyrfalcon · on Nov 2, 2010

The latter; otherwise it wouldn't be dynamic. :-)

I don't know the details, but I assume that some of the Python compiler projects do make assumptions like this, i.e. they can compile to faster code if you don't shadow or outright overwrite builtins, etc.

jpr · on Nov 2, 2010

I still don't see why the variable names can't be interned somehow by the parser. This would mean that a global variable would be just a pointer in to the table, and "lookup" would be just dereferencing that pointer. AFAICT, this is how symbols and packages work in Common Lisp.

froydnj · on Nov 2, 2010

You are correct (you do have to check that the symbol is actually bound prior to dereferencing). But they don't (generally) work that way for symbols with a function binding. I think the rationale here is that it's much more common to have BOUNDP symbols than FBOUNDP symbols. So sticking an extra slot for the value in a symbol is worth it, space-wise, whereas an extra slot for the function is not, so function lookups go through some sort of global table.

Note too that Common Lisp forbids you from modifying the function binding of symbols in the COMMON-LISP package for precisely this reason: the compiler can do a better job reasoning about the effects of calls to those functions.