Tornado twice as fast with PyPy 1.7 compared to Python 2.7

peterhunt · on Nov 29, 2011

I love PyPy, but it kind of sucks watching everyone get excited about the CPU performance of a web server while forgetting about this:

"Pypy used about three times as much memory in both cases, but usage was stable over time (i.e. it's not "leaking" like pypy 1.6 did)."

For a front-end web server, I'd trade CPU perf for memory any day of the week.

jflatow · on Nov 30, 2011

The part that interests me about this one is not that I necessarily want to run a faster Tornado server, but that I have often been wondering how much speedup I can get from switching e.g. Python to PyPy in general.

Granted, it generally makes sense to do your own benchmarking for a particular application, but its also nice to see other people's results like these from time to time, especially against a rapidly evolving project like PyPy.

sb · on Nov 30, 2011

I think that's a good point and was also heavily commented on when Google's Unladden Swallow released its benchmark numbers (IIRC, for the django benchmark its binary size grew to 800 megs.) Probably that was even a reason they stopped working on it (there was a link somewhere, but I cannot find it right now.)

Furthermore, I think this "problem" is attributable to jit-compilation in general, since you have to store the code somewhere. The situation was/is somehow similar to the JVM's memory requirements. An interesting alternative to code generation is to optimize interpreters instead.

dmpk2k · on Nov 30, 2011

I'm not sure it's intrinsic to JIT. LuaJIT's footprint is no larger than the normal Lua VM last I looked.

It has more to do with how you pack your data and how much pressure you put on the GC (thus requiring more memory).

sandGorgon · on Nov 30, 2011

with SSDs catching up (in terms of price/gb) to hdds, could you possibly use SSD as a jit-cache ? processors are the most power hungry part of a machine, and considering cooling is the biggest bottleneck in a modern datacenter, I wonder if SSDs acting as caches could amplify a processor's power.

wmf · on Nov 30, 2011

Sure, you can use an SSD as swap (unless you're in the cloud where SSDs have inexplicably not been invented yet).

ars · on Nov 30, 2011

Putting swap on the SSD will make it much faster. On the other hand it will wear it out much faster as well.

So if it's worth it depends on your workload.

dmpk2k · on Nov 30, 2011

I think likewise. It should probably be number of requests per MB of RAM for I/O-bound servers where latency isn't critical.

wmf · on Nov 30, 2011

At least Tornado supports concurrency so each process can handle multiple connections.

daenz · on Nov 30, 2011

Last time I checked, Tornado was a single process with an event loop of non-blocking sockets. Are you talking about running multiple Tornado processes on different ports and putting a round-robin proxy in front of them?

ntoshev · on Nov 30, 2011

No, he's talking about multiple http connectons being handled by this single threaded process (meaning you don't need too many processes that consume memory)

codexon · on Nov 30, 2011

I complained about memory leaks back when pypy 1.5 (I think) was released on reddit.com/r/python.

A bunch of people downvoted me.

abecedarius · on Nov 29, 2011

While I've also seen it to be about twice as fast in my own testing (not with Tornado), it also had much longer GC pause times. (No stats, just watching my program's logs go by -- very noticeable stutters every couple seconds or so.)

aidenn0 · on Nov 30, 2011

I would be very impressed if any real GC beat out CPython in terms of GC pause times, as CPython uses reference counting, which is far less subject to GC pauses than other GC methods (but has the disadvantage of worse amortized performance and inability to collect circular data-structures)

jerf · on Nov 30, 2011

"Difficulty" collecting circular structures, not "inability". You can still walk the live set with reference counting every so often to catch the circular trash, but you do have to pay through the nose, relatively speaking.

sb · on Nov 30, 2011

Which is actually what Python is doing, too. There is (IIRC) a mark and sweep collector that collects cycles. Another interesting technique for dealing with this problem is called "trial deletion."

abecedarius · on Nov 30, 2011

Yes, it's hard -- I'm just saying it looked like enough to matter for a server app. Maybe you could arrange to force GC in between requests somehow.

(I haven't read the OP, Google wanted me to log in for some reason.)

kingkilr · on Nov 30, 2011

FWIW, Armin was playing with implementing a concurrent GC, to help with the pause times. Not sure what the status of it is though.

ecdavis · on Nov 29, 2011

It's interesting that the first run on both the PyPy tests was significantly slower than all subsequent runs. I guess PyPy needs to do a bit of work before it's properly warmed up.

endtime · on Nov 30, 2011

That's because PyPy is JIT-based - it compiles the code just before running it, unless it has already been compiled (which is the case on all but the first run).

peterbe · on Nov 29, 2011

And memory stable!

And three times faster on template rendering!

kingkilr · on Nov 29, 2011

Major thanks to Justin Peel for finding the memory leak! That was a fun one, memory allocated external to the GC (so e.g. something allocated by OpenSSL) wasn't factored into calculating when to run the GC, which meant that the destructors of the objects holding onto the OpenSSL structs weren't being called because the GC never saw the correct memory pressure.