The part that interests me about this one is not that I necessarily want to run a faster Tornado server, but that I have often been wondering how much speedup I can get from switching e.g. Python to PyPy in general.
Granted, it generally makes sense to do your own benchmarking for a particular application, but its also nice to see other people's results like these from time to time, especially against a rapidly evolving project like PyPy.
I think that's a good point and was also heavily commented on when Google's Unladden Swallow released its benchmark numbers (IIRC, for the django benchmark its binary size grew to 800 megs.) Probably that was even a reason they stopped working on it (there was a link somewhere, but I cannot find it right now.)
Furthermore, I think this "problem" is attributable to jit-compilation in general, since you have to store the code somewhere. The situation was/is somehow similar to the JVM's memory requirements. An interesting alternative to code generation is to optimize interpreters instead.
with SSDs catching up (in terms of price/gb) to hdds, could you possibly use SSD as a jit-cache ?
processors are the most power hungry part of a machine, and considering cooling is the biggest bottleneck in a modern datacenter, I wonder if SSDs acting as caches could amplify a processor's power.
Last time I checked, Tornado was a single process with an event loop of non-blocking sockets. Are you talking about running multiple Tornado processes on different ports and putting a round-robin proxy in front of them?
No, he's talking about multiple http connectons being handled by this single threaded process (meaning you don't need too many processes that consume memory)
While I've also seen it to be about twice as fast in my own testing (not with Tornado), it also had much longer GC pause times. (No stats, just watching my program's logs go by -- very noticeable stutters every couple seconds or so.)
I would be very impressed if any real GC beat out CPython in terms of GC pause times, as CPython uses reference counting, which is far less subject to GC pauses than other GC methods (but has the disadvantage of worse amortized performance and inability to collect circular data-structures)
"Difficulty" collecting circular structures, not "inability". You can still walk the live set with reference counting every so often to catch the circular trash, but you do have to pay through the nose, relatively speaking.
Which is actually what Python is doing, too. There is (IIRC) a mark and sweep collector that collects cycles. Another interesting technique for dealing with this problem is called "trial deletion."
It's interesting that the first run on both the PyPy tests was significantly slower than all subsequent runs. I guess PyPy needs to do a bit of work before it's properly warmed up.
That's because PyPy is JIT-based - it compiles the code just before running it, unless it has already been compiled (which is the case on all but the first run).
Major thanks to Justin Peel for finding the memory leak! That was a fun one, memory allocated external to the GC (so e.g. something allocated by OpenSSL) wasn't factored into calculating when to run the GC, which meant that the destructors of the objects holding onto the OpenSSL structs weren't being called because the GC never saw the correct memory pressure.
"Pypy used about three times as much memory in both cases, but usage was stable over time (i.e. it's not "leaking" like pypy 1.6 did)."
For a front-end web server, I'd trade CPU perf for memory any day of the week.