I'm impressed that there is still room to eke performance improvements out by fiddling with the base data structure, the original author has been working on refining it for quite a while*
It is a long article and requires some time to digest. No comments is a good thing that implies people do take time to read. I am still in the progress of reading ...
It also shows people here are unwilling to comment from a position of ignorance; a smaller percentage of readers will be authoritative on such an advanced topic.
YMMV but for my Python web app executing mostly OO Python 2.5.x code, I got a 10% performance increase by using jemalloc compared to the malloc in RHEL 5. It's as simply as LD_PRELOAD=/path/to/libjemalloc.so -- memory usage is also better (millions of objects allocated than eventually released in a long running process ended up with smaller amount of memory used when using jemalloc).
I tried to include nedmalloc 1.0.5 in the benchmark, but it quite simply isn't polished enough for actual use. To start with I had to remove an invalid assertion in malloc() and implement posix_memalign(), only to find that it crashes within moments of application startup.
Because if there is one library that needs to be conservatively packaged, its libc. A critical flaw in libc pushed out around the world could cause nightmares too awful to contemplate. Our entire modern world literally depends upon it.
* http://t-t-travails.blogspot.com/2008/07/treaps-versus-red-b..., http://t-t-travails.blogspot.com/2010/04/red-black-trees-rev...