Hacker News new | past | comments | ask | show | jobs | submit login
Scalable memory allocation using jemalloc (2011) (facebook.com)
37 points by luu on Nov 15, 2014 | hide | past | favorite | 15 comments



Maybe this is the right place to ask a meta-question on malloc that I've been wondering about.

Are there any malloc implementations that try work more cooperatively with the virtual memory system? The systems I've looked into seem to duplicate a lot of what the kernel and hardware are already doing. They don't seem to really distinguish between actively mapped memory and allocated but untouched memory.

With a 64-bit system, malloc could internally just reserve some vast logical address space and let the kernel fault pages in as needed. Instead of having a single frontier, you might have thousands of arenas used for different slab sizes, each being automatically extended as needed. Each arena would have a maximum waste up the 4KB page size, but on modern systems this seems unlikely to ever add up to a real burden.

You'd still need record keeping for allocations within a slab, but the page table itself would be the authoritative record for the arenas themselves. All the pieces for this approach seem fairly straightforward. The only major thing lacking (I think) is an efficient userspace way to request the unmapping of a page that is no longer needed.

Is there research into this approach? What key terms should I be searching for?

Edit:

I think this discussion may be about the same approach that I'm wondering about: http://caml.inria.fr/mantis/view.php?id=6101

And this paper touches on a lot of the same issues, although coming at it from the direction of modifying the hardware: http://research.cs.wisc.edu/multifacet/papers/isca13_direct_...


jemalloc works much like what you're describing. The one missing piece in terms of userland/kernel cooperation is that jemalloc aggressively discards dirty unused pages because it doesn't know whether other processes need those pages. If the kernel were to communicate the severity of memory pressure to userland, it would be possible to more intelligently rate-limit the purging.


You don't need to have a single frontier, and IIRC no one does these days. A stack of slabs per "arena" suffices, and the extra cost is only a read from a cacheline you'll soon need anyway. It looks like jemalloc uses a tree of "runs" for that.

I don't know exactly what you're imagining, but I don't see how you can pass this off to the kernel. You need a slab which has free memory. How do you know where to find it? You could write a nasty new syscall to ask the kernel for a ZFOD page(s), and then cry because your program is slow.

Also, unless your slabs are huge and your allocator is being used to fuel some kind of pre-caching, faulting pages is going to be suboptimal. You incur at least twice as many mode switches: one for the mmap call, one for each faulting page. I've measured this, and it gets pretty expensive. I'm kind of dubious about the performance of schemes which stray from the principle of "allocate what you actually need; free it when you don't."

The bottom line is that a slab allocator should be solving a different, mostly less general problem than the kernel. It needs finer granularity and it's optimizing for common block sizes that are small enough to fit into slabs of known sizes. So it should optimize for allocating page chunks in a small subset of possible sizes (namely, the slab size(s)), while the kernel should be worried about unbounded, oddly-sized page chunks.

I toyed with scalable allocators for a course a few years back. It's a little embarassing, but if you want to see what the costs actually are: http://www.andrew.cmu.edu/user/apodolsk/418/finalreport.html

You can do lockfree slab allocators pretty easily, so the game in scalable malloc is really about avoiding inter-CPU cacheline sharing and fitting your data structures into per-CPU caches. Slabs happen to be really good for that, for non-obvious reasons.


It certainly wouldn't be the first time that I've suggested something ridiculous. One easy explanation for the lack of papers I can find is that the idea is so obviously useless that no one has written about it!

That said, it just seems unlikely that the best solution for 32-bit systems with limited virtual address space and slow interrupt generating system calls is also the best solution for 64-bit systems with practically-limitless virtual address space and fast hardware-supported syscall/sysenter.

And I think what I'm suggesting is roughly analogous to what's being done in Azul's (successful?) C4 garbage collection: provide fast userspace for per-process page table manipulation, and offload as much as possible to the hardware: http://www.azulsystems.com/sites/default/files/images/c4_pap...

But I'm definitely too fuzzy about some of the key concepts to suggest anything particularly coherent at this point. I'll check out your writeup, and try to learn more. Thanks!


Hey folks, jemalloc author here, willing to answer any questions you have about it. I'll be AFK for much of today but will be around all day tomorrow.


It would be interesting to see the performance of TBB malloc included in the mix. For our application (parallel graph evaluation) we found the TBB allocator outperformed jemalloc. The margin was not huge, but enough to make it worthwhile using it. However a colleague at a company with somewhat similar software found jemalloc outperformed TBB, so your mileage definitely varies depending on application.

It can also happen that these allocators use more memory than regular malloc since they hold onto local pools, so that may be a concern for some use cases.


Interesting. Just today I [read][1] some discussion about on which platforms and under which circumstances Rust should use jemalloc instead of libc's malloc.

[1]: http://smallcultfollowing.com/babysteps/blog/2014/11/14/allo...


So a few questions on C and C++ memory allocation for the experts here.

Will typical C++ new (and C malloc) give thread contention issues?

Specifically I think I have seen multi threading performance die when multiple tasks are finishing and doing lots of freeing at one time.

-Is it possible that is true, and if so, is that the sort of thing tcmalloc (and jemalloc?) fix?

-If they fix it, what is the best way to get that into C++ programs that use new instead of malloc (the vast majority through STL structures)?


Yes (to both--new usually calls malloc), but the degree to which depends on the platform. In the old days, there was a single lock around the allocator serializing malloc/free. These days, all the major platforms have a memory allocator that tolerates some degree of concurrency, and you can also link in tcmalloc or jemalloc. Even then, a lot will depend on your allocation patterns. For example, many threaded malloc implementations depend on thread-local caches to avoid having to take a lock on the fast path. This works great for many workloads, but can also fall flat on others. Consider what happens when a producer thread allocates memory, passes a pointer to a consumer thread which frees it. The thread-local cache in the producer thread will never be replenished by free() calls, and the thread-local cache in the consumer thread will never be used by malloc() calls. Thus performance can become very dependent on the mechanism the allocator uses to transfer memory to/from the local caches and the global heap.


Yes, simpler designs for malloc share free pool tracking structures among threads and use locks to protect against creating races. Other malloc implementations use thread local free pool caching structures to lessen the frequency of synchronization in the allocator in the general case. TC actually stands for 'thread caching'.

Under the hood, the default std allocator being used to power new is typically calling malloc, so simply linking in a different implementation should work.


Does anyone know of a more recent comparison between tcmalloc and jemalloc?

Also of note, if you haven't played with this before, be aware of how you are linking if you are in a more complex project. It is possible (though difficult) to end up with multiple malloc implementations in one binary, so be careful to keep from passing memory allocated in one to the free function of the other :-)



In case FB ever has problems with a mass migration of users to a successor (though unlikely, as there are billions of people still not having either internet or a FB account, in the next 10 years), they still can morph into a software and hardware technology giant...


I think it should be stated on the title that this is a post from 2011.


Wow, I've been looking for something like this. Much better than vanilla malloc. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: