Hacker News new | past | comments | ask | show | jobs | submit login

You don't need to have a single frontier, and IIRC no one does these days. A stack of slabs per "arena" suffices, and the extra cost is only a read from a cacheline you'll soon need anyway. It looks like jemalloc uses a tree of "runs" for that.

I don't know exactly what you're imagining, but I don't see how you can pass this off to the kernel. You need a slab which has free memory. How do you know where to find it? You could write a nasty new syscall to ask the kernel for a ZFOD page(s), and then cry because your program is slow.

Also, unless your slabs are huge and your allocator is being used to fuel some kind of pre-caching, faulting pages is going to be suboptimal. You incur at least twice as many mode switches: one for the mmap call, one for each faulting page. I've measured this, and it gets pretty expensive. I'm kind of dubious about the performance of schemes which stray from the principle of "allocate what you actually need; free it when you don't."

The bottom line is that a slab allocator should be solving a different, mostly less general problem than the kernel. It needs finer granularity and it's optimizing for common block sizes that are small enough to fit into slabs of known sizes. So it should optimize for allocating page chunks in a small subset of possible sizes (namely, the slab size(s)), while the kernel should be worried about unbounded, oddly-sized page chunks.

I toyed with scalable allocators for a course a few years back. It's a little embarassing, but if you want to see what the costs actually are: http://www.andrew.cmu.edu/user/apodolsk/418/finalreport.html

You can do lockfree slab allocators pretty easily, so the game in scalable malloc is really about avoiding inter-CPU cacheline sharing and fitting your data structures into per-CPU caches. Slabs happen to be really good for that, for non-obvious reasons.




It certainly wouldn't be the first time that I've suggested something ridiculous. One easy explanation for the lack of papers I can find is that the idea is so obviously useless that no one has written about it!

That said, it just seems unlikely that the best solution for 32-bit systems with limited virtual address space and slow interrupt generating system calls is also the best solution for 64-bit systems with practically-limitless virtual address space and fast hardware-supported syscall/sysenter.

And I think what I'm suggesting is roughly analogous to what's being done in Azul's (successful?) C4 garbage collection: provide fast userspace for per-process page table manipulation, and offload as much as possible to the hardware: http://www.azulsystems.com/sites/default/files/images/c4_pap...

But I'm definitely too fuzzy about some of the key concepts to suggest anything particularly coherent at this point. I'll check out your writeup, and try to learn more. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: