Hacker News new | past | comments | ask | show | jobs | submit login
How to allocate memory (sdf1.org)
156 points by ksherlock on Aug 26, 2016 | hide | past | favorite | 45 comments



If you'd like to learn more, here's an MIT research paper on fast memory allocation that has some really clever ideas: http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf

Abstract:

SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM). It turns out that the same design decisions also make it fast even without HTM. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size.

SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 MiB chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using O(1) instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.


While it has some new material, it's also not very well-written, and some of the benchmarks could be argued as being unrealistic. For example, Brad should have also included results using applications such as Redis (which by default uses jemalloc), or MongoDB, or many others. In many other scenarios not detailed in the paper, the allocator can use much more memory compared to tcmalloc.

The "Vyukov" benchmark is an invented (and I would argue, contrived) scenario that Vyukov him or herself thought of to cause memory bloat in Intel's TBB just by examining the code. Whether or not it actually occurs in a real application is debatable.

If you may have noticed, nowhere in the paper is tcmalloc mentioned :) It is still a viable alternative today.

There are many other papers I would recommend in addition to this reading, e.g., "Dynamic Storage Allocation A Survey and Critical Review" Wilson et al. despite it being quite old.

This might get me down votes, but it is wise to remember that just because work has the MIT stamp on it, doesn't mean it's always the most top-quality.


The only limitation of "Dynamic Storage Allocation A Survey and Critical Review" is that it does not include a discussion on multithreaded allocator design - because the paper predates when people really started caring. I'm unaware of a survey paper which covers that aspect of memory allocation.

(By the way: yours is a good comment! There's no need to mention you may get downvotes. We ask people not to do that, as it can tend to have a "I dare you" effect: https://news.ycombinator.com/newsguidelines.html)


> There's no need to mention you may get downvotes. We ask people not to do that, as it can tend to have a "I dare you" effect: https://news.ycombinator.com/newsguidelines.html

Oops. Sorry, I will be careful next time. Thanks for pointing that out.

> does not include a discussion on multithreaded allocator design

This is very true. tcmalloc seems to have been the earliest design with thread-local pools. jemalloc didn't originally have this design[1], and over time many allocators just adopted it, including SuperMalloc and others.

[1] https://www.facebook.com/notes/facebook-engineering/scalable... (search for tcmalloc)


Actually, thread-local pools predates tcmalloc by quite a few years. Cribbing from the related work section from a paper I'm a co-author on from 2006 (http://www.scott-a-s.com/files/ismm06.pdf):

"Streamflow uses segregated object allocation in thread-private heaps, as in several other thread-safe allocators including Hoard [3], Maged Michael’s lock-free memory allocator [18], Tcmalloc from Google’s performance tools [10], LKmalloc [15], ptmalloc [9], and Vee and Hsu’s allocator [25]. In particular, Streamflow uses strictly thread-local object allocation, both thread-local and remote deallocation and mechanisms for recycling free page blocks to avoid false sharing and memory blowup [3, 18]."

[3] E. Berger, K. Mckinley, R. Blumofe, and P. Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proc. of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 117–128, Cambridge, MA, November 2000.

[9] Wolfram Gloger. Dynamic Memory Allocator Implementations in Linux System Libraries. http://www.dent.med.unimuenchen.de/wmglo/malloc-slides.html.

[10] Google. Google Performance Tools. http://goog-perftools.sourceforge.net/.

[15] P. Larson and M. Krishnan. Memory Allocation for Long-Running Server Applications. In Proceedings of the First International Symposium on Memory Management, pages 176–185, Vancouver, BC, October 1998.

[18] M. Michael. Scalable Lock-free Dynamic Memory Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 35–46, Washington, DC, June 2004.

[25] V. Vee and W. Hsu. A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors. In Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks, pages 230–235, Perth, Australia, June 1999.

The earliest appears to be Larson and Krishnan from 1998. It appears that in the late '90s and early 2000s, it was SMP focused, for servers. Then in the early to mid 2000s, people (including my advisor) started realizing this whole "multicore" thing was for real, and system software would have to change.


I had read your paper :) but did not dig enough to find whether tcmalloc had introduced the concept or not. Thanks for pointing those out!


I wasn't sure where it appeared first, either! I had to dig out that old related work section. There may be work that predates the '98 reference, but it may not have gotten much attention. (I had assumed Hoard would be the first in the literature, but that's from 2000.) I think when it shows up is more related to the available hardware at the time, and what people were doing with it. It's not a huge stretch to imagine thread-local pools, but I don't think enough people were paying attention to the problem before then.


If you haven't spent much time reading about allocation strategies and are looking for a good place to start, _Dynamic Storage Allocation A Survey and Critical Review_ is a fantastic start. One of my all-time favorite survey papers.

http://www.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pdf


That's a good paper, but it's worth noting that most of the interesting work these days is in multithreaded memory allocators, which weren't as important in 2005. Scaling well under multithreading (i.e. not taking a global malloc lock) changes the design space considerably: you need to have per-thread heaps and rebalance them from time to time, which is itself a very interesting problem.


For anyone interesting in this area, take a look at the memory allocator (and GC) for HotSpot. Specifically "Hierarchical PLABs, CLABs, TLABs in Hotspot" [0]

[0] http://cs.uni-salzburg.at/~hpayer/


Thanks for the link. Do you mind sharing some of these other all-time favorite survey papers?


On the subject of papers, for those who have not read them Jeff Bonwick's work on the Solaris slab allocator is well worth reading.

https://www.usenix.org/legacy/publications/library/proceedin...

https://www.usenix.org/legacy/event/usenix01/full_papers/bon...

Pretty much every OS kernel has adopted the techniques outlined in these papers.


These ideas are also excellent.

While slab allocators can be very fast and space-efficient, they are meant to be used in scenarios where you know in advance what object sizes are, as each slab cache only allocates one kind of object. The kernel is a great use case, because you know which structs will be frequently allocated/destroyed, thus create a slab cache for just that one data structure (and others for other structs). And the kernel itself does not change frequently, nor does it allocate objects of sizes imposed by user-space (like a file system or key-value store would have to).

Memcached employs (-ed ?) its own allocator that is almost analogous to a slab allocator. It creates each cache not from understanding specific object sizes (since it cannot - it is a key-value store), but from a range of sizes incremented by some runtime constant. This can lead to really severe memory bloat in some instances.

Memory allocators are not easy. Even "general-purpose" allocators are not truly meant for general-purpose allocation; the DSA paper mentioned above states (page 6 top left):

> It has been proven that for any possible allocation algorithm, there will always be the possibility that some application program will allocate and deallocate blocks in some fashion that defeats the allocator's strategy, and forces it into severe fragmentation


The SuperMalloc paper has some really interesting ideas about how to manage metadata. When you do a free(), you need to know the size of the underlying allocation. You also probably need to get access to some metadata relating to that allocation. Ideally, you want to even be a bit robust against someone freeing something that isn't a pointer to the beginning of a heap object.

Jemalloc uses a radix tree to map from addresses to the headers of the 1-2MB "chunks" it uses as the top level of allocation. SuperMalloc allocates into 2MB chunks, but such that each chunk has objects of the same size. It then uses an array with one double word per 1MB chunk in the address space, which holds the size of the objects in each chunk. That also allows you to quickly find the metadata, which is stored at the beginning of each chunk.

The array can get big--512MB for an array covering a 48-bit address space with 2MB chunks, but it's space and the virtual memory system won't map that to real memory until each page is accessed.


I don't know how fast SuperMalloc is for other people, but in my experience it's not as fast as some other mallocs() (like Hoard or tcmalloc) and, in the end, it's still just another malloc().

Giving up the malloc() interface allows a very simple page-long allocator to outperform every malloc() I've ever tried (including SuperMalloc), and it would truly surprise me to learn there was some exotic memory technique (HTM, caching, etc) that could beat me.


Can you elaborate on what you mean by "Giving up the malloc() interface"?


The API for malloc is very simple:

  void* malloc(size_t len);
  void free(void* ptr);
Basically, ask for an amount of memory, and give back a pointer to memory. But it does not allow you communicate anything but the amount on an allocation, and you can only give a memory pointer on freeing. If, for example, malloc had an interface like:

  void* malloc(size_t len, duration_t dur);
  enum duration_t { SHORT, MEDIUM, LONG};
One can imagine you could optimize based on the given hint. Once you open up one kind of hint, you can imagine many other kinds of things users could communicate explicitly to the memory allocator about the allocation and access patterns of their data.

All implementations of free also need to find the related metadata for the given pointer. We could also imagine an interface for free which required the user to maintain that, but could perhaps speed things up:

  void free(void* ptr, meta_t d);
Or, we could also imagine communicating how important it is to free this memory:

  void free(void* ptr, immediacy_t im);
  enum immediacy_t { IMMEDIATE, DELAYABLE };
In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)

I'm not sure any of these ideas would help, but the point is that because of the limited interface, we really can't explore them. We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.


> In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)

Amusingly, when I've used an IMMEDIATE/DELAYABLE style hint, it was for the opposite purpose: I had some batched deallocs that I would either delay (to spread out over multiple frames instead of handling as a single batch, to eliminate the framerate hitch we were getting), or perform immediately as a single batch (to achieve greater throughput when switching scenes as delayed deallocation was adding untenable amounts of overhead.)

> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.

Changing such a fundamental building block if C is impossible.

However, providing a second alternative interface, for those applications which could really benefit from such fiddly high performance tweaks, already happens a good bit in games at least. Pool allocators, allocators with extra debug information, allocation of entirely different styles of memory (e.g. write combined memory for texture uploads)... lots of stuff out there. Some low level graphics APIs now make you decide if e.g. you want to put shader constants in their own GPU buffers, or just interleave them into the command buffers themselves...


jemalloc has an alternative API that allows specifying the size of the allocation: sdallocx [1].

[1]: http://www.canonware.com/download/jemalloc/jemalloc-latest/d...


"jemalloc has an alternative API that allows specifying the size of the allocation"

I would like to see an API for malloc where you don't need to specify the size of the allocation :-)

For those who wonder: it takes additional flags specifying alignment, whether to zero memory, whether to store data in a thread-specific cache, or am arena to use.


Interesting idea.

Another way of improvement is to use alloca() for small local (to function) objects but there is no direct way to know if a variable is local or not (in C and C++ at least).

> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.

I can be made with static analysis or binary instrumentation.


> but there is no direct way to know if a variable is local or not (in C and C++ at least).

If you only have one stack, and the stack is at the top of memory and it grows down, you can:

    int onstackp(void*x){char a;return x>&a;}


That's so simple! Thank you a lot.


malloc(sizeof(foo)) is slower than alloc_foo() because the latter can simply be a pointer increment, but there is no way to tell the malloc() interface that you're going to be doing nothing but allocating foo for a while.

malloc() can guess this with heuristics, but a good malloc() needs to perform well for a wide variety of use cases: Surely you can appreciate that balance has a cost that the specialised allocator simply doesn't have to pay.


> SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 MiB chunks which contain objects all the same size.

Hmm, this must not interact well with huge pages: either all those (1 huge page-sized) chunks actually take up physical memory or you're stuck with 4 KiB pages.

Transparent huge pages on Linux are a noticeable performance difference in practice. In one of my applications, a coworker discovered that dTLB misses were a significant source of slowness. He pointed out that we had a lot of "holes" in our memory allocation caused by the malloc implementation calling madvise on 4 KiB pages when they were empty, thus preventing them from being coalesced into a 2 MiB huge page (or causing them to be migrate back to small pages). Setting a commandline flag to prevent this decreased CPU usage by 7%.


HN discussion on the SuperMalloc paper: https://news.ycombinator.com/item?id=10770781


> SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap.

On the other hand that can considerably increase the size of the page table, and the cost and frequency of TLB misses. So I'm not sure this all a meaningful comparison.


> One major limitation of malloc (and even the best implementations like jemalloc and dlmalloc) is that they try to use a single allocator for each data structure. This is a mistake: A huge performance gain can be had by using a separate allocator for each of your data structures — or rather, for each of your data usage patterns.

I stopped reading the article at this point. This statement has been disproven repeatedly over the years with Hoard and jemalloc. It is counter-intuitive but the data backs it up.

Custom per-data-structure allocators can fragment the global memory arena and cause more CPU cache misses as result of the extra code involved. The latest/greatest malloc/free implementations use a myriad of optimizations to achieve speed improvements that a custom allocator implementation would rarely use.

It's not an accident that jemalloc is so widely used in major applications - it works extremely well.

https://github.com/jemalloc/jemalloc/wiki/History

https://github.com/jemalloc/jemalloc/wiki/Adoption


> This statement has been disproven repeatedly over the years with Hoard and jemalloc. It is counter-intuitive but the data backs it up.

What data is that? I would honestly love to see it. TBH, it would not change my mind since I have practical experience of custom allocators working better for me over other general ones. But then again, I work in a severely memory constrained environment.

>Custom per-data-structure allocators can fragment the global memory arena and cause more CPU cache misses as result of the extra code involved.

Well, pretty much all modern allocators have techniques to avoid fragmentation. But that is besides the point.

But your comment did not make sense to me. Perhaps it does to others. As the developer you know the general sizes of your data structures, their usage patterns, their lifetimes and the frequency/rate at which they are allocated (all of which a general allocator has to guess or use some sort of rudimentary strategy which adds needless metadata). That is where you can design your custom allocator to actually reduce fragmentation. It may well be that the gain from a custom allocator isn't worth it, but I don't quite understand how it can get _worse_ as you claim.

>The latest/greatest malloc/free implementations use a myriad of optimizations to achieve speed improvements that a custom allocator implementation would rarely use.

So what exactly stops you from writing in those optimizations in a custom allocator? Certain domains, like game programming for e.g. People have been doing this for decades with excellent results.

>It's not an accident that jemalloc is so widely used in major applications - it works extremely well.

Sorry, you have no possible way of knowing _why_ everybody else uses something. That itself is not an argument for anything.


Custom allocators have the disadvantage of not knowing how the rest of the heap is managed. They are at a disadvantage to a modern well designed memory allocator like hoard or jemalloc.

Feel free to examine the many papers put out by Emery Berger on the subject of memory allocation as well as the design documents of jemalloc.


>Custom allocators have the disadvantage of not knowing how the rest of the heap is managed.

Sure, but they can work out of a fixed region statically allocated on startup.

>Feel free to examine the many papers put out by Emery Berger on the subject of memory allocation as well as the design documents of jemalloc.

https://people.cs.umass.edu/~emery/pubs/berger-oopsla2002.pd...

Do you have links to more? I only found one paper from 14 years ago. It is interesting but it comes with a huge YMMV since they only examined specific projects which might not have a pattern that _requires_ a custom strategy anyway. A distribution of the objects along with time distribution would have been nice to include.


> But then again, I work in a severely memory constrained environment

Then likely you're dealing with a single-threaded application where dlmalloc will work just fine. No multi-threaded contention for allocation/deallocation.


There is a ton of YMMV with these things. dlmalloc works fine, and so do other general purpose allocators. My custom allocator is and will always be superior because it takes advantage of the fact that I only perform allocations from a very very small subset of known sizes, at very specific points. The allocator does next to nothing during dealloc, except at certain times when it can do things in bulk. I've modeled the allocator so it doesn't have to "be prepared" for anything else. I allocate a large chunk of memory which is never released to the OS so I don't ever have to worry about other threads messing up my memory region.


How about fragmentation ?

AFAIK jemalloc is horrible when it comes to memory usage.


What makes you say that? jemalloc's fragmentation is very low [1].

[1]: https://blog.pavlov.net/2007/12/04/vlad-and-analysis-of-dtra...


The pictures there are missing (i get error 400 when i try to download them directly).

After 10min of googleing i found this [0] (it shows swap), and this [1] from 2012.

Nothing conclusive, especially considering that all the popular malloc-s can be tuned.

[0] https://suniphrase.wordpress.com/2015/10/27/jemalloc-vs-tcma... [1] https://juliank.wordpress.com/2012/05/31/memory-allocators/

As a bonus, i found a phrack article on jemalloc. http://phrack.org/issues/68/10.html#article


My use of jemalloc in multithreaded servers shows the opposite to be true. Memory use is a little higher than the single mutex dlmalloc, but you'd expect this with a modern high performance thread aware malloc implementation. You think MariaDB would choose jemalloc if it wasn't efficient in memory use?


>> You think MariaDB would choose jemalloc if it wasn't efficient in memory use?

I don't think this is a smart way of choosing an allocator.


Not to be a downer, but this is poor and outdated C programming style.

Don't use sbrk(2) unless you're completely reimplementing the standard library.

Don't use alloca(3) unless you want weird headaches. Just use dynamically-sized arrays, which have been present in C for nearly two decades. They are block-scoped and easier to reason about.

None of the magic numbers are explained. `(p->si_addr + (16LL<<22)) & ~4095`, where do those come from? I'm guessing 4095 is the page size less one, but DON'T hardcode that!

Don't assign to lvalues in conditionals! (e.g. `while(j<31 && !(h=free_table[j]))`) It's hard to read and bug-prone.

Don't use `&h[1]` to refer to the space after a structure. I'm pretty sure that's undefined behavior, and it often gives the wrong alignment for whatever you want to put after h. Rather, add a final element to the structure of flexible array type (say `int data[]` if you're storing ints after the structure). That is guaranteed to have the semantics you're looking for.

Here's my advice:

1) Just use malloc(3). It's already fast and tuned to many allocation scenarios (including all four in this article!), and your application will continue to reap whatever performance improvements its maintainers make. Use aligned_alloc(3) if you need pages.

2) When working in the embedded world, it's often preferable to preallocate pools as large as your app will ever need for each kind of object you have collections of, and pin them to RAM (not necessarily for performance, but so you know you have the space). If you do so, you can reference the objects by index rather than pointer. It's much faster and more space-efficient, especially on 64-bit architectures.

3) Don't be afraid to realloc(3) to grow arrays. It incurs no asymptotic penalty.

4) Pay attention to data layout, to make sure that commonly-accessed stuff is not interspersed with rarely-accessed stuff. E.g. separate indexes from data where possible, if you tend to traverse the index rather than the data.

(Caveat: all my comments above apply to glibc. YMWV on other systems.)


Don't use `&h[1]` to refer to the space after a structure. I'm pretty sure that's undefined behavior

This is legal. '&h[1]' is a synonym for 'h + 1', thanks to C99 section 6.5.3.2 paragraph 3:

    ... if the operand is the result of a [] operator, neither the & operator
    nor the unary * that is implied by the [] is evaluated and the result is
    as if the & operator were removed and the [] operator were changed to a
    + operator.)
and that is valid thanks to C99 section 6.5.6, paragraphs 7 and 8:

    7. For the purposes of these operators, a pointer to an object that is
    not an element of an array behaves the same as a pointer to the first
    element of an array of length one with the type of the object as its
    element type.
    8. When an expression that has integer type is added to or subtracted
    from a pointer, the result has the type of the pointer operand. [...]
    Moreover, if the expression P points to the last element of an array
    object, the expression (P)+1 points one past the last element of the
    array object [...] If both the pointer operand and the result point
    to elements of the same array object, or one past the last element of
    the array object, the evaluation shall not produce an overflow; otherwise,
    the behavior is undefined. If the result points one past the last element
    of the array object, it shall not be used as the operand of a unary *
    operator that is evaluated.
So is 'h' points to an object it is entirely legal to compute '&h[1]'; whether you can dereference it is a separate question (and will depend on memory layout et cetera).


Ah, I forgot that "one past the last element" is excepted. I stand corrected. (Similar tricks, such as &h[-1], are indeed undefined behavior.)

Regardless, my point stands that it can give you the wrong alignment, if the elements you're putting after the header have greater alignment requirements than the header itself. E.g.:

    #include <stdint.h>
    struct foo { uint16_t a,b,c; }
    uint64_t *bar(struct foo *x) { return (uint64_t *) &x[1]; }
Gives you a misaligned pointer to an 8-byte object.

    #include <stdint.h>
    #include <stdint.h>
    struct foo { uint16_t a,b,c; uint64_t d[]; }
    uint64_t *bar(struct foo *x) { return x->d; }
That gives you a properly aligned pointer.


> Don't use `&h[1]` to refer to the space after a structure. I'm pretty sure that's undefined behavior

Not defending the code, but it isn't undefined. Pointers to "one past the last element of an array" are a special case and have specific/defined behavior in the standard.


You are correct, and I was right to doubt my memory. Though my point about misalignment stands.


I don't mean to be spammy about this, but I feel like I should repeat it. If you like this kind of content, consider joining the SDF. It's a free public access computing community with lots of artists, hackers, and grognards of all stripes. Your support is appreciated: https://sdf.org/


sizeof returns size_t, not sure why he's using `long long`, but to be precise it isn't a function, so it doesn't "return" anything. I guess it "yields" size_t.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: