Hacker News new | past | comments | ask | show | jobs | submit login
The effect of switching to TCMalloc on RocksDB memory use (cloudflare.com)
111 points by g0xA52A2A on Feb 3, 2021 | hide | past | favorite | 48 comments



I wrote a commercial memory allocator (long long gone) for pre MacOS 10 and tested it by writing a ton of memory bashing applications based on every perverse pattern I could think of in order to ensure it was fast, failure free and limited fragmentation. Of course from that era I didn't need to support multiple threads so that made it somewhat easier, but in building it (and studying others at the time) it was obvious that this is far from a solved science, and there were terrible design decisions in many common allocators at the time that made little sense.

It clear that some of these decisions still hang around today and that switching to a better allocator really can make a huge difference, especially given everything is multiple threads/cores today. But a lot of programmers never even look at memory usage or investigate alternatives. Even the JVM has a boatload of options on allocators for different usages; yet most Java programmers I know just go with the defaults.

Computers may be ridiculously fast and have enormous memory today and maybe you can live with the default; but it doesn't hurt to look, and might even save you money (pay less to AWS!) by reducing your need to allocate more or bigger servers.


It's interesting that I saw this post minutes after watching the talk "What's a Memory Allocator Anyway?" by Benjamin Feng. It's ostensibly about Zig's allocation schemes, but it's more of a general overview of the varied (single-thread) allocation strategies in use today. It made me really want to try out arena allocators

It's a quick watch at 1.5x speed, the last third is an interview.

1. https://youtu.be/vHWiDx_l4V0


> Computers may be ridiculously fast and have enormous memory today and maybe you can live with the default

Often these hardware improvements exacerbate problems with memory management. For years the runtime of the default JVM garbage collector would blow up past 48 GB of allocations. Which was probably fine for 99% of software out there, but the day your honking great server blew past that... was a bad day all round.

It pays to know your allocators and garbage collectors inside and out for modern cloud scale computing.


Yeah, it's a huge probelm with Cassandra.

The 4.0 version of cassandra is finally doing the smart thing and spinning up separate threads dedicated to sets/shards of hash ranges held by a node. I'm not sure if they'll run different JVMs per thread as well to do further sharding of the GC generations as well or if they can effectively do that from one JVM, but that would make sense


I'm not quite sure why, but Java/JVM seems to be uniquely bad with regards to memory usage. Nothing else seems to use quite so much memory. It's expected that C/C++/Rust would be better, but similar languages like C# also seems much better. Even Python/Ruby/JavaScript don't seem as bad.


Is heap management that far from the algorithmic headaches of Garbage Collection? Compiled languages typically are "arena allocators" by default and probably don't hit these similar heap management/paring/centralizing/compaction problems until they do a ton of dynamic datastructure allocations and manipulations.

... and they get to punt to the OS too.

I'd certainly believe this isn't a "solved" problem.


I've spent some time really studying memory allocation. It seems to me that almost any high-performance application will take advantage of the fine-details of memory allocators to optimize themselves.

There's something to be said about garbage collection as an actual strategy (!!). I know people think that garbage collection is slow, but... very smart people have put their minds on the memory problem.

A generational garbage collector is designed to tackle fragmentation head-on. Every "generation", the garbage collector recompacts all the data, "squeezing" your fragmentation out of the memory area.

Because some data is "longer lived" than other data, the compacted data is placed into a new generational-pool, where it is left alone for the next collection. IE: All data that survived a collection at time T will be left alone at collect time T+1. Maybe at time T+2 will it be re-checked... and those that survive the check at T+2 will be left alone until T+6.

--------------

This generational strategy minimizes the copying / moving of data around the fragmentation issue. Still though: moving / copying data is the singular catch-all solution to fragmentation.


This sort of thing is why I'm generally not impressed by the perennial "GC vs. manual memory management" debate. Manual memory management usually isn't. Simply using "malloc" isn't manual memory management, there's still a lot of stuff going on behind the scenes that may or may not match your program's use case. Both GC and "manual" memory management aren't single points, they're suites of options, which you can often mix and match beyond that. Statements like "Manual memory management is better than GC" is roughly like saying "Pie tastes better than cake!"... it doesn't even really make sense, there's so many pies and so many cakes that there's no way to universally order them like that, even for a specific person; add in the subjectivity of taste (in this metaphor, the different needs various programs have of memory in the first place) and it's beyond absurd.

(To my mind, "manual memory management" is when you get a big slab of memory from something, doesn't much matter what, and everything else is managed in your application code. If you're not doing that, you've got some sort of "automation" involved, and that automation can mismatch your program's memory needs every bit as much as a "GC" can.)


> "Pie tastes better than cake!"

Indeed. And is Cheesecake a pie or a cake? You bake cheesecake in pie-crust, but it has cake in the name.

The memory-management "its both" scheme is reference-counting. In some contexts, its ref-counts are considered manual (C++ shared_ptr, Rust). In other contexts, its considered automated garbage collection (Python, Lisp).

In all cases: tons of atomic add/subtract + memory-barriers to strictly order reference counts between many threads... making it actually kind of bad for multithreaded code compared to other garbage collection schemes. (Each shared_ptr<> = blah() requires an atomic add and atomic-subtract. So you actually have escalating costs the more you use the shared_ptr). I think the circular-issue is kind of overblown: it certainly happens (doubly-linked lists, graphs, trees which store a "root pointer" or even an "iterator" somewhere inside of them) but people know how to use weak_ptr<> these days and avoid most issues. (Well... except for arbitrary graphs. No way to know where to put weak_ptr<> there. But most "graph heavy" algorithms use other representations: dense matrix or sparse matrix forms... and probably should avoid slow / low-performance general purpose memory allocators anyway)

-----------

Once you recognize the multithreaded-sync issues associated with shared_ptr<> / RefCount, you start building a thread that "centralizes" the ref-counts in a reader/writer queue to ensure that objects are deleted at the right time. Oh wait, that's called a garbage collector thread and you've suddenly moved to mark&sweep accidentally.

Once you get a garbage collector thread, its not too big of a leap to start thinking about multiple garbage-collector threads (scaling to higher garbage collection performance). I mean, yeah, I'm glossing over all sorts of high-performance multithreaded lock-free datastructures here, but lets pretend those implementation details are solved. Lol.

-----------

As usual: its not really the garbage collection or memory-management parts that are hard. Its the multithreaded high-performance (and provably correct!) portion that's really, really hard. Even for simple ideas like ref-counts or even "manual" malloc/free.


"And is Cheesecake a pie or a cake? You bake cheesecake in pie-crust, but it has cake in the name."

I tried to follow the rest of your comment but I was too disturbed by you thinking all cheesecake is baked in pie-crust.


https://www.target.com/p/keebler-graham-cracker-pie-crust-9-...

There's literally a cheesecake on the "serving suggestion" image for the pie crust. I can't be the only one who makes cheesecakes on these...


The keyword in my comment was "all". There are types of cheesecake baked in pie crust but it is far from all cheesecakes. Certainly far enough from far to be posing the original question.


I read such a statement from Andras Kovacs re: compilers https://www.reddit.com/r/haskell/comments/gok70o/simple_hask...

One needs different allocation strategies for immutable (and strict/lazy) languages.


Rust is probably close to a good idea: have the programmer give directions as to the lifetime/behavior of the data and allocated memory, rather than having to guess.

Compilers might be able to do some of that, but at least the standard libraries should be able to do the hints, and almost all dynamic allocations on day-to-day code uses dynamic structures in standard libraries like hashmaps and variable-length lists/vectors/arrays.


This problem hit the Ruby community hard about 5 years ago.

Basically everyone uses jemalloc or tcmalloc in production in the Ruby community nowadays. What frustrates me is the malloc community's sort of blase reaction along the lines of "there's no bug, this is expected behavior".


Why try to optimize for every cases possible ? Use the right tool for the right job.


"multihreaded programs" seems like a pretty big use case, and it seems to occur in just about every major multithreaded C program using glibc malloc.

I would agree with you if it didn't seem like other allocators are solving this without any tradeoffs.


This is really a thing which happens with the glibc allocator and long-lived applications. Back in the day I had a custom chat server which consistently used more and more memory until after about a month it ran out of memory and crashed. I spent quite a while investigating, even annotating every malloc/free in the program to find the leak, only to conclude that there was simply no leak at all in the program and it was some effect in the allocator itself. (Rather than spending time trying other allocators, the "solution" was to schedule a server restart every couple of weeks, early in the morning when no one was using the system.)


for long-running daemon/server programs where reliability is a key priority, I have had a lot of success building in "partial" restarts into the design, which means that periodically the program exits the main event loop and enters it again.

clearly there are circumstances where this is not an option, but for distributed systems that can handle the short-term loss of a given node it works very well. one important thing is to add is some randomized jitter to the periodic restarts to spread out when nodes are offline.

in practice this remedies a wide range of seldom-encountered problems. for instance, I have a program that pulls data from an http apis, and last week its connection to one of the remote servers became hung up somehow, resulting in no new data being pulled from that server. this problem had never happened before in 6 months since the program was deployed. rather than track down the very rare condition that was behind this I just added periodic "partial" restarts to the program, and now it will be able to recover from this if it ever happened again, and many other sorts of problems like slowing increasing memory fragmentation.

for a chat server or other server with client connections, obviously those would need to be maintained across the partial restart, but that seems like it would be pretty easy to do. while there is nothing dramatically different from my approach to what you did with actual, full periodic restarts, the partial restart approach does allow special handling for application-specific issues like that.


This is very common in long-lived Ruby applications as well, where jemalloc seems to be the accepted solution.


I've seen this in python as well. gunicorn has two helpful flags to do this automatically (--max-requests and --max-requests-jitter) which will reboot every X number of web requests handled.


An important part of glibc malloc design is that it expects developers to free memory in a reverse order of allocation, otherwise a lot of memory will be ‘locked’, and never returned to the system.

I'm a C novice and generally inexperienced with languages that require memory management. Nobody ever tells you this stuff!

Is this a very specific GNU concern, or do I have to keep this in mind in general on other systems when using their default allocators?


I think, this particular part of the design is here for historical reasons. The classic way of allocating memory in Unix systems was heap extension with brk/sbrk calls, so glibc malloc from beginning was designed using these syscalls and logic. So with this mechanism you need to free last allocated object to shrink the heap. Now, allocators often get new memory from system using mmap syscall and probably have different expectations.


Defragmentation is one benefit of garbage collected environments I feel like you don't see mentioned often. Every collector I can think of is capable of compacting its heap so that you are less likely to suffer from fragmentation overhead, and it happens as part of the regular GC flow (though typically not the 'fast' collections) instead of being something you have to go out of your way to trigger at ideal times.

Obviously there are downsides to GC, but it's interesting to see how in this scenario malloc+free had the same "uses way more memory than strictly necessary" problem that people usually pin on garbage collectors.


I remember this hit the Varnish Cache project about 7-8 years ago.

We've spent endless hours trying to identify leaks before we finally tried a few alternative allocators. We switched to jemalloc and the positive effects where huge.

Its a bit weird that this is still an issue.


Another option is explicit memory pools when some classes of allocation have well-defined lifetimes. Freeing an entire pool is instant compared to either free()/delete or GC and produces zero fragmentation.

For example, build parse trees in explicit pools and deallocate the whole tree and associated allocations when done with it.

Language support for this is fairly poor or tricky in OO languages AFAIK; but I'm mostly familiar with C++ where destructors and lack of allocator scoping (until C++11, but optional) make this a pain. The language needs support for formally and transitively abandoning objects (or not caring, like C) to support efficient use of memory pools.


It would be interesting to see how jemalloc performs compared to tcmalloc. The article doesn't explain why they choose tcmalloc among other alternative memory allocators.


I've mentioned at the beginning of this article, that we already have several systems running with tcmalloc. So using jemalloc would mean that we would need to bring new dependency. Tcmalloc solves fragmentation issue really well for us, we can get estimation of how much memory is 'lost' due to fragmentation. We saw that potential saving from let's say jemalloc, cannot be higher than 1-3% of memory use and decided it's not worth to bring new dependency for so little value.


Microsoft's mimalloc looks interesting also: https://github.com/microsoft/mimalloc

HN discussion from 2019: https://news.ycombinator.com/item?id=20249743


For sure mimalloc is interesting!

At work we've been able to use mimalloc to great effect, using it as a drop-in replacement for the standard msvc allocator showed 3x runtime (!) improvements on multi-minute workloads. In terms of memory use we also saw improvements, but less dramatic.

Of course these numbers mean nothing without context (and there was much more heap allocation going on than needed), but getting that kind of performance improvement for (comparatively) such little effort really changed my view on memory and optimization.

Not the most user-friendly library though, perhaps that is a result of its selling point: 0 code changes needed.


The blog post sadly doesn't cover what alternatives are out there, and why they settled on their choice.

Besides tcmalloc, there are jmalloc, hoard, the more recent mimalloc, a bunch of commercial options, ...

Would have been interesting to see a comparison.

Edit: background is in this thread: https://news.ycombinator.com/item?id=26013230


The best allocator for this kind of problem would of course be a compacting GC. First it would be much faster, it would be safer, and easier to use.

For my bigger libraries glibc malloc and free has insane runtimes. Eg a final free could last over 1 minute. If you see this with a GC you would throw it out immediately. But apparently people don't measure and spread false rumors about GC's. Even a very poor and primitive GC as in emacs is not that slow as glibc malloc/free. A good one is miles faster. My GC has an upper limit of 10ms, not minutes.


Do you think any of these stand a chance at replacing Doug Lea's malloc in glibc, or does dlmalloc provide more desirable behavior for the general case?


The FSF requires copyright assignment for code added to glibc, which creates an impediment to including code from other projects.


since each memory request chooses the best fit fragment from an individual arena and not the best fit fragment overall

I vaguely remember when working with the DOS memory allocator many years ago, you could choose between the first/next/best/worst fit strategies, and best fit would tend to produce the worst fragmentation (creating lots of tiny fragments); but as others have noted, it highly depends on the allocation pattern of the application.


Lol 15years ago I fixed the same problem in some long running memory hungry multithreaded c++ application in same way .. and i was wondering why such hack was necessary and not default.. i remember reading some guide in some Google blog which was already dated by then... 15 years and not yet fixed(defaulted) ? Btw I'm not using c++ anymore



The author notes that they could have decreased MALLOC_ARENA_MAX but it would have decreased throughput. But how much does switching to TCMalloc affect the throughput of the application compared to decreasing MALLOC_ARENA_MAX? (And if the answer is that it's insignificant, why isn't it the default?)


The main problem with this approach for us is not throughput, MALLOC_ARENA_MAX is a maximum number of arenas per core, and our new servers have 48 cores, meaning we would still have a huge number of arenas and fragmentation problem. Yes, it will be a bit better but not even close to tcmalloc. There is a good post by Heroku with some tests of different MALLOC_ARENA_MAX, that can give more insight on performance hit. https://devcenter.heroku.com/articles/testing-cedar-14-memor...


Why don’t things like malloc get fixed? I’ve known about these issues for years. Does the core team not know?


The FSF requires copyright assignment for code added to glibc, which makes it difficult to add code from other projects.


What if you asked those other projects? I find it tough to believe nobody would contribute. It’s not like the other mallocs are for profit products.


Unless those projects have required copyright assignments (neither does, to my knowledge), then you’d need to get agreement from the copyright holder for every contribution that still has code in the project. If there are any holdouts (or even any people you just couldn’t get in touch with) then you’d need to remove the parts they contributed to proceed.

More practically, anyone who was asked could reasonably respond “Why do you need to own the copyright? It’s already permissively licensed (in jemalloc’s case, with an LGPLv2.1-compatible license) - just use it!” FSF policy is the only obstacle, and they could make an exception if they wanted to.


Because there's no such thing as a truly general purpose memory allocator.

Memory allocation is built to some purpose, and the programmer who uses it should be aware of its problems. There's probably a technical reason why the original glibc malloc does what it does (probably better optimized for low-thread counts)


What you say is technically true but can be used to just hand wave away any issue. Things can almost always be improved.

Allocators like TCMalloc are not significantly worse in single threaded mode. The glibc allocator just isn’t very good.

I have personally fought with the glibc allocator and seen issues with it that don’t exist on other OSes.


In FreeBSD, the malloc libc uses is jemalloc. Works fine for a general-purpose allocator.


Historically the glibc allocator had a number of hooks as well as support for unexec(). I think unexec() has been killed (thankfully) but the glibc allocator API is still not exactly equal to other allocator APIs.


I'm not aware of any production use cases where tcmalloc would be worse than glibc malloc in a way that end users would actually feel. tcmalloc and jemalloc are used all over the place in all sorts of different applications.

If glibc malloc is so bad at something like this I find it hard to believe it would be great at anything important if jemalloc/tcmalloc are not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: