This vaguely reminded of the time where our self-hosted installation of JVM-based app worked fine in DEV and QA (on vmware-based VMs), but in PROD it would error out the time with "out of memory", even though the machines (bare metal) had far more RAM.
It turned out they also had lots more CPU cores (they were repurposed former DB machines), and by default the memory manager (of the JVM, I suppose?) scaled the allocation block size the number of CPUs, so it tried to allocate far more memory than necessary, and since this was still a 32-bit JVM it quickly ran out of space for its allocations.
Once we found out what the issue was, it was pretty easy to tune with an environment variable.
Off topic, so apologies in advance. Why do environments so often get capitalised like that? Prod isn’t an acronym, it’s an abbreviation of Production. Similarly Dev is an abbreviation of Development.
I know in the grand scheme of things it’s meaningless, this is just one of those things that’s like someone scratching their fingers down a blackboard when I see it day to day.
I don't know, but I see unnecessary all caps in other words too - it's not infrequent that I see JAVA and sometimes RUST (for the programming languages). Kind of irks me, but obviously doesn't really matter. Ultimately I think it's just a weird niche cultural holdover from the very early days of computing when lots of random stuff was arbitrarily capitalized, like FORTRAN was originally all caps, maybe related to early character sets that only had capital letters.
This sort of thing has long annoyed me too when I see it, so thank you for stepping into the firing line for at least the both of us. See also, "FED" for Federal Reserve.
I'm trying to train myself out of this nonsense, but some habits die hard.
The worst example of this is VISA instead of visa when talking about international travel as opposed to credit/debit cards. (Hats off though to whoever decided on that marketing name...)
My first guess for that particular case is that PROD and DEV are often either parts of hostnames for those particular machines or are referring to hostnames, which are often written in all caps.
Years ago, I had a non-technical stakeholder on a project who always capitalized it as PROD, and I made fun of him behind his back, so now I say it ironically... or at least I hope so. There was a time when I said "bro" ironically too... irony is truly a gateway drug.
It might be a leftover from the time where we had text-only docs (no courier) and the use of `backticks` was not commonplace. With those restrictions, ALL_CAPS would be a way to signal “this is a name that has meaning for machines”, simply because humans don’t use that convention.
Is production going to be the name of a variable or a value? Given that you can't be running in more than one place at once, I don't see why you'd have an env var like PROD=true instead of something like STAGE=prod, since otherwise you need to decide what to do if you have both PROD=true and DEV=true. Environmental variable values certainly don't have to be all caps, so I don't think you're explanation makes sense.
"PROD" is an acronym (a shortened form of a more formal name for something). What it isn't is an initialism (a word coined by taking the first initial of a descriptive phrase). Traditionally, both initialisms (eg. CPU) and acronyms (eg. NORAD) use all majescule typography. Sometimes the coined words become common usages and are commonly rendered in capital or miniscule case as appropriate (eg. radar, scuba).
This is, of course, descriptive typography. Prescriptive typography will vary from style manual to style manual.
I think you're getting the definition of acronym mixed up with that of abbreviation. An acronym is an initialism that is pronounced as if it is a natural word. For example, NASA is both an initialism and an acronym. FBI is an initialism, but not an acronym.
Well that's the dictionary definition anyway. In practical terms, I think most people use acronym as a synonym of initialism, at least in the US. This is one of the places where the dictionary definition doesn't match with real world usage.
32 bit and high core count is definitely a nasty combo. Many applications spawn 2-4 threads per core which is a great way to eat up that address space.
But don't all threads of a process share mostly the same address space?
I could imagine that starting many processes would exhaust address space, or allocating a lot of virtual memory / mmap without committing.
That's one of the fun bits, other processes don't share the memory space. That's one reason you can have 16GiB of ram on a 32bit x86 server without much issue. This was done on x86 through a tech called PAE (Physical Address Extension) that let the OS know there was more memory and allocate it around, but any one process was limited to 4GiB (usually lower due to kernel/userspace barriers) because of 32 bit pointers. So if those threads all allocated memory it was easier to hit the limit but if you used separate processes then they could each reach 4GiB (or whatever the limit is) independently without causing problems.
Interesting, we had very similar issues with malloc and RocksDB on a production instance, ended up switching to jemalloc and the problem went away. Cloudflare had some similar issues with it as well: https://blog.cloudflare.com/the-effect-of-switching-to-tcmal...
We once ran into a very similar issue, but using the old glibc allocator in bionic. For us the difference was between JVM versions/vendors. The arenas in glibc all allocated memory, based on what thread they are in. Because threads in Java are somewhat ubiquitous, then this caused massive "memory leaks", read allocated memory that wasn't being used, as every allocator was getting used, which caused them to all build up their buffers. This in turn caused a large increase in AWS bills, as we had to increase the memory allocated to avoid the OOMK.
We ended up moving to tcmalloc, mostly because the Debian packages worked the best for us. we did testing between the different malloc replacements, and honestly, there wasn't a huge difference between them for our workloads, although they _did_ have differences, with minor variations in peak/avg memory usage and CPU usage.
It was frustrating that the article ended with no real resolution. It was posted almost a year and a half ago; I wonder if there actually is some follow-up somewhere.
TLDR glibc’s malloc doesn’t work well for server workloads. Nice deep dive into the available options to trim it, but the FUD around switching to something like mimalloc or the most recent tcmalloc is unwarranted considering both have seen large scale deployments and have same defaults out of the gate that don’t need tuning (and any tuning you do for glibc would need similar validation requirements).
I assume what the GP actually meant was that glibc's allocator isn't tuned for any specific workloads. It tries to be a jack-of-all-trades, and in doing so, just ends up being okay-ish.
As for what allocator you should be using, it depends on your workload. You'd need to test out some different options to see what works best. And "best" might also be something you have to define. Maybe you're ok with higher baseline memory usage if allocations are faster. Or maybe you want the lower memory usage.
Actually, mimalloc and tcmalloc both show that across almost any workload they outperform glibc. It’s difficult to outperform everywhere of course, but in terms of “works best without knowing the workload a priori”, tcmalloc and mimalloc should handily trounce glibc. Know if you know your workload and start tuning glibc, maybe it can outperform? Not sure but I’d also be skeptical. Tcmalloc and mimalloc both apply really advanced techniques that also require newer kernel support to implement user side RCU iirc. Glibc’s allocator by comparison is much longer in the tooth and just can’t compete on that.
Most workloads don’t care of course, but glibc’s penchant for hanging onto RAM when not needed is one very user visible consequence of this.
Tcmalloc (not the gperftools variant but the new one) and mimalloc would be the 2 I would try first.
I am honestly surprised no one has bothered replacing the system default for distros as a better default allocator would free up RAM since both tcmalloc and mimalloc do a good job of knowing when to release memory back to the OS (not to mention that they’re generally faster allocators anyway).
In early days Java’s allocator was quite a bit faster than C++ in part because memory allocation was not fully concurrent, making allocation part of the sequential element of Amdahl’s law.
Where we are now is better, but has its own problems.
Not really. Lots of rust users come from high level languages so when people come in and complain about rust being unexpectedly slow allocation issues are in the top 5 at least. It’s incredible how shit the standard allocators are in almost all systems.
Meh. Users not used to tracking allocations keep to the same regime (lots of allocations), but because the platform allocators are utterly terrible the allocation overhead is orders of magnitude higher than in even a relatively basic runtime, thus the program is dog slow.
This is not inevitable, platforms could provide allocators which are less awful. Obviously they can’t be as fast as specialised runtime facilities but when you see the gains many applications get by just swapping in jemalloc or some such…
> the platform allocators are utterly terrible the allocation overhead is orders of magnitude higher than in even a relatively basic runtime, thus the program is dog slow
Huh. I’ve seen lots of people write slow Rust code because they didn’t realize there were allocations. But the complaint is usually “my rust program is only 5x faster than my equivalent python program instead of 100x how come”?
Can you point out some good blog posts about Rust or C++ being slower than a “basic runtime” due to inferior allocator?
I would simplify it and say it's an issue with developers who don't understand manual memory management (allocating memory pools up front is just 1 such strategy for manual memory management).
This is akin to the idea that a compiler will eventually solve performance problems.
We're still waiting on that to happen.
There's a difference between a developer who understands manual memory management choosing to use a tech that does it for you and a developer choosing to use a tech that does it for you so they don't have to learn manual memory management.
manual memory management is table stakes for any halfway competent developer.
Otherwise known as Victim Blaming. C developers are that old guy who white knuckles everything and calls people weak for wanting help.
Many of the features in modern programs didn’t exist in the 90’s not because they hadn’t been thought of but because people spent all their energy getting the first fifty features to work reliably. Every app had a couple cool features. Most of them are de rigeur now because we can.
If rust isn’t doing escape analysis for stack vs heap allocation then what is even the fucking point of this language? I would have thought that was the first thing implemented.
Rust doesn’t allocate anything on the heap unless you tell it to. When you tell it too, it puts it on the heap. The target use case is as a systems language.
Escape analysis as you’re alluding to isn’t needed in this model because the amount of times this helps you (ie you put it on the heap but the compiler can figure out it can live on the stack) is about 0. You need escape analysis in managed memory languages where everything is nominally a heap allocation and the compiler is responsible for clawing back performance through escape analysis.
Rust used to use jemalloc, which was arguably the better default. For some reason they switched it to the system allocator when they added support for specifying the global allocator
jemalloc was not the default on every platform, but was on some. It adds a non-trivial amount of binary size to every program, even ones that don't need the performance boost, and that is an area Rust is often criticized vs C.
The number of arguments I've gotten into with other engineers that "manual memory allocation is deterministic, GCs are not!" is too high.
First off, GCs are quite deterministic, unless they are calling rand() someplace. I've walked through GC code under controlled conditions with identical repeated allocations, GCs do the exact same thing under the exact same circumstances every time.
Second off, you have to pay the price of allocation and deal with memory fragmentation somewhere. GCs typically pay the price at the tail end (deallocation, compaction) but have absurdly fast allocators (a handful of instructions), where as manual memory allocators pay the price for allocation (finding a slab of memory) up front, and unless you know how to work around the details of your allocator, you end up dealing with fragmentation yourself.
FWIW this also means many, many, types of applications can gain a boost from intelligent memory allocation strategies. There is the famous example of the cruise missile that never worried about deallocation because it would explode before it ran out of memory, but scenarios involving bespoke memory management being better than "just use libc!" are actually quite common.
Have a stateless microservice? If you know the max size of memory it'll use, for each connection that comes in just allocate a slab of memory and when the connection closes free the entire slab, far more efficient than using a GC or any standard allocator. Also having to calculate the max amount of memory a single connection can use is a great way to figure out how much load an instance can handle (depending if you are CPU bound or not).
For certain stages of compilation, compilers can get away with not freeing memory and just exiting and letting the OS reclaim everything. (Of course less doable now days with compilers running as background services recompiling code as it is typed, and also whole program optimization has made it so compilers use a lot more memory, so I don't know if this strategy is still in use in these modern times!)
The tl;dr is that freeing memory is complicated no matter if you have a GC or not. It is something every developer should be thinking about from time to time.
I thought that GCs are often run in their own thread, and because scheduling at the OS layer is seen as non-deterministic from the application's viewpoint (you never know when your own threads are going to get stopped/started), you can't control exactly when the GC is run?
Is that wrong? Or obsolete? Or GC-dependent?
But also, just because something is theoretically deterministic (e.g. it might technically be possible to work out when a GC sweep will run, and multiple runs of the exact same program will cause the exact same set of GC sweeps), actually working out when a GC sweep will run is so non-trivial the only way to actually work it out is to run the program and watch it happen, and also it becomes impossible to predict how any changes at all to a program will affect how the GC runs as a result (other than making the change and running it).
Or are there ways to deduce when GC sweeps will run for a given program, without actually running it?
Though most GCs are threaded so there is always a level of indeterminism there. That being said, the same can be said of any application. Unless you are doing actual real time programming you always have at least a little indeterminism for how your app will behave based on the rest of what's going on in the OS.
> you can't control exactly when the GC is run?
Depends. A number of SDKs will expose a "trigger the GC" API, though that's generally seen as an anti-pattern.
But speaking of GCs, there's roughly 2 big categories of GCs. The first is a full stop the world while the GC is running, the second has the GC running concurrently with the application.
For a full stop the world GC, that is fairly predictable when it comes to how it will behave. GC is pretty much always triggered on an allocation failure. For full stop the world GCs, that will involve tracing through the live memory roots and somehow marking what's still used and discarding what isn't (State of the art, AFAIK, is a moving collector where live data is moved to a new memory region. This has an advantage of always compacting the data can (but isn't always) be pretty cache friendly as related data is very often colocated.
For the second type, generally what's happening is some signal is sent that indicates it's time to collect memory. The world stops for that signal to propagate but after that happens the app is allowed to continue to run while the garbage collector runs on a parallel thread. The whole app is only completely stopped to wait on when you run into a "you are still collecting and I want to allocate, but there's no space" so of scenario. In these second types of applications it's common for there to be an added GC logic hoisted into the application when memory is accessed because "obviously this is still live so do the GC work as part of the regular app work". This sort of collector would be the hardest to reason determinism about as what happens when you access a memory location will be determined by where the collector currently is and what the application is doing.
The first type of collector can be seen in the JVMs parallel and serial collectors.
The second type can be seen in the JVM's G1GC (partially), ZGC, Shenendoah. It's also fairly closely related to how Go's GC works and how (AFAIK) javascript's GC works.
There is 1 other semi common category of GC, that's reference counting. There you have a VERY high level of determinism as there's no side threads and collection happens purely based on application actions and not any sort of opaque memory investigation. However, RC has it's own set of issues that tracing collectors do not.
> GCs typically pay the price at the tail end (deallocation, compaction) but have absurdly fast allocators (a handful of instructions)
I don't know how GCs work. Don't they still need to find free memory somehow during allocation? I always thought of GCs as manual memory allocators + automatic deallocations
For allocations, GCs are basically bump allocators. Asking for more memory is just a factor of advancing a pointer based on how much memory is needed. The only thing they have in addition to the bumping is a check to make sure the bump won't push them past what's available. That's the condition that will usually trigger a partial GC (A minor collection).
There are some GCs that are much more complicated than that. However, typically, especially for new small allocations, it's just a bump allocator. For larger allocations or when talking about being a generational collector that's when things get more complicated.
For JVM's GCs, most of the collectors are "compacting" which means that after a collection the region collected is left with all the live data unfragmented. That means allocating to a given region is simply a process of keeping track of the pointer for that region and bumping it when something needs to go there.
The JVM did have the now defunct (thank goodness) CMS collectors which would still compact the minor collection regions, but for the old heap regions it'd maintain a skip list. It would only compact the old gen region when after a major collection it still failed an allocation. In that case it'd take the swiss cheese old gen and squish it together.
Go does not have a moving collector so I'm guessing it is doing something like an arena allocation (perhaps just relying on jemalloc and calling free at the right times?). In which case go's allocation speed will be around C's speed. They could be doing something different but I didn't stumble across that googling.
Well, Linus told you ages ago that the "fundamental design principle of Linux is to have fun".
So they change stuff like this for fun. Isn't it fun for you too?
And I will stick to FreeBSD, thank you.
And yes, I understand that technically glibc is not a part of "Linux", but...
It turned out they also had lots more CPU cores (they were repurposed former DB machines), and by default the memory manager (of the JVM, I suppose?) scaled the allocation block size the number of CPUs, so it tried to allocate far more memory than necessary, and since this was still a 32-bit JVM it quickly ran out of space for its allocations.
Once we found out what the issue was, it was pretty easy to tune with an environment variable.