You can also optimize memory and cpu management through linux control groups. Oracle published a pretty good description (see: example 1: NUMA Pinning) of how to assign dedicated cpus and memory to a process or group of processes [1], but you can also read about the supporting cpuset & memory cgroups subsystems too [2, 3].
Very interesting note about the reclaiming. Yet another warning when transparently using a NUMA system.
NUMA can be a real pain. You can get a 40% hit on direct memory access, and far worse if you're modifying a cacheline in another processor. On one of our VoIP workloads, we noticed major (250%+) increase in performance and CPU stability after splitting a very thread-intensive process into multiple processes, each set with affinity to a particular core.
OSes try to help you, but it seems like they're primarily concerned with multiple processes, not huge processes like databases. Such processes should become NUMA aware and handle things themselves for best performance.
It might even make sense to ask if you can split the machine on NUMA boundaries and just act like they're separate systems. RAM's getting very cheap, and RAM/core is going up faster than CPU power is (it seems to me, anyways).
Also, is there a reason not to use large pages directly for the mmap'd sets if you know you're going to have them hot at all times? (I assume they read the entire file on start?)
> Also, is there a reason not to use large pages directly for the mmap'd sets if you know you're going to have them hot at all times? (I assume they read the entire file on start?)
We could use large pages directly. But, as I mentioned in the article, the performance gains would be negligible compared to the gains that come from having things in memory in the first place. These are not very large memory systems and the page table / TLB miss overhead doesn't seem to be biting us. We are just following the mantra 'pre-mature optimization is the root of all evil' :)
In my experience, most people don't know they have TLB problems because, effectively, it's always bad.
It's only when you start getting to the metal to see what your hardware is actually capable of that the TLB stands out as a glaring source of inefficiency.
Put another way: yeah, the TLB is making your app slow, but it's doing so always, so you don't notice. Instead, you mistakenly think your hardware is just slower than it really is.
I guess my Windows bias is showing through. Let's split it and call 2MB pages large, and 1GB huge. (Yeah, I know 1GB pages only have real hardware support in really recent processors.)
Huh, that's not even one I considered. I guessed 1/5, so the change was 4x the final value. Either way it's wrong - "dropped by X%" should be calculated as (original - final) / original.
In regard to conclusion 2, there is another approach here - when you're finished with an old segment, posix_fadvise(..., POSIX_FADV_DONTNEED) can be used to drop it from the page cache.
That would be true if we were using C++. Unfortunately, all our code is in Scala and we use Java NIO libraries to memory map our files. AFAIK, they don't give us the option on using these POSIX calls.
I was hit by the transparent huge pages on RHEL 6.2 in my workload. If you find our ordinary processes randomly taking up huge amounts of CPU time -- system CPU time -- when doing apparently ordinary tasks, you might be affected too. That was a real pain to diagnose when you're used to trusting the kernel not doing anything that weird. Running "perf top" helped to narrow down what the system was REALLY doing.
I didn't have LI-size databases -- just a dozen Python processes allocating each perhaps 300MB and all restarting at the same time were enough to trigger it, taking 10 minutes rather than 2 seconds to start up.
Thanks, the 400% number is wrong. It was a last minute edit.. I should learn not to do that. I have updated the post to say that the error rates have dropped by 1/4th.
We did our experiments directly on hardware. I don't think that AWS VMs simulate multiple physical sockets. If they don't then this article will not apply to them.
p.s. I can recently created a screencast about control groups (cgroups) for anyone interested @ http://sysadmincasts.com/episodes/14-introduction-to-linux-c...
[1] http://www.oracle.com/technetwork/articles/servers-storage-a...
[2] https://www.kernel.org/doc/Documentation/cgroups/memory.txt
[3] https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt