Optimizing Linux Memory Management for Low-latency, High-throughput Databases

WestCoastJustin · on Oct 8, 2013

You can also optimize memory and cpu management through linux control groups. Oracle published a pretty good description (see: example 1: NUMA Pinning) of how to assign dedicated cpus and memory to a process or group of processes [1], but you can also read about the supporting cpuset & memory cgroups subsystems too [2, 3].

p.s. I can recently created a screencast about control groups (cgroups) for anyone interested @ http://sysadmincasts.com/episodes/14-introduction-to-linux-c...

[1] http://www.oracle.com/technetwork/articles/servers-storage-a...

[2] https://www.kernel.org/doc/Documentation/cgroups/memory.txt

[3] https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt

MichaelGG · on Oct 8, 2013

Very interesting note about the reclaiming. Yet another warning when transparently using a NUMA system.

NUMA can be a real pain. You can get a 40% hit on direct memory access, and far worse if you're modifying a cacheline in another processor. On one of our VoIP workloads, we noticed major (250%+) increase in performance and CPU stability after splitting a very thread-intensive process into multiple processes, each set with affinity to a particular core.

OSes try to help you, but it seems like they're primarily concerned with multiple processes, not huge processes like databases. Such processes should become NUMA aware and handle things themselves for best performance.

It might even make sense to ask if you can split the machine on NUMA boundaries and just act like they're separate systems. RAM's getting very cheap, and RAM/core is going up faster than CPU power is (it seems to me, anyways).

Also, is there a reason not to use large pages directly for the mmap'd sets if you know you're going to have them hot at all times? (I assume they read the entire file on start?)

apurvamehta · on Oct 8, 2013

Hi, post author here.

> Also, is there a reason not to use large pages directly for the mmap'd sets if you know you're going to have them hot at all times? (I assume they read the entire file on start?)

We could use large pages directly. But, as I mentioned in the article, the performance gains would be negligible compared to the gains that come from having things in memory in the first place. These are not very large memory systems and the page table / TLB miss overhead doesn't seem to be biting us. We are just following the mantra 'pre-mature optimization is the root of all evil' :)

erichocean · on Oct 8, 2013

In my experience, most people don't know they have TLB problems because, effectively, it's always bad.

It's only when you start getting to the metal to see what your hardware is actually capable of that the TLB stands out as a glaring source of inefficiency.

Put another way: yeah, the TLB is making your app slow, but it's doing so always, so you don't notice. Instead, you mistakenly think your hardware is just slower than it really is.

SEJeff · on Oct 9, 2013

One correction, in the Linux community, they are generally referred to as Huge Pages.

MichaelGG · on Oct 9, 2013

I guess my Windows bias is showing through. Let's split it and call 2MB pages large, and 1GB huge. (Yeah, I know 1GB pages only have real hardware support in really recent processors.)

SEJeff · on Oct 10, 2013

touche, you get an upvote from me :)

introspectif · on Oct 8, 2013

"after rolling out our optimizations, we saw our error rates (ie. the proportion of slow or timed out queries) drop by up to 400%"

There is some good shared knowledge in the post (unlike this comment, to be fair), but what does drop by 400% mean?

If a rate drops by 100% it becomes zero. I get that.

If it increases by 400%, the outcome is slightly ambiguous (do we add 400% for 500% total or do we multiply up to 400% of the original value).

But a rate decreasing by 400% - am I the only person who finds that (not uncommon) expression hard to conceptualize?

erichocean · on Oct 8, 2013

I understood it to mean that it had 1/4 the error rates they were previously seeing.

dllthomas · on Oct 8, 2013

Huh, that's not even one I considered. I guessed 1/5, so the change was 4x the final value. Either way it's wrong - "dropped by X%" should be calculated as (original - final) / original.

apurvamehta · on Oct 8, 2013

This is exactly right :)

caf · on Oct 8, 2013

Then it dropped by 75%.

apurvamehta · on Oct 8, 2013

Yes. It was a blunder. The post has been updated to reflect this.

AsymetricCom · on Oct 8, 2013

Obviously, he gets -300% errors now.

ihsw · on Oct 8, 2013

That's pretty good.

caf · on Oct 8, 2013

In regard to conclusion 2, there is another approach here - when you're finished with an old segment, posix_fadvise(..., POSIX_FADV_DONTNEED) can be used to drop it from the page cache.

apurvamehta · on Oct 9, 2013

That would be true if we were using C++. Unfortunately, all our code is in Scala and we use Java NIO libraries to memory map our files. AFAIK, they don't give us the option on using these POSIX calls.

pquerna · on Oct 9, 2013

Cassandra binds to posix_fadvise to do exactly this when writing out new SSTables:

https://github.com/apache/cassandra/blob/trunk/src/java/org/...

apurvamehta · on Oct 9, 2013

Wow.. that's great to know. We will definitely investigate this approach. Thanks for sharing! :)

Erwin · on Oct 9, 2013

I was hit by the transparent huge pages on RHEL 6.2 in my workload. If you find our ordinary processes randomly taking up huge amounts of CPU time -- system CPU time -- when doing apparently ordinary tasks, you might be affected too. That was a real pain to diagnose when you're used to trusting the kernel not doing anything that weird. Running "perf top" helped to narrow down what the system was REALLY doing.

I didn't have LI-size databases -- just a dozen Python processes allocating each perhaps 300MB and all restarting at the same time were enough to trigger it, taking 10 minutes rather than 2 seconds to start up.

krakensden · on Oct 8, 2013

According to LWN, this is probably going to be automatic in the future:

http://lwn.net/Articles/568870/ (subscriber-only now, will be free in a week)

dllthomas · on Oct 8, 2013

"we saw our error rates (ie. the proportion of slow or timed out queries) drop by up to 400%."

Should that be 80%?

Edited to add: Apparently it should be 75%, per comments elsewhere.

apurvamehta · on Oct 8, 2013

Thanks, the 400% number is wrong. It was a last minute edit.. I should learn not to do that. I have updated the post to say that the error rates have dropped by 1/4th.

apurvamehta · on Oct 8, 2013

Yikes. I meant that they dropped TO 1/4th the original.

vosper · on Oct 9, 2013

Does the information in this article apply to VMs (specifically AWS) or is it only relevant when you're running directly on hardware?

apurvamehta · on Oct 9, 2013

We did our experiments directly on hardware. I don't think that AWS VMs simulate multiple physical sockets. If they don't then this article will not apply to them.

dhruvbird · on Oct 9, 2013

> On small setting for Linux, one dramatic performance improvement for LinkedIn!

should be...

you know what it should be ;)