Virtual memory overhead for trees

epistasis · on Aug 4, 2012

I was playing around with some stuff that required a 48GB hash table and, to the very best of my ability to understand this stuff, the run time was completely dominated TLB misses. I say this because, based on my throughput, every lookp was taking the time of about 3 memory accesses on average; i.e. there were page table lookups for every single memory access I made. I don't know the tools that would let me actually monitor the true number of TLB misses.

Had I pursued it further, it seems that using a hugepages interface could alleviate this, but hugepages are a royal pain in the ass to get going as they require kernel parameters, rebooting, and special memory allocation routines, and praying that your memory doesn't get fragmented. Of course I was doing this in C, and if my application had been in any other language it may have been extremely difficult to get this to work.

My use case may have been unusual, but as we store more and more data in RAM it's going to become less unusual. When we care deeply about the latency it seems that virtual memory pagesize is going to be a big problem, and already it seems that there are few use cases where 4kb pages are large enough.

aquark · on Aug 5, 2012

Intel's VTune will measure TLB misses, along with many many other detailed CPU/cache/memory diagnostics and is a great tool to identify these types of issues.

epistasis · on Aug 5, 2012

Thanks for the pointer! It's a bit pricey, but if this was a serious project it would probably be worth it. The price motivated me to search once again, more thoroughly, and I found another section of the great LWN coverage on huge pages that mentions that oprofile can pull out the hairy bits of Intel's PMU:

http://lwn.net/Articles/379748/

neopallium · on Aug 5, 2012

You could try to turn on transparent hugepages:

echo always >/sys/kernel/mm/transparent_hugepage/defrag

echo always >/sys/kernel/mm/transparent_hugepage/enabled

For more details see: http://lwn.net/Articles/423584/

Tuna-Fish · on Aug 5, 2012

The transparent huge page code still only gets you 2M/4M pages, which will still fault most of the time when you handle multi-gigabyte in-memory structures. To avoid ever faulting, you really need to use the 1G pages. And no-one is crazy enough to build support for making them transparent. :)

neopallium · on Aug 5, 2012

epistasis called manual management of huge pages a "royal pain in the ass", which is why I suggested trying transparent huge pages since they are easier to work with (i.e. no code change).

I would be interested in seeing some benchmarks that shows the affect of using huge pages of different sizes with hash tables or other data structures.

Here is a memory bandwidth benchmark that uses huge pages (only uses the default size which is 2Mbyte on my system): http://blogs.utexas.edu/jdm4372/2010/11/11/optimizing-amd-op...

It only sums the values from 32 million doubles (about 256Mbytes of ram). So it is only useful as a memory read speed benchmark. But is does show a lot of different ways to optimize memory bandwidth (prefetch, large pages, sse).

Andys · on Aug 5, 2012

We are not really asking for them to be transparent - TFA is talking about low-level kernel fiddling to achieve what would take a couple of lines for a custom hugepage size?

Tuna-Fish · on Aug 5, 2012

I was specifically replying to neopallium, who suggested turning on transparent hugepages, which, however, would not really do any good at all in this situation. Using (non-transparent) 1G large pages is the correct approach here.

epistasis · on Aug 5, 2012

Thanks, I had not seen that article, but LWN's coverage of huge pages is what made me purchase a subscription, it was absolutely fantastic documentation.

It does turn out that to fully remove TLB misses I'd need 1GB pages, which as was pointed out, are rather scary to imagine being paged to disk, as it would take approximately 10s to page it out. As machines with 512GB or 1TB of RAM become more commonplace, I wonder if there could be a mechanism to better assist with their allocation. Right now, one has to use a bunch of manual management and be extremely aware of any other processes that might start consuming memory.

hobbyist · on Aug 5, 2012

In todays AMD and Intel Architectures, except the leaves(ie last level PTE's) the rest of the nodes are also cached in the processor, so when a TLB miss occurs if there is contiguity in your access pattern you will take hits in the upper level of page table accesses ie PML4, PDP...

epistasis · on Aug 5, 2012

Unfortunately, there was no continuity in my access pattern, so it's worst case all around. Doing random reads in 48 GB of ram really made me appreciate just how well caches work these days, and the penalties when your workload does not try or can not use the cache. It also made me start to think of DRAM like I do a rotational hard disk: a large time to seek to a new location, then extremely fast speed to stream off of that location.

andreasvc · on Aug 5, 2012

Isn't a B-tree more suited for this purpose than a hash table?

epistasis · on Aug 5, 2012

I don't think so, but I'm all ears to suggestions. I was playing around with string search where the haystack is large, fixed, and searched for small needles whose total volume is thousands of times larger than the haystack. My goal is to optimize the number of lookups per second on approximately ~2.5 billion keys, and each value is 4 bytes long; and for this particular application I can lookup the keys in each hash bucket super quickly once I know the values.

The benefits of B-trees over hash tables don't matter in this application, as I don't care about adjacent keys after lookup, and the input order of my keys is not going to be sorted in any way so once I find a leaf it's not going to help me with subsequent lookups.

In this setting it seems that any B-tree access is going to require traversing several new nodes in addition to the first node, and those internal nodes are extremely likely to be both cache-cold and TLB cold. Whereas with a hash table I only have a single cold memory lookup to check the hash bucket, then a single cold memory lookup to validate the key at the bucket matches; this cold memory lookup for the validation is also likely to warm the cache for the subsequent matching work, and this small amount of subsequent matching work would likely be required by a B-Tree as well. Any thoughts?

andreasvc · on Aug 5, 2012

I see. Note that the amount of lookups for the hash table only holds if there is no collision, but that may be a fair assumption. Given that the data are strings, a specialized datastructure such as a Trie or suffix array might work well, especially if it is possible to exploit redundancy in the data (i.e., repeated substrings). There are specialized datastructures used in full-text search which will probably be useful. In any case dealing with this amount of data might mean that using an off-the-shelf database might be a good idea.

derefr · on Aug 4, 2012

I was pondering, a while ago, an operating system that--as well as exposing a raw "allocate me a block of memory" function--exposed a managed, typed key-value representation of virtual memory (picture, say, a Redis kernel module), from which one could allocate hashes, trees, linked-lists, and so forth. Given a NUMA architecture, this K-V store could then just be clustered between each memory pool in the same system in exactly the same way (save optimizations) one would cluster it against remote systems.

zippie · on Aug 5, 2012

Just some background - the solution/benchmark spawned from an index lookup latency issue. In our search engine, we generate enormous b-tree indexes and store them in memory (rsync from master then mmap). After adding more logic, intersects, and unions the search engine started to miss SLA.

Eventually, we traced the problem back to the additional latency in the vmalloc code path. The get_free_page* API code path had much lower latency and llds was born (llds uses k*alloc which is a wrapper around GFP).

Additional use cases where llds is being used is in low-energy compute environments (like SeaMicro machines) where every CPU cycle is expensive due to increased hardware latency.

justincormack · on Aug 5, 2012

I remember when there was a webserver in the Linux kernel. However it was considered a bug that you could not do equal performance from userspace and eventually it was removed.

Should also be possible to fix for this type of case. Making a kernel module is the easy solution and gives a benchmark though.

gwern · on Aug 5, 2012

Reminds me of exokernels. Being able to freely roll or adapt your own virtual memory management system tuned to your application was one of the signature uses.

chimmy · on Aug 5, 2012

i am not able to fully understand what it is shooting for. The README, says it avoids the VM layer (which seems impossible in a pure software solution). The code suggests its merely doing a kmem_cache_zalloc. am i missing something?

its true that vm is an overhead now, with infinite/very large memory, the concept of virtual memory is outdated. TLB misses are too high and huge pages just don't cut it. this has been repeated over and over but we need to re-design the VM/hardware to support TLBless access for a portion of memory of the working set size of your primary application.