How Misaligning Data Can Increase Performance 12x (2013)

TillE · on Feb 17, 2015

I've read this article before, and I have no idea why anyone would think that page alignment is a good idea, unless you're doing kernel-level stuff that absolutely requires it.

Optimization for performance has always been about cache lines (64 bytes), not pages (4 KiB). Of course you're going to get terrible results when you're wasting huge amounts of memory.

caf · on Feb 18, 2015

You're not wrong, but it is quite possible that you might accidentally end up with page-alignment without realising it, such as if your `malloc()` is backed by `mmap()`.

colanderman · on Feb 18, 2015

This is easier than one would think. Say you have many large shared-memory regions (e.g. queues) with control structures at the start. Bam, all those control structures, and therefore your atomic accesses, are now sharing cache lines.

somerandomone · on Feb 18, 2015

It's probably because if your data is spanned across two pages, it may require a lot of page swaps if you access data back and forth?

TrainedMonkey · on Feb 18, 2015

I think your premise is correct, if our data all on one page we only need to fetch one page. If we write anything back, only one page needs to be updated. In case of two pages, read/write work doubles.

MichaelGG · on Feb 18, 2015

And the prefetcher cannot fetch across page boundaries since that might fault.

TheLoneWolfling · on Feb 18, 2015

This seems like a hardware flaw to me.

We can do OoO execution, why can't we do OoO prefetching w.r.t. page faults? (I.e. try to fetch into a cache, if there would be a page fault don't. If there's something that would cause that fetch to page fault between it being prefetched and it logically being fetched, invalidate the cace.)

thesz · on Feb 20, 2015

This was actually done in some early DEC Alphas, I believe.

OoO engine were able to issue address calculation from loads (if address register is ready) ahead and thusly when real load instruction execute the data is already in cache or closer to cache.

It is really easy to do, actually. It can even simplify page fault handling in CPU.

teraflop · on Feb 18, 2015

But data is transferred between the CPU and RAM in units of cache lines, not pages.

mattgodbolt · on Feb 18, 2015

This is correct: the posts above referring to "page swaps" are a little misleading. The only reason I can think of to keep things within one page is to reduce the load on the TLB, which caches logical-to-physical addresses and which has the granularity of a page.

That being said if you're really looking to optimize for TLB misses you should be using huge pages if your OS and processor support them.

solarexplorer · on Feb 18, 2015

When you want to minimize transfers between disk and ram and your OS does paging, then it makes sense to put unfrequently used data (like error messages) together in a page with other unfrequently used data. It's mostly what you would do on a cache line level (instead of a page level) to minimize transfers between main memory and cache.

solarexplorer · on Feb 18, 2015

Well, aligning for words or cache lines is good. Aligning for pages not so much. Of course this has been known for ages, but sometimes it is hard to avoid, e.g. inside the Linux Kernel: http://yarchive.net/comp/cache_thrashing.html

colanderman · on Feb 18, 2015

I've always wondered: why do CPU caches only key off the low-order bits of the memory address? e.g., DDR controllers generally key off a linear hash of most of the address.

Is gate count really that tight in L1, that they can't throw a few XOR taps in front of the cache bus? Or is it simply to make cache collisions more predictable?

etep · on Feb 18, 2015

Gate count is not the issue - the issue is L1 timing. For clocks at 2 GHz plus, there can be very few stages of logic in between flops. Just decoding an address to a one hot wire (necessary to access the memory bank) takes up a good chunk of that time interval. Reading the bits out and resolving them to back to full rail also takes time. Checking for a tag match, more time. Routing that back to the register file, yet more time. If it's not the L1 cache, say shared L3, the total time of flight across the chip is multiple clock cycles, not to mention all the aforementioned time penalties in addition to the longer access latency into even larger shared L3 cache.

Relatedly, in the L3, there is a hash function used to distribute different address to different regions of the L3. The cost of doing this is less significant in two ways: the L3 access latency is already much much higher (as elaborated on above) and the hash calculation can be done in parallel with other required logic (e.g. in parallel with L2 access).

colanderman · on Feb 18, 2015

Thanks, that was a great answer.

userbinator · on Feb 18, 2015

The addressing is designed to take advantage of spatial locality - adjacent addresses will be adjacent in the cache too.