I see many comments here acting like this means not aligning on word boundaries (e.g. using packed pragma, sandy bridge have unaligned access, etc.). This has nothing to do with word alignment. As the article states at the beginning, this is about aligning to page boundaries, which is on a very different level than word boundaries or structure packing. Let's not get these confused.
It's important to at least know about the n in n-way set associative cache (i.e. that it exists) and this article is a good reminder. Next time you see data accesses that ought to be fast (ought to be hot in cache) but seem not to be, this is another thing you can look for.
It's easy, from a software engineer perspective, to know that your CPU has cache, and just think it's like any other cache you might implement in software. But the implementation details - the hashing by masking the address, and only having a few slots available for things that hash to the same bucket - are actually important, as shown here.
One of the things to keep in mind here is that this trick won't work or work as well for every architecture or every processor family in that architecture.
Some architectures do not support unaligned memory access and will raise an exception. If you're using things like packed attribute with your structs your compiler will generate the correct code but that code will be slower. In almost all cases it will generate many more instructions and because of that your cache will be less effective (due to larger code size) your decoder cache will be less effective, etc...
The author has a more modern Intel processor. The x86 family always supported unaligned access, albeit it was always slower in terms of cycles. More recent Intel processor have made this penalty much shorter. I believe this was driven network applications many of which focus on efficiency of packing as many bytes down the channel and less on the alignment requirements. http://www.agner.org/optimize/blog/read.php?i=142&v=t
You're talking about word aligned boundries, which is absolutely an issue with certain architectures(I remember dealing with the issue with older SPARC processors). The article is talking about L2 and L3 processor cache hash collisions, which can result in lost performance as the cache's are overwritten.
This optimization doesn't preclude those architectures necessarily, it's saying that instead of allocating at address 512, 1024, etc., there might be a boost from allocating at off-page addresses.
I understand this only enough to understand that it may or may not be a surprising and paradoxical result.
As a not-systems c++ programmer, it seems to reinforce the usual lesson: don't optimize your structures initially for anything but readability, once you have your system running and can pinpoint the bottlenecks, then try things like aligning structures to various things. But naturally always have a real-world-like test-suite to verify you are improving things.
Interesting. One question. How would I take advantage of this ? Are there special flags for the compiler or special memory allocator that I would need ? Do libraries like BLAS already account for this ?
I'm not aware of any flags of C compilers that would "fix potential cache-slot-collisions" for you automatically, neither of any allocators. Intel processors have some profiling registers that can point you to this kind of problems, but typically you first have to know what you want to profile.
Compilers are actually built to align the structures you write as there are a lot of processors which have significantly slower access to the misaligned values. Allocators also have to return you aligned addresses for each "malloc." The newest Intel iX processors are actually an exception in being able to amortize misaligned accesses.
I've actually used hand-made unaligned "string" stores. As soon as you allocate some bigger memory block and store the character sequences of the variable size one after another, their starts won't be aligned unless you want that. For doubles and other fixed-size values it's still better to keep them aligned.
Moreover I wasn't able to extract some value out of the article. Something constructed can be constructed to be slow or fast, fine. But I don't see anything that would inspire me to improve my real code. Maybe somebody who reads the article manages to produce such examples?
You can always write your own incremental allocator to provide word aligned (or whatever alignment) blocks of memory (from a large block of memory you obtain via a usual malloc() call) for the individual structures such that their page offsets are spread evenly (and avoid other problems such as spanning pages).
[EDIT] although it is tricky to do optimally given that different processors will have different cache set characteristics (as the article shows).
"Spreading" which will produce optimal cache use is something that depends on dynamic and not static properties of the program, so you'd have to "spread" differently depending on the use patterns. I can't imagine any universal solution.
And most of the programmers make much bigger omissions than those mentioned in this topic. Like using wrong algorithms, wrong libraries, doing too many allocations, having bad structures of the data... So this topic effects are invisible unless you already fixed other issues.
True, but having every structure aligned to exactly the same offset into each page is extremely unlikely. Given a page size of 4KB then a whole bunch of 32 byte structures aligned in such a way would represent a huge waste of memory. No sane allocator will allocate things like this.
If your structures happen to be very close to the system's page size it could easily happen, then you'd need to avoid this yourself with your own incremental allocator (or other tricks).
Definitely agree with your last point, I've seen lots of code (in commercial applications) where people are optimising completely the wrong thing.
If you go through to the original discussion on comp.arch, there are some interesting ideas, including converting arrays of power-of-2 structures into structures of arrays if you're often walking them and not hitting all the fields every time.
In C / C++ (in POD structs) you can use the pack attribute. Here's more documentation: http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html I would still recommend you try to lay out the struct by hand to try to the right alignment for the variable.
Having said all that, before you do this. Why don't you first measure where you're seeing high cache misses and only optimize those. On Linux the perf tool is a god send, you can annotate down to the source code line.
BLAS doesn’t layout data for you; it needs to work with the alignment of buffers that you pass as arguments. High-performance BLAS implementations do often allocate their own temporary workspace, and some account for issues like this (because BLAS access on large buffers is dense, the exact issue described here is rarely a problem, but 4k false aliasing frequently is for huge power-of-two sized matrices, for example).
Great example. I looked briefly at the source, and wasn't sure whether "pointer_chase" was on or off in your graphs. Or maybe it didn't make a difference?
Page-aligned accesses like these also make compulsory misses worse, because prefetchers won’t prefetch beyond a page boundary. But if you have enough data that you’re aligning things to page boundaries, you probably can’t do much about that anyway.
To the contrary, I think this is one of the relatively rare cases that explicit prefetching can help you. But maybe this helps only once your sets are too large for L3?
> Well, I’ve now managed to blog about three of the areas where I have the biggest comparative advantage. Three or four more blog posts and I’ll be able to write myself straight out of my job. I must be moving up in the world, because I was able to automate myself out of my first job with a couple shell scripts and a bit of Ruby. At least this requires some human intervention.
This was a great explanation of caching architecture, but I really want to hear the story of how the OP automated himself out of his first job with some shells scripts and Ruby.
Many modern processors hash some of the address bits before using them as a cache index to avoid these problems.
To avoid them in code, it's often sufficient to round to convenient decimal numbers when allocating arrays, instead of powers-of-2. That is, allocate an array of 1000, instead of an array of 1024.
Why don't caches store data based on a linear hash of the address, rather than simply the low-order bits? This would retain the property that an aligned block of data can be stored without collision, and would extend this benefit to page-strided data. Even a "hash" as simple as (low order bits) XOR (middle bits) would provide this benefit.
It's not that you would want to. It's that certain data structures (particularly in kernel land) naturally form around powers-of-2 sizing. Since the hashing in an n-way associative cache is done by masking away bits, you can get into this nasty situation where multiple elements of your data structure end up hashing to the same location set (the N in an N-way associative cache). You don't want that if you are going to walk the elements in your structure.
Either avoid walking the elements, or do whatever it takes to ensure that you don't end up getting stuck behind the N.
Hardware is inescapable.
It's also worth noting that you can use this to your advantage, ensuring that accesses to certain elements does not push much else out of the associative cache.
It's commonly done in memory allocators at various levels. You inherently want to manage pages, and you'll carve out the start of the page itself to hold its own data structure.
This bit of code executes when a call to malloc() could not find any free memory, and it has to request more memory from the operating system. The function supermap() allocates a large chunk of memory from the kernel. We then write the meta-information to keep track of that chunk of memory directly to itself - that "meta-information" is the pageblock_t struct. In malloc(), we then use that chunk of memory to find a sub-chunk inside of it to return to the user.
Note that internal to malloc(), we will maintain lists of these pageblock_t structures, and they will all be page-aligned. So, anytime we want to search for free memory among the managed pages, we're going to access page-aligned structures.
The OP is plotting the ratio of page-aligned vs. unaligned access time against problem size. So, on the y-axis, a value of 2 means the page-aligned access took twice as long as the unaligned access.
Took me a few moments to grok the charts too, since I haven't had my coffee yet.
Darn, really unexpected. Learn something new everyday. Page-aligned access is common because I/O is usually aligned on page. Now CPU intense cases are different because of cache line usage.