One of the things to keep in mind here is that this trick won't work or work as well for every architecture or every processor family in that architecture.
Some architectures do not support unaligned memory access and will raise an exception. If you're using things like packed attribute with your structs your compiler will generate the correct code but that code will be slower. In almost all cases it will generate many more instructions and because of that your cache will be less effective (due to larger code size) your decoder cache will be less effective, etc...
The author has a more modern Intel processor. The x86 family always supported unaligned access, albeit it was always slower in terms of cycles. More recent Intel processor have made this penalty much shorter. I believe this was driven network applications many of which focus on efficiency of packing as many bytes down the channel and less on the alignment requirements. http://www.agner.org/optimize/blog/read.php?i=142&v=t
You're talking about word aligned boundries, which is absolutely an issue with certain architectures(I remember dealing with the issue with older SPARC processors). The article is talking about L2 and L3 processor cache hash collisions, which can result in lost performance as the cache's are overwritten.
This optimization doesn't preclude those architectures necessarily, it's saying that instead of allocating at address 512, 1024, etc., there might be a boost from allocating at off-page addresses.
Some architectures do not support unaligned memory access and will raise an exception. If you're using things like packed attribute with your structs your compiler will generate the correct code but that code will be slower. In almost all cases it will generate many more instructions and because of that your cache will be less effective (due to larger code size) your decoder cache will be less effective, etc...
The author has a more modern Intel processor. The x86 family always supported unaligned access, albeit it was always slower in terms of cycles. More recent Intel processor have made this penalty much shorter. I believe this was driven network applications many of which focus on efficiency of packing as many bytes down the channel and less on the alignment requirements. http://www.agner.org/optimize/blog/read.php?i=142&v=t