The IBM 360 shipped with 32-bit addresses but only 24 bits decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When they wanted the address space IBM found themselves architecturally hamstrung, and the cost to dig out was significant.
The 128K Macintosh used a 68000; it had 32-bit addresses but only 24 bits were decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When Apple needed the address space they found themselves hamstrung by pieces of MacOS that did use those bits, and many applications that did, too. The cost to dig out was significant.
It is practically guaranteed that "Hey, there's 16 whole bits up there that nobody's using today" will wind up the same, because this industry just never learns.
You can do things with lower bits and usually get away with it; many systems put GC tags and type bits down there. But those upper address bits do not belong to you.
On x86_64 and arm without those features enabled, the top bits of the pointer must be sign extended. This means that x86_64 by default gives you the top two bytes to play with as long as you don't use the values 0xffff or 0x0000. Attempting to access a pointer whose top 16 bits aren't a sign extension of bit 48 will fault. You can still safely play this game as long as you fix up the pointer before dereferencing it.
"Attempting to access a pointer whose top 16 bits aren't a sign extension of bit 48 will fault."
Currently. Coming soon to an Intel chip near you is 57-bit virtual addressing and 5-level page tables [1]. It would be quite a bug that would only crash on new Intel hardware on probably quite full memory maps where your pointer fix up code wouldn't restore bits 48-57 correctly.
x86 large pages are 2 MB (64-bit) or 4 MB (32-bit). Even larger page variety is 1 GB. They do save a lot of TLB misses, but can be painful to reliably allocate.
> Currently. Coming soon to an Intel chip near you is 57-bit virtual addressing and 5-level page tables [1]. It would be quite a bug that would only crash on new Intel hardware on probably quite full memory maps where your pointer fix up code wouldn't restore bits 48-57 correctly.
Yeah, I'd hope that code that uses such tricks would have an error check for pointers that use the bits it wants to use. Better a clean crash than corruption.
It'd be neat if a program could tell the OS what address range is acceptable. Linux has mmap(..., MAP_32BIT, ...) but obviously that's pretty limited. Maybe something like map(addr, ..., MAP_MAXADDR, ...) which would tell it addr represents the maximum acceptable address to return. So if you intend to use the top 16 bits, you could tell it this mapping can't use those.
Edit: oh, actually, I see they do something kind of like this. https://lwn.net/Articles/717293/ "An application that needs [virtual address beyond 48-bits], and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible." Still not quite as flexible as I was imagining but not bad.]
Back around 2001 I was part of a webdev shop that had its own proprietary application server. It pre-parsed HTML files for <script language="perl"> or <script language="python"> and would run whatever you put there. We linked in slightly patched perl/python libraries so we didn't have to start a new interpreter every request. There were a couple in-house RPN languages too, and other comment-based markup for easy loops/interpolations. One design goal was to let you round-trip between developers' editors and designers' Dreamweaver, so that even after adding code the designers could still see real HTML and usually even change it. I haven't seen any system so successful at that part since. (I love Haml, but design tools don't. :-) Anyway, I got to help out with porting the app server from Solaris to Linux, and one of the tricks they were using was to stuff some extra info into unused areas of the Solaris file struct, basically to smuggle it around without needing to rewrite anything. As long as you could get to the struct, you could retrieve what you put there. It was a brilliant-but-horrifying trick, and of course it was a big challenge for the porting effort.
It was pretty cool. The company was founded by MIT folks, so I guess they weren't afraid of mixing C & scripting languages. :-) I was just a junior developer, so I barely understood what was going on. The company tried to build an open source version, but it suffered a lot from second-system effect IMO.
I don't often get to do work as cool as that, but I've really enjoyed pursuing more researchy things in my spare time. Right now I'm working on adding SQL:2011 temporal features to Postgres, so I guess I'm still getting my C fix from somewhere. :-)
I once shrugged this of as paranoid BS and got bitten by it in the very same project. It did cost me some all-nighters to fix the results of my arrogance.
Honestly, you may as well use the bits if they are available, because as you say too many will break and will break backwards compatibility.
With x86_64, the top bits have to be all 0, or all 1. You can still reuse them as long as you clear them before actually using them as pointers. Of course that might break at a later date is memory spaces increase in size and they are used, but I'll deal with that when it happens.
You'll have to deal with APIs that do return full 64-bit pointers with real information in the top bits. You won't be able to play games with these bits. Stuff is going to break unless you are careful. If you are shipping a platform of some kind then you must also ensure that your customers cannot get into trouble (if you tell them "Hey, these upper bits are up for grabs" then you are doing them a disservice, in several ways).
"We know this is wrong, but we'll deal with it when it happens" is a demonstrably expensive road. I can't think of many clearer cases of refusal to learn from history in computing.
I think as with everything it’s about context. If you play games deep within some internal component that is very limited in scope, easy to rewrite, and it buys you performance that is truly worthwhile, that’s one thing. If you’re doing this with pointers beyond some region of local control, you’re asking for it.
Nothing on x86_64 is returning "full 64 bit pointers", unless they are also playing games, as CPUs only support 48 bits of address space.
Also, your suggestions are a bit strict, in my opinion. Every VM or interpreter I know about uses this kind of trick, usually to squish things like integers. Not doing this requires adding an extra allocation for every integer.
> Nothing on x86_64 is returning "full 64 bit pointers"
Well, I'm going from the processor reference manuals from Intel and AMD. Existing implementations are 48 bits, and future implementations WILL have more. I trust these folks to make good on their promise. (Similarly, the 68000 had 24 bits, and when the 68020 came out it had full 32 addressing, and an MMU, and badly written software broke).
There are pretty safe ways to do tagging; common techniques usually distinguish limited range integers and pointers with a low bit, and pointers often have extra bits since those are relatively safe to manipulate. I have not seen runtimes that use high-order bits for tags in quite some time, probably the late 80s; you haven't seen that software recently either, because it doesn't work anymore.
The x86 segmented architectures (which is all we had from Intel until the 386's flat address space came out) had several variations on pointers, basically a menu of terrible tradeoffs between fast and small code versus small and large address spaces. Far pointers were formed with a segment and an offset, and the game was all about making these multipart pointers efficient for things like comparison. In the face of pointer normalization, you mucked with high bits at your peril.
The one thing I'm sure of in this industry is that software is always going to be getting bigger. :-)
The responsibility for this lies not with just the programmers, but the address decode logic. The hardware should have seg faulted on non-zero upper bits (which would have required the addition of a few OR gates, hardly a problem).
Apple’s market cap is $983B and IBM’s is $125B. So somehow they’ve survived this lowbrow hack.
Seriously, with 64 bit addresses available on the iPhone I’m typing this into, this is an excellent trick, especially for applications. Even ARM’s TBI leaves 56 bits which is more address space than a data center’s DRAM.
Right, because the quality of every lesson one learns about programming practice should be evaluated by the market cap of the firm where it was learned.
Those examples and companies, IBM (the IBM 360) and Apple (the Macintosh), were picked by the parent post.
DEC was acquired by Compaq not HP. Compaq was later acquired by HP but before that happened Compaq sold the Alpha line to Intel because it would have competed with the Itanium. The Alpha was then licensed by Intel to the Chinese (according to Gustafson) and was used as the basis for the Sunway Taihu Light.
Meh, when you get to the point where you're exceeding your available address space, there's probably a million other refactorings you had to make to address all of the other architectural changes.
Or, you play by the rules and your old code continues to just work. Might not be optimal, but at least it will run. Your customers will thank you . . . well, we all know they won't, but they won't be speculating about your ancestry in public forums.
I don't remember how many Mac applications were '32-bit clean' and ran without modification when VM was turned on. I do know that some apps had to do significant rework, and that there was a whole generation of abandonware where developers didn't think it was worth the effort and just walked away from their customers.
Can't you just make sure that your allocator only uses the bottom x bits? (Of course you still need to be careful when interfacing with code that maps memory in some other fashion)
If your allocator calls mmap or an equivalent, and the OS gives you back a pointer that uses all 64 bits (or more than 48, anyway) then what do you do? I guess the allocator has to retry and pray, or maybe just give up and fail. In any event your clever code is not going to work well, since you have made a choice that is non-portable.
I imagine that the guy on the MacOS team who stuffed some desk accessory bits into the high byte of a pointer parameter thought he was being clever. Four or five years down the line he was paying for it, and it took years to fix his original moment of convenience. He could have added a second parameter on the call in question, or allocated another byte in a struct, and the original system would have been slightly bigger but he wouldn't have hatched a nightmare. Go read about '32-bit clean' Mac applications if you don't believe me. It sucked real hard.
A bunch of companies with a bunch of smart people have made the decision to use "unused" high bits in pointers and have later regretted it, to the tune of a ton of expensive remedial engineering. I see a lot of talk here along the lines of "b-b-b-but we know what we're doing" and "oh, 48 bits is enough, and if it's not then we'll deal with it later" and I'm thinking, "This is precisely how the software industry just never learns."
I figure that I have about 20 years left in my career writing software. I've seen address spaces grow from 16 bits (and we used to think "wow, that's BIG") to 64 bits [okay, 48 bits if you're still in denial :-) ] and wow, that's big. But I'll bet I'll see 64 bits generally considered kind of tight before I retire, and I'll bet there are at least half a dozen people on the planet who are up against a 64-bit wall today. (High probability at least a couple of those folks are on HN. Any hands?).
(The world numbers these from 0 to N, least significant to most significant. IBM numbers them from 1 to 32, with 1 being most significant. I guess that made sense to accountants).
For everyone that enjoyed this, there's an entire free online MIT course called Performance Engineering of Software Systems[1] where you'll learn plenty more tricks and common pitfalls like these. You'll also learn how to use tools to debug the low level performance of your programs: looking at cache misses, cpu utilization, time spent per assembly operation and so on. It's pretty cool :)
Very good article, facts looked correct and it had useful advice.
I'd add, keep things local. Don't access memory (or cache) outside core (L1 & L2), NUMA region or processor socket boundary unnecessarily.
Keep networking, GPU, etc. code in same NUMA region where the physical adapters are.
Use memory like tape, stream through. CPU branch predictors love that kind of access pattern.
Oh, and perhaps most importantly: use a profiler that can access CPU internal performance counters. Do this on different system types, from low power laptops to servers with 2 or more CPU sockets.
One annoying thing, though. Remember that the fastest thing in a microbenchmark might not be the fastest thing on a real system when different code modules fight for shared limited resources, like memory bandwidth, caches and inter-core communication links.
Sequential access patterns, forward or backward. Repeating predictable gaps are ok, but do remember minimum unit that can be read from memory is a cache line. So if you read one byte, you'll read 64 bytes on modern x86.
DRAM latency for a random access read can get into the low hundreds of cycles on modern multi-socket devices. But the streaming bandwidth remains very high. Cache systems will routinely prefetch the next block ahead of an access if they detect that memory is being used sequentially, eliminating a huge chunk of that pipeline stall.
A discussion of systems programming optimization that doesn't start with single-writer ring buffers starts on the wrong foot.
Those other tricks are excellent, and I use all of them, in cases where they work at all. But, e.g., seeking a way not to need to take a lock at all should come before discovering a low-contention locking protocol.
Readers should note that packing spare bits into the bottom bits of suitably aligned pointers is more portable than using high bits. Any page-aligned pointer has at least 12 bits free at the bottom, and any malloc'd pointer has at least 2, more often 4.
Ring buffer counters into a power-of-2 sized buffer can be incremented without bound, enabling use of ordinary arithmetic on them, and high bits masked off cheaply on each use. [But use 64 bits!]
Probably the most neglected primitive data structure is the bitmapped set. A `uint32_t` gives you a universe of 32 elements; a byte is enough for the days of the week. The popcount native instruction is very valuable here, usually expressed as `__builtin_popcount` in source code. C++98's `std::bitset` provides Standard, portable access to it, but C++20 offers `std::popcount` directly.
[I add here that storing things in high bits of addresses is very likely to fail on systems with ASLR, and that I have learned MSVC bitsets have a very slow popcount.]
std::bitset is one. But you can just use an unsigned int, or array of them. In C++ or C, set intersection is &, union is |, complement is ~. Cardinality is __builtin_popcount, or std::bitset<>::count(). Membership test for m is (s & (1<<m)). For a larger set, represented as an array of unsigned, it's (s[m>>5] & (1<<(m&0x1f)). std::bitset encapsulates all this, optimally. There are similarly efficient implementations for lots of other useful operations: consult HAKMEM.
Bitsets are useful in more places than you might guess, because they can be used to filter out, with typically a single instruction, a majority of uninteresting cases in searches, leaving fewer cases to be examined with more precise and expensive operations. You record interesting properties of elements of a collection upon insertion as bits in a word stored alongside (or, better, in an index); and then first check the bits during any search. It is easy to speed up an amazing variety of programs by an order of magnitude, sometimes much more, with a tiny change.
For example, you can store an int value for each of a (possibly very large) collection of lines of text, representing the set of less-common letters that appear in the line. Searching for lines that contain a string, you first check it against the set for the string. Any line that doesn't have them all certainly doesn't have the string.
Leibniz (inventor of calculus) used this method very heavily in his own work. Before computers--and even into the 1960s--it was the most important way of automating data processing. Back then, you used cards that had a row of either holes or notches along one edge, and selected matching cards from a stack by inserting a rod through the stack at a selected spot, and lifting.
>Leibniz (inventor of calculus) used this method very heavily in his own work. Before computers--and even into the 1960s--it was the most important way of automating data processing. Back then, you used cards that had a row of either holes or notches along one edge, and selected matching cards from a stack by inserting a rod through the stack at a selected spot, and lifting.
Hey, do you have a reference for that? I've been doing some research into Leibniz's calculators, and I've been finding few sources.
There is quite a lot about Leibniz's calculating machine designs on Wikipedia.
I think I found out about Leibniz's bitwise activities in Neal Stephenson's Quicksilver, but he invented the modern notions of both sets and digital logic, according to Wikipedia. He would have used the cards in catalogging libraries.
Yup I used bit sets to write a super duper fast Sudoku solver. For examples you can detect naked singles extremely fast by just and-ing the bitsets in a given row, column or house, taking the complement, and checking if a unique bit is set.
But they used to be essential for automating chess. A 64-bit word represents a board, and a set of words represents the positions of each type of piece, one word for white pawns, etc. Another word shows places threatened by the white pawns, and ORing them shows all the threatened places.
Good article. Basics that everyone can benefit from knowing.
Just one nit/warning... breaking coarse locks into fine-grained locks can be taken too far. There is a point of diminishing returns where you end up spending increased time acquiring/releasing/waiting-for locks. At some point you want to clump together under a single lock resources that tend to often be used together, even if you often end up locking an extra resource or two unnecessarily. As always, benchmark workloads are your friends.
Ha Ha! Yes. The gillectomy project collected some interesting data on that a couple of years ago, creating very fine-grained locks and measuring. Not unexpectedly, with a zillion locks performance suffered. The reason though, had much to do with how much it thrashed the cache on a modern CPU. Lots of cache invalidation traffic back and forth between cores. The C-Python implementation is architected around the GIL, and that is a tough problem to crack.
My favorite too-many-locks story goes back quite a long time, a former coworker is an old-time Unix guru, who was at Sequent back in the day when 8 CPU's was kind of a big deal. He spent about 9 months on a project putting lots of fine-grained locks into their Unix kernel. After shipping that, his next project was 6 months taking about 25% of them out :)
I didn't read it as a joke, you're just operating at a different abstraction level. The CPython interpreter famously uses a single global interpreter lock to protect the language internals and runtime, so it has trouble scaling beyond a single CPU on interpreter-heavy workloads. You're saying that threads in python scripts can be arbitrarily preempted, and so locking is required to protect them against each other, which is also true.
Theoretically not necessarily the referenced GIL, though. There are conceptually alternative models; see Microsoft doc[0] on apartment threading model, or Tcl[1].
In most cases your compiler should do the clever work of turning your division or modulo operation into easier to do bit banging... but only if you're operating on a reasonable constant. Powers of two are best but you can do other constant divisions by stringing together 1-latency operations in ways that are still far faster than division.
Yes - as long as x is unsigned, then (x % 1024) will be compiled to the same thing as (x & 1023) with any reasonable compiler.
If your hashtable dynamically resizes though, (x % size) will use a full divide. You could keep the log2 of the size around instead, and rely on (x % (1U << size_shift)) being optimised (this works on gcc: https://godbolt.org/z/HbGnJH) but (x & (size - 1)) might be easier at that point.
Instead of repurposing top bits you can also repurpose the Bits beyond alignment. E.g 32 bit integers are aligned to 4 bytes, so you can use the lower two bits of pointers to them instead.
As someone who's worked on old Macs and has also done lots of 32 -> 64-bit porting, this is the sort of trick that works wonderfully...until it doesn't. And then you've got a nightmare on your hands.
I'm not saying never do that (ok, maybe I am...) But definitely think long and hard about how long your code will be around before you do it.
> As someone who's worked on old Macs and has also done lots of 32 -> 64-bit porting, this is the sort of trick that works wonderfully...until it doesn't. And then you've got a nightmare on your hands.
That's why you hide the trick behind a zero-cost abstraction which checks at compile-time if the platform supports this
Modern optimizing C++ compilers (especially with Link Time Optimization enabled) are pretty amazing and can very often actually achieve that abstraction collapsing..
But, of course, always measure.
While it's true that modern compilers are wondrous things, checking whether they're clever enough to optimize away a particular construct - and to do so correctly, and to continue doing so in the next release - still takes time. If the same optimization can be done at a higher level, such that it will apply for any correct (but not necessarily clever) compiler, that's preferable. In my experience that's practically all the time. The best compiler optimizations IMO are the ones that can't be easily done at the source level.
Yeah, but an even more important system programming trick is to measure what you're doing every time. Performance optimization at this level is never about just trusting tools. If you aren't willing to be reading generated machine code and checking perf counters, you aren't really going to get much benefit.
And if you're willing to check your optimizations by reading disassembly, tricks like stuffing tag bits into the bottom of aligned pointer values is pretty routine.
That does not hold for Intel x86 architecture chips, which are perfectly happy to have unaligned integers.
As for struct members, the alignment is (of course) "implementation defined", which is the fancy way of throwing up your hands and saying "whatever". (Since C++11 we actually have alignas(), which at least gives manual control)
One can construct an aligned pointer on a chunk of allocated memory by asking for the needed quantity plus the size of the alignment and the "nudge" the pointer you get to the next word boundary (and you can do something with the "wasted" byte before it, like storing the size of the object).
But that sort of trick is only worth it if you are writing a compiler and its runtime, an interpreter, a memory allocator (in particular GCs) or at the very last some sort of high performance library (you would return your "featured pointer" as an opaque type to the user).
> Part of why the change couldn’t be enabled by default is because various high performance programs, notably various JavaScript engines and LuaJIT, use this repurposing trick to pack some extra data into pointers.
Does any one know if this sentence can be backed up by a citation?
I know that the NaN-tagging trick assumes that pointers have 48 bits (small enough to fit inside a floating point mantissa), but was this ever a factor for deciding whether 5-level page tables should be added to the Linux kernel or not?
5-level page tables have actually been in the kernel for a couple of years now.
The issue listed was definitely a concern, but was worked around by having the kernel only allocate userspace linear addresses that aren't 48-bit-canonical in response to a mmap() call that supplies such an address as the hint argument. See the commit message in this commit for example: https://lore.kernel.org/patchwork/patch/796025/
So for a program to get an address above the 56-bit limit, its memory allocator has to specifically indicate to the kernel that it supports that.
I guess partially because it can quickly become very dependent on the system you work on. But if you're specifically interested in caching issues, one good keyword to look for is "cache aware algorithms" - or "cache oblivious algorithms".
On the locking side, the counterpart would probably be "lock-free algorithms", but I still don't believe their complexity means that in most cases you shouldn't look at those :)
The false-sharing macro in the example expands to __attribute__((alligned(/* etc / )) or __declspec(align(/ etc*/). Is there a reason these are preferred over the alignas specifier introduced in C++ 11?
> Now, to support multiple processors on a single machine reading and writing from the same memory in a coherent way, only one processor on a machine can have exclusive access to a given cache line.
Does this also apply when multiple processors are only reading memory?
No. That's what MESI-based cache protocols are all about. Multiple cores/processors can have the same cache line in a shared/read-only state. Only writes require exclusive access.
However, for signed division, the C semantics (round towards zero) are different than the semantics when you apply an arithmetic shift (round towards negative infinity).
If you are fine with the latter behavior explicit shifts remove several extraneous instructions dealing with the difference.
The IBM 360 shipped with 32-bit addresses but only 24 bits decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When they wanted the address space IBM found themselves architecturally hamstrung, and the cost to dig out was significant.
The 128K Macintosh used a 68000; it had 32-bit addresses but only 24 bits were decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When Apple needed the address space they found themselves hamstrung by pieces of MacOS that did use those bits, and many applications that did, too. The cost to dig out was significant.
It is practically guaranteed that "Hey, there's 16 whole bits up there that nobody's using today" will wind up the same, because this industry just never learns.
You can do things with lower bits and usually get away with it; many systems put GC tags and type bits down there. But those upper address bits do not belong to you.