Hacker News new | past | comments | ask | show | jobs | submit login
Top byte ignore for fun and memory savings (linaro.org)
77 points by matt_d on Feb 8, 2023 | hide | past | favorite | 49 comments



Intel has linear address masking and amd has upper address ignore which are similar, but of course have different implementation details.

Lwn wrote about the linux support here, https://lwn.net/Articles/902094/

How many years till "256TB ought to be enough for anyone", will become a repeated quote? And we will wish we had kept those extra bytes available for addressing.


Well, TBI only uses one of the two unused bytes. Meaning it would still keep working until an increase to 56 bits, or 65536TB of virtual memory. Once even that's not enough this CPU architecture isn't going to be used outside of VMs anyway.


It's a rather contrived demo, seeing as LSB-tagging makes pointer and fixnum aritmetic work already, with fixnums requiring no heap allocation, and the implementation of the symbol table [0] is very much O(live values) when creating an object. (Do see "Hash-consing garbage collection" [1] for a non-strawman implementation of coalescing heap objects though.)

But putting metadata in the MSBs of pointers is a nice trick for concurrent garbage collection like Pauseless, ZGC, C4 and such [2]...though I'm not sure how much you win, as collections aren't _that_ frequent, so if you map the same physical memory multiple times (like ZGC, probably C4 too) your TLB should pick up whichever replica in virtual memory you're currently accessing quickly.

[0] https://gitlab.com/Linaro/tcwg/tbi_lisp/-/blob/main/SymbolTa...

[1] https://www.cs.princeton.edu/research/techreps/TR-412-93

[2] https://www.usenix.org/legacy/events/vee05/full_papers/p46-c... - the Azul hardware seemingly only stole one bit still.


It's a contrived demo because everything you could want for a Lisp-like object representation, using spare bits in a pointer, can be done without relying on those spare bits being ignored by the memory referencing hardware---only relying on those spare bits being available.

I recently converted a Lisp to NaN tagging. All it needs is that pointers are restricted to the 48 bit range, not that the top 16 bits are ignored.

All values which are pointers have a zero there; if there are nonzero bits, the value is something other than a pointer, and not equal to any pointer.

Other choices are possible within the scope of NaN tagging: it's possible to favor numbers so that values which are floating-point numbers do not require any treatment before use, but recovering pointers requires bit operations.

Either way, the run-time never relies on two different values both being pointers to the same object, let alone both being dereferenced to access that object.

That could turn out to be inconvenient in some ways. While memory access is cheap due to not having to mask the special bits (the hardware doing that for "free"), all comparisons which check "are these two values the same object" are going to require masking, because your comparison instructions (xor, subtract, compare, ...) will not ignore any of the bits. Those word comparisons are pretty frequent operations in Lisp: e.g. whether two values are the same symbol. You more frequently compare a symbol than you peek into it to geet its properties!

(On one platform, the top bits are not ignored: Android. Its libc uses tagging for added safety. E.g. malloc hands you a tagged pointer. Only the low 48 bits of it matter for addressing, but the tag has to match when you call free. This causes a problem, since I need those bits to be zero in order for a value to be of pointer type. Luckily, I allocate heaps in large blocks. On Android, each block remembers its tag, and then I clear the pointer's top 16 bits. If/when the garbage collector is able to return an entire heap block to the system, those top bits are restored in the pointer handed to free, so everything is cool. The overhead of this is absolutely negligible, therefore. One word of storage per tens of thousands of objects, and some rarely executed bit operations.)


> all comparisons which check "are these two values the same object" are going to require masking, because your comparison instructions (xor, subtract, compare, ...) will not ignore any of the bits.

Wouldn’t two pointers to the same object also have the same tag? I don’t understand why you would need to mask it.


The article shows a reference count field. Presumably, two pointers to the same object could differ in the reference count, which would require masking for them to compare equal.


The structure described is an element of a "symbol table" and not a Lisp pointer. Which makes it even weirder honestly.


"256 TB should be enough for.."

Instead of using true 64-bit addreses AArch64 systems only use 48-bits. In a decade or two (or maybe even less time) it's reasonable to expect systems will exist with > 256TB of memory in a single machine.

Thus is it ill-advised to leverage such a hackey feature? Particularly since the bits are guaranteed to rot.

From a power efficiency perspective the appeal of using only using the minimal set of 6/8 "strictly required" bytes is understandable.. Oh wait, this design choice doesn't reduce power consumption? Hrm.

Would appreciate hearing from an expert about why this is a thing. Also, if the addresses are virtualized, why not take the memory savings that'd come with efficiently packing the addresses into only 48-bytes? Too much overhead?


Yup, something like that happened with MacOS, which used 24-bit pointers a loooong time ago.

In general it's not a big issue. You only want to use it with proper CPU support anyways. It's just best to avoid using it for user code and leave it for use by the compiler and/or operating system.

For example, AArch64 supports Pointer Authentication[0]. This uses the additional bits to essentially "sign" a pointer, which could serve as a final defense against some attacks. It is essentially free, so why not use it for the dozen or so years those bits aren't needed yet?

[0]: https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets...


The situation on classic MacOS was a bit different. The issue in that case was the the 68000 with a 32 bit architecture with a micro-architectural limitation of 24 bit addresses. Nobody thought deeply about the implication of that in terms of bincompat and ABI (which is not shocking, very few machines from that era were fully compatible with their predecessors). As such the processor just did not wire up the 8 address lines at all. A number of pieces of software (including the MacOS Toolbox) used those high bits for various things, and would read and write to addresses without masking the bits which worked because the address lines were not hooked up.

That is the reason why almost every future architectures after the 68000 specified the behavior for such unused bits (in general it is to sign extend them, IOW, the top bits must be all 1s or all 0s). If that had been the case on the 68000 then when software attempted to access pointers with data in the high bits it would have caused an exception when the load / store was issued. It is worth noting it would not have prevented people from using the top bits, it would have just forced them to manually mask them in software (a technique that is used all over the place for things like tagged pointers, etc). That still might have been an issue when the memory space was actually expanded to the point where the high bits were needed for real physical or virtual addresses.

TBI is actually about explicitly stating architecturally that these 8 bits are used for data not addresses. That is different from the situation on 68000 because it is an architectural guarantee that future processors will not use those bits for addresses, which means you won't have bincompat issues and you don't have to bother masking them before use dereference them.


Those old 68k machines used a single shared address space for all processes. Now with a per process virtual address space it makes a lot of sense to not require full width pointers, ideally the process should declare how many pointer bits are actually needed and the hardware should mask those before the page table lookup.


> Would appreciate hearing from an expert about why this is a thing. Also, if the addresses are virtualized, why not take the memory savings that'd come with efficiently packing the addresses into only 48-bytes?

Not an expert at all, but the the place where I've really used this sort of method is when writing a GC and trying to figure out where to store a "pointer has been visited" marker in the pointer instead of in a different data structure.

This was also on ARM, but quite a long time ago on a Sharp Zaurus. The word aligned pointer model gave 2 bits at the end of the pointer to mark cycles in GC pathway.

This was much simpler than a separate data structure to keep the mark phase of the traversal in memory & was an extension of the "1 bit reference counter" paper[1] with white, gray and black for the bits in there.

Of course, you need to go back and rewrite them out to 00, but with virtual memory in the mix you can play bigger tricks. This does save a lot of memory, hash lookups and is very low-overhead way of keeping a 1:1 metadata storage for every pointer you have.

You'll see the same trick in the JVM ZGC where higher bits are used to keep track of the metadata about the items being collected [2].

With a little bit of pointer arithmetic, you can also turn a pointer read-only to prevent some process from modifying shared data outside of a lock too, to avoid copying data before unlocking. I had a double mmap mode in PHP APC which segv'd when the core engine mutated a cached data structure outside of a lock[3].

[1] - https://link.springer.com/article/10.1007/BF01932156

[2] - http://hg.openjdk.java.net/zgc/zgc/file/59c07aef65ac/src/hot...

[3] - https://notmysock.org/blog/2009/Jan/08/


Years back I was building some concurrent data structures and I had cpu instructions to operate atomically on 64-bit quantities. So, with my data structures storing pointers, it was very useful to use the 3 low bits (0 in the pointer due to alignment) for 3 booleans, and I think sometimes a counter or something in the high bits, so my atomic operations could work on a few more variables.


Yes, this kind of thing is common in mutex and other concurrency primitive implementations. You can store the lock's owner (an aligned pointer) and some bits about if any reader is waiting for the lock; that kind of thing.


ARM wants to use the extra bits for memory tagging. A feature that would allow detecting C memory bugs.


In a few years, 256 TiB will be barely enough to run VSCode, Slack, Teams, and a browser concurrently.

But yeah, the idea is that you can add tagging info to the high bits of pointers (to enforce data types, reference counting, etc.) and pay zero cost to indirect through such a tagged pointer. No need to mask off the tag bits, just use it like a regular pointer.


Most modern processors read their native word size from memory internally. And that's it. So to read a 16 bit word on a machine with a 64 data bus, that is split over two 64 bit words (e.g. addresses 0x1007 and 0x1008) two 64-bit reads are performed.

It's standard to store a byte or half-word in a struct with a full 32-bit or 64-bit word, for performance reasons.

Some RISC machines can't even do such sub-word accesses at all. Byte and half-word load/store are an optional extension to the RISC-V base instruction set.


Most modern processors do reads on the level of cache lines.


The JVM has a special mode called CompressedOops when you reserve 32G or below where all the references are 32-bit instead of 64-bit.


> it's reasonable to expect systems will exist with > 256TB of memory in a single machine.

Note that this is a limit for address space, not for physical memory. You can need that address space e.g. for mmaping of large files. And disk arrays with files larger than 256 TB are realistic even today (that is just 16x 16TB HDD).


It is commonly used in mitigating the concurrency issues such as ABA, and which normally occurs in lock-free data structures design: https://en.wikipedia.org/wiki/ABA_problem


Using shorter virtual addresses make for a simpler page table and translation lookaside buffer (TLB) design.

Intel processors support up to 57 bit virtual addresses (128 PB), in a 5-level page table. The trade-off to increased size is an increase in worst case latency reading a virtual address.


Only special applications will need that much in a single address space. OS's will likely have support for 48 bit processes vs full 64 bit, which isn't complicated; basically just restricting the virtual memory mapper below a certain address.


I'm particularly excited for top byte ignore, as it can be used for some pretty creative things.

In Vale, we're planning on storing a small "offset-to-generation" integer in the top bits. This will let us easily find the top of the allocation that contains the object we're pointing to. When the user doesn't want to use the opt-in region borrow checker, this mechanism helps provide an easy fallback.

The technique could theoretically also be used for RC'd languages to allow taking references to the interior of objects, which (I believe) would help those languages close the gap between them and high-performance languages like Rust, Zig, Vale, etc.

I hope it catches on with more CPUs.

Another reason I'm excited for memory tagging is that it can somewhat reliably help identify memory unsafety in development/testing. Perhaps a low-level language (Zig? Odin?) could even make it work correctly in the presence of tagless unions, to enable a more efficient alternative to tagged unions (like in Rust) for certain situations.


Does this mean vale can only support 256 memory generations? Or is there a way to handle more?

> The technique could theoretically also be used for RC'd languages to allow taking references to the interior of objects, which (I believe) would help those languages close the gap between them and high-performance languages like Rust, Zig, Vale

I don’t understand how it would make any difference. You would still need update the reference count, which is what makes rc expensive. It would make accessing slightly faster, because you don’t need to mask the top bits but that wouldn’t make up for the bigger cost.


Let's hope Apple has learned from the past and doesn't use this feature in its ARM machines - 24 bit addressing (the original 68000 only had 24 address lines but 32 bit registers) was abused by software developers who used the 8 MSBs of a pointer to store additional information. When switching to the 68020, which could address 4 GB, this caused a lot of trouble...

See https://apple.fandom.com/wiki/24-bit_addressing for more details.


Since people know about it, and turn it on only when they want to use it, it's not a problem.

Had the move to the m68020 and m68030 included mandatory memory management, then old tasks could've all been run in the first 16 megs and clean apps run anywhere else. It's a shame Apple didn't do this.

Since this is now a documented feature of the CPU, processes can have up to 65,536 terabytes of usable memory and still use this. If they need more than 65,536 terabytes of memory, they can simply not use this feature :)


Still, with VM enabled (the 68851 PMMU was only available some time after the 68020 came out and an expensive addition, only the 68030 included an on-chip MMU as you probably know - so Apple was unable to use it in the Mac II initially) all 256 possible aliases for software storing arbitrary data in the 8 MSB would have to be mapped in the worst case. This would certainly be possible, but depending on the application might imply quite some overhead due to the limited number of TLB entries available.

Of course, on 68k Macs also the OS parts stored in ROM were not 32 bit "clean". MODE32, a tool developed by Connectix, patched the ROMs (traditional MacOS stores entry points for system functions in RAM which can be modified) to get rid of 24 bit addressing (https://en.wikipedia.org/wiki/MODE32).

What would be nice, of course, is to use the ARMv8-a top byte ignore feature to implement tagged data types for Lisp and Smalltalk implementations... [and I should read the whole article since this is a large part of it... oops]

"65,536 terabytes ought to be enough for anybody." :)


> old tasks could've all been run in the first 16 megs and clean apps run anywhere else. It's a shame Apple didn't do this.

It would not have been feasible; RAM was expensive back then. Most of Apple's 68020 and 68030 machines only shipped with 1MB of RAM; it was not until 1992 that they started shipping machines with 4MB.


You don't need that much memory to make use of memory management. In many cases, using memory management can help to save memory, but that's another discussion.

Obviously you don't need to use MODE32 if you don't use more than 8 megs. Same thing. Memory wouldn't be rearranged to have (stuff in the first 16 megs) and (everything else) until you have more memory than would fit in the first 16 megs.


Apple disables TBI and uses those bits for pointer tagging (edit: only for code, it’s enabled for data, which is likely where you’d want it more). This is pretty much irrelevant at this point anyways, applications today that use pointer tagging are generally written with portability in mind and have a way to adjust based on how many bits are available (if any). I don’t actually think Apple exposes a way of detecting whether it’s enabled so it’s probably not worth using until they commit to it.


I'm also fairly confident that there will be a rebuild-the-world ABI break between now and the time Apple ships machines with > 256 TBs of memory.


was abused by software developers

The developers responsible for this 'abuse' were first and foremost Apple themselves - key services provided by the OS/ROMs used the high order bits for various things.


You don't want to store reference counts in pointers. The reference count of a heap-allocated object is a property of that object, and not of the pointer.

The pointer gets copied; the object does not. You're going to end up with copies of the pointer which have different reference counts, not knowing which one is the real one.

As a rule of thumb, no mutable property of the heap object should be forwarded into the pointer word used to reference it.

Not sure where the author is going with all that.

Edit: it may work if these pointers are handles in a table, which seems to be what the article is getting at. The actual value representation points or indexes into the table and those are passed around.

So the table is just a very compact way to store objects in a single pointer-sized word.


> However unlike Linux, I could find no official statement of support so do not assume this will always be the case.

TBI is used by memory tagging extension (MTE; primary ingredient to "HWASAN"). I'd be pretty surprised if Apple Silicon didn't support it officially. IIRC they're 8.6 and MTE is specified in 8.5? Note that it's not the same as pointer authentication which Apple actively promotes and as I recall contributed on the community toolchain development for this feature.


A bit dangerous feature to use. Carelessly used it can lead into a platform lock-in.


This is so low-level, I think you'd just consider the respective parts "platform-dependent code", and provide alternative implementations for other platforms (possibly including a very generic one).


I'm not sure this is true, since you could simulate it on other CPUs. I believe most modern OSs don't use the top byte, so you can store things in there (as long as you clear it before trying to dereference it).


While that may be true, my experiences from Motorola 68k times still make me shiver. 68k used just bottom 24 bits for an address. Later models used full 32 bits, breaking anything that used the top 8 bits for something else.

The introduced incompatibility was that bad and hard to get rid of.

So pardon me for feeling uneasy.


Yes, but on the Mac (where these bits were used) the entire system participated, whereas this is per-process and needs to be explicitly enabled.


Almost the whole IBM 360 mainframe line used 24 bits for addresses, and you had the top byte for your own use. The 370 line had a mode for 31-bit (not a typo) addresses in user space.


NBA Jam Tournament Edition for the arcade had a CPU with bitwise addressing https://en.m.wikipedia.org/wiki/TMS34010 and the programmers used the lower 3 bits of pointers to store extra values. Source: a friend of mine ported the game to the PC by hand-transcoding the assembly.


Can I just use the first 48 bits of a 64 bit pointer forever for the address and then the rest for whatever else? Or is there a forward-compatibility risk?


According to Wikipedia, Intel's Ice Lake microarchitecture, released in 2019 Sept, supports 5-level paging, which uses bits 0-56 for virtual addresses [0].

A quick search hasn't yielded any info on when AMD support for this may be coming, but I'd assume it (or something similar) is in the works.

[0]: https://en.wikipedia.org/wiki/Intel_5-level_paging


On ARM, there are a couple of security extensions that use the upper bits for memory tagging and/or a short pointer authentication code. The latter is already in use on iOS.

I would expect Intel or AMD to follow (if they haven't already).


Why only the top byte if there are two unused bytes before the remaining 48 bits of address?


Some intel server CPUs support 57 bit addresses [1]

[1]: https://en.m.wikipedia.org/wiki/Intel_5-level_paging


Are there actual computers running these CPUs addressing that much real or virtual memory?

What’s the practical purpose of implementing that in Ice Lake?


Apparently AWS will rent you a machine with 24 TB, so that's within an order of magnitude of the 256 TB limit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: