Memory Tagging and how it improves C/C++ memory safety

BeeOnRope · on Feb 27, 2018

It seems pretty weird for me that a paper submitted in 2018 doesn't mention Intel's MPX (https://en.wikipedia.org/wiki/Intel_MPX), which is pretty much tackling exactly the same problem, in hardware (and requiring software cooperation).

I'm not taking (yet) a position on the relative merits of the bounds-register based MPX approach versus the tagged pointer one presented here, but the comparison should at least be made if at least only to explain why MT would be successful in contrast (MPX has hardly taken the world by storm).

Also, since they appear to be positioning this as something that could be fast enough to run in production (with hardware assist), they should at least take a shot at analyzing the attack surface presented by the probabilistic protection: a figure like 99.6% chance of catching a rogue access makes sense for testing scenarios, but if the adversary is an attacker, can they manipulate those odds to their advantage? The exact tag allocation mechanism becomes pretty important in this case.

maxxxxx · on Feb 27, 2018

A long time ago I found some memory problems in a C++ codebase we had inherited and that crashes from time. In Windows you can protect memory blocks with a function call so I overloaded new and malloc and added some protected blocks before and after the actual memory. Then we ran the code for a lot and it would fault right at the point of the memory overwrite instead of some place later. It made the code very slow but we found problems that were incredibly hard to find in any other way.

jclulow · on Feb 27, 2018

On illumos you can use libumem to replace malloc() via LD_PRELOAD, and then you can instruct it to do the same thing via the UMEM_DEBUG environment variable. It's a fantastic technique for catching these issues without completely destroying application performance.

See also: https://illumos.org/man/umem_debug

jjnoakes · on Feb 27, 2018

These days the gcc and clang memory/address sanitizers can do this quite nicely, and they tend to have low overhead as well (in my limited experience so far).

sebcat · on Feb 27, 2018

ASan has lower overhead than Valgrind, and is great when you cross-compile something to an embedded/constrained system. I have however observed noticeable overhead on code doing many small allocations.

Which was nice to be able to observe, because the code didn't need to do many small allocations.

xxpor · on Feb 27, 2018

When running a test suite at work with ASan + UBsan turned on, the test goes from taking ~30 seconds to ~20 minutes. So there's certainly some pathological cases.

martincmartin · on Feb 27, 2018

Try it with just ASan. Some of the UBsan checks can be quite expensive.

Also, check how much free ram your computer has when running with ASan. It may be running out of RAM, and e.g. spending a lot of time looking for other memory to free, or even swapping.

cpeterso · on Feb 27, 2018

The Windows system heaps include a similar feature called PageHeap. You can enabled it for any process using Microsoft's pageheap.exe utility. It uses a lot of extra (virtual) memory and doesn't work well for applications that have their own memory sub-allocator like jemalloc or dlmalloc.

https://docs.microsoft.com/en-us/windows-hardware/drivers/de...

kazinator · on Feb 27, 2018

This approach was implemented in Bruce Perens' "Electric Fence" library [1987].

https://en.wikipedia.org/wiki/Electric_Fence

I think he might have developed that while at Pixar.

dkersten · on Feb 27, 2018

I had similar memory issues once and was able to trace it using valgrind, which afaik did something similar under the hood.

zwieback · on Feb 27, 2018

We did the same thing for some device driver code and also use a tool called "Boundschecker", back in the 90s. I just checked and it still exists!

maxxxxx · on Feb 28, 2018

We also had BoundsChecker but it was my little library that found most bugs eventually :)

mhandley · on Feb 27, 2018

Another great resource on hardware extensions for security, including efficient tagged memory, is the Cheri work at Cambridge:

http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

dis-sys · on Feb 27, 2018

I saw quite a lot RISC-V discussions lately, I was first confused and asked myself "why people are so interested in such stuff not coming out ranked top X in performance". Now I get it - it is close to impossible to evaluate new ideas/proposals like Memory Tagging on those CPUs with closed design and tight patent "protections".

dahart · on Feb 27, 2018

This somewhat reminds me of the Microsoft debugger convention of initializing memory with various tags so that when you crash with a pointer bug, you can tell what kind of memory was trod upon. They even made the hex codes semi-mnemonic so it was easier to remember, e.g., 0xCD=clean, 0xDD=dead, 0xFD=fence, etc.

https://stackoverflow.com/a/370362/1424242

Interesting note that with this method in the paper, "Temporal and spatial bugs are detected probabilistically" - and it depends on how many tag bits you choose, which trades addressable space for more tags (higher probability of catching errors.)

> Memory safety in C and C++ remains largely unresolved.

I read the paper's description, and still don't quite understand exactly what they mean by "memory safety". This is not referring to safer programming practices like using the STL with iterators, correct?

BeeOnRope · on Feb 27, 2018

> I read the paper's description, and still don't quite understand exactly what they mean by "memory safety". This is not referring to safer programming practices like using the STL with iterators, correct?

Memory safety generally refers to treating each region of memory as it's expected type, not over-writing or over-reading the bounds of allocated region, not reading unitialized or freed memory, and so on.

It's a correctness problem that often ends up in the news since it's usually unchecked in C and C++ meaning that errors of that type don't necessarily result in predictable exceptions or process termination, but rather all sorts of exciting exploits.

Many higher-level languages have no similar memory safety issue (absent bugs in their implementations) since they exclude features that could result in the above problems, and include explicit runtime checks in places where the compiler or runtime cannot prove that an access is safe.

masklinn · on Feb 27, 2018

> This somewhat reminds me of the Microsoft debugger convention of initializing memory with various tags so that when you crash with a pointer bug, you can tell what kind of memory was trod upon. They even made the hex codes semi-mnemonic so it was easier to remember, e.g., 0xCD=clean, 0xDD=dead, 0xFD=fence, etc.

BSD's malloc.conf framework does something similar. Though they've managed to be incompatible with one another.

On FreeBSD, activating junking (default off) will fill allocated memory with 0xa5 and deallocated memory with 0x5a.

On OpenBSD, allocated memory uses 0xdb, deallocated memory uses 0xdf, junking is default-enabled for "small chunks" & the start of freed pages (which are also checked for write-after-free on a small delay).

jabl · on Feb 27, 2018

.. and on glibc (the usual libc on Linux), the MALLOC_PERTURB_ environment variable can be used for similar stuff (http://man7.org/linux/man-pages/man3/mallopt.3.html ).

cpeterso · on Feb 27, 2018

Wikipedia has a long list of magic debug numbers used by different tools and operating systems:

https://en.wikipedia.org/wiki/Magic_number_(programming)#Mag...

Rusky · on Feb 27, 2018

> I read the paper's description, and still don't quite understand exactly what they mean by "memory safety". This is not referring to safer programming practices like using the STL with iterators, correct?

I would assume it's referring to some kind of actual guarantee, which STL doesn't provide even in idiomatic use, due to things like iterator invalidation or subtly broken use of references.

nickpsecurity · on Feb 27, 2018

Here's related work for those interested in the subject. Two old ones:

Burroughs Architecture https://www.smecc.org/The%20Architecture%20%20of%20the%20Bur...

Capability-based Computer Systems (book) https://homes.cs.washington.edu/~levy/capabook/

Some new ones using a range of techniques:

SAFE Architecture http://www.crash-safe.org/papers.html

CHERI Architecture http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

Note: CHERI runs a modified version of FreeBSD on a FPGA board. Quite practical if it can handle that!

Watchdog Architecture https://www.cs.rutgers.edu/~santosh.nagarakatte/softbound/

Hardware-assisted Secure Allocator http://jin.ece.ufl.edu/papers/HASP17.pdf

flavio81 · on Feb 27, 2018

Memory tagging in hardware? Yay, they're implementing 1970s Lisp Machines again!

rurban · on Feb 28, 2018

Would be a godsend for GC, much better than malloc/free safety checks. There are small boards with HW assisted GC support, 2bits per ptr, but no big ones as in a lispm.

protomyth · on Feb 27, 2018

Does anyone know how this relates to what the Burroughs B5500 and B6500 operated?

gumby · on Feb 27, 2018

This is very similar to the Burroughs approach that later influenced Multics and the lispm architectures (and is available in some RISC-V versions). Notably Intel used this in the unfortunate iapx 432 and cleverly built some of this approach in their segmented memory model).

However by the late 1970s it had been figured out that you could do some of these tricks without dedicated hardware support (mask-boxing of integers, for example on processors that ignored unaligned access you could use the least significant bits for tagging; on ones with address spaces less than the word length you could often do the same with the upper bits). These bits could be used by GC and/or for fast type systems, and by the mid 1980s had pretty much wiped out the advantage of dedicated hardware for Lisp implementations (the lispms started out much faster at tag manipulation but the weight of work on the general purpose machines caused the latter to get better (and cheaper) much faster).

The bigger advantage of hardware support IMHO is the security opportunity for capability management.

Sean1708 · on Feb 27, 2018

Am I correct in thinking that this is a similar idea to page faults but per allocation rather than per process?

DSMan195276 · on Feb 27, 2018

After skimming the paper, I think it's actually closer to segmentation. They just split the address space into X number of segments. The allocator then ensures every allocation lies completely within one of these segments and attaches that segment's ID to the returned pointer. Fundamentally, paging and segmentation are fairly similar though, so you're not wrong, there is just no concept of a 'page-table' here.

That said, I feel like the paper is missing a few details on the implementation. They give the example of a segment size of 64 bytes, and a tag size of 4 bits. Obviously, that doesn't even cover a full 4K page of data. So I presume the segments must repeat, meaning after segment 15 there is another segment 0. This seems to fit into their probability description (An OOB access might hit the next segment with the same ID, but it's unlikely).

The approach still seems a bit limited though. I would assume objects larger then 64 bytes would have to be untagged, since normally you would access them via offsetting from the same base pointer, and that base pointer would always have the same tag. But also, a big problem is that most OOB accesses are likely to hit the region of memory just past the allocated region. So with larger segments you are likely to miss many of these OOB accesses since the tag is still correct.

All that said, I don't think this idea is mentioned in the paper, but you could also vary the size of every set of tags in the address space. So have, say, 20 sets of tags for 64-byte regions, then after that 20 sets of tags for 1024-byte regions, and etc. So there is a mix of small and large regions. There's no guarantee you would have the right amount of different-sized regions, but you can always fall-back to using untagged if you need so it is not a disaster if the numbers are a little off (Besides, of course, the fact that untagged memory loses the security advantages of the tagged memory.

BeeOnRope · on Feb 27, 2018

> After skimming the paper, I think it's actually closer to segmentation. They just split the address space into X number of segments. The allocator then ensures every allocation lies completely within one of these segments and attaches that segment's ID to the returned pointer.

It's sort of the other way around. Replacing "tags" with "colors" since somehow it seems a bit more intuitive, the allocator itself just "colors" different allocations with a different color, rather than splitting up the address space beforehand. The returned pointers are also colored and a fault occurs if any pointer is used to access memory of a different color.

> Obviously, that doesn't even cover a full 4K page of data. So I presume the segments must repeat.

Definitely - every color will be used many times, so they only probabilistically catch any fault, based on mostly the total number of colors (as long as the colors were randomized from run to run it seems like any fault would be quickly caught with a high probability). They mention in in the paper improvements such as ensuring that adjacent regions always have different colors so that linear overruns are caught with 100% probability.

> I would assume objects larger then 64 bytes would have to be untagged, since normally you would access them via offsetting from the same base pointer, and that base pointer would always have the same tag.

Not at all. The 64-bytes is only the granularity of the colors (actually they suggest 16 bytes is a better option): but an arbitrarily large region can have the same color.

Most of the rest of your post follows I think from that misunderstanding: there is no need to change the color every 16 or 64 bytes (they call this quantity TG) - you color the whole region with the same color. The TG value is just the granularity at which it can change, so it's a tradeoff between overhead for small allocations (favoring smaller TG) and total tag storage cost (favoring larger TG).

DSMan195276 · on Feb 27, 2018

Reading it again, I guess I missed the "shadow memory" concept they talked about, but they only mention it 5 times and none of those mentions are in the actually explanation of how they do the memory tagging. They do mention it one more time in the AArch64 implementation though, so it does look like you're correct on this. I don't understand why they didn't really make this detail much more clear though, since like you've pointed out it changes a lot about how things work.

Now, if that is the case it complicates the setup a fair amount, since now you do need a global 'page-table' like structure holding this information that the allocator configures. And from reading the paper, don't they intended for the kernel to do this checking as well on userspace pointers, indicating the kernel also needs access to this information? It's doable, and better overall, but more complicated.

BeeOnRope · on Feb 27, 2018

Yes, the memory->tag mapping has to be stored somewhere, but that's true even for a segmented system, right? You need some place to store the segment offset and length: either some dedicated registers if you only allow a few of them, or some in-memory tables for more general schemes, right?

Yeah they don't really get into how the shadow memory is managed. If you only had limited or no hardware support, it seems like it would be tricky, especially when you are arbitrarily mapping various chunks of the address space. The ideal for performance is that there is some straightforward relationship between a pointed-to region and the associated tag bits, e.g., just shift the pointer down by lg(TG) bits and then index into a single contiguous region which stores the tag bits, pointed to by some global.

That avoids the need to have any kind of page-table like structure, but it would waste a lot of virtual address space if the range of the pointers was very large or very sparse.

If you had hardware support though, maybe this problem goes away: the hardware could store the bits in "holes" right alongside the actual memory, as how ECC works (they suggest SPARC is doing this), or they could include the information in the TLB alongside the regular mappings (I guess this would need kernel support too).

I didn't see much about kernel-side checks in the paper? Maybe I missed it. The kernel already checks that user pointers are valid in the context of the process that provided them, but of course that is a very coarse check (can the user process as a whole read/write here?), and could still violate semantic memory safety of the process depending on what was passed.

DSMan195276 · on Feb 27, 2018

> Yes, the memory->tag mapping has to be stored somewhere, but that's true even for a segmented system, right? You need some place to store the segment offset and length: either some dedicated registers if you only allow a few of them, or some in-memory tables for more general schemes, right?

Not if your segments/tags are static and defined beforehand, which is what I thought that were going for (Since that doesn't require any extra memory). But obviously that approach is pretty limited, which was my thinking.

> Yeah they don't really get into how the shadow memory is managed. If you only had limited or no hardware support, it seems like it would be tricky, especially when you are arbitrarily mapping various chunks of the address space. The ideal for performance is that there is some straightforward relationship between a pointed-to region and the associated tag bits, e.g., just shift the pointer down by lg(TG) bits and then index into a single contiguous region which stores the tag bits, pointed to by some global.

This is exactly what I'm thinking as well. If the platform has paging support, in theory this is cheaper then it seems because (besides using up tons of virtual memory) the array can be sparsely allocated, with only the regions actually in used taking up physical memory. But 64bytes is still really coarse, it might be necessary to use a multi-level approach to make the size a bit saner. But since it is just virtual memory, depending on the architecture it may not matter.

Storing it in the page table mappings themselves is an interesting idea. Of course, some architectures have more 'free' space in there then others. And at that point you're talking about likely 4K granularity, not 64 bytes. Still better then nothing of course, but it would definitely decrease the effectiveness for smaller objects.

> I didn't see much about kernel-side checks in the paper? Maybe I missed it. The kernel already checks that user pointers are valid in the context of the process that provided them, but of course that is a very coarse check (can the user process as a whole read/write here?), and could still violate semantic memory safety of the process depending on what was passed.

Someone else asked about this as well, it seems I was led astray. The paper has this line about ASAN:

    Does not currently detect when the kernel accesses invalid user-space memory

From reading that, I thought they were then going to use their memory tagging system to do such checking (And in theory you could do it as long as the kernel knew about the mappings). But they never actually bring this up again in the paper.

BeeOnRope · on Feb 27, 2018

> This is exactly what I'm thinking as well. If the platform has paging support, in theory this is cheaper then it seems because (besides using up tons of virtual memory) the array can be sparsely allocated, with only the regions actually in used taking up physical memory.

Yes exactly. The worse case still seems like a doubling of (actual) memory use though: imagine a very sparsely accessed region of memory: even today each access would bring in a whole 4K page (on x86, for example), but now you'd bring in a whole 4K shadow page as well (yes, the access has to be really sparse to get this 1:1 ratio).

> And at that point you're talking about likely 4K granularity, not 64 bytes. Still better then nothing of course, but it would definitely decrease the effectiveness for smaller objects.

I think a 4K granularity is next to useless - you already get that from the hardware/OS paging support. If you want to detect problems on a 4K level, you only need to change your allocator to mmap in a page at a time, which you can even already do on some with tunable ones.

DSMan195276 · on Feb 28, 2018

> Yes exactly. The worse case still seems like a doubling of (actual) memory use though: imagine a very sparsely accessed region of memory: even today each access would bring in a whole 4K page (on x86, for example), but now you'd bring in a whole 4K shadow page as well (yes, the access has to be really sparse to get this 1:1 ratio).

In theory you should be partially saved by the allocator, just because it is going to try and place as much stuff as it can into every page it allocates, as well as allocate multiple pages at once (So they all end-up in a row). An allocator that was aware of the 'shadow' pages should be able to allocate chunks of pages at once that correspond to a single 'shadow' page, ensuring you avoid creating tons of them. The higher alignment requirements mean more wasted memory though, meaning you need to allocate more pages and need then need more 'shadow' pages overall.

The paper suggests anywhere between 14% and 18% RAM overhead (discounting Android apps, which had a range from 6% to 42%) for 64 byte granularity, and that it actually goes down if you choose a smaller granularity like 16, due to the lower alignment requirement.

And of course, you're also talking about doubling memory accesses as well, assuming this is done in software. Every access has to first check the 'shadow' page. Unfortunately both of their implementations are at least partially hardware assisted, so I'm not sure you can use those numbers to just a pure CPU implementation, but they suggest around a 2x slowdown.

> I think a 4K granularity is next to useless - you already get that from the hardware/OS paging support. If you want to detect problems on a 4K level, you only need to change your allocator to mmap in a page at a time, which you can even already do on some with tunable ones.

I would agree. The tagging system can prevent a higher percentage of OOB accesses at the page level in theory, but considering the size of the address-space vs the number of allocations (assuming a 64-bit/48-bit address space) I think the likelihood of an OOB access actually hitting another separate allocation (assuming a mostly random placement) is already pretty low, so the tagging system would make it lower (You now have to hit a random allocation with the correct tag) but not significantly lower then before, and at the cost of performance.

convolvatron · on Feb 27, 2018

I dont understand the implication of why the kernel needs to be aware of the checking. the tag bits are architecturally ignored.

you can remove the restriction on the number of the colors by using valid upper virtual memory bits to denote type (bibop), as long as you know that the kernel hasnt put anything else there. maybe thats the kernel interaction you were thinking of?

DSMan195276 · on Feb 27, 2018

I suppose that's what I get for skimming, this paper is a bit weird. I thought they were going to have the kernel check the tags of userspace pointers that are passed to it, because it includes this line in regards to ASAN:

    Does not currently detect when the kernel accesses invalid user-space memory

But after reading the paper again it looks like they never bring this up again so ¯\_(ツ)_/¯

BeeOnRope · on Feb 27, 2018

Yeah they aren't clear about it, but I guess the idea is in a hardware supported model with memory tagging you'd presumably get kernel (and more generally "cross process") checking kind of for free since the checks would be pervasive and hardware enforced.

With the software only solution, it's essentially just a variation on ASAN: with the trick that you use some pointer bits to additionally detect some cases that ASAN can't (because ASAN doesn't care about the origin/formation of the pointer: only whether the access is allowed).