Hacker News new | past | comments | ask | show | jobs | submit login
You Can't Always Hash Pointers in C (nullprogram.com)
155 points by ingve on May 31, 2016 | hide | past | favorite | 96 comments



> In other words, even in a conforming implementation, the same pointer might cast to two different integer values.

If you're porting to that type of system, you will be keenly aware of this. The problem can be solved.

On the Intel 8086, every address has 4096 ways of referring to it by some combination of segment:offset. A segment is 64 kilobytes wide, and the segment: part of the address is "paragraph" (16 byte block) aligned. 65536/16 = 4096. For instance address 0x1234 can be referenced as 0123:0004, or 0122:0014 and so on.

However, there is a unique underlying address.

A pointer value can be normalized (using nonportable code, of course) to use, say, the largest possible segment value and the smallest possible offset. The resulting value can then be hashed.

Another way to deal with it on 8086 might be to use a "far" pointer, and calculate its offset from a "far" null pointer:

   (far char *) ptr - (far char *) 0;
This should produce the physical address as an integer.

So that is to say, the concept of hashing an address is sound; only using "pointer" to mean "address" is not maximally portable. An implementation of pointer hashing can be ported to such architectures by some nonportable code, if necessary.

If you care about such portability, wrap the "pointer to integer address" logic in a function and implement that function as necessary.


Some motorola 68000 machines did automatically mask out some of the most significant bits of any pointers. This means that two pointers to the same object can have different representation. That did happen in practice as programmers used the highest bits to stuff metadata and relied on the automatic masking. That made things interesting when the platform wanted to increase the address space. And preventing this sort of backward compatiblity issue is the reason why AMD64 will trap if the unused bits of the address space are not all 1 or 0.

Bottom line, the pointer to integer conversion should be specified by your platform ABI. Read it, then either refuse to support your program on platforms with insane ABIs or have a fallback path for them.


You can do this on most microprocessors just by leaving the address lines uninterpreted. On the 68000 the top 8 lines were flapping in the breeze, so you could use those bits for anything and still dereference the same object.

The first MacOS used those bits for some things; notably desk accessories. It took them years to kill that practice so that later virtual memory versions of the Mac would work. IBM made a similar mistake in OS/360 (24 bit address space, 32-bit pointer values . . . that was very bad).


Wasn't this the case with 68000 based designs in general? The CPU only had 24 address lines. I think it was first the 68020 that had a full 32 bit address bus

And you're right, people absolutely did use it.

On the Amiga I believe that Amiga BASIC (developed by Microsoft) was perhaps the most prominent example of Amiga applications that allegedly did this (only source I've found for this is a comment by a former Commodore employee - Dr Peter Kittel), amongst a massive amount of other problems. People did get it running on 68020+ systems, with a combination of patches and disabling fast ram, but it was pretty much forgotten from AmigaOS 2.x onwards anyway.


If your program targets a 68000, and uses tags in the upper bits of pointer, and that program also hashes pointers, then of course you're going to reconcile these two facts: your hashing function will have to strip out the pointer tags.

Or perhaps not; it depends on whether the tags are immutable attributes. For instance if they indicate type, and type is immutable, then they can perhaps just be mixed into the hash. Under this design, the only way two pointers to the same thing can have a different hash is if you have a use-after-free bug: you're hashing a pointer to something that was de-allocated and re-allocated as a new kind of object. Two objects which occupy the same address at different times don't have to have the same hash, even if it is a pointer hash.

Looked at in another way, those tag bits, if they are immutable, effectively extend the address space. An object with tag 0101 and another one with 0110, all other bits equal, can be considered to be in different spaces (with the constraint that since they collide to the same physical address, they cannot exist simultaneously).


I'm not quite sure what you're replying to. I was not talking about hashing pointers at all. Bare, raw, pointers on the M68000 are 32 bits, but the CPU only has 24 address lines, and the top 8 bits are disregarded. But starting with the 68020, the CPU has 32 address lines and the top 8 bits are not disregarded.

The point is/was that relying on bits that are unused today is risky because it has historically tended to ensure your software breaks when the next CPU generation needs more of the possible address space.

Note specially that a lot of the apps that used the top 8 bits did not use it for type tags that might imply safe usage, but often used it to store unrelated data.

E.g. lets say you had a data structure with a number of pointers and a number of flags. You might very well decide to pack the structure so that the flags overlapped the top 8 bits of the pointers.


Yes, lots of people used this on Atari ST and Amiga (Tempus editor on Atari comes to mind, which used the highest byte of address registers as additional faux registers).

Same tricks are used even today. E.g., have a look at Facebook's DiscriminatedPtr from the folly library, which uses the top 16bit of a 64bit pointer (otherwise unused) for type tagging.


There is a long tradition of embedding tags in pointers, for all kind of reasons (type tags, pointer/int discriminants, ABA counters, etc.). That's perfectly fine and it is one of the reason intptr_t exists.

The problem is when the architecture lets you get away with explicitly masking out the tag bits (i.e. a tagged pointer is a valid pointer) which makes it hard to evolve the architecture.


"Wasn't this the case with 68000 based designs in general?"

Possibly, I wasn't sure so I hedged :).

And yes, I heard that the issue was mostly encountered on the Amiga.


In early Mac OSs, the memory management designs used the top byte for flags as those units didn't have a hardware MMU.

System 7 was the first OS to support 32-bit.


> pointer to integer conversion should be specified by your platform ABI

heh, I just published a post on this that I worked on over the weekend: https://nickdesaulniers.github.io/blog/2016/05/30/data-model...


> preventing this sort of backward compatiblity issue is the reason why AMD64 will trap if the unused bits of the address space are not all 1 or 0.

This helps debugging, but it doesn't help future extensions of AMD64. Programs exist that store data in bits 48..63 of a pointer on AMD64, and expect to retrieve a pointer by doing "x << 16 >> 16" on the integer data. Increasing the address space is a pain no matter what.


To make such program work it's sufficient to map it into the lowest quarter-petabye of address space. I can imagine a flag for that in the executable if it ever becomes a problem.


> That did happen in practice as programmers used the highest bits to stuff metadata and relied on the automatic masking

The old Mac OS did this: https://en.wikipedia.org/wiki/Mac_OS_memory_management#32-bi...


I read about an early computer (JOHNNIAC) with the same issue. It's instructions had 12-bit addresses, but the original machine had 10 bits of memory attached. Programmers often interleaved data through the upper two bits of the address, which wasn't a problem until they quadrupled the machine's memory capacity later in its life.


This ( using unused bits of addresses ) was a terrible idea at the time and it's a terrible idea still.


This remind me of a story.

When Stephanov was implementing the STL, he needed a strict weak ordering among all pointers to implement things like std::map.

Instead of wasting time arguing within the committee on wether operator< should have the required semantics, he simply added std::less as a primitive and default comparison for std::map. It is specified to call operator< for every T, except that for pointer it relies on unspecified compiler support for doing the right thing.

On pretty much all implementations it simply calls operator< even for pointers.


Oh, so then on C++ you can check if pointer-to-T p is within T a[sz]; by doing (a == p || std::less(a, p)) && (std::less(p, a + sz))


Possibly not. The standard guarantees that less yields a total order for pointers. But the ordering itself is unspecified, and not even guaranteed to be consistent with operator< for pointers to elements of the same array.


I thought it was UB to use < on pointers outside of the same block of memory. e.g. your function is fine if the answer is true, but UB if the answer is false.

I could be misremembering though, but I do know I'd always cast to uintptr_t for this.


Another platform that doesn't have straightforward pointers: x86. 16-bit mode. Because of the segmentation system, if you wanted a fully general pointer to anywhere in the address space ("far") it had to be split into two registers. Greater speed could be achieved with "near" pointers, but only within a 64k segment. You could also very easily have non-identical pointers that pointed to the same memory location.

Fragments of this mess survive in the Windows API, like Shelley's "vast and trunkless legs of stone": https://blogs.msdn.microsoft.com/oldnewthing/20031125-00/?p=...


I think the 8051 is more interesting - it's a Harvard architecture with at least 3 separate address spaces:

http://www.keil.com/support/man/docs/c51/c51_le_ptrs.htm

http://www.keil.com/support/man/docs/c51/c51_le_ptrconversio...

Just like x86, it's likely that everyone interacts daily with a system containing at least 1 8051 core.


As someone who did a lot of embedded development, I wouldn't exactly call it "more interesting". More like "that shit needs to die already".

The biggest problem with those is, that modern compilers don't support it. Well, that is open-source compilers. IAR will happily sell you a license of their IDE for 2.4k$/seat. There is sdcc, but it's no comparison to a modern gcc or the commercial offerings.

Unfortunately you're right though, 8051 processors are cheap and abundant. Chip makers use it as the go to architecture for any simple embedded processing requirements. It's slowly changing though, but ARM licensing costs are still much higher. Let's hope that in a decade or so they'll use RISC-V instead.


RISC-V is indeed what I thought, but hasn't MIPS also been free (enough) and open for at least a decade? You'd think that on licencing alone that'd be enough for a replacement in new products.

Edit: I am mistaken, it seems that though MIPS is used in all sorts of things that I'd expect wanted the cheapest possible CPU core it has only become free for academic use. I wonder how common that mis-conception is for those not active in the space?


Yes, MIPS is a cheap candidate for when you need a bit more computing power than just a microcontroller.

Home routers usually have MIPS CPUs and run Linux on top of it. This wouldn't be possible with the 8051.


Yes, there is nothing more fun than moving a large 8051 embedded project from Keil to Tasking.


So, does the boot sector load to 0000:7C00 or to 07C0:0000 or to 03A0:4200, or what? :)


http://wiki.osdev.org/MBR_%28x86%29

"This is usually 0x0000:0x7c00 (CS = 0, offset address 0x7c00). However, some BIOSes load to 0x7c0:0x0000 (CS = 0x07c0, offset address 0) -- which resolves to the same physical address, but can cause problems."


Yeah, I encountered the problem when writing a bootloader, that's why I brought it up! :)

Don't rely on the BIOS' values. You can for example use a far jump to reset CS:IP if you rely on them having specific values.

[Edit: Or, maybe better, just don't rely on them.]


If you know that the strings are stored in the same table, couldn't you subtract from each pointer a pointer to the beginning of the table and hash the resulting offsets?


That's probably true, however there's a small catch.

Looking at the standard (no difference between C99/C11), chapter 6.5.6 ("additive operators"), point 9 defines the behavior of pointer subtraction:

  When two pointers are subtracted, both shall point to elements of the same array object,
  or one past the last element of the array object; the result is the difference of the
  subscripts of the two array elements.
However reading further:

  The size of the result is implementation-defined,
  and its type (a signed integer type) is ptrdiff_t defined in the <stddef.h> header.
  If the result is not representable in an object of that type, the behavior is undefined.
Unlike the (u)intptr_t types, ptrdiff_t is not optional AFAIK. The way I understand it, you could still invoke UB if the implementation uses a very silly type for ptrdiff_t, though I didn't do further research on it.

Still, using pointer difference makes this a lot safer and more portable. Especially on platforms with unusual memory models, where a pointer might not be just a single integer (e.g. segmented memory on x86). Also the operation is Theta(1), so it doesn't impact performance much.


Yes.


The post mentions only the C standard; I wonder what POSIX says about it.

And there are also the platform ABIs, which specify how a pointer is passed between functions. If you're in a platform in which the ABI doesn't allow arbitrarily setting unused bits, even a "security-conscious implementation" will have to leave them alone, since a pointer can be passed to a function written in a different language. This also means that the "map the same object at multiple virtual addresses" trick, in which the implementation flips bits within the pointer knowing there is always a valid alias mapping to it, is broken: suppose two pointers to within the same object are passed (perhaps at separate times) to code written in another language. From the point of view of the other language, they might not be pointing to the same object, even if they should (for instance an array and an element within it, the function might want to compute the offset).

I also wonder how well does the "map the same object at multiple virtual addresses" trick work in architectures with VIVT caches, where both the index and the tag are virtual addresses. In these architectures, if you write through one mapping and want to access the same memory through another mapping, you have to flush the cache, otherwise you will have hard-to-debug problems if both mappings are in the cache at the same time.


"I also wonder how well does the "map the same object at multiple virtual addresses" trick work in architectures with VIVT caches"

Not well at all, unless the OS goes out of the way to do cache coloring. There is a reason that VIVT are today considered suboptimal.


Another platform: IBM zSeries has 24-bit and 31-bit addressing modes, so that means an address stored in a 32-bit register has some of the highest bits ignored when dereferenced. Especially in the past, people stored useful information into those bits, so again, the two pointers are not necessarily equal even though they can point to the same address.


Just like the 48-bit addressing mode of the x86-64 chips most people reading this are using along with the storage of useful information in bits 48-64 (JS engines!).


If they store something there they have to mask it out before dereferencing the pointer. x86-64 doesn't ignore the topmost bits, if you set them wrong the CPU will throw an exception.


A big problem for the Mill people has been getting LLVM to understand that pointers on the Mill do not behave like normal integers.


Video about using clang/LLVM for the Mill here: https://www.youtube.com/watch?v=QyzlEYspqTI


Interesting. What if I use memcpy and memcmp? As in,

    void foo(void *a, void *b) {
      char ca[sizeof(void*)];
      char cb[sizeof(void*)];
      memmove(ca, &a, sizeof(void*));
      memmove(cb, &b, sizeof(void*));
      assert((a == b) == (memcmp(ca, cb, sizeof(void*)) == 0));
    }


That doesn't help. Your assert can be false. Consider, for example, a 32-bit architecture which ignores the top 8 bits of the pointer (as was the case in the old 68000 Macs, as discussed elsewhere in these comments). Consider two pointers which differ only in a top bit, like 0x1000f000 and 0x0000f000. They will compare equal, but the underlying bytes extracted with your memmove call will differ.

And this isn't just a historical curiosity. ARM64 can optionally ignore the top eight bits of its 64-bit pointers, for example.


Very insightful, never fully understand why pointers can't be cast to integers(not even to uintptr_r, intptr), what's the guidance here to avoid the casting?


You'll need custom logic with in-depth knowledge of the mapping. In many cases the idiom mentions by "kazinator" above will often work - but not always:

#define ptrToInt(x) ( ((char )x) - ( (char )0 ) );

... and even that may require local tweaking.


You also can't test if a pointer points to an element in an array without undefined behavior.

C++14 5.6/6, on subtracting pointers: "Unless both pointers point to elements of the same array object or one past the last element of the array object, the behavior is undefined."

http://stackoverflow.com/questions/31774683/is-pointer-compa...

I wish "undefined behavior" wasn't a thing in C/C++. It's not worth it. :(


I think you can do that test by comparing pointers, you just can't subtract them. (The SO answer you link to also points out that comparing and subtracting are different, but I'm not sure that what I said follows.)

In practice, almost all C and C++ programs depend on both implementation-defined and undefined behavior, and, modulo security implications, that's fine as long as you test the program properly after compiling with a new compiler.


I'd guess that PAE on x86 systems might really hurt with this. Basically any system where the memory model is one of segment + offset will run into trouble with pointer comparisons. Ever since the 8086 I've assumed that pointer math is tricky and non portable (see http://catb.org/jargon/html/V/vaxocentrism.html for the 80s version of the same problem)


PAE means you have longer physical addresses, but pointers are always virtual addresses. That is, the difference is only visible in the layout of the page tables.

Segment+offset is something else, but even on 32-bit x86 the segments were almost always set to base 0, and AMD64 dropped most of the segment mechanism. The exception on both is using segments to access thread-local data, but even then it's used just as an offset into the same flat address space.


PAE means pointers that need to hold at least 36 bits, but in practice I don't think anyone, OS developers included, would ever treat the whole PAE address space as one entity and represent addresses with a single pointer.


You can test equality with a given element. But if you want to test "does this pointer to any element of that array" you can't. As this test would be:

bool b = ptr - array_start < array_end - array_start

But that first subtraction is undefined if ptr does not point to an element between start and end.

Not that this is exactly a common operation or anything. But the fact that a seemingly simple and obvious test is undefined behavior is crazy imo.


What about

    array_start <= ptr && ptr < array_end // or `<=` if the upper bound is inclusive
Edit: well, for one it may not be thread safe, since it's not atomic, but neither is your example AFAIK.


Evaluating that expression is undefined behavior if ptr is outside the +1 bounds of the array. Check out the enumerated defined behavior in the (old C99, working draft) standard - http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

6.5.8 Relational operators

...and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values...

... If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P.

In all other cases, the behavior is undefined.


Whelp... Is there a non undefined way to do bounds checking in C?


Avoid maintaining pointers into arrays and instead maintain indices. If you maintain them as unsigned then you only need one range check (although you then have to avoid underflow if you want to stay in the realm of defined behaviour).

One of the downsides to C is that K&R and most source code out there will encourage you to do pointer arithmetic freely, but the standard points out all kinds of surprises where it is undefined. So you get technically non-conformant code in widespread use.


Unsigned overflow/underflow is defined behavior, unlike signed overflow/underflow. Underflow will be caught by the upper range check.


I thought unsigned underflow is defined as wrapping?


I'm just a C newbie, so I can't give you a definitive answer. Here's two I came up with:

If you know the bounds of the array, then you could enumerate it and check if the pointers to any of its elements are equal to your pointer.

Alternatively you could store references to array elements as a structure of a start pointer + offset, and arrays as a pair of start and end pointers. Then checking if a reference points to part of the array would be a two-step operation - check that the starts match and that your offset is within the end - start bounds.


Your edit is pretty much irrelevant - it could be applied to literally any example code anywhere. Unless mentioned otherwise you should assume that all variables are thread-local.


It's been a long time since I wrote C for a living, but what would you expect sensible behaviour to be for the expression ptr - array_start if ptr isn't in the array?


Do the subtraction and divide by the size of the pointed-to type, like the expression says? Why would you expect the calculation to be any different if ptr was inside, one element past, two elements past, or 1000 elements away from the ends of the array?

I can understand it possibly being a problem if the two pointers being subtracted denote completely different address spaces (which is definitely possible on some architectures, particularly Harvard MCUs; the one that comes to mind immediately is the 8051 which has IRAM, XRAM, and PMEM), and even there you can "linearise" address spaces so one comes after another and subtraction still works despite possibly giving a meaningless result, but on any architecture where all pointers are in the same address space, the sensible behaviour is the most consistent and straightforward one.


The idea here is that if you have segment:offset pointers, like x86 real mode, then you (as a compiler writer) don't have to worry about the segment part at all - you can just subtract the offsets and shift, because pointers within the same object will have the same segment part.


Just nitpicking, but in real mode it is at least conceivable to implement arbitrary pointer arithmetic (at some significant performance hit) because segments are spaced evenly every 16 bytes.

The real killer is arbitrary segmentation, where software can say this segment start here and that segment starts there, like, say, the x86 protected mode. Is the OS going to expose absolute segment addresses to the C language to allow some "defensive programming" like checking if a pointer really points to the array you think it does? Hell no, you write your damn code correctly ;)


You expect a pointer to represent a location in linear memory, so you expect ptr - array_start to be a negative number if ptr is before the array, and a number larger than the array size if it's after the array.


> that's fine as long as you test the program properly after compiling with a new compiler.

Which 40 years of experience prove that it doesn't happen as much as it should.


> I wish "undefined behavior" wasn't a thing in C/C++. It's not worth it. :(

What do you suggest instead? Run-time checks?


Have a proper defined behavior like most sane languages.

The only reason UB exists was that when ANSI C came to be, the compiler writers and hardware vendors that provided C like compilers didn't want to give up on their semantics.

So in order to please everyone, all the semantics that could not be defined on the standard became UB.


I don't think UB is bad. It's a reasonable thing to do when someone invokes an operation with bad input.

For instance, what should `sqrt(-1)` be? Or `head([])`?

You can define all behavior and say "bad input will throw an exception" or "bad input will abort the entire program" but who really cares what it does since it's a malformed program in the first place? Just say it's undefined and it's up to the programmer to make sure it never happens. Otherwise you're littering your code with countless checks for things that should never happen in a properly coded program.

Programmer error is distinct from operational error. There is no sane thing to do in a program error. Constantly checking for programmer error at runtime is wasteful.


That is untrue. UB exists because C is supposed to run on so many platforms there is no way to have a defined behaviour for all of them in that standard.


You are confusing undefined behavior with implementation defined behavior.


More importantly, C is not only supposed to run on many platforms (many programming languages do that), but also provide very low-level control on wide variety of platforms.


Which people keep forgetting is only available as implementation defined extensions, as such it is no longer ANSI C and any language compiler is free to have extensions.


This is why implementation defined behavior exists.

Undefined behavior though?


I was already coding before C had any meaning outside UNIX.


You can do it by converting pointers to integers. No UB, but of course it is implementation defined.


UB is unfortunately inherent in the mission statement of 'C'/C++.


Or rather I wish the standard did not bend over backwards to allow every kind of aggressive optimisation that some compiler writer claims will increase speed of unoptimized code by 0.5%.


This one is about portability, not optimizations. There are architectures where pointer isn't a simple int, due to things like segmentation, tagging or even possibly something as stupid as lack of sufficiently wide integer type.

Heck, there were even C implementations for 36-bit machines, which had 36-bit registers and could only load/store 36-bit-sized and 36-bit-aligned chunks of memory. They packed 4 chars into such word (leaving 4 bits unused) and I don't know how they implemented char* but it must have been something totally mad and not reasonably castable to int or even int*.


DEC's 36-bit PDP-10 and -20 had hardware support for pointers-to-byte: A pointer had lower 18 bits for the (word) address, while the upper 18 bits had room for the bit-offset within the addressed word, and the bit-length of the "byte" being used. So, if you were using 8-bit characters, a pointer to the beginning of a string would roughly look like (0,8,addr). A special "load and increment" instruction would fetch that first character being pointed to, and increment the pointer to (8,8,addr). So, c=*ptr++ in one instruction! After the fourth fetch, the hardware would know to increment from (24,8,addr) to (0,8,addr+1). And, best of all, if you were using plain old ASCII (and assembly rather than C), as was common in the day, you'd use 7-bit chars, packing 5 into each 36-bit word. Or six 6-bit chars if you were only using upper case, as in file names.

So, casting one of these pointers to int the most natural way (leaving the bits alone) would not give you an easily subtractable representation.


I don't know how any 36-bit machine implemented C's char pointers, and I can imagine all kinds of other grief stemming from the ones'-complement arithmetic on the 36-bitters with which I'm familiar, but I can tell you how we implemented char pointers on the 64-bit word-addressable Cray-1, -2, and their descendants. The 3-bit byte offset (big-endian, arbitrarily) of a linear byte address was kept in the most significant bits of a 64-bit word, and the 24-bit or 32-bit regular word address was in the least-significant bits. To do pointer arithmetic, code would have to do circular shifts (more accurately, a shift with fill from the same register) to get a linear address and to put it back into the canonical form that I described.

I don't recall whether a cast to int was raw or converted to linear.


but it must have been something totally mad and not reasonably castable to int or even int.*

As long as you can impose an ordering on the elements in memory and thus derive a number for each one, it's possible to define the straightforward representation of a pointer. 4 chars per word is an easy case since you can just let char* be a number with the lowest 2 bits indicating the index within a single word, and the rest of the bits are the word-address. int* can be the same number with the lowest 2 bits always 0. If word addresses are smaller than 36 bits, then values with the higher bits set are invalid, and if they're larger, then use multiple words to hold them.

Working with chars on such a machine is going to involve a lot of shifting and otherwise data movement, but there you go, a way to implement C pointers on such a machine.


> int* can be the same number with the lowest 2 bits always 0.

And now you have to right-shift int pointers on every dereference for some dubious benefit. It's easier and more efficient to make pointer arithmetic and casting painful.


No, you left-shift them on every cast.

And is someone type-puns using a union, they get a thing four times samller than they might have expected. Stil it is totally well behaved in a hash.


Isn't that the point of C/C++? "Code should run as fast as the hardware allows" and "We allow you to do crazy things that probably won't be portable, if that's what you want".

I wonder, if somebody created a language with C like semantics, like pointer arithmetic, but with a layer in between to ensure consistency and portability.. Would people want to use that, or would the semantics be considered less interesting, when they don't map directly to hardware?


Corresponding directly to the hardware makes sense, sure. But what modern GCC (especially) does goes far beyond that. The intent of signed integer overflow being undefined is that it will be twos complement on a twos complement machine, not that I want half my code to not run if I had an overflow somewhere.


If that had been the intent, it would have been specified as implementation-defined, not undefined.

More likely, the intent was to also accomodate implementations that trapped on overflow.


Implementation-defined includes trapping on overflow.


"Evaluates to an implementation-defined value" would not include that. I'm not sure if any of the usual definitions in the standard includes trapping and doesn't include arbitrary behaviour.


Well, there's kind of one place: when a value is converted to a signed type and it's out of range for that type, the standard says "either the result is implementation-defined or an implementation-defined signal is raised."

However, integer overflow within an expression is trickier than this to define, because you want to preserve the compiler's ability to reorder sub-expressions that are mathematically equivalent when there's no overflow, without forcing the compiler to prove that there won't be overflow. The same is true of CSE and hoisting expressions out of loops.

For example, if you have a machine that traps on overflow and thus under this hypothetical standard your implementation defined signed overflow to trap, you can't simplify (a + 2 - 1) to (a + 1) because the second might trap when the second does not, and your implementation has said that the trap will happen. It also means that you can't hoist an expression out of a loop if that expression might overflow, because the trap has to happen at the point where it would happen on the C virtual machine, not later after other side-effects have become visible.

This is why trapping implementations imply "undefined behaviour" - it's because defining to trap severely restricts the compiler's freedom under the as-if rule to reorder and simplify expressions.


Sure, but the standard need not say "Evaluates to an implementation-defined value". There are other wordings which use "implementation-defined" and would support trapping on overflow.


Can you give an example? I struggle to think of anywhere the standard uses a wording that would fit here.


Why would it have to be wording already in the standard? Does each phrasing in the standard have to appear more than once?


Adding a new phrase with a new meaning would add a substantial amount of complexity to what's already a very complex standard.


I disagree. There's nothing difficult to understand about new phrases with new meanings in general.

Do you have a specific phrase which you can show to be sufficiently difficult to understand by a group of people, and which you don't think could be rewritten to retain the same meaning but be easier to understand?

If you do, let me know, and I will try to simplify it for you and help you understand it.

If you don't, then there isn't much to say. Your current argument ("phrases could be written in a difficult to understand way") is too general to be useful.


I suppose it's less the phrase itself than the concept. What would be a way to say "integer overflow behaviour may reflect the underlying machine, but must be sensible", rigorously enough for the C standard? We've established that "undefined behaviour" is too broad, and "implementation-defined value" is too narrow.


Why "must be sensible"? Why not just say "signed integer overflow is implementation-defined" and in a footnote say "for example, it may wrap, trap, saturate, or do anything else documented by your implementation"?

I'm not seeing the big deal here.


> Why "must be sensible"? Why not just say "signed integer overflow is implementation-defined" and in a footnote say "for example, it may wrap, trap, saturate, or do anything else documented by your implementation"?

I don't think that actually defines anything - I don't think GCC would interpret that as anything other than undefined behaviour.


If that was the intent, then overflow should have been classified as enumerated implementation-defined behavior, just like the bit representation of an integer type which already deals with those variations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: