Hacker News new | past | comments | ask | show | jobs | submit login
What’s the difference between an integer and a pointer? (regehr.org)
170 points by luu on Sept 23, 2018 | hide | past | favorite | 82 comments



The Linux kernel uses a macro called RELOC_HIDE to stop gcc from seeing through the conversion from pointer to integer in these cases: https://elixir.bootlin.com/linux/v4.18/source/include/linux/...

Edit: a good explanation of that macro is at http://lists.linuxcoding.com/kernel/2006-q3/msg17979.html :

"The reason for it is that gcc assumes that if you add something on to the address of a symbol, the resulting address is still inside the bounds of the symbol, and do optimizations based on that. The RELOC_HIDE macro is designed to prevent gcc knowing that the resulting pointer is obtained by adding an offset to the address of a symbol. As far as gcc knows, the resulting pointer could point to anything."


> gcc assumes that if you add something on to the address of a symbol, the resulting address is still inside the bounds of the symbol

How can one C object span two symbols in the first place for this assumption to be invalid?


> How can one C object span two symbols in the first place for this assumption to be invalid?

Because Linux isn’t written in C. It’s written in C, assembler, and linker script (.lds). In both assembly and linker script, one can lay out multiple symbols with a defined relationship. Unfortunately, C doesn’t have a way to say “give me a pointer to the object 10 bytes past the address p. Yes, I know it’s valid and it is not the same object as p.”


I believe for most, if not all, architectures the synchronization and memory barrier primitives cannot be written in standard C.


That has not been true since C11.


Hm? The point is that compilers are allowed to assume that a pointer to x can never be formed by offseting a pointer to y (except if they are subobjects in the same array). But of course you’re not prevented from trying, either on purpose or accidentally, it just results in undefined behavior.


> it just results in undefined behavior.

Right - I'm saying that if they need this macro then their program is not a defined C program in the first place. They should fix that, rather than trying to use a macro to work around the problem. The macro may break in the future if GCC adds other optimisations.

They call it a 'miscompilation' - how can you 'miscompile' something which is undefined!


Being pragmatic and in the OS kernel business, they basically disagree with the standard here and say that tricks like that are necessary in kernel development (even if they might not be needed in the userland). Those parts of the kernel intrinsically cannot be portable anyway. From that perspective GCC is ”wrong” if it doesn’t at least provide an ”escape hatch” to disable optimizations in cases UB isn’t UB but in fact perfectly well-defined on that particular architecture.


Insisting that the environment matches the tooling seems backwards. The tooling needs to match the needs of the environment.

That GCC may break your program in the future is one of the ways that people get locked into specific versions.


You're making it sound like the C language has changed - it hasn't. I think this operation has been undefined in C as long as Linux has existed. They should not have written that code in the first place - it was never correct.


There seems to be two possibilities here:

1) you have rightly pointed out all the authors of the Linux kernel are total idiots who don't understand C.

2) It turns out writing a kernel isn't possible in pure C, and the kernel requires some carefully designed and considered methods of accessing all of raw memory, in various unusual ways.


> It turns out writing a kernel isn't possible in pure C

This one is the case. Writing a kernel is not possible in pure C. I'm asking - given this, why are they trying to do that? Write the parts which aren't expressible in C using assembly, where they can get the semantics they want.


The reason is that by restricting yourself to gcc and clang, and agreeing how to define some undefined parts of C with those compiler authors, the amount of assembly you need drops massively, which is particularly useful when you are trying to support as many architectures as Linux.

While technically this means your kernel is "gcc c" rather than "true c", this doesn't cause a practical problem. I know gcc, clang and visual studio's c++ standard libraries require a flat address space and 8 bit bytes, for example, but I've never seen anyone bothered by this.


> the amount of assembly you need drops massively

How about this as a compromise: use GCC, with all its flags, to generate the assembly, and just use that? Skip the messy transition step.


And what would this change, if authors are working off of the GCC-C files? If you're after a system by which you don't need GCC to run Linux, you may be interested in a concept known as "binaries" which sounds remarkably close to what you're trying to invent here


They're not working off the C files. They're using the assembly files when performing the actual compilation going forward; using GCC is just a bootstrap to get the necessary assembly churned out quickly.


But.. there C files has changed hundreds of times, for tens of architectures, over tens of years. Who is going to want to edit 20 (or more) assembly files when a change is made? How do you add a new architecture? Do you have to keep the C files up to date and somehow check the different assembly files don't drift out of sync?

Instead we could just mark some files as "These C files have to be run with GCC versions X to Y", problem solved. Then why not just compile everything with those versions of GCC?


> Instead we could just mark some files as "These C files have to be run with GCC versions X to Y", problem solved.

Is it solved? You'd still have to check the generated assembly of the entire file after every change you make, to make sure it didn't have some knock-on effect that optimised something differently, even if you kept the version of the compiler the same.

And is GCC deterministic? I don't know - do you? Did you know it has a flag `-frandom-seed`?


The idea there is that gcc provides various intrinsic pseudo-functions and features that you can write kernel in "whatever gcc accepts as GNU C" dialect. Last time I looked the only linux platform that had significant assembler objects in kernel was IA-64. And this approach is very common in Unix world, with memory-mapped HW registers represented by C structs and so on (and in fact ability to do this is often cited as reason for various C++ mis-features)

[Edit: I am almost sure that the only part of Linux on i386 and amd64 that is written in assembly is the bootloader stub at the beginning of "Linux kernel image" file format. And it is somewhat funny and relevant that this part depends on abusing gas to generate 16bit code]


> this approach is very common in Unix world, with memory-mapped HW registers represented by C structs and so on

The "correct" way to do this is to write the bare minimum assembly needed to present this view to a programmer in C (so there's no need to worry about undefined behavior), then write standards compliant code from there. Writing all this code in C is just papering over the fact that what you're doing is architecture-specific and should be treated as such.


That is exactly my approach to this issue (and at least gcc seems to be clever enough to mostly optimize the abstraction layer away). On the other hand you will see the direct approach with structs directly mapping to hardware, questionable uses of volatile to make the thing actually work and so on very often in various production code.


It's not that they don't understand C, it's that they willfully ignore the C standard and have enough clout that compilers capitulate with flags to support their slightly modified but not quite C. And so yes, you cannot write a kernel in pure C.


Do you really not grok the difference between writing platform-independent code and writing the platform itself? You don't write for an abstract machine as specified by the standard, you're writing for a very concrete real-world machine in order to implement a part of that abstract machine. Indeed, that's exactly what C was created for in the first place, for writing an OS in something else than assembly!


At some level you have to eventually decide, "if it works it works."

Obviously at the application level it's frowned upon, but in the kernel? You do what you have to with the tools you have.

Think about Java. Yea sure, keep telling us sun.misc.Unsafe isn't legit. It's going to get use until that capability exists elsewhere.


> Obviously at the application level it's frowned upon, but in the kernel?

I'd rather my kernel not have a security hole in it because a compiler decided to optimize out undefined code: https://lwn.net/Articles/342330/


And as the ongoing Linux Kernel Self Preservation Project proves, those tools are not enough.

I appreciate using software that is actually written with security first, performance second in mind.


They maybe should not have written code in C at all, since the language that you need hire a lawyer for is not a viable option. But everyone hoped that with a given team they could create something that both will meet requirements and be future-proof. While what they did was not really legal, it is not practical to make it not work anymore. If they did it right way, Linux would not exist, see? Lawyers don’t build anything.


Maybe they shouldn't have, but there was no realistic alternative. At the time, C compilers weren't as "smart" and the sort of trickery they're doing was a perfectly reasonable thing to try in C.

Today, if you were starting a kernel project from scratch, I'd think Rust would deserve a good, hard look.


> They call it a 'miscompilation' - how can you 'miscompile' something which is undefined!

It simply implies that they consider themselves, rather than the C standard, as the authority on how C should behave.


If I understand correctly what this is about, one example is buffer overflows, where you write to the first symbol but end up in the second:

    char symbolA[10];
    char symbolB[10];

    symbolA[15] = 0;  // actually writes to symbolB


This isn't one C object spanning two symbols - it's two C objects, and reading one object from a pointer to another object.


What does this have to do with "one C object span[ing] two symbols"? I'm not sure I get what you're asking


The problem that they are solving with their macro is that

> gcc assumes that if you add something on to the address of a symbol, the resulting address is still inside the bounds of the symbol

I'm asking - how can you write a valid C program which adds something on to the address of a valid symbol and get something outside the bounds of the symbol?


Something that comes up moderately often in low-level stuff is:

You have a bunch of objects, defined in different files, and no code is aware of what the complete list of objects is. You want to iterate over an array of all these objects. How to make the array, with minimal overhead?

A common technique is for each file defining on the these objects to to tell the linker to put it in a special section. The result of this is that there's then an "array" of objects, which according to the C language could be anywhere in memory, but because of how the linker has been told to lay them out, we know that actually they're all consecutive. So any code which wants to iterate over then can do so easily.

Except... this is isn't actually a valid C program, and is actually undefined behaviour, so the compiler might "optimise" things in a way which breaks it. Hence the RELOC_HIDE macro which obscures things from the compiler and prevents it from applying "optimisations" we don't want.


Ah, now I understand.

I am guessing here, but my guess is that they're using C but actually want to do something very low level. As in they believe (correctly?) that they know the exact memory layout relative to something.


Correct. Think control/data registers in some dedicated chip.


    char x[N],*p=x+sizeof x;


You've defined a pointer p, but you aren't actually using it, so what's the point of that?

Reading or writing the pointer p will not be a valid C program.


The question appeared to be simply how one might obtain such an address.


I miss the good old times where a pointer was just a number, and if you did `*addr` you plainly read from that memory address. I'm not sure if the standard was ever that way, but compilers sure used to behave like that.

Maybe we can one day get a `--std=simple-world` flag that forgoes a few optimizations and makes UB code do the "obvious" thing.


Then all software would get slower for no reason other than making compiler nerds happy. C++ in particular heavily depends on optimizations like SROA, which you just killed in most cases (e.g. whenever you called a method on an object, the compiler would be forbidden from promoting its contents to registers because of the "this" pointer). That would be doing a huge disservice to users, virtually all of whom don't know or care one bit about what machine instruction the "*" operator in C++ corresponds to.


Well, not all software, only where you'd decide to use that feature. In non performance-critical code, I'd disable certain optimizations anyway to make sure that sanity checks do not get optimized away. (The Linux kernel had a few high-profile cases of this problem.)

One way to implement my idea would be as pragmas that tell the compiler "don't optimize away this null check", "assume this variable can alias", "just emit the obvious ASM for memory access" or "put memory barriers around this code".

I don't see how this kind of "stupid memory access" would make e.g. SROA impossible. If I just access the memory around "this", sure, I would not see the layout I'd naively expect. But that's not what I want anyway. I want to see the memory as it is actually laid out.


There are ways of informing the compiler of your intentions; using idioms such as memcpy, for example. The compiler will peer through the function call and generate the code you want.


It never was guaranteed by any standard, and in fact numerous old machines had weird pointer formats: segmented addressing, strange widths (36 bits), and tagged pointers come to mind.

Its more like we had this window of time when pointers were not weird and got too used to it.


Sure, but back then the standard was not relevant. You had your x86 computer, your compiler (MSVC, DJGPP, CodeWarrior, Borland...) and maybe a "how to write C/C++" book. You did have weird things like segmented addressing, but for most purposes, pointers were just memory addresses, and you could use casts to reinterpret data as you wished. In fact, using bit-cast-magic and doing pointer arithmetics were the preferred way of doing things.

It's strange that we put so much effort into learning all this arcane knowledge and now we're supposed to unlearn it...


When I was writing portable UNIX software in the late 90's, early 2000's, the standard was quite relevant.


I can’t tell for sure, since I was too young back then and didn’t write something complex enough, but I suspect that compilers were not evil genies back then. It was probably later gcc who first used UB as an excuse to evil optimizations.

I also don’t understand a reason behind dragging all that legacy into current standards. Does a real application of it make at least 0.1% of all usage? You cannot even buy a chip that implements segments, tagged pointers, etc etc. at least they could make a special mode where you can span across symbol whatever it means or see a pointer as a cpu sees it.

All this can be solved with simple ptr_untag(p), ptr_span(expr) and similar constructs, but instead we have resort to outsmarting the specific compiler logic or introducing ourself with complicated type systems. That went insane. Personally I just want my bytes in linear address space and a way to tell which bytes point to which and which do not. I liked asm and then C, they were two of my first three languages, but what C became today is just a horrible mess.


> You cannot even buy a chip that implements segments

Not only can you buy a chip that implements segments, the computer you wrote that statement on probably has such a chip.


It is arm-based, so I believe it doesn’t. If you’re about x86s, then segment registers are there, but were not actually used (except for fs-gs utility cases) for almost a couple of decades, afaik. Segmented memory model is simply slow, cumbersome and unnecessary in the presence of decent pmmu.

Edit: though strictly speaking I was obviously wrong on that, clarifications are welcome.


While it may be “cumbersome” from the programmers perspective - it’s certainly a lot faster than an MMU.

There’s no universe where traversing a page table (actually a tree) in memory is faster than an offset and a bounds check.


What are you talking about? On x86 page tables were cached in TLB since their introduction. No mmu at all (80286) means that you're subject to fragmentation, and swapping segments is as expensive as a syscall since you have to lookup the descriptor through GDTR. x86 segments are scarce, expensive and cumbersome resources invented to cover bus width mismatch. Today it is not ever an issue. Even TSS is shared one per cpu, so much useless and slow its hardware part is.

We could imagine more segment registers and bigger descriptor tables, but that would be just poor man's manual TLB (manual always failed in cpus). If you concerned with constant checks, you may order yourself a cpu with BOUND instruction support and put it everywhere with the same result. Oh, it is already in x86-64, nevermind.

>actually a tree

It is 2 level "tree", afair. Directory and table. Please make your homework and research why segmented model was kicked off software arena by virtually everyone involved.


Spoken like a programmer that has never had to fix TLB misses as a bottleneck. It’s actually quite common as the TLB is rather small. Not even enough for 2MB worth of 4K pages.

You can use huge pages but it has all of the drawbacks of segments and none of the benefits.

VMWare used to have a super fast 32-bit hypervisor based on segments long before special instructions were added. This of course had to be reworked completely for X64.

Also Intel’s bound check instructions are still extremely rare and don’t work that well in practice. I’ve used them.


A recipe for fixing TLB misses (as any cache misses) is simple: don’t thrash your cache. Ofc I didn’t, I cannot even imagine what does one do to bottleneck at TLB – LSD? It is one of these problems like “doc, if I turn my finger 180, it hurts”.

>VMWare 32-bit before special instructions

DOS also was pretty fast, but that didn’t make it a good multi-user protected-mode OS. All these early emulations and monkey patching of guests cannot substitute hardware vt in the wild.


But memory isn't a linear address space. If you're forced to treat memory as strictly linear than you're denied many optimizations.


> But memory isn't a linear address space

Would you mind expanding on this point a bit?


Modern memory is layered. You have registers, you have three layers of CPU cache, etc, etc.

At higher levels of abstraction we pretend like it's all flat but we do need to allow for optimizations to occur.


So what that does with linear addressing and optimization? How the presence of registers and caches depend on [non-]linearity?


Thanks for responding.

I would however argue that the existence of hierarchy in physical memory is orthogonal to the existence of a linear address space. Caches work really hard to maintain the ability of programs to use a single linear address space.


Ways in which memory is not linear: many asymmeyric multiprocessing setups (eg gpu/cpu), distributed processing (eg openmp), multiprocess systems w several virtual spaces, harvard architecture systems, segmented architectures, disk vs ram. Many of these may be considered performance motivated departures from linear address space, especially if performance includes reliability in addition to execution speed.

(Otoh virtual memory can be used to create the appearance of single address space for any of the above, and sometimes has been - eg as/400)


I think it was mostly aliasing-related, and that’s why I mentioned “tell which bytes point at which”. Managing C today is like managing that girlfriend who doesn’t say anything straight and avoids direct conversations. I’m not against optimizations, I’m against compiler pretending to be smarter than average me and spoiling our relationship when I only have basic demands.

If not that, I’m also curious.


Sorry but it's the other way around. You are the one pretending to be smarter than the compiler by imagining objects are laid out sequentially in an infinite char[] when they they are in fact just abstract objects laid out in whichever way the compiler thinks is best. Just keep it this way and it will be much easier to manage C, it doesn't need micro management and neither does girlfriends.


I'm saying that I don't like this model, not that I want to see everything as char[] in it. You insist that I should just embrace the model, since char[] doesn't work in it. I know, ok? If that tauthology was an argument, it completely missed the point.

Below the compiler, I'm pretty familiar how things are laid out, since there is ABI, packing/alignment rules, documented hardware and 15 years of dealing with them. Please don't mock up magic and wisdom out of few conventions.


All the world's a VAX.


It's still that way with Object Pascal (Lazarus/FreePascal and Delphi). You can do all of the things mentioned in the blog post without the compiler screwing you by trying to interpret what you're doing with the pointer arithmetic/access. You still get really good performance, so I'm not sure why the compiler developers feel the need to go this deep into optimization territory on what amounts to "one step above assembly" code. If a developer is doing pointer arithmetic, then the compiler should step back and let them perform any necessary optimizations themselves because they're already working at that level.


Regehr and colleagues have been doing excellent work on undefined behavior for some time, which has been fantastic.

I do personally wish the C standard was changed a bit to have less undefined behavior, leaning towards getting programmer intent correct and not worrying (as much) about getting every possible optimization into the generated code.


> I do personally wish the C standard was changed a bit to have less undefined behavior

Agreed. There are some particularly egregious ones where the behavior should at the very least be implementation defined.


I sort of see the point of the article but I think talking about differences between pointers and integers obscures the real issue here: the problem as far as I can tell is just that offsetting a pointer more than one byte after the end of the object it points to is UB.

TFA argues that converting to integers first sidesteps the issue but I'm not convinced (although I'm not a C standard lawyer). Surely if you convert a pointer to an integer, add an offset that's greater than the size of the original object and then convert that to a pointer you trigger UB? How else could this be defined? You can't really expect the compiler to figure out what you're doing when you're casting "random" integers as pointers.

Also the article mentions Rust in its introduction but in Rust anything surrounding pointers (including pointer arithmetics) is unsafe so I don't think there's any conflict whatsoever here.


Related discussion at internals.rust-lang.org:

https://internals.rust-lang.org/t/pointers-are-complicated-o...

"This summer, I will again work (amongst other things) on a “memory model” for Rust/MIR. However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that “pointers are simple: they are just integers”. Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers.

I also want to define a piece of the memory model that has to be fixed before we can even talk about some of the more complex parts: Just what is the data that is stored in memory? It is organized in bytes, the minimal addressable unit and the smallest piece that can be accessed (at least on most platforms), but what are the possible values of a byte? Again, it turns out “it’s just an 8-bit integer” does not actually work as the answer."

Ralf Jung is a coauthor of the "What’s the difference between an integer and a pointer?" paper.


Okay I can agree with that but then there's no need to look for contrived examples, the C standard itself spells out a bunch of UB that arise from using pointers as "just integers". It's probably worth repeating, especially for people new to these languages, but it's not exactly breaking news. I guess I just expected something else when I started reading this article.


The underlying problem is that memory access is slow, allocating registers is hard (so we let the compiler do it) and compilers don't always know what you're doing with memory.

So, in order to generate faster programs, compilers are getting more and more agressive about reusing data that has already been copied into registers, hiding under the shadows of 'undefined behavior.' The standard doesn't indicate a defined way to mess with this data so I'll just keep using the copy I have.

Writing through an untracked pointer probably convinced the compiler to dump its register caches.

C used to be thought of as kind of a portable assembly language, but when the compiler is doing all these optimizations, there's a lot of room for clashing.


> First, it is a big mistake to try to understand pointers in a programming language as if they follow the same rules as pointer-sized integer values.

I’m a bit confused by this statement. I don’t know the C & C++ standards, but isn’t pointer arithmetic defined and allowed within the bounds of the heap allocation of the base pointer? Like, on an array for example, aren’t pointers required to “follow the same rules as pointer-sized integer values”, as long as you’re not indexing outside your array allocation?

Wouldn’t the issue here be more clearly stated as “don’t use pointer arithmetic to index outside the bounds of the base pointer’s allocation” rather than “never use pointer artithmetic”?


Yeah it's fine inside an array, but as soon as you move a pointer from one allocation to another, boom.

Dunno what it means for stack pointers though, probably the same except now you've made it so they can't be registers.


A pointer is an integer with a shiv


Conceptually, Pointer is a higher and a more well defined, level of abstraction than a raw integer.In much the same way as an Iterator is a higher and more well defined level of abstraction than a raw pointer.

The fact that an Iterator, a pointer and an integer all tend to be different ways of looking at same integer value is more or less an implementation detail.


This pointer manipulation is fine. GCC simply optimizes the parameters for printf(): https://godbolt.org/z/AHY7Nd

*x will be 7 in the end.


I can't replicate this example on my mac. Regardless of `diff` or 16 (which is the diff for me), I get the same output for clang and gcc: 7 5 7


Agreed. I get the intended result of 7 5 7 on macOS 10.13.6 LLVM.

  Mac:tmp $ clang -O3 mem1d.c ; ./a.out
  diff = -96
  x = 0x7f9f4fc00300, *x = 7
  y = 0x7f9f4fc00360, *y = 5
  p = 0x7f9f4fc00300, *p = 7

  Mac:tmp $ diff mem1d.c mem2d.c 
  13c13
  <   int *p = (int *)(yi + diff);
  ---
  >   int *p = (int *)(yi - 96);

  Mac:tmp $ clang -O3 mem2d.c ; ./a.out
  diff = -96
  x = 0x7fc2b5c00300, *x = 7
  y = 0x7fc2b5c00360, *y = 5
  p = 0x7fc2b5c00300, *p = 7

  Mac:tmp$ clang --version
  Apple LLVM version 10.0.0 (clang-1000.11.45.2)
  Target: x86_64-apple-darwin17.7.0
  Thread model: posix


You only pasted a clang run here. It's gcc that's supposed to give 3 5 7.


I get a 16 diff too, but I'm seeing 3 5 7 on gcc on High Sierra with a 2013 Macbook Pro. I'm using gcc 8.2.0 installed via Homebrew. Did you run gcc-8 with the -O3 flag?


One question that is not answered by the post: if the gcc-compiled program does not overwrite *x, what does it overwrite instead?


> For example, they assume that a pointer derived from one heap allocation cannot alias a pointer derived from a different heap allocation.

Doesn't this depends on which allocations have been freed?


One is a number and the other is a dog often used when hunting




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: