You Can't Always Hash Pointers in C (2016)

scraft · on Sept 19, 2018

Slightly related (but only slightly), on the PS2 (games console, from Sony) you could access main memory with three different pointers, cacheable, uncached accelerated and uncached (which was the same base address with different high bits set accordingly). Each mode had reasons to be used (for example, when you read/write to memory which is cacheable and then fire off a DMA transfer, the data being transferred may, and often won't, match, what has been set in the program, until a cache flush has been requested). As you can imagine, this regularly resulted in hard to track down bugs.

Anyway, the bottom line of this is: I have seen various pieces of code on platforms which when comparing a memory pointer you need to mask out several bits (unless of course you actually want to differentiate between the different access methods of the same piece of memory).

It isn't quite what this article is talking about, but I thought it was a related piece of interesting trivia to those that don't already know ;)

karmakaze · on Sept 19, 2018

Great example to add to the discussion. The post does mention that pointers may have some meta bits which I fully accept but without concrete examples wouldn't give much consideration for.

Reminds me of the recent underspec that was closed with signed/unsigned some operation or other. Better for the spec but can't imagine practically mattered.

If some unbelievable situation occurs on some unique platform/compiler remember this post. In the meantime do as we have been with using pointer values. Back before the System7 32bit clean code people put all sorts of stuff in pointers because we could. We know the costs but also remember that it had value too.

derefr · on Sept 20, 2018

> which was the same base address with different high bits set accordingly

Reminds me of memory mirroring on the NES (https://wiki.nesdev.com/w/index.php/Mirroring#Memory_Mirrori...). Except that, in this case, the mirrored memory has different access semantics :)

mrslave · on Sept 19, 2018

Smells like a MIPS processor?

I really enjoyed this stuff at university. I work with very high level tools these days, informing business decision makers, and I don't regret the move up the value stack.

In fact I have found that being a competent developer at low levels (embedded systems, operating systems) has helped solve some incredibly subtle issues with high-level tools in my career.

RcouF1uZ4gsC · on Sept 19, 2018

One thing interesting to note is that C++ has a specialization for

template<class T> struct hash<T>

which is required to have the correct semantics.

Given that the in practice the underlying machine models of C and C++ are very similar, and that the most widely used C compilers (gcc, clang, msvc) are also C++ compilers, in practice hashing C pointers is likely not an issue.

However, if you want to hash pointers in a way that is blessed by the standard, you can probably do this

my_hash.cc

    #include <functional>

    extern "C"{
      size_t hash(char* c){
        std::hash<char*> hasher;
        return hasher(c);
     }
   }

my_hash.h
size_t hash(char* c);
EDIT: Added code if you wanted to be pedantic about it.
Then you just link your C program to my_hash.o and include my_hash.h

tlb · on Sept 19, 2018

C++11 only requires that "A Hash is a function object for which the output depends only on the input and has a very low probability of yielding the same output given different input values" and that "the probability of h(a)==h(b) for a!=b should approach 1.0/std::numeric_limits<std::size_t>::max()."

This allows a pointer to be larger than size_t, which might be the case on some system with tagged pointers.

saagarjha · on Sept 19, 2018

Wait, so std::hash::operator() doesn't have to return a size_t, or are you just saying that pointers can have values "greater" than size_t and need to be "trimmed" down to size?

asveikau · on Sept 19, 2018

There are lots of places where size_t can't fit a full address. That is kind of the point of adding intptr_t (if it were covered by another typedef, why add it?) By the way, intptr_t is relatively recent (C99).

saagarjha · on Sept 19, 2018

> That is kind of the point of adding intptr_t (if it were covered by another typedef, why add it?)

I thought it was a stylistic choice: size_t for lengths and sizes, and intptr_t for, well, an integer holding a pointer value.

int_19h · on Sept 19, 2018

size_t only has to be large enough to accommodate the size of the largest object possible in a given implementation. It has no relation to pointers at all.

Similarly, ptrdiff_t only has to be large enough to accommodate the difference between two pointers where such difference is legal ("when two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object").

Hence the need for intptr_t and uintptr_t, and why they're optional unlike size_t and ptrdiff_t.

ajross · on Sept 19, 2018

Sorry, what system are you thinking about where size_t is something other than pointer sized? This is silly pedantry

asveikau · on Sept 19, 2018

It's not as unlikely as you think. Use your imagination.

I seem to recall some late 90s Unix where size_t was 32 bits and pointers were 64 bits. I think it was Tru64? Seems to be backed up here: http://www.cecalc.ula.ve/documentacion/tutoriales/COMPAQC/DO... In Compaq C, size_t is unsigned int .

On 16 bit x86 it certainly is most natural to have size_t be 16 bits but full addressable memory is 1mb.

Again, if there were no ambiguity here there would be no need to introduce a new typedef, but they did, in 1999.

ajross · on Sept 19, 2018

I'll grant the Alpha point, which seems like a clear and unfortunate bug.

But the DOS one is wrong. "Addressible" memory with segmentation may be 20 bits, but the maximum size of an object that can be referenced through any pointer is 65536 bytes. It was possible to compile in a "huge mode" where all pointers were expressed as base/offset, but in fact there the size_t was a compiler-generated 32 bit type.

gumby · on Sept 19, 2018

std::hash::operator() is indeed required by the standard to return a size_t, but the standard puts few constraints on a size_t. It's certainly reasonable for an architecture with boxed numbers to make a size_t too small to hold a bona fide pointer.

BTW the boxed environment could be a hardware platform (e.g. some of the Risc-V variants these days) or a software architecture -- imagine a C++ variant that generated code to run inside such an environment even if the hardware it's hosted on has a flat address space).

RcouF1uZ4gsC · on Sept 19, 2018

This is the case with hashing algorithms in general. The important part is

"For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2)."

This is what allows hash maps to work.

taejo · on Sept 19, 2018

I think you meant template<class T> struct hash< T* >

dang · on Sept 19, 2018

Discussed at the time: https://news.ycombinator.com/item?id=11805030

delhanty · on Sept 19, 2018

Generally, hashing pointers in C or C++ is a bad idea anyway.

Unless, you have a particular love for debugging non-reproducible behavior that is.

delhanty · on Sept 19, 2018

Clarification: I though this was well know, but apparently it isn't, so I'll explain it.

`malloc` doesn't have to give you the same pointer the next time the program is run - and sometimes it doesn't.

In that case, the ordering of the hash table on iteration can change.

Particularly, release build `malloc` and debug `malloc` are likely to differ, leading to bugs in production that don't reproduce when debugging.

wutbrodo · on Sept 19, 2018

> In that case, the ordering of the hash table on iteration can change.

Order-dependent iteration of a hash table's items isn't exactly a central use case of the data structure and often isn't even guaranteed. I don't think that this example is really sufficient to suggest that hashing pointers is a bad idea (though it may be for other reasons).

TrinaryWorksToo · on Sept 19, 2018

In golang it is purposefully randomized on every execution to prevent you from relying on iteration order

delhanty · on Sept 20, 2018

That's actually better than the situation in C.

In C, debug `malloc` may very well behave the same run after run.

Then you move into production, and release `malloc` behaves differently to debug `malloc`.

Suddenly a latent bug that was there all along shows itself in release, but fails to reproduce in debug.

With the randomized behavior you describe in golang, the bug probably will show in debug and be caught.

wutbrodo · on Sept 20, 2018

Yea I wasn't sure enough to make a stronger claim, but I don't believe I've ever used ordered iteration of a hashtable. I feel like if I was tempted to do so, it would prompt me to rethink my design. Call it an "algorithm smell"

quotemstr · on Sept 19, 2018

Who said you need or want run-to-run hash stability? One common technique for defeating hash collision attacks is maxing a secret random tweak, newly computed on each program run, into to hash function. (Python is one environment that uses this trick.) Under this scheme, hash values are unstable, but the world keeps working. Hash stability turns out not to be particularly important most of the time.

Generally speaking, if your program depends on the stability of hash-table enumeration order, you're doing something wrong in the first place.

delhanty · on Sept 19, 2018

>if your program depends on the stability of hash-table enumeration order, you're doing something wrong in the first place

The point is not all programs are correct.

If you're trying to reproduce a bug that occurs in the release version, but goes away in the debug version (because hash-table enumeration order changes) that makes life more difficult.

saagarjha · on Sept 19, 2018

I don’t see anything wrong with using pointers as keys in a lookup table. There’s nothing non-reproducible about this.

maljx · on Sept 19, 2018

Compiler/runtime can create two pointers with different integer values that point to same memory address. The article explains a few reasons why that might happen. It's very uncommon, but it's valid in the standard.

delhanty · on Sept 19, 2018

>There’s nothing non-reproducible about this.

No that's false. See my clarification above.

saagarjha · on Sept 19, 2018

I really don't care what the actual value of the pointers are, since I'm not going to use them for anything other than key lookup. The thing that breaks in your case is hash table iteration, but if I'm not doing that I still don't see any non-reproducible behavior other than slight timing differences.

Zanni · on Sept 19, 2018

If you read the article, there's an explicit mention of the possibility of tagged pointers for some implementations, where the low-order bits represent the address and the high-order bits are used for housekeeping and might change dynamically. Thus the pointer to the same object might naively cast to different integers at different times in program execution. That will lead to non-reproducible behavior.

wruza · on Sept 19, 2018

Given the factual amount of times it will happen, even this small thread is not worth it except for few pedants. And on architectures where it will, and on compilers that dare to make it happen — it will break much more than just your clever code anyway. All these ancient standard details are there only for history. In reality, you cannot even change the direction of memcpy(3) without turning the world upside down.

rurban · on Sept 19, 2018

sure, but in practice everybody knows the actually used bits of every pointers per platform. on amd64 e.g. it's the 2 lowest and 16 highest which need to be masked off for comparison. it's simply platform and vm dependent.

I know very well where and how I'm able to compare ptrs in a hash table, and I'm doing it all the time. just complaining out loud is no solution.

xamuel · on Sept 19, 2018

There's nothing non-reproducible about it within a particular run. Of course it would be silly to write a pointer hash table to disk!

dingo_bat · on Sept 19, 2018

Agreed. Even in the author's use-case, just hashing the string itself would be much better.

quotemstr · on Sept 19, 2018

What? No. Hashing a string involves extra pointer chasing and looping over a variable-length data structure, and that's on top of the collision-chain resolution the hash table itself has to do. In performance-sensitive code (like a dynamic language interpreter), it's far better to intern your strings and look them up by ID (i.e., pointer value) directly.

Animats · on Sept 19, 2018

With Microsoft going to "fat pointers" in their systems, some pointer conversions are not going to work.

fulafel · on Sept 19, 2018

I assume this is something else than the "far" pointers from the 16-bit days - what is the new application of fat C pointers at MS?

cbHXBY1D · on Sept 20, 2018

I think Animats is referring to checkedc [1] which I can assure you Microsoft/MSVC is not moving towards.

[1] https://www.microsoft.com/en-us/research/project/checked-c/

quotemstr · on Sept 19, 2018

I don't like this genre of programming blog post. These things always amount to premature optimization for portability. They point out that some technique that's been used by millions of programs for decades might not work on every single architecture that's technically compatible with the C standard. So what? The techniques work, and because so many programs use these techniques, they'll keep working.

In practice, almost nobody targets esoteric environments. Avoiding useful and well-known tools just for the sake of an environment that will never see your program is just a needless self-imposed tax that's going to suck time from other aspects of development. In my programs, I'll keep hashing uintprt_t values.

wutbrodo · on Sept 19, 2018

> don't like this genre of programming blog post. These things always amount to premature optimization for portability. They point out that some technique that's been used by millions of programs for decades might not work on every single architecture that's technically compatible with the C standard. So what? The techniques work, and because so many programs use these techniques, they'll keep working.

The article lays out its motivation pretty explicitly, and it runs entirely contrary to your point:

> My purpose is not to discourage you from casting pointers to integers and using the result. The vast majority of the time this works fine and as you would expect. I just think it’s an interesting part of the language, and C/C++ programmers should be aware of potential the trade-offs.

"Details of C implementations are sometimes interesting to think about" is very different from the strawman of "never do this thing because it's not universally portable".

spc476 · on Sept 19, 2018

While an architecture where two distinct pointers could point to the same location is esoteric today (minus `mmap()` tricks), go back 30 years and it was not only esoteric, but downright the most popular system in existence! [1]. Who's to say something just as esoteric becomes the mainstream? (although I hope not---it was horrible).

[1] MS-DOS with its FAR pointers.

cnvogel · on Sept 19, 2018

It's still non-esoteric today:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

ARM Cortex M4 supports adressing individual bits (often registers of hardware peripherals) of memory locations as one additional memory location writing to, or reading from a single bit.

e.g., from the examples of said infocenter url:

   *(uint32_t*)0x20000000 |= (1<<7)
   *(uint32_t*)0x20000000 &= ~(1<<7)

should be equivalent to

   *(uint32_t*)0x2200001C = 1;
   *(uint32_t*)0x2200001C = 0;

Also Analog Sharc DSPs (which, I think, still are being sold with this architecture, and still used, even though I've used them only 10 years ago) alias their memory four times, depending if you want to access it as 16bit, 32bit, 48 or 64bit data.

Taniwha · on Sept 19, 2018

I think you're missing 'volatile's in your casts

cnvogel · on Sept 23, 2018

The examples should only view the bit/address mapping, not be a complete example on how to access memory mapped peripherals.

foxhill · on Sept 19, 2018

why would you need that in this case?

saagarjha · on Sept 19, 2018

Probably so the compiler doesn't just optimize out the statements, seeing as they don't seem to have a visible effect on the program execution because they're never read again.

Taniwha · on Sept 21, 2018

because writing to one location causes the data to change in another (a system register) you don't want the compiler to assume that data in an IO register can be cached in a CPU register

anfilt · on Sept 19, 2018

Harvard architectures might be an other example that's not really strange. You can have two different pointers with the same value. However, why are we trying to compare data and instruction pointers? They are pointers for different types of data. One being instructions and the other data.

spc476 · on Sept 19, 2018

True. But C is pretty restrictive in what you can assume. POSIX is more lax (for the programmer, not for the compiler) because it mandates certain behavior---like an all-0 bit pattern for NULL pointers, and the ability to case `void *` to a function pointer (and back again).

That "`NULL` need not be physical 0s" [1] is a real pain in ANSI C because `memset(astruct,0,sizeof(astruct))` won't NULL out pointers---you have to explicitly assign NULL to each pointer. It's one of the reasons I tend to stick to POSIX if I can.

[1] In source, a 0 in a pointer context is a NULL pointer, but the architecture could mandate "all 1s" as a NULL pointer.

quotemstr · on Sept 19, 2018

The zero-ness of NULL is a good example of the phenomenon I'm talking about. How much time have people wasted over the years tweaking and worsening code afer somebody pointed out that some platform at one point might not have used the all-zero bit pattern for NULL? In how many cases did the program that was changed get anywhere close to such a strange, special-purpose, and likely obsolete computer? I'm guessing the answer is "vanishingly close to zero." Being exactly ANSI C correct (as opposed to relying on POSIX's stronger guarantees) comes with an engineering cost, and there's no sense paying that cost unnecessarily.

spc476 · on Sept 19, 2018

It's one thing to do something because you found that it works. It's another if you know what the rules are, and why breaking them will work in this case, and if you (or someone else) needs to port the code, you know the potential problem spots that need attention.

anfilt · on Sept 19, 2018

I agree, but I would not consider hashing pointers a common tool or technique.

I would say treating pointers as integers is a more common approach. Heck the STD library has some support for that as well. So any sane compiler for even a wacky environment is going to try to make the transparent.

saagarjha · on Sept 19, 2018

[flagged]

reacweb · on Sept 19, 2018

No, the bad practice is on the side of standardisation commitees and all users suffer from it. Recently, I got burned by the removal of the symbol CLK_TCK from a header (https://github.com/Orc/pdksh/issues/1). If you have access to a RedHat 5, you can observe it with the command /bin/pdksh -c "time sleep 5". I think we should work to replace the C language because all its weaknesses have been worsen by standardisation commitees.

saagarjha · on Sept 19, 2018

I don't quite see how the removal of CLK_TCK has anything to do with "bad practice…on the side of standardization committees".

reacweb · on Sept 19, 2018

This change breaks working applications, is difficult to notice and has no clear benefits. The same occurs when compilers use standard's undefined behaviour for micro optimizations.

dang · on Sept 19, 2018

This breaks the HN guidelines, which ask: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize."

If you'd review https://news.ycombinator.com/newsguidelines.html and follow the rules when posting here, we'd appreciate it.

saagarjha · on Sept 19, 2018

Ouch. Thanks for the reminder; I've edited the response into something I hope better contributes to the discussion.

Though, I'm not quite sure I was using a straw man there, since my argument really hasn't changed. I think the issue was just that I was being overly snarky.

anfilt · on Sept 19, 2018

Why would one need to hash pointers? Seems like a bad idea and data strutures made of pointers like trees are pretty well behaved.

xamuel · on Sept 19, 2018

Say you have an array of pointers and you want to take its distinct members.

For each pointer, you could look up its hash in a table. If it's not there, you can be sure the pointer hasn't been seen before: append it to your result and put its hash in the table. If the hash is there, you learn nothing since it could be a collision, and you'll need to do something else (like iterate through your whole result so far). But if the original input has few distinct members, this shouldn't happen often.

anfilt · on Sept 19, 2018

Yea a list of pointers can be handy. However, if the code managing this list is receiving duplicate pointers this sounds like an other problem. If it's newly allocated memory it should probably be submitted to this list on creation. If we have some sort sub-allocater that sounds like a problem with sub-allotments. I think is probably more common ways to avoid the need for hashing, and not needing to do a linear search of your list.

rurban · on Sept 19, 2018

cloning trees, garbage collection, serialization on structures with references. graphs.