Hacker News new | past | comments | ask | show | jobs | submit login
You Can't Always Hash Pointers in C (2016) (nullprogram.com)
72 points by ScottWRobinson on Sept 19, 2018 | hide | past | favorite | 61 comments



Slightly related (but only slightly), on the PS2 (games console, from Sony) you could access main memory with three different pointers, cacheable, uncached accelerated and uncached (which was the same base address with different high bits set accordingly). Each mode had reasons to be used (for example, when you read/write to memory which is cacheable and then fire off a DMA transfer, the data being transferred may, and often won't, match, what has been set in the program, until a cache flush has been requested). As you can imagine, this regularly resulted in hard to track down bugs.

Anyway, the bottom line of this is: I have seen various pieces of code on platforms which when comparing a memory pointer you need to mask out several bits (unless of course you actually want to differentiate between the different access methods of the same piece of memory).

It isn't quite what this article is talking about, but I thought it was a related piece of interesting trivia to those that don't already know ;)


Great example to add to the discussion. The post does mention that pointers may have some meta bits which I fully accept but without concrete examples wouldn't give much consideration for.

Reminds me of the recent underspec that was closed with signed/unsigned some operation or other. Better for the spec but can't imagine practically mattered.

If some unbelievable situation occurs on some unique platform/compiler remember this post. In the meantime do as we have been with using pointer values. Back before the System7 32bit clean code people put all sorts of stuff in pointers because we could. We know the costs but also remember that it had value too.


> which was the same base address with different high bits set accordingly

Reminds me of memory mirroring on the NES (https://wiki.nesdev.com/w/index.php/Mirroring#Memory_Mirrori...). Except that, in this case, the mirrored memory has different access semantics :)


Smells like a MIPS processor?

I really enjoyed this stuff at university. I work with very high level tools these days, informing business decision makers, and I don't regret the move up the value stack.

In fact I have found that being a competent developer at low levels (embedded systems, operating systems) has helped solve some incredibly subtle issues with high-level tools in my career.


One thing interesting to note is that C++ has a specialization for

template<class T> struct hash<T>

which is required to have the correct semantics.

Given that the in practice the underlying machine models of C and C++ are very similar, and that the most widely used C compilers (gcc, clang, msvc) are also C++ compilers, in practice hashing C pointers is likely not an issue.

However, if you want to hash pointers in a way that is blessed by the standard, you can probably do this

my_hash.cc

    #include <functional>

    extern "C"{
      size_t hash(char* c){
        std::hash<char*> hasher;
        return hasher(c);
     }
   }

my_hash.h

    size_t hash(char* c);
EDIT: Added code if you wanted to be pedantic about it.

Then you just link your C program to my_hash.o and include my_hash.h


C++11 only requires that "A Hash is a function object for which the output depends only on the input and has a very low probability of yielding the same output given different input values" and that "the probability of h(a)==h(b) for a!=b should approach 1.0/std::numeric_limits<std::size_t>::max()."

This allows a pointer to be larger than size_t, which might be the case on some system with tagged pointers.


Wait, so std::hash::operator() doesn't have to return a size_t, or are you just saying that pointers can have values "greater" than size_t and need to be "trimmed" down to size?


There are lots of places where size_t can't fit a full address. That is kind of the point of adding intptr_t (if it were covered by another typedef, why add it?) By the way, intptr_t is relatively recent (C99).


> That is kind of the point of adding intptr_t (if it were covered by another typedef, why add it?)

I thought it was a stylistic choice: size_t for lengths and sizes, and intptr_t for, well, an integer holding a pointer value.


size_t only has to be large enough to accommodate the size of the largest object possible in a given implementation. It has no relation to pointers at all.

Similarly, ptrdiff_t only has to be large enough to accommodate the difference between two pointers where such difference is legal ("when two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object").

Hence the need for intptr_t and uintptr_t, and why they're optional unlike size_t and ptrdiff_t.


Sorry, what system are you thinking about where size_t is something other than pointer sized? This is silly pedantry


It's not as unlikely as you think. Use your imagination.

I seem to recall some late 90s Unix where size_t was 32 bits and pointers were 64 bits. I think it was Tru64? Seems to be backed up here: http://www.cecalc.ula.ve/documentacion/tutoriales/COMPAQC/DO... In Compaq C, size_t is unsigned int .

On 16 bit x86 it certainly is most natural to have size_t be 16 bits but full addressable memory is 1mb.

Again, if there were no ambiguity here there would be no need to introduce a new typedef, but they did, in 1999.


I'll grant the Alpha point, which seems like a clear and unfortunate bug.

But the DOS one is wrong. "Addressible" memory with segmentation may be 20 bits, but the maximum size of an object that can be referenced through any pointer is 65536 bytes. It was possible to compile in a "huge mode" where all pointers were expressed as base/offset, but in fact there the size_t was a compiler-generated 32 bit type.


std::hash::operator() is indeed required by the standard to return a size_t, but the standard puts few constraints on a size_t. It's certainly reasonable for an architecture with boxed numbers to make a size_t too small to hold a bona fide pointer.

BTW the boxed environment could be a hardware platform (e.g. some of the Risc-V variants these days) or a software architecture -- imagine a C++ variant that generated code to run inside such an environment even if the hardware it's hosted on has a flat address space).


This is the case with hashing algorithms in general. The important part is

"For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2)."

This is what allows hash maps to work.


I think you meant template<class T> struct hash< T* >



Generally, hashing pointers in C or C++ is a bad idea anyway.

Unless, you have a particular love for debugging non-reproducible behavior that is.


Clarification: I though this was well know, but apparently it isn't, so I'll explain it.

`malloc` doesn't have to give you the same pointer the next time the program is run - and sometimes it doesn't.

In that case, the ordering of the hash table on iteration can change.

Particularly, release build `malloc` and debug `malloc` are likely to differ, leading to bugs in production that don't reproduce when debugging.


> In that case, the ordering of the hash table on iteration can change.

Order-dependent iteration of a hash table's items isn't exactly a central use case of the data structure and often isn't even guaranteed. I don't think that this example is really sufficient to suggest that hashing pointers is a bad idea (though it may be for other reasons).


In golang it is purposefully randomized on every execution to prevent you from relying on iteration order


That's actually better than the situation in C.

In C, debug `malloc` may very well behave the same run after run.

Then you move into production, and release `malloc` behaves differently to debug `malloc`.

Suddenly a latent bug that was there all along shows itself in release, but fails to reproduce in debug.

With the randomized behavior you describe in golang, the bug probably will show in debug and be caught.


Yea I wasn't sure enough to make a stronger claim, but I don't believe I've ever used ordered iteration of a hashtable. I feel like if I was tempted to do so, it would prompt me to rethink my design. Call it an "algorithm smell"


Who said you need or want run-to-run hash stability? One common technique for defeating hash collision attacks is maxing a secret random tweak, newly computed on each program run, into to hash function. (Python is one environment that uses this trick.) Under this scheme, hash values are unstable, but the world keeps working. Hash stability turns out not to be particularly important most of the time.

Generally speaking, if your program depends on the stability of hash-table enumeration order, you're doing something wrong in the first place.


>if your program depends on the stability of hash-table enumeration order, you're doing something wrong in the first place

The point is not all programs are correct.

If you're trying to reproduce a bug that occurs in the release version, but goes away in the debug version (because hash-table enumeration order changes) that makes life more difficult.


I don’t see anything wrong with using pointers as keys in a lookup table. There’s nothing non-reproducible about this.


Compiler/runtime can create two pointers with different integer values that point to same memory address. The article explains a few reasons why that might happen. It's very uncommon, but it's valid in the standard.


>There’s nothing non-reproducible about this.

No that's false. See my clarification above.


I really don't care what the actual value of the pointers are, since I'm not going to use them for anything other than key lookup. The thing that breaks in your case is hash table iteration, but if I'm not doing that I still don't see any non-reproducible behavior other than slight timing differences.


If you read the article, there's an explicit mention of the possibility of tagged pointers for some implementations, where the low-order bits represent the address and the high-order bits are used for housekeeping and might change dynamically. Thus the pointer to the same object might naively cast to different integers at different times in program execution. That will lead to non-reproducible behavior.


Given the factual amount of times it will happen, even this small thread is not worth it except for few pedants. And on architectures where it will, and on compilers that dare to make it happen — it will break much more than just your clever code anyway. All these ancient standard details are there only for history. In reality, you cannot even change the direction of memcpy(3) without turning the world upside down.


sure, but in practice everybody knows the actually used bits of every pointers per platform. on amd64 e.g. it's the 2 lowest and 16 highest which need to be masked off for comparison. it's simply platform and vm dependent.

I know very well where and how I'm able to compare ptrs in a hash table, and I'm doing it all the time. just complaining out loud is no solution.


There's nothing non-reproducible about it within a particular run. Of course it would be silly to write a pointer hash table to disk!


Agreed. Even in the author's use-case, just hashing the string itself would be much better.


What? No. Hashing a string involves extra pointer chasing and looping over a variable-length data structure, and that's on top of the collision-chain resolution the hash table itself has to do. In performance-sensitive code (like a dynamic language interpreter), it's far better to intern your strings and look them up by ID (i.e., pointer value) directly.


With Microsoft going to "fat pointers" in their systems, some pointer conversions are not going to work.


I assume this is something else than the "far" pointers from the 16-bit days - what is the new application of fat C pointers at MS?


I think Animats is referring to checkedc [1] which I can assure you Microsoft/MSVC is not moving towards.

[1] https://www.microsoft.com/en-us/research/project/checked-c/


I don't like this genre of programming blog post. These things always amount to premature optimization for portability. They point out that some technique that's been used by millions of programs for decades might not work on every single architecture that's technically compatible with the C standard. So what? The techniques work, and because so many programs use these techniques, they'll keep working.

In practice, almost nobody targets esoteric environments. Avoiding useful and well-known tools just for the sake of an environment that will never see your program is just a needless self-imposed tax that's going to suck time from other aspects of development. In my programs, I'll keep hashing uintprt_t values.


> don't like this genre of programming blog post. These things always amount to premature optimization for portability. They point out that some technique that's been used by millions of programs for decades might not work on every single architecture that's technically compatible with the C standard. So what? The techniques work, and because so many programs use these techniques, they'll keep working.

The article lays out its motivation pretty explicitly, and it runs entirely contrary to your point:

> My purpose is not to discourage you from casting pointers to integers and using the result. The vast majority of the time this works fine and as you would expect. I just think it’s an interesting part of the language, and C/C++ programmers should be aware of potential the trade-offs.

"Details of C implementations are sometimes interesting to think about" is very different from the strawman of "never do this thing because it's not universally portable".


While an architecture where two distinct pointers could point to the same location is esoteric today (minus `mmap()` tricks), go back 30 years and it was not only esoteric, but downright the most popular system in existence! [1]. Who's to say something just as esoteric becomes the mainstream? (although I hope not---it was horrible).

[1] MS-DOS with its FAR pointers.


It's still non-esoteric today:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

ARM Cortex M4 supports adressing individual bits (often registers of hardware peripherals) of memory locations as one additional memory location writing to, or reading from a single bit.

e.g., from the examples of said infocenter url:

   *(uint32_t*)0x20000000 |= (1<<7)
   *(uint32_t*)0x20000000 &= ~(1<<7)
should be equivalent to

   *(uint32_t*)0x2200001C = 1;
   *(uint32_t*)0x2200001C = 0;
Also Analog Sharc DSPs (which, I think, still are being sold with this architecture, and still used, even though I've used them only 10 years ago) alias their memory four times, depending if you want to access it as 16bit, 32bit, 48 or 64bit data.


I think you're missing 'volatile's in your casts


The examples should only view the bit/address mapping, not be a complete example on how to access memory mapped peripherals.


why would you need that in this case?


Probably so the compiler doesn't just optimize out the statements, seeing as they don't seem to have a visible effect on the program execution because they're never read again.


because writing to one location causes the data to change in another (a system register) you don't want the compiler to assume that data in an IO register can be cached in a CPU register


Harvard architectures might be an other example that's not really strange. You can have two different pointers with the same value. However, why are we trying to compare data and instruction pointers? They are pointers for different types of data. One being instructions and the other data.


True. But C is pretty restrictive in what you can assume. POSIX is more lax (for the programmer, not for the compiler) because it mandates certain behavior---like an all-0 bit pattern for NULL pointers, and the ability to case `void *` to a function pointer (and back again).

That "`NULL` need not be physical 0s" [1] is a real pain in ANSI C because `memset(astruct,0,sizeof(astruct))` won't NULL out pointers---you have to explicitly assign NULL to each pointer. It's one of the reasons I tend to stick to POSIX if I can.

[1] In source, a 0 in a pointer context is a NULL pointer, but the architecture could mandate "all 1s" as a NULL pointer.


The zero-ness of NULL is a good example of the phenomenon I'm talking about. How much time have people wasted over the years tweaking and worsening code afer somebody pointed out that some platform at one point might not have used the all-zero bit pattern for NULL? In how many cases did the program that was changed get anywhere close to such a strange, special-purpose, and likely obsolete computer? I'm guessing the answer is "vanishingly close to zero." Being exactly ANSI C correct (as opposed to relying on POSIX's stronger guarantees) comes with an engineering cost, and there's no sense paying that cost unnecessarily.


It's one thing to do something because you found that it works. It's another if you know what the rules are, and why breaking them will work in this case, and if you (or someone else) needs to port the code, you know the potential problem spots that need attention.


I agree, but I would not consider hashing pointers a common tool or technique.

I would say treating pointers as integers is a more common approach. Heck the STD library has some support for that as well. So any sane compiler for even a wacky environment is going to try to make the transparent.


[flagged]


No, the bad practice is on the side of standardisation commitees and all users suffer from it. Recently, I got burned by the removal of the symbol CLK_TCK from a header (https://github.com/Orc/pdksh/issues/1). If you have access to a RedHat 5, you can observe it with the command /bin/pdksh -c "time sleep 5". I think we should work to replace the C language because all its weaknesses have been worsen by standardisation commitees.


I don't quite see how the removal of CLK_TCK has anything to do with "bad practice…on the side of standardization committees".


This change breaks working applications, is difficult to notice and has no clear benefits. The same occurs when compilers use standard's undefined behaviour for micro optimizations.


This breaks the HN guidelines, which ask: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize."

If you'd review https://news.ycombinator.com/newsguidelines.html and follow the rules when posting here, we'd appreciate it.


Ouch. Thanks for the reminder; I've edited the response into something I hope better contributes to the discussion.

Though, I'm not quite sure I was using a straw man there, since my argument really hasn't changed. I think the issue was just that I was being overly snarky.


Why would one need to hash pointers? Seems like a bad idea and data strutures made of pointers like trees are pretty well behaved.


Say you have an array of pointers and you want to take its distinct members.

For each pointer, you could look up its hash in a table. If it's not there, you can be sure the pointer hasn't been seen before: append it to your result and put its hash in the table. If the hash is there, you learn nothing since it could be a collision, and you'll need to do something else (like iterate through your whole result so far). But if the original input has few distinct members, this shouldn't happen often.


Yea a list of pointers can be handy. However, if the code managing this list is receiving duplicate pointers this sounds like an other problem. If it's newly allocated memory it should probably be submitted to this list on creation. If we have some sort sub-allocater that sounds like a problem with sub-allotments. I think is probably more common ways to avoid the need for hashing, and not needing to do a linear search of your list.


cloning trees, garbage collection, serialization on structures with references. graphs.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: