Hacker News new | past | comments | ask | show | jobs | submit login
Implementing Hash Tables in C (andreinc.net)
268 points by todsacerdoti on Oct 16, 2021 | hide | past | favorite | 159 comments



Open addressing or 'open bucket' hash tables always seem to be a second consideration in textbooks or blogs. I find them generally faster, and more memory efficient, and easier to implement. Instead we're always taught the closed bucket approach, often using linked list buckets of all things.

In certain scenarios open addressing has worse performance, but with the right optimisations it's generally only contrived situations where it performs worse. In normal usage, open addressing has vastly superior locality, a smaller memory footprint, and less dereferencing overhead.

A lot of writing and textbooks describing open addressing seem to imply that the order of walking through the table when collisions occur is really important (and it is). But the approach with best performance in practice is to just do a dumb increment through the table in-order because of caching... the penalty for that simple approach can be made up for with a good, pseudo-random hash.


Lua uses "chained scatter" (linked list, but links point to other entries in the same table, to maintain locality): https://github.com/lua/lua/blob/master/ltable.c

This is a good visual depiction of chained scatter: https://book.huihoo.com/data-structures-and-algorithms-with-...

Inspired by Lua, I did the same for upb (https://github.com/protocolbuffers/upb). I recently benchmarked upb's table vs SwissTable for a string-keyed table and found upb was faster in both insert and lookup (in insert upb is beating SwissTable by 2x).

It's true that the links do create more memory overhead though.


To improve locality it has to be on the same cache line, effectively linear probing


Textbooks, and especially undergraduate textbooks, tend to favor conceptual simplicity over performance and ease of implementation. They are supposed to teach you the key ideas behind the techniques rather than the complicated details of how things work in the real life.

The fundamental idea of a hash table is that you turn the key into a hash value and then store the key in the bucket corresponding to the hash value. The simplest way of dealing with hash collisions is assuming that each bucket is a concrete data structure capable of storing multiple keys. With this conceptual model, you already understand the essence of how hash tables work and when it's the appropriate data structure to use.

Open addressing is a more advanced topic. Undergraduate data structure classes usually mention it, but many students don't have the prerequisites (such as probability and memory hierarchies) to understand it or to make informed choices about it. After all, basic data structures is an early class, typically taught after introductory programming but before the core CS classes that rely on it.


> But the approach with best performance in practice is to just do a dumb increment through the table in-order because of caching... the penalty for that simple approach can be made up for with a good, pseudo-random hash.

SwissTable, which is generally among the fastest hash tables around, doesn't do this. It does chunks of 16 entries each, then starts skipping chunks (so first 0 chunks are skipped, then 1, then 2, etc.) The chunk skipping behavior helps avoid O(n) behavior on pathological cases.


The O(n) behavior is not even really that 'pathological', lots of normal cases get that behavior. If SwissTable operates as you say, then it sounds like it is prone to bad performance for lots of normal data with a dumb hash, i.e. as a "Set" on numbers where the hash is the identity function (a valid hash for a closed bucket hash table), and many of the numbers are next to each other.

Therefore, much like the situation I described in my parent post, SwissTable would benefit from being used with a decent pseudo-random hash. Which is no trouble, there are a number of good, fast ones.


Yes, Swisstables explicitly require you to use a decent hash. If you insist on using a bad hash, you get terrible performance, so, don't. Google internally (I didn't check Abseil) has diagnostics so they just flag that the problem with your Swisstable is that you used a bad hash and save engineers wasting their time "investigating" this non-problem.

Rust's HashMap (which is a Swisstable reimplementation) chooses a SipHash 1-3 by default but you can drop in anything you want, just with the knowledge that if your hash is bad your HashMap will have lousy performance.


Swiss tables are part of Abseil and Abseil provides its own hashing library. I haven't looked into the details of the implementation, but by default it's not using the identity function for ints. These are the first few hashes it generates for int inputs:

     0:   918782482972292171
     1: 12294850989704472498
     2:  5224175424757160222
     3: 16600243932558888071
     4:  9529568361303343081
     5:  2458892796473463123
     6: 13834961304296162488
     7:  6764285737197172256
     8: 18140354246076836247
     9: 11069678680190000369
    10:  3999003117507595883


Academic computer science tends to model a 'computer' as a simplified pointer-machine with O(1) memory access everywhere.

In reality computers actually have a memory hierarchy and no memory access is ever O(1). In RAM it only looks like constant access, because the memory address search is hardware optimized.

Many programming languages are heavily inspired by this academic pointer chasing and so their performance is pretty terrible, no matter how clever typing systems or compiler optimizations you add. It's all crap once you scale to a decent size of data and beyond one machine. The only thing that performs and transparently scales beyond one cpu are 1. array languages and 2. sql databases. Both of these actually understand things such as linear memory access, data-parallelism, the fact that cpus have caches and that 'memory' is a hierarchy of cache levels, RAM, disk, etc.


Academic computer science uses many different models for analyzing computational complexity. Most of them are not taught at undergraduate level, because students generally expect to graduate without spending 10+ years at college. Computer science is a wide and diverse field, and it's impossible to teach everything every CS graduate should know in any reasonable time.

The random access model is usually taught first, because it works reasonably well in most situations. A slight variation of it, the cell-probe model, tends to be more useful for proving lower bounds. In that model, all computation is free and only memory accesses count.

A more practical variant of the cell-probe model is often called the external memory model. The computer is assumed to have M words of fast memory and an unlimited amount of slow memory that is accessed in blocks of B words. Computation is still free, and parameters M and B are both known to the algorithm. This model is useful in situations where accesses to one level of memory hierarchy dominate the running time.

The cache-oblivious model can be useful in situations where many levels of memory hierarchy matter. It can be understood as the external memory model with unknown values of M and B.

Another useful technique is parameterizing the complexity in terms of key operations. If operation X takes t_X time, we can then say than an algorithm takes O(|P| t_X log n) time with pattern P and problem size n.


> with O(1) memory access everywhere

Isn't that technically correct? The access time is bounded by a constant, hence the 1 -- unless there's a tape in your memory hiearchy. (That you always strive for a better constant is of course another matter.)


It can't be O(1) because to just search for a memory location you need at least a binary search. Since memory lookup is implemented in hardware it is treated as constant access. Of course this is all obscured by magnitudes of the memory hierarchy. For real systems such as databases, it is a better idea to treat random memory access as O(log n), which then justifies doing batch lookups and libear scans. All of this O(1) RAM access makes people confused and they never start thinking beyond the simple undergrad abstraction of a single cpu with a single layer of RAM.


So here is what I just learnt from https://programming.guide/hash-tables-open-addressing.html:

Open adressing = When there is a collision, that is when the memory slot is already taken, you just put your key in the next one (linear probing), or at squared increment (quadratic probing) or at an increment determined by another hash function. Howewer all those techniques seems to be suffering from clustering and the more elaborate ones trying to trade memory locality for less clustering. I guess it just depends on how long and what scale you are going to need a hash table ?


The clustering can still happen with any walk. The dumb walk is the best, but it is the most prone to clustering if the hash is simplistic. But if the hash is pseudo-random, it's as likely to get clusters as any other walk, and has better locality.


A direct addressing scheme with fixed cache line sized and aligned buckets [typically 8 machine words/entries] incurs precisely 1 cache line access. I think the strawman here is the "linked-list" bucket. If we even the playing field -- a fixed sized pre-allocation of space which open addressing certainly requires -- then we can dispense with a dynamically sized bucket [and the linked-list].

So then the issue is space efficiency of direct addressing schemes. Here n-choice (2 is optimal actually) hashing provides for excellent loading at the cost of (precisely) 2 [n] cache line accesses. Your linear probing for an empty slots is likely to incur that on average if not worse.

Cuckoo hashing, imo, is a hybrid that wants the deterministic access times of direct addressing (2 locations on reads) at the cost of variable times on writes (which are not worst than pure open addressing).


Since Robin hood hashing is a thing, you can even have assurances that any given "cluster" will still have good performance.

Robin-hood hashing is to "steal from the rich" (aka: if someone is 2-away from their ideal spot, and you're at 5-away from your ideal spot, you take the slot and make the 2-away "worse").


Totally agree with the assessment on the open address/linear probe ones - easy to implement, compact, lower cache pressure.

As for the 2nd consideration - it's not fancy, too easy to implement, no novelty, etc.

The way I have dealt with the downsides of linear probe, open address: not well spread out lower bits - smearing/rehashing the hashes of the keys (via murmur3). As a side benefit it effectively randomizes the iteration order.


> Open addressing or 'open bucket' hash tables always seem to be a second consideration in textbooks or blogs.

I find that weird since my introductory textbook went straight for an open addressing, linear probing hash table -- probably because this approach is ancient. I'm not even sure a different approach was mentioned (but of course in that course, it was only one topic out of many to grasp).


Actually, in a closed bucket hash table why would you use a linked list and not some kind of tree structure in the leaves?


It requires not only a hashing function but a comparing one. Also it's not beneficial at lower size as 'compare' might fully derefences the keys (esp. if they are composite) and it's log2N is not that lower than just N.

So dynamically switching between a linked list and a rb tree (or a trie) is quite practical and done in the real world, e.g. Java's HashMap


I’m missing something… doesn’t open addressing fail catastrophically the moment you try to insert more items than the table has buckets? (With very bad degradation in performance as you approach this saturation point)?

Is the idea that you only use open addressing in cases where you know you’ll place much fewer items into the table than the number of bins?


When the load factor of an open-addressed hashtable gets too high, you allocate a larger set of buckets and rehash the table’s existing contents into them. You’ll have to do this with a chained hashtable anyway, as performance will start to approach O(n) once the table is overloaded.

This does mean that insertion time goes from a true O(1) to an amortized O(1), but that’s usually not an issue. (You can do the rehashing gradually if you really have to, but it’s probably not worth the complexity.)


You can replace the linked list with a tree to get better read performance for collisions. The JDK does this.


You get better algorithmic performance, but whether a tree or a linked list you get bad locality versus open bucket.


You can get good cache locality with a tree depending on the implementation.


Note that if you have untrusted input, you may want to use a defensive option for hashing involving a private key, such as SipHash[1]. Otherwise, an attacker who knows your hash functions can just pre-generate a large number of colliding elements and reduce your hash function to a linked list; given enough attacker-controlled elements, this can effectively amount to a DoS attack[2].

[1] https://github.com/veorq/SipHash

[2] https://www.aumasson.jp/siphash/siphashdos_29c3_slides.pdf


SipHash is considerably slower than an unkeyed hash function, isn’t it? One idea I had to avoid its overhead is to use double hashing, with a fast primary hash and SipHash as the secondary hash.

Naïvely this feels as though it should provide good performance in the common case and also good security - even if there are many primary hash collisions, they won’t also be secondary hash collisions. Does anyone know if there is an analysis of this approach anywhere?


Can't you simply salt the output of your simple hash function with a randomly-pregenerated salt value? (can be unique for each hashtable, a simple integer to be xored) That way the attacker can't predict the collision and you have same performance as before.


A key result of the SipHash paper is that the SipHash authors broke common hash functions so completely that they can generate seed-independent multicollisions.

That means that certain inputs will collide regardless of what the seed is. Seeding the hash function is no defense against such an attack.


If you scramble the output of the hash function, it has to be in a way that does not commute with the mapping from hash to bucket index. Xor won’t cut it for power of two sized hash tables.

Should probably just use a keyed hash function like siphash - the security analysis is straightforward and these hash functions are quite well developed.


If you salt the output, wouldn't anything that collided before salting also collide after salting? Surely you would salt the input of the hash function? The effectiveness of that kind of depends on your hash function. SipHash tries to give you guarantees that your salt is effective and stays secret even if hashes (along with unhashed values) ever get exposed to the attacker


Another alternative is not using a hash table, and instead using a sorted binary tree or similar, which tends to be slower on average but has worst case O(log n) lookup, O(log n) insertion. Depends on your criteria.


That depends on the implementation. If it's a simple binary tree, the worst case is O(n), not O(log n) since if you insert data in sorted order everything will always be in the left branch. O(log n) is average lookup time.

If you have a specialized tree structure like a red-black tree which will automatically balance, this will improve, but sorted input is still a viable attack method.

That said, a lot depends on the application and one should pick a hash function wisely. A binary tree–mapped hash table works best when you want to be able to iterate on keys in sort order or find an entry that's just before or just after a non-existent key in sort order. It's not really a good defense against hash attacks. On the flip side, the default hasher in Rust is hardened against hash attacks, but it's overkill for most applications and can cause severe performance degradation (although at least the library documentation is clear on this and points at alternative hash implementations for such cases). Unfortunately, the Rust resizing algorithm is flawed and won't resize until the hash capacity is completely full¹ making hash collisions much more likely. I had a doubling of performance in benchmarking some code that I was writing just by increasing the initial capacity of the map.

⸻⸻⸻

1. A common mistake I see Java developers make is, when they know they'll be inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n. This guarantees that Java will end up re-allocating the index since the default load factor in Java is 0.55 meaning that once you have 0.55×capacity elements in your collection it will resize.


> inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n.

That's actually a bug in HashMap (either the implementation or the spec). If you ask a structure to reserve space for n elements, that means it should should reserve enough space to actually store n elements (so n/0.55 in this case). If the method actually does mean to set the internal, implementation-detail capacity rather than the real capacity, then the specification is buggy, because that's largely useless unless you're relying on implementation details (as demonstrated).


Here's what Oracle says for JDK8:

"When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets."

and it explicitly tells you that HashMap will default to 0.75 load factor, therefore the documentation is telling you that HashMap(400) is not a suitable container for 400 items but only up to 300 items.

Now, it's true that e.g. Rust's HashMap::with_capacity(400) is actually a container which promises it's suitable for 400 items‡ and that's more ergonomic, but I don't think we can call Java's choice here a mistake, it's just harder to use correctly.

‡"The hash map will be able to hold at least capacity elements without reallocating".


Based on some benchmarking details¹ for my application, it appears that Rust is willing to completely fill its hashmap. It may be that my use case is somewhat unusual in that I'm expecting most requests to not return a value,² but I found that just specifying a larger initial capacity dramatically sped up the application (I got a bigger speed boost from that than I did from switching from SipHash to FNV).

1. https://www.finl.xyz/2021/09/30/building-a-trie-in-rust-opti...

2. And since this won't be the only char-indexed map in the application, I will have to revisit this.


The HashMap is "completely filled" in the sense that you get to put N items in a HashMap with capacity N, because that's specifically how Rust's HashMap chooses to define capacity as I explained above, so it's a tautological explanation.

The "real" capacity and load factor of the underlying data structure is not presented to you. Swisstables (the current implementation) can't be filled entirely, they need an empty slot to function correctly. There is some trickery to ensure that very small Swisstables get to have an "empty" slot that doesn't really exist, and thus doesn't need RAM. Since you can't ever write to it the non-existence doesn't matter. A Rust HashMap::with_capacity(0) is guaranteed not to allocate any actual heap storage, just like String::new() don't need heap to store the empty string inside it.


>Rust is willing to completely fill its hashmap.

Which is the correct decision for lower size(s). I use dynamic fill factor with 1 being the default with <= 16 elements.


It depends on how you're using the hash lookup though. In my case, most lookups would be expected to not return a value, so completely filling the small hashmap will cause retrievals to be slower since it has to keep going until it determines that there's no match.


Yes, of course. There are many use cases. Yet, in practical terms - most (90%) of look ups do return a value. For 'small' N, you can consider everything O(1). If you do know the use case - like most look ups fail, you can pick a proper impl. and tune it.

Yet again, very few cycles are spent on small hash table look ups in general. Yet, a lot of (like really a lot) is spent on small hash table wasted memory.

If you care about performance: measure, measure, measure... and know what you measure.


>A common mistake I see Java developers make is, when they know they'll be inserting n items into a HashSet or HashMap is to initialize the structure with a capacity of n.

The common mistake is using the c-tor allocating N size. The c-tor does way more harm than good. Most HashMaps in Java have few elements with tons of empty array space, many of them are pre-allocated to quite large sizes too. (At the very least nowadays the empty c-tor doesn't immediately allocate the Node[], so empty HashMaps are ok).

One of the strengths of HashMap is that resize/rehash is extremely fast/efficient and it doesn't dereference the keys (i.e. no hashCode/equals called). It doesn't reallocate Nodes, etc.


I said 'sorted' but I meant balanced as well. And even if you can 'attack' still, there is a limit to how slow it can get. Worst case O(log n) is not bad. As for security, I can't speak to its efficacy, but I know it's appropriate for real-time applications.


Samsung has an internal programming test for its employees, and it prevents the use of any standard libraries. So we have to implement hash tables ourselves every time. It is good that I now find it relatively easy to build hash tables myself, but I think this limitation is too restrictive because being good at implementing hash tables hasn't been useful to my daily programming tasks.


What is the test required for?


It is not exactly mandatory, but the company nags the employees who haven't passed the exam yet. It is also good for your performance rating, because the managers are advised to consider the result of the exam when doing performance review.


For what it's worth, their recruitment tests for fresh graduates in India is similar in that it doesn't allow usage of any STL. At least, that was the case 3 years or so ago.


It must be the same test. There are three levels in the test and new employees are given the first level.

Just out of curiosity, did you take the exam? Was the process satisfactory? Did you have a feeling that the test correctly represented participants' skills?


> In modern times, modern compilers can perform all kinds of optimizations, including this one. So it’s up to you to decide if making things harder to read is worth it.

It is still worthwhile to manually perform these arithmetic optimizations because you may want to build with optimizations off and switching to actual multiplication in hot code paths can cause performance regressions.

An unrelated area where I wouldn't do this sort of thing is bit twiddling tricks to implement branchless code. That generally defeats modern compilers' abilities to generate branchless code with conditional moves and ends up being slower.


> switching to actual multiplication in hot code paths can cause performance regressions

Why would you turn off optimisations if you were worried about performance optimisations?


Sometimes the compilers will do insane stuff that causes bugs if you enable too much optimization. Usually because of strict aliasing violations causing undefined behavior.


...don't violate strict aliasing then?


Aliasing is ubiquitous in systems programming. Reinterpreting arbitrary types as arrays of bytes, for example. Honestly, it's kind of amazing that C even has this rule. Seems to be an attempt to be competitive with Fortran.


It's definitely rife with footguns but you can very much write aliasing-safe systems code (or even kernel code, for that matter). Or at the very least you should strive to contain the alias-infringing code in well delimited sections of the source code that are built with special flags for instance.

If optimizations break code due to aliasing violations I would really recommend fixing the code, not turning the optimizations off!

Also if anything C's aliasing is less strict than Fortran by default, hence the later introduction of "restrict" to allow further optimizations.


> you can very much write aliasing-safe systems code (or even kernel code, for that matter)

Yeah, sometimes it's possible. Usually by making a mess of everything with unions. If I remember correctly, type punning with unions is still illegal C code but in practice every compiler understands the idiom.

The simple and intuitive solution to many problems is to cast the data to the new pointer and work directly with it. This should always produce correct code no matter what. People think like this and they write code with these assumptions in mind. In practice, nobody really cares too much what the C standard says. What matters is whether the compilers produce the desired code.


Pinning via unions is legal in C. Casting via a pointer is not and may produce incorrect code.


This is why the aliasing rules in C explicitly allow aliasing any pointer type with a char*.


Yeah, but char sucks. It's not a synonym for octet. Not guaranteed to be 8 bits. Its signedness is even ambiguous unless unsigned is specified.

The correct type is uint8_t/u8. Sadly, using that to alias other types is undefined. I've also seen hashing algorithms which used uint16_t/u16, with similar aliasing optimization bugs. Sometimes you want to reinterpret things as a struct, too.

This is the reason why Linux compiles with strict aliasing disabled.


Aliasing with unsigned char* is also allowed. (Oddly, signed char is not, even if char is signed.)

uint8_t is not guaranteed to be unsigned char but in practice almost always is. GCC did originally have separate 8 bit types when stdint.h was introduced but quickly changed to a typedef for char-based types precisely to allow using it for aliasing.

Yes, technically char may not be 8 bits but in practice that is very rare (and you can statically assert it).

Overall IMO the best solution is always use uint8_t and turn off optimisations on those rare weird platforms where it's not an alias for unsigned char for whatever reason.


I don't understand, if you're trying to work around aliasing restrictions why would you use `uint8_t*` in the first place?

By definition `sizeof(char) == 1`, so that's almost always what you want when messing with types in C anyway. What you want is bytes, not octets.


chars can be two octets on some DSP platforms that lack byte addressability.


sizeof(char) must always be 1, regardless of how many bits or octets that represents. On such a platform, uint8_t does not exist.


> uint8_t is not guaranteed to be unsigned char but in practice almost always is.

Does this imply any unsigned char typedef is able to alias anything? Or is uint8_t a compiler special case?


`char` is not guaranteed to be 8 bits and in some more exotic environments (DSPs for instance) may not be.

IIRC POSIX guarantees that char is 8 bits though (but I still think that the sign is implementation-dependent).

But as I said in a parent comment, I don't understand why it's even relevant. If you want to alias any type then use `char *` and not anything else. I don't understand why one would prefer using stdint for that.


Because sometimes people need to reinterpret data as an array of 8/16/32/64 bit elements. Sometimes people also need to reinterpret things as a structure.

This is independent of how many bytes the underlying platform can address. If we have 8 bit processing code but the platform can only address 16 bits at a time, it should be up to the compiler to generate code that works. Compilers already do stuff like that in other circumstances.


Those people need to be copying, otherwise the reinterpretation might not be working. The char data might not be correctly aligned, for example. Recently went through a big nightmare where a C++ codebase that had accreted on x86 was thought to be ported to another platform where alignment actually matters and there were all manner of rare low-level malfunctions stemming from the idea that you can just wantonly cast a char* to structured data.


The strict aliasing rules look through typedefs (and const/volatile-qualifiers). It's possible to use char, signed char, and unsigned char, or any typedef thereof, or any typedef of any typedef, etc., to access any memory whatsoever.


Use -fno-strict-aliasing if you do have such code.


I do. The -fno-strict-aliasing and -fwrapv are standard flags for me. They essentially fix C the language by making it behave reasonably.


But -fwrapv is broken for decades and will not be fixed.


How is it broken?


Eg https://marc.info/?l=gnulib-bug&m=124788622422245&w=2

it disables signed integer overflow checks


Simplify use of a debugger. Would be nice if the performance is at least bearable when you’re doing this.


When I was an undergraduate my university used a test of ‘can you implement a hash table in C’ as their basic ‘can you really program’ test to stream incoming students.


I think I'd have trouble with the hashing function but unless you're using a basic modulo.


IMO this is the least interesting part of a question like "can you write a hash table", unless you're in an area where writing hash functions is a critical skill. You have to deal with collisions either way...the ideal solution for me testing someone's software engineering skills would be one that uses a crappy hash function, but can trivially swap it out for something better. (Maybe this is different than what a school is testing for though?)


I agree. If I gave a test like that and someone gave me an answer where the hash function was pretty much just a dummy I'd be fine with that as long as they were clearly aware it was just a dummy and hopefully could explain roughly the characteristics of a good hash function.


Back then that was likely true. Just like building linked lists from scratch was on the same level as writing a basic loop nowadays

High level programming languages have come a long way in the past 20 years. So many things we no longer need to think about in normal day to day programming


> High level programming languages have come a long way in the past 20 years

Have they? How? It seems to me that we're not very far away from where we were. Yeah we're running on better tin, but the programming languages are pretty much the same - at least the mainstream ones are.

> So many things we no longer need to think about in normal day to day programming

I'm always surprised when I read comments like this. It's not necessarily wrong, but all programmers writing anything less than trivial will bump into limitations of the various types of data-structures eventually. Not understanding the trade-offs and not being able to make informed decisions about different approaches will be a problem. We may not have to write them ourselves, but we should certainly understand how they're implemented in my opinion.

I may be a special case in that I started programming on machines with tiny amounts of memory and with slow CPUs, and have spent time writing core data structures when in the games industry in the 90s, and now in my large web-application life have a large open-source library [1] which is all core data structures; but I honestly couldn't imagine being a successful developer without a key understanding of these things.

[1] https://github.com/louthy/language-ext


I think perhaps the explanation why modern applications seem as slow as, or in many cases slower than their Windows 3.11 equivalents loading stuff off a floppy disk, may be this general disregard for choosing good data structures.

We have a hardware that's thousands, if not a million times faster, yet it hardly seems to matter when the code is a thousand, if not a million times slower.


I do a lot of low-level stuff by choice. Early in my career, I fell into web development because I got a job offer from a friend, and I found myself writing for loops and various SQL queries. On a good day, I got to write a regex. Hell, these days, you hardly need for loops in javascript because they're hidden by frameworks. It was boring as all get-out, so I got out of the industry, went to university and learned a whole lot about programming. I try not to be elitist about it; I think it's important to acknowledge that a ton of people in the industry have a radically different perspective on the nature of programming.


> Not understanding the trade-offs and not being able to make informed decisions about different approaches will be a problem. We may not have to write them ourselves, but we should certainly understand how they're implemented in my opinion.

I agree. My argument is that they are no longer something everyone needs to know how to implement to be considered “can code at all”.


Dunno, I've had use from knowing how to implement hash tables several times just in the last few months. Extremely useful stuff if you are working with memory mapped data that is larger than system memory.


Of course it’s useful. My argument is that it’s no longer necessary for most programmers. Almost every language now comes with a native good enough hash table implementation.


It's necessary when you need to debug something, regardless of who wrote the implementation, now it's you're responsibility.


If they want to be allowed to write Engineer so-and-so, yes it is necessary.


That's laughable. Software "engineers" pretending that "software engineer" is a protected title in any way or form (including soft and metaphorical community based protection).

We'll get to call software engineers actual engineers when they can lose their (currently not existent) license over malpractice, not a second before that happens.

The tech industry cheapens the term "engineer". The discipline isn't nearly enough mature to be called that


It certainly is in many countries, where one isn't legally allowed to call themselves as such after a 6 weeks bootcamp.


Do you have any cites for that?

I’m not aware of any that even attempt to hold software engineers to any sort of Professional Engineer standard, or have a mechanism that one could actually comply if they did.

I do know of some areas where a PE is required for some software engineers, like some nuke or aerospace shops, but they’re very very few in number - and the PE isn’t about the software part of it.


In Germany you can only call yourself an engineer if you actually have a technical degree (details and exact implications vary by state). "Software Engineer" might or might not fall under that regulation, though it is probably the reason why most job ads in German will call for "Software developers" not "engineers".


Sure enough, thanks! Looked it up and it seems like Engineer is the protected title, which plausibly would apply (Ingenieur formally).


Portugal, Germany and Switzerland for example.


I wish more people saw through the corruption of terms like software engineer.


Almost as laughable as people who fawn over professional titles.


I mean, you can call a homeopath doctor (not sure if the expression is actually valid in english... it is on my native language) an actual doctor and it's gonna be approximately as ridiculous as calling a software engineer an actual engineer.

Tis a technical blog, being pedantic about expression meanings is all the rage around these parts :)


Well as long as we’re being “tehnical” you seem to have completely missed my point. Whether you are called a “tehnologist” or a “code monkey” or a “janitor” or an “engineer” is less important than the work you do. And being obsessed with pretentious titles is much more of a sickness of doctors, engineers and other professionals, and I would venture to guess less of a sickness of homeopaths, aroma therapists and other weirdoes.


I'm not sure what you're saying? That someone's not a real engineer unless they can implement a hash table? Firstly, I don't think any programmers are classified as engineers by law in some countries. Secondly, I don't think I've even once thought about implementing any high school/college data structures at work because all of them have libraries with much better implementations.


Any software engineering degree in countries with professional titles has algorithm and data structures as compulsory lecture, and exam.


What countries are these you keep referring to?


I don't know what he was referring to specifically but I this would be the first time that I hear of a country that actually has a protected title of _Software_ Engineer.

Plenty protect "Engineer" meaning "Software Engineer" is actually a title that you _can not_ hold.

As for the "mandatory courses on algorithms" and such if you go to university, definitely Germany. If you go to university and study computer science, the first few courses which you will have to pass before ever getting to choose your own courses are going to be about modelling, data structures and algorithms. You will learn the theory of hashing, you will learn how to model problems you will learn various sorting algorithms etc. in the lecture. Labs will make you implement various of these things on an actual computer. An exam will probably ask you to write one or two of these algorithms in pseudo code. Been there, done that. And of course comparing all manner of algorithms in Big-O notation etc. Also "reducing" one algorithm onto another (dunno if that's the proper term in English). But basically taking an algorithm that you know the run time of and showing that a different algorithm you have has the same characteristics and thus Big-O complexity. You will also learn about P/NP and will implement bin packing and such.

Unrelated to hash tables but hashing always reminds me of substring search algorithms that we learned about. The naive way is to just do a "text" search, advancing one character at a time and comparing the whole thing to your substring. I don't remember what the algorithm is called or who invented it but we learned about an optimization for substring search, which hashes with a particular hash function and when advancing a character in the string to search through, you reverse the part of the function for the first character and only add the result of the function for the next character. Thus you save having to deal with the characters that haven't changed at all.

Just like hash tables or linked lists, I've never needed to code any of these ever again and I just use them, but it was definitely worth learning about all this stuff.


Portugal, Germany and Switzerland for example.

The respective order validates that Software/Informatics Engineering degrees are actually engineering.

Additionally, for certain legal activities like signing contracts with liability, putting the Eng. so-and-so is only legally binding if the title was validated via the traditional admission exam.

Also you can be sued if lied about having a title, although since our prime minister got away with it, and legal system takes ages, probably no one would bother unless some big loss comes to be.

And if you so wish, can also get the ring.


Just like you had to learn how to code with absolute memory addressing because only a silly non-engineer would waste computer resources on a compiler, right?


That's why it's called a computer science degree, not a gluing-libraries-together degree.


Yes, programming in Assembly is usually also a possible assignment during a 5 year degree.


So if one is pretentious.


Not needing to think about stuff anymore implies the ability to do so when needed. Somehow I doubt people who have always relied on language features and libraries would be able to work effectively without them should they ever need to.


You can also build a hash table from scratch in for example Ruby. The language doesn’t matter and C isn’t the key bit of the problem.


I wish this article was more opinionated.

One of my pet peeves about hash tables is that most of the resources about it describe how there are dozens of different ways to do it (different hash functions, different collision management, etc), but they often don't tell you which one you should choose.


As a nascent dev, at least the best advice I've gotten so far was "write what is necessary for you". There are obvious "bad" practices that need to be eliminated early on to make life easier, but for something like this I think the wisdom is more just work with it yourself and find what does or does not work and why.

My biggest "leaps" have been to spend time on code and get the general system I need in place and working, then consult with a more senior colleague on it. Typically, once I've gotten my work to a point where the intended system is clear enough and I have functional code, that's where a more senior review can first take place and:

1. The senior can immediately (hopefully) understand what our goal is here

2. The senior likely has some ideas on better ways to approach it than I did in the first iteration

3. If it does get to matters of opinion, at least I'm lucky enough that my senior is very unemotional about such things and explains why they like their way, but usually can understand why I did something the way I did

And after that, I iterate and I feel like I get the idea way better.

Trying to think about the "right way" from the beginning traps me a lot personally and almost scares me that I'm going to do something wrong. I also don't really feel I "get" why something is the right way until I've done it the wrong way or at least explored the wrong way a bit first.

But again, I'm pretty nascent so you might be talking on a different level. But I feel this experience applies to basically everything with code, and just getting your hands into bad implementations and being willing to rewrite is the best way to find that "right" way for you.


I don't think there's consensus and very few people have a total grasp of the subject.

Look Java and python for example. The people working on the languages are quite smart, and they decided to take different routes. Who was right ?


> I wish this article was more opinionated.

There's the Attractive Chaos blog for you: https://attractivechaos.wordpress.com/


Nice article. Since the beginning of the 1990s I enjoy using the hash tables implemented by John Ousterhout for Tcl: https://core.tcl-lang.org/tcl/file?name=generic/tclHash.c&ci...


Similarly, Robert Sedgewick covered this in his 1990's book "Algorithms in C" quite nicely, and a bit more succinctly, IMO.

https://www.amazon.com/Algorithms-Computer-Science-Robert-Se...


> The % operation is quite efficient (although it’s usually two times more expensive than multiplication);

I though it was much more than two times slower? Division and modulus are the slowest integer operation.


Agner Fog’s manual lists a latency/reciprocal-throughput of 15/10 for 64-bit divide on Ivy Lake & Tiger Lake, vs 3/1 for 64-bit multiply. So division is 5-10x slower than multiplication.

https://www.agner.org/optimize/#manuals


It makes a significant difference in profiling, versus i.e. a bitwise mask to find a modulus for a power-of-two size.


The I believe that author gets the value of phi wrong: it is roughly 1.618, not 0.618.


Several ways of defining the Golden Ratio allow either, or both. One convention is that upper-case Phi or Φ is 1.618[...], while lower-case phi or φ is 0.618[...].

The values are related by φ = 1/Φ, as well as Φ = φ + 1. Both of these relations can be derived from the fact that they are the ratios of a Golden Rectangle--if a square has length one, then adding a rectangle with height φ to the side will give a rectangle of length Φ, and the rectangles are Similar.


It's its brothe (more accurately its inverse), usually written Phi with a bar on top. They both respect the golden ratio.


He probably confused it with 1/phi and didn’t check his equation.


TIL 1/phi=phi-1


That's essentially the definition. φ is the solution to

    φ^2 - φ - 1 = 0
Which can be re-arranged, by dividing by φ to,

    φ - 1 - 1/φ = 0
or:

    1/φ = φ - 1


Or

φ^-1=φ-1


Yes, φ^-1 is another way to write 1/φ.


Yeah but the 【 aesthetics】


Happy phi day ^_^


CCAN's htable uses open addressing, and bit stealing from the pointers for maximal density (extra hash bits go here to avoid derefs for misses).

The result is that performance tends to be great up to about 95% full. The resize operation is a dumb double-and-rehash, but for most programs, the lower cache impact is a win.


Unless I'm missing something, the definition of H_division is misleading. The co-domain (image) set is really {0, ..., M - 1}, which _is_ surely a subset of [0, M), although definitely very different in size. Usually it's written as Z/qZ where q = M.

Neat article otherwise. It reminded me of mappings to real numbers which I always found interesting during my undergrad.


I had hoped for a little more insight and information density, but I did learn about Knuth's hash x(x+3) % P which was news to me.

However these are all techniques that I might have used in undergrad, but today I would hands down use Cuckoo hashing unless the application domain was insertion dominated.


He seems to overlook a fundamental optimization for open hash tables: ordering the elements in the linked list associated with each bucket. Right now to find an element, he reads the entire linked list if it's not there. By ordering the elements in the list upon insertion, he can greatly improve on that.


Wouldn't ordering basically improve lookup performance at the expense of insertion performance? If you don't order the elements, you can just smash the new entry at the end of the list and be done with it.

Of course that might still be a reasonable trade-off to make. Is there experimental (or experiential) support for getting significant improvements from ordering?


ordered insertion can happen in the "middle" of the list. unordered always has to reach the end (to ensure no duplicate) so it's worse for insertion as well.


Good point. For some reason I didn't think about duplicate checking even though it's pretty obvious.


Many (most?) hash tables are write-once, read-many (caches, databases, etc. among others) so optimizing for reading makes a lot of sense.


This is possible, but typically this is not done for generic hash tables because it requires keys to be ordered (i.e. there must be a function that can tell whether one key is less than, greater than, or equal to another) while the current implementation only requires testing for equality.


> it requires keys to be ordered (i.e. there must be a function that can tell whether one key is less than, greater than, or equal to another)

It doesn't, you can order by hash value. However, it's a separate question on whether the hash value is stored with each item or not.


I believe the Compressed Hash-Array Mapped Prefix-tree (CHAMP) [1] is considered the current state of the art in terms of performance, memory efficiency, and memory locality.

I have an implementation here [2] that is used for the HashMap and HashSet types in language-ext. The 'guts' of the implementation is here [3]

[1] https://michael.steindorfer.name/publications/phd-thesis-eff...

[2] https://github.com/louthy/language-ext/blob/main/LanguageExt...

[3] https://github.com/louthy/language-ext/blob/main/LanguageExt...


Is that for immutable/persistent hashmaps, or also for mutable hashmaps?


It can be used for both. With my implementation it's mutable when initially populated (by passing an enumerable of key/value pairs), and then becomes immutable once constructed.


Hashmaps that have string-like keys suitable for indexing a trie are a subset of all hashmaps.


Every hash table implementation I have done I use an RB tree for each bucket. This way if the bucket is loaded with hundreds or thousands of elements, you aren’t penalized too badly.


JavaScript is a language where you need to implement your own hash table, since the built-in object only understands key strings, and the built-in Map class only has value semantics when the keys are primitive types (strings, ints, etc), and compares objects by object identity.


> Cryptographic hash functions are a special family of hash functions. For security considerations, they exhibit an extra set of properties. The functions used in hash table implementations are significantly less pretentious.


Wow, learned more from this than from Uni.


Spare this light article on bad hash tables. The best one in C is currently github.com/LIMachi/swiss-table but this is missing the security discussion.


Great article --- very little having to do with C though.

It might be unpopular to say this but: can we please stop teaching people to write new programs in C? The language is inherently unsafe, and we're collectively losing billions of dollars per year to preventable memory safety errors.

Why not use Rust or C# or something to demonstrate the principles of hash tables? Why start with the programming equivalent of a 1950s death crap automobile with a spike in the center of the steering wheel?


Driving without power steering has made my arms much stronger, and a giant spike capable of ramming through my lungs at collision has made me a much safer driver. I can also repair a 1950s mobile, because it’s mechanical workings are much simpler than something modern


Genuinely unable to tell if this is continuing the joke or an entirely serious but misguided comment.


When I need to borrow your car, I get stabbed in the lungs, though.


No, we can’t. Because C is a necessary evil. If you stop teaching people to program in C, who is going to maintain virtually every software this world is running on?


Presumably the same way find all of the people who maintain all of the critical software written in Fortran or COBOL or Ada or other languages that aren't really taught in schools anymore.

If a new hire is minimally competent at programming, I can bring them up to speed in whatever language it is we are using. Hell, even when people have notionally been taught a language in school, there's a decent chance I would still have to train them in that language as a new hire, because what they learned in school may be insufficient to cover the complexities that get involved in practice.

And to OP's point: C itself is a pretty poor language at covering what it actually covers. It has some missing features that just seem random (e.g., there are no 8- or 16-bit arithmetic operators in C). It has gratuitous undefined behavior (signed integer overflow and strict aliasing). As a language design, it's atrociously outdated (#include is a very poor substitute for modules, and generic programming might as well not exist in C). There are questionable features--null-terminated strings don't look like such a good idea in retrospect. Even if you want to avail yourself of its strengths--access to low-level hardware features--C's gratuitous undefined behavior (particularly strict aliasing) undercut it there, and languages like Rust and Zig give you the same power with none of C's warts.


> C itself is a pretty poor language at covering what it actually covers.

C's job is to be the first thing ported to a new architecture, allow you to bootstrap a system, and build on top of it (last 2 steps optional if you're working on some sort of embedded system).

And, frankly, it's reasonably good at that. Writing a simple C compiler is trivial compared to something like rust.


You don't have to write a Rust compiler. You have to write an LLVM codegen backend. Is it that hard to do?


Depends on how exotic of an architecture we're talking about.


Out in embedded land, where you will have C, assembler and a tiny library if you are lucky, the luxuries of Rust just aren't widely available yet. Its also a very valuable language to learn since the big 3 operating systems are made out of it. You can't wish that away.


Out of curiosity, what kind of embedded device has only C and asm support? Usually, the compiler used for these is gcc which, for a really long time is written in C++. It shouldn't be hard to use even go on embedded now. Also Rust gets rustc_codegen_gcc[1], so you will be able to use it everywhere.

[1] https://github.com/rust-lang/rust/tree/master/compiler/rustc...


Sometimes the compiler is gcc, but the libraries are...some weird hybrid. Texas Instruments RTOS used their own compiler and a strange mashup of C/C++ (xdctools they call it). I don't understand it, it kind of looks like if somebody had been described the original cfront over the telephone. The latest versions are undoing some of that and they even have the option of using a clang compiler, but on their older chips you are stuck with the original weirdness, which basically means you are restricted to C and assembly in a practical sense.


PIC16?


You’re implementing hash tables on your 0.05$, 44.5 byte RAM, uphill-both-ways embedded devices?


Oh hell no. There are some useful uses of hashing in them though, such as verifying the contents of flash memory before loading it. I was recommended Murmer[0] and use it as a kind of "is what is contained in the flash memory consistent with what was written" check for an app that has to keep some state between reboots. You could use it in an embedded device for a hash table but it's probably faster to make a b-tree or just brute force an array.

[0] https://en.wikipedia.org/wiki/MurmurHash




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: