Without appropriate memory barriers, you can end up with an inconsistent view of the memory due to out-of-order execution and other weird stuff the CPU does behind the scenes.
There's significant footguns around low level concurrency primitives.
The real problem is resource management. If you have a threaded GC then you can just read, and the act of reading will cause the value you read to be alive. If you don't have a GC, however, you probably use reference counts, and now you may need to compose an atomic read of a pointer with an atomic reference count increment, but by the time you're doing the second step the object you're referencing may have been destructed.
> As long as there are no writes during those reads... everything should be stable, no?
Yes, but how do you ensure there are no writes during those reads?
You have to protect the reads against concurrent writes.
The simplest way is to use a mutex, but that doesn't support concurrent reads.
The next way, which is fairly common, is to use a read-write lock. That allows reads that are concurrent with each other, but only one write at a time, and no concurrency between reads and writes.
A standard read-write lock is not particularly fast for concurrent reads. The lock-for-read operation is required so that reads prevent a concurrent write from starting, and wait for a write already started to finish. That's not fast on a multi-core system because it forces cache line bouncing between cores.
That is, unless particularly fancy types of read-write locks optimised for mostly reading are used, such as rwlock-per-core, and those are slow for writes. They are somewhat fast for reads on architectures with fast atomic operations, but not as fast as possible.
Read-write locks also come in different flavours, depending on whether you want new reads to be blocked and queued when there's a write blocked waiting for current reads to finish. Fairness is an issue. This can get complicated, and bugs in libc rwlocks are not unheard of because of the complication.
A seqlock can be used which has fast reads when there are no writes. They are fast on a multi-core system because there's no cache line bouncing between cores when there are only reads. But if there is a high rate of writes in one thread it can block all reads continuously, by causing them to livelock in loops. This is called spinning. More commonly, the writes tend to slow down seqlocked reads by a large factor in some scenarios, without blocking them completely. Just wasting a lot of CPU time and running slowly, out of proportion to the amount of blocking you would expect is necessary.
Rather like an over-contended spinlock. Spinlocks, which can come in a mutex flavour or read-write flavour, should rarely be used in threaded code outside a kernel. Because they spin as described above, out of proportion to the amount of blocking that's really required, and the effect is much worse in pre-empted userspace threads than in a non-premptible kernel.
It's possible to reduce seqlock and spinlock CPU spinning by transitioning to a different type of lock after some number of spins. Sometimes a dynamically estimated number of spins. This makes them behave better in userspace threaded code outside a kernel. But now your lock is rather complicated, and still not consistently fast at reads.
An approach which works really well is RCU. Concurrent reads can be very simple and never spin because there's no loop. There's no multi-core cache line bouncing. Writes are more complicated, and the reads have to adhere to certain patterns because the kind of concurrency allowed between reads and writes is different in RCU than with locks. It works best if your program has some kind of top-level event loop that is returned to often, to provide the "quiescent states" RCU requires. But there are other ways to do it, if there's no top level, they just require reads to do a bit more work than almost nothing.
Even RCU requires a little something in reads though, to get correct data. This is a data-dependency memory barrier. These barrier operations require zero instructions on nearly all CPUs because of how memory systems are designed, but are famously not free on the DEC Alpha which shows that it's not a "no operation", it just happens to be a side effect that is usually baked in. Even with zero instructions on nearly all CPUs, they limit which code optimisations the compiler is allowed to do, so have a non-zero average overhead, but it is very small in practice.
The OP wrote this as a learning exercise. With the goal of learning in mind, I would suggest that it's probably best for the OP to go and read as much code as they possibly can, no matter the license!
Perhaps, but lawyers can ruin your life. If you ask me, it's a travesty that the law is so complex nobody in this thread can agree if this is good advice or paranoia run out of control. That's the world lawyers have created for us, and like it or not, we live in their world.
"Software licenses" aren't their own subject. They are instruments of copyright law. Without copyright, there is nothing to be licensed. And clean room implementations are the best way to ensure that no copyright violation has occurred.
Looking at GPL code doesn't poison your mind forever. Of course, you cannot directly copy GPL code and call it MIT. And you arguably cannot have GPL code open in one window and write your version in a separate window. Copyright protects a specific set of symbols in some order, not an underlying idea or algorithm. You can look at a GPL hashtable one day and write an MIT one the next, no problem.
However, it is my opinion that having a project posted on HN, being directed to look at the "internals" of another software project (such project being licensed under the GPL), and subsequently modifying your own project with what you've learned, is legally risky.
Specifically, if I were corporate counsel at a company looking to use MIT-licensed code in a product of ours, and our due diligence uncovered that just such a thing had happened, I would advise against using that code. The risk—that is, likelihood multiplied by the magnitude of the severity of the consequences—of being compelled to license our software under the GPL would be far too high.
As a result, I stand by my assessment that it is probably best—albeit not mandatory—for the OP author not to take a look at how GPL'd code accomplishes what OP author is trying to accomplish.
> Specifically, if I were corporate counsel at a company looking to use MIT-licensed code in a product of ours, and our due diligence uncovered that just such a thing had happened, I would advise against using that code.
I get that corporate counsel is extremely conservative (do you practice in this area?) and often insensitive to the costs of following their advice (as opposed to the costs of not following it) and you may well be right that this is what they would advise if asked explicitly. But I don't think the end result is good advice for an engineer.
> The risk—that is, likelihood multiplied by the magnitude of the severity of the consequences—of being compelled to license our software under the GPL would be far too high.
I think you're overestimating both likelihood and severity. Likelihood -- I mean, your internal hash table is never going to see GPL enforcement action. Severity -- the least expensive path to remediation is unlikely to be GPL'ing your software. You could replace the component, for example.
A human beings right to enlighten themselves trumps any other human beings right to enrich themselves.
You're calling for a reduction in agency over a potential for legal action - self-censorship, essentially - in a way, forming a pre-judiciary "pre-crime" conclusion of guilt.
Code is language. Restrictions on its use are human rights violations, no matter what legal-ese can be trotted out to FUD the arena.
> Looking at GPL code doesn't poison your mind forever.
If you purposely go through the source code of a project to get inspiration and afterwards you are so inspired that you end up implementing the same thing in your code, you're literally reusing the code. If the project that inspired you so much does not grant you the right to reuse the code in the way you are reusing it, you're violating the licensing terms.
"Clean room implementation" does not mean "I may or may not have copy/pasted the code". It literally means replicating a design without infringing copyright related to the work you're replicating. Directly accessing the code and lift the good parts in different degrees of verbatim-ness is the textbook definition of copyright violation.
Software licenses are the greatest legal psych out of my lifetime. US law permits anyone who has a legally acquired copy of a piece of software to copy it further[1], such
that such a new copy or adaptation is created as an essential step in the utilization of the computer program in conjunction with a machine and that it is used in no other manner
As anyone can see, this completely obviates any need to be licensed to use software once the seller has lawfully sold you a copy. It's yours and you can execute it as you see fit. Interestingly, this doesn't really affect the GPL because the GPL doesn't attempt to kick in unless you further redistribute. And that, of course, is not protected by section 117 and does require a license.
Of course we live in a regime where the process is the punishment, so if you're high profile you'll probably get hit with a ruinous lawsuit anyhow. C'est la vie.
I also think a linked list is not a great data structure in practice. Having the buckets be one of those amortized-O(1)-append heap allocations would be kinder on CPU caches. But, if you are making the string part of the entry, that would also require padding the string length to word boundaries so that the struct gets aligned.
Note also if you kept the length of the string somewhere (which you would need to for my above paragraph's suggestion), you could avoid strcmp-ing hash collisions where the string has different length. Strcmp is going to add up to an expensive thing in this code.
This code also never checks the return value of malloc. Color me unimpressed with this code. (Cue people arguing that checking for malloc failure is a bad idea ... Linux oom killer and high level languages have you spoiled, there i said it! /s)
Also should check for arithmetic overflow, or at least static_assert that sizeof(Entry) is "too small" to matter. (Hopefully the process of deciding what "too small" should be results in just adding the overflow check.)
Not really. We defined Entry, so we know it's only a few bytes. And we know size_t is large enough to iterate over all of main memory. So the only way that statement overflows is if the key is literally larger than main memory. Not a concern.
I agree with your reasoning, but it's not "literally", it's main memory minus (sizeof(struct Entry) - 1). But it's impossible to find a C string that big. Any point in memory is pretty clearly going to hit a 0 or a page fault before then.
>But, if you are making the string part of the entry, that would also require padding the string length to word boundaries so that the struct gets aligned.
It also makes deletion a far more complicated situation. You'd essentially be implementing your own allocator then.
That's true. Maybe you could keep it simpler by making small strings inline with the struct, but larger strings go on the heap. This gives you fixed size records, so removal could be o(1) by swapping with the last one and shortening.
Well it's C, it's ought to be broken in some way. Just put it in some sufficiently high value target, a few months to years later the CVE is gonna explain how it was broken exactly.
When writing C code, which for me is rare these days, the language doesn't give me enough guardrails in the type system (e.g. unique_ptr) and at compile time that I'd prefer to write straightforward and dumb code. And that means code that does a lot of malloc()s with no surprises. If I have a char* that cannot be free()d, such as pointing to the middle of a longer string, I don't even let that escape the scope of a single function. Leave the memory management trickeries to more modern languages like C++, or, better yet, Rust.
I think parent was criticizing the multiple use of mallocs when one would be cleaner and easier. This wouldn't really make a difference to the library users, besides perhaps more heap fragmentation in the old version.
I was surprised recently when looking at different hash tables that have been implemented in C to discover that the standard library includes its own hash table. They are even part of POSIX. There is a reason you have never heard of it, or if you have you have never used it. In true POSIX fashion they are close to useless. The implementation doesn't allow you to modify the table after it has been created, you have to pass in all the data when you create the table. There is no add or delete and you can't change any value you pull from the table. It also stores the data in a static area local to the function so you can only use a single table in a program at a time. It doggedly commits every C stdlib sin possible.
I think GP is talking about, e.g., glibc, or any libc you might find on a POSIX system, rather than the abstract C standard library. I agree that as a point of clarity, POSIX isn't standard C and the crappy hashtable described is in POSIX, not standard C.
Come on, this is a straw man argument. That function was built for a specific purpose, and the fact you had to hunt for it means its not meant for general purpose use.
Besides which, this statement is just inflammatory:
"In true POSIX fashion they are close to useless."
The straw man does not support the conclusion .. POSIX is full of utility.
Nice article! I think eventually you need to move to macros to support multiple key/value types in C. Just leaving some macro implementations for reference:
The trick is that you make sure that your table is large enough to not have a lot of collisions, then if you have a collision instead of storing exactly in the bucket given by the hash, you store it in the next available bucket. When you look for a key, you get the hash, then the index from the hash, and you start searching at this point. If you reach an empty value, then there is no value. If you find the value then you return that. So everything is stored in a single array, no need for extra allocations and it is better for cache locality.
This video also talk about how to optimize hash table and look into several implementations: https://www.youtube.com/watch?v=DMQ_HcNSOAI&t=1690s&ab_chann... . It talks about techniques used by advanced hash table. There was a lot more that I didn't know.
This is how it was taught to me in a CS undergrad class last week.
What we didn't look at though was, what happens when the hash table is full? Does everything get rehashed into a bigger table? What are the Performance implications of that? I wish it would have gone just a little further.
Even before it's full, an open addressing hash table (also called closed hashing - confusing names!) becomes very slow as it gets close to full, because most of the table needs to be scanned on inserts, and on queries for the recently inserted vales.
When it's full, insertion requires rehashing into a bigger table. (Or you can use an overflow list when the table is full or nearly full, if you don't want to rehash and prefer to treat this case as rare and ok to be slow.)
So for a general purpose open addressing / closed hashing hash table where you don't know the number of items in advance, rehashing to a bigger table is done at some threshold before it's actually full.
As long as you do rehashing using a multiplier, for example doubling the size at 50% full (as opposed to adding a fixed amount to the size), it stays fast on average, but the rehashing operations are of course individually slow. They just don't happen often, so the average stays at O(1) time. This is called amortised O(1) time complexity. The ideal threshold for resizing depends on the probing type.
Linked list hash tables, (also called closed addressing or open hashing - I think linked list is a clearer name) don't suffer from catastrophic slowdown with size in the same way as the open addressing / closed hashing type. Nor abrupt failure when full, because they're never full. These can still be resized and this is required if O(1) time is required to arbitrary numbers of items, but the resizing threshold is not as critical. If not resized they gradually degrade to O(N) time performance when filled too much. Many applications use fixed size linked list hash tables safely because of this.
Array-of-arrays hash tables behave similarly to linked list hash tables.
Thank you for your comment, it is really informative and I appreciate it a lot. We looked at the different approaches and I asked my teacher if there is no in-between method, that is, open address hashing combined with linked list hashing where e.g. after an arbitrary amount of elements in the linked list the table is scanned to find the next possible place. I'm sure many things have been tried before.
What I don't get is, when is the rehashing done? After a specific insert operation? So then that insert operation will be much slower than all others? I understand that it doesn't matter if the runtime has the same upper bound but it's just not what I instinctively would have thought.
> Does everything get rehashed into a bigger table?
Usually yes, though it’s also possible to perform gradual resizing that’s less efficient so it’s only used when needed (e.g. real time systems which can’t afford a full resize).
> What are the Performance implications of that?
Same as a vector. That’s why hashmaps generally provide an amortized insertion time.
> What we didn't look at though was, what happens when the hash table is full?
Of note: a hash table never gets full, hash tables have a load factor (which mostly depends on the collision resolution algorithm, as that informs how performances degrade as collisions increase), and the table will resize when it exceeds its load factor, to maintain a “healthy” rate of collisions.
Just realized the reason I like this is the documentation isn't just an API reference, but a progressive refinement of the problem domain to solution domain mapping. You start with a simple statement and gradually introduce concepts, problems and solutions.
These requirements rules out multiple mallocs: I do one single malloc and expect that to be enough for the entire hashmap.
I'm not sure if these constraints might also force a fixed number of buckets and a fixed capacity.
The clonable property requires interior mutability be limited, because if you memcpy pointers they would cause structure sharing, that I'm trying to avoid, hence a deep clone.
Rationale: these are the use cases I have for such a data structure. The first is cheap copy on write. I also have a left-right concurrency control hashmap, but I feel the properties of this data structure are even better. The first is a sharding in multithreading design: do a cheap memcpy for each thread and shard the processing and merge at the end, without any synchronization cost while the threads are working. I know that deep cloning is slow in Java and presumably C if you do it with loops rather than memcpy. Another use case is efficient serialization for network.
In other words, a protobuf but easily deep clonable without loops.
If you pre-allocate a large enough buffer, you can keep a pointer to mark the boundary between the part where you have already stored data and the part still empty. Then re-implement malloc() to just move such pointer. Of course you must check for running out of allocated memory, etc. This simple approach assumes that you seldom delete entries, because it never reclaims the room occupied by deleted entries.
Thanks for sharing! I do like the OP's article, but the more the merrier! My "How to implement a hash table in C" article [1] is for whatever reason one of my most popular articles. If only C had a slightly better standard library. :-)
In my article I use "open addressing" [2] instead of linked lists for collisions, as 1) it tends to be simpler, as you only have one data structure to manage (array, rather than array and linked list), and 2) it tends to be faster, as linked lists are slow on modern CPUs because they require jumping around in memory.
Hey, thanks for writing it up! Like I said, I was looking for exactly this a few months ago, and between the explanation and the code it was a big help.
Yeah I think open addressing with linear probing is definitely the paradigm I'd choose if I wanted to build a "simple" hash table for educational purposes.
It well enough and is conceptually probably the easiest approach to grok with relatively few pitfalls.
What's performance like when each bucket is an array instead of a linked list?
Another question-- at the moment where one would normally decide to resize the table for performance reasons (i.e., buckets are getting too full), how would the array-bucket style perform compared to the linked-list bucket style?
> A lot of mallocs. Just malloc a pool and implement an allocation strategy on top of it. Malloc more as necessary (say 2^n).
Good mallocs do that themselves already. It's built in, they use a pool. Really good mallocs use a fast, cache-friendly pool, typically per CPU-core or per-thread.
By using a pool on top of a good malloc, you are implementing a second pool on top of the first pool. For something general purpose like a general purpoe hash table, layering pools like that often provides no advantage and some disadvantages including to performance. It's often, though not always, better to use a really good malloc
> You mention map types in other languages. They're not always implemented as a hashmap. A tree is commonly used to implement maps too.
Notably the C++ `std::map` is a tree. A later standard added `std::unordered_map` which is a hash table. The effect is many C++ programs use trees for their maps, and some C++ devs don't realise they are using a tree with O(log N) time operations instead of a hash table with O(1) time operations.
> Embedded environments generally avoid `malloc` and the heap entirely. You have a simple allocator on top of a fixed allotment of memory instead.
I think that's a strange way to look at it, because `malloc` is a simple allocator on top of a fixed allotment of memory when you have a fixed allotment of memory to work with. You can write whatever `malloc` and `free` functions you like to suit the situation. You don't have to use the one in libc, and small embedded environments often don't have one. E.g. I've seen extremely memory-constrained bootloaders and DSPs where there is still a function called `malloc`, but it is a very simple allocator for the situation.
In high reliability systems you might need to control the pattern of allocations in the progam so that memory use stays bounded. But in a general purpose hash table, the way to bound memory use is to control the pattern of hash table usage, i.e. the callers.
No bucket lists are necessary. Guaranteed O(1) read access (not amortized, but always). Can be parameterized wrt. access time vs. memory efficiency. Can also be precompiled for offline data, e.g., Unicode codepoint tables (thought tries are also really good for this).
I 'invented' a hash table like this many years ago. I'm sure I wasn't anywhere near the first, either, but at the time I thought it was pretty hot shit using a linked list as the thing to deal with collisions.
It's funny how many people come up with the same ideas.
Well these things are what everyone has to do in college anyway because it's helpful for training you. And if you didn't get to do that in your algo courses or you didn't do college or haven't done it yet then articles like this are super useful.
Having Swiss Tables (a more or less state of the art hash table for modern computers) is attractive but I think it'd be appropriate to be scared of unmaintained C labelled as a "proof of concept".
The CTL "unordered set" is roughly the same bad design as its namesake in C++, presumably on purpose. We really shouldn't be teaching new users this bad structure, nor the "security policy" mitigations it provides. And by the way it's not a "distributed" denial of service when somebody plugs 128 colliding values into your API, just a normal trivial DOS.
If the author sees this, you might want to take a look at it.
[0] https://github.com/qemu/qemu/blob/master/util/qht.c