Hacker News new | past | comments | ask | show | jobs | submit login
Hashmaps in Factor are faster than in Zig (factorcode.org)
180 points by kencausey on Nov 11, 2023 | hide | past | favorite | 61 comments



I don't see a compelling reason to use tombstones in linear probing except in a concurrent context (where you can't move entries around). The tombstone-free deletion algorithm is quite simple: https://github.com/senderista/hashtable-benchmarks/blob/mast.... No rehashing is necessary.


findPreferredBucket rehashes. The problem with this deletion algorithm is that every element that is a candidate for back-shifting must be rehashed to ensure that we don't shift it back beyond its home bucket (unlike in Robin Hood probing, where probe lengths already give us this information). Hence, it is pretty unsuitable when the hash function is expensive, unless we're storing hash codes or keeping the load factor very low. This image - https://ibb.co/xSytxHF - shows a benchmark of a few hash tables with string keys and a maximum load factor of 0.95. Guess which one is using linear probing with back-shift deletion.


Recalculating the hash codes is indeed potentially expensive. In my example the table stores the hash codes themselves instead of the keys (because the hash function is invertible), so the hash function never needs to be recalculated (just the preferred bucket based on the hash, which is extremely cheap). With a different algorithm (Robin Hood or bidirectional linear probing), the load factor can be kept well over 90% with good performance, as the benchmarks in the same repo demonstrate.


In my example the table stores the hash codes themselves instead of the keys (because the hash function is invertible)

Oh, I see, right. If determining the home bucket is trivial, then the back-shifting method is great. The issue is just that it’s not as much of a general-purpose solution as it may initially seem. In the bad cases, it's really bad.

With a different algorithm (Robin Hood or bidirectional linear probing), the load factor can be kept well over 90% with good performance, as the benchmarks in the same repo demonstrate.

I’ve seen the 90% claim made several times in literature on Robin Hood hash tables. In my experience, the claim is a bit exaggerated, although I suppose it depends on what our idea of “good performance” is. See these benchmarks, which again go up to a maximum load factor of 0.95 (although boost and Absl forcibly grow/rehash at 0.85-0.9):

https://strong-starlight-4ea0ed.netlify.app/

Tsl, Martinus, and CC are all Robin Hood tables (https://github.com/Tessil/robin-map, https://github.com/martinus/robin-hood-hashing, and https://github.com/JacksonAllan/CC, respectively). Absl (from Google) and Boost are well-known C++ SIMD-based hash tables. Khash (https://github.com/attractivechaos/klib/blob/master/khash.h) is, I think, an ordinary open-addressing table using quadratic probing. Fastmap is a new, yet-to-be-published design that is fundamentally similar to bytell (https://www.youtube.com/watch?v=M2fKMP47slQ) but also incorporates some aspects of the aforementioned SIMD maps (in particular, it stores a 4-bit fragment of the hash code to avoid most false key comparisons).

As you can see, all the Robin Hood maps spike upwards dramatically as the load factor gets high, becoming as much as 5-6 times slower at 0.95 vs 0.5 in one of the benchmarks (uint64_t key, 256-bit struct value: Total time to erase 1000 existing elements with N elements in map). Only the SIMD maps (with Boost being the clear better performer of the two) and Fastmap appear mostly immune to load factor in all benchmarks, although the SIMD maps do - I believe - use tombstones for deletion in at least some cases.

I’ve only read briefly about bi-directional linear probing – never experimented with it.


Thanks for the benchmark references.

I would be curious what you think of my benchmarks of RH and BLP: https://github.com/senderista/hashtable-benchmarks/wiki/64-b.... BLP performance seems much more stable than RH in general. I haven't directly compared BLP to SIMD designs (since the impl is in Java), but would like to do so at some point. At any rate, I haven't yet found any reason to use RH rather than BLP (note that a BLP impl could store preferred bucket index, truncated hash code, distance from preferred bucket, etc., just like RH).


As I’m sure you know, benchmarking hash tables is pretty difficult because there’s many variables that affect their performance and it’s hard to cover all use cases. Let me start by explaining the benchmarks I linked you to earlier. Eventually they will be open-sourced and published as part of a comprehensive review of C (and some C++) hash tables, but that’s potentially months away now that there is a war in Gaza keeping me very busy in my day job.

The "uint32_t key, uint32_t value" benchmarks test how the hash tables perform when the hash function and key comparison function are inexpensive, traversing buckets is inexpensive (i.e. does not cause many cache misses), and moving elements is cheap. These benchmarks disadvantage tables that store metadata in a separate array (which here are Absl, Boost, Martinus, and Fastmap) because doing so necessarily causes at least one extra cache miss per lookup.

The "uint64_t key, 256-bit struct" value benchmarks test how the tables perform when the hash function and key comparison function are inexpensive, traversing buckets is expensive (a cache miss per bucket), and moving elements is expensive. These benchmarks disadvantage tables that don’t store metadata in a separate array (or do but access the buckets array with every probe anyway to check the key) and that move elements around a lot (e.g. Robin Hood).

The "16-char NULL-terminated string key, uint64_t value" benchmarks test how tables perform when the hash function and key comparison function are expensive. These benchmarks disadvantage hash tables that lack a (metadata) mechanism to avoid most key comparisons or do a lot of rehashing (this is where the performance of the liner-probing/back-shift-deletion tables goes nuclear if they’re not storing hash codes or home bucket indices).

As I mentioned earlier, in these benchmarks the max load factor is set to 95% (although the SIMD tables rehash early even after we modify the fixed max load factors hard-coded into the libraries). Measurements are taken at intervals of 50k. Each data point is the average of five runs. We can make the lines smoother, and therefore make the benchmarks more readable, by upping the runs to ten or more (although adding more runs hides the variability of the maps that suffer from this problem – e.g. notice how Khash’s plots are usually much more squiggly than those of the other tables?). This approach allows us to see the performance of each map across the whole spectrum of load factors from about 0.48 (the troughs) to 0.95 (the peaks).

Of course, these benchmarks don’t cover all cases and combinations – just a hopefully representative sample. They also don’t show memory usage (in general, the SIMD maps have one or one-and-a-fraction of a byte of overhead per bucket, Fastmap has two bytes, and the Robin Hood maps should have anywhere from two to eight bytes). They also don’t show the effect of tombstones when we do lots of deletions (e.g. the “erasure” benchmarks don’t show why the tombstoneless Fastmap may be superior to Boost in this regard).

Now, on to your benchmarks, which I had a quick look at earlier and again just now:

The first thing that jumps out at me is that it looks like you’re only testing longs as keys. In other words, you appear to only be covering the first scenario - perhaps the ideal scenario for an open-addressing table - that I described above: the hash function and key comparison function are inexpensive, traversing buckets is inexpensive, and moving elements is cheap. Your results may well vary when you change any of these variables.

The second thing I’m concerned about is the way you handle load factor and measurement intervals. Take your “Average time for successful lookup (90% load factor)” benchmark, for example. Are you just setting the max load factor to 90% and then measuring at 10k, 100k, 1m, 10m, and 100m elements? If so, then you’re not measuring the tables at a 90% load factor – you’re measuring them at whatever their load factors happen to be at those intervals. This might explain, for example, why the measurement of the Robin Hood map at 1m is so radically different from its measurement at 10m in that benchmark, or why your plots look like bell curves in your “Average time for successful lookup (99% load factor)” benchmark. If I’m right, then I think you need to either use a much smaller measurement interval so that you actually capture the performance near the target load or benchmark the tables not at element-count intervals but when they approach the target load factor. If you do the former (as I do), then there might not be any need for the separate lower-max-load-factor benchmarks because the highest-max-load-factor benchmarks will – as I mentioned earlier – also show the performance at lower load factors.

Of course, it’s nice that your horizontal scale is exponential, not linear like mine. For my benchmarks, I’m basically assuming that the trends we can clearly see for the 0-20m range will continue at higher element counts (I can’t see why they wouldn’t, and the difference made by slightly more cache misses as the element count grows should become less and less apparent at higher counts).

Anyway, that's my 2c - hope it's helpful. Let me know if you have any suggestions regarding my benchmarks. I followed you on GitHub so that I can share some more comprehensive (i.e. many more C and C++ hash tables) benchmarks with you once I have time to run the full suite again (it takes a few hours, and I only have one computer).


Thanks for the details on your benchmarks. I would like sometime to extend BLP to a more generic setting; as I said I think any trick used with RH would also work with BLP. I just used an integer set because that's all I needed for my use case and it was easy to implement several different approaches for benchmarking. As you note, it favors use cases where the hash function is cheap (or invertible) and elements are cheap to move around.

About your question on load factors: no, the benchmarks are measuring exactly what they claim to be. The hash table constructor divides max data size by load factor to get the table size (https://github.com/senderista/hashtable-benchmarks/blob/mast...), and the benchmark code instantiates each hash table for exactly the measured data set size and load factor (https://github.com/senderista/hashtable-benchmarks/blob/mast...).

I can't explain the peaks around 1M in many of the plots; I didn't investigate them at the time and I don't have time now. It could be a JVM artifact, but I did try to use JMH "best practices", and there's no dynamic memory allocation or GC happening during the benchmark at all. It would be interesting to port these tables to Rust and repeat the measurements with Criterion. For more informative graphs I might try a log-linear approach: divide the intervals between the logarithmically spaced data sizes into a fixed number of subintervals (say 4).


I'll try to download and play around with your benchmarks when I have a chance. After reading you explanation of how you create the tables at the desired load factor, some of those plots definitely look rather odd to me. What I'd expect to see across all your benchmarks is a bunch of upward curves tapering off at the top (or perhaps just an upward liner lines, given that your horizontal scale is exponential). Basically, the performance of a given table at the same load factor should be fundamentally similar irrespective of how many elements are in the table, except that the higher the count gets, the less frequently the table will benefit from incidental cache hits when consecutive lookups coincidentally hit the same part of the buckets arrays. You can see this in my benchmarks (except for the cumulative "Total time to insert N nonexisting elements" benchmarks and the iteration benchmarks, which are a whole other can of worms). Notice how my peaks grow higher with the element count but tapper off on the right-hand side? In contrast, your plots' data points (which I think should be analogous to the peaks in my graphs) seem to jump around, with the high-element-count points often appearing lower than the low-element count ones. This seems very unexpected.

H̵e̵r̵e̵'̵s̵ ̵o̵n̵e̵ ̵i̵d̵e̵a̵:̵ ̵I̵t̵'̵s̵ ̵b̵e̵e̵n̵ ̵m̵a̵n̵y̵ ̵y̵e̵a̵r̵s̵ ̵s̵i̵n̵c̵e̵ ̵I̵ ̵t̵o̵u̵c̵h̵e̵d̵ ̵J̵a̵v̵a̵.̵ ̵H̵o̵w̵ ̵d̵o̵e̵s̵ ̵J̵a̵v̵a̵'̵s̵ ̵g̵a̵r̵b̵a̵g̵e̵ ̵c̵o̵l̵l̵e̵c̵t̵o̵r̵ ̵w̵o̵r̵k̵?̵ ̵D̵o̵e̵s̵ ̵i̵t̵ ̵k̵i̵c̵k̵ ̵i̵n̵ ̵i̵n̵t̵e̵r̵m̵i̵t̵t̵e̵n̵t̵l̵y̵?̵ ̵C̵o̵u̵l̵d̵ ̵t̵h̵e̵ ̵g̵a̵r̵b̵a̵g̵e̵ ̵c̵o̵l̵l̵e̵c̵t̵o̵r̵ ̵b̵e̵ ̵m̵u̵d̵d̵l̵i̵n̵g̵ ̵y̵o̵u̵r̵ ̵m̵e̵a̵s̵u̵r̵e̵m̵e̵n̵t̵s̵,̵ ̵a̵n̵d̵ ̵i̵f̵ ̵s̵o̵,̵ ̵c̵a̵n̵ ̵i̵t̵ ̵b̵e̵ ̵d̵i̵s̵a̵b̵l̵e̵d̵?̵ Edit: Sorry, I just reread your comment and saw that you already addressed garbage collection.


Despite my disclaimer about GC (and my effort to use JMH properly), I find it difficult to trust microbenchmarks on the JVM. I don't know when I'll have time for this, but "someday" I'd like to port this whole codebase to Rust/Criterion (which should be straightforward because the algorithms/data structures are "trivial"), and see if the more surprising artifacts persist. I do find the overall differentiation between RH and BLP surprising; I expected them to have pretty similar performance profiles.

In any case, I would definitely appreciate someone else rerunning the benchmarks on a different platform/JVM!


imo a pretty good approach is tombstones where you delete trailing tombstones. that way the tombstones can't overly clog the dictionary but never have to move live objects


Here is a simple heuristic for avoiding most tombstones when the load factor isn't too high:

https://arxiv.org/pdf/1808.04602.pdf


This is a really cool approach! But if it so obvious, why doesn't every hashmap use it? It seems like there are some trade-offs here that I must be missing.


If you do any kind of probing other than linear probing (e.g. quadratic probing) this approach doesn't work anymore, because your collisions are no longer densely grouped together.


The main tradeoff is concurrency: it's difficult to safely read a hash table while its entries are being concurrently relocated. Another tradeoff is (probably) higher average latency (but worst-case latency is much better since there's no global rehashing required).


May I ask why you decided to store the hash instead of the element?


Two reasons:

1. I never need to recalculate the hash in order to find the preferred bucket.

2. The BLP invariant is that entries are strictly ordered by hash code (Robin Hood can actually be reformulated in terms of the same invariant, but it's less efficient because entries with the same preferred bucket can only reside at or after that bucket rather than before, at, or after it). This allows you to view a hash table as just a sorted array with gaps, and hash table lookup as interpolation search.


I assume you do a lot of deletions for 1. to be significant. On the other hand, you restrict yourself to a specific class of hashes: reversible and bijective (and never producing a zero for a valid input, although I think this is just how this particular table is built and can be easily changed). This a severe limitation to ever store any data other than integers, e.g. strings.

I'm not criticizing. I've never implemented a hash table myself and want to learn. I'm just trying to understand what's the trade off you considered that I don't see.


Using a reversible hash function to store encoded keys is certainly a domain-specific optimization, but integer keys are so common that I don't understand why I don't see this optimization more often. For non-integer keys, you can just store truncated hash codes to eliminate hash function calls in most cases (note that the hash value is needed both to calculate the preferred bucket and to determine the order of entries in that bucket). The smaller the table, the smaller the truncated hash codes can be. When you resize the table, you can increase the truncated hash code size if necessary (one bit for each doubling, which you can amortize by adding a byte every 8 doublings). This of course requires rehashing all keys, but conventional hash tables already rehash all keys on resizing.


Hash maps are such a fundamentally important data structure that it comes as a surprise that the Zig implementation is so broken. Good to see it’s getting fixed, but surprising that this wasn’t detected before.


My impression from the article is that Zig provides several different hashtables and not all of them are broken in this way.

This reminds me of Aria's comment in her Rust tutorial https://rust-unofficial.github.io/too-many-lists/ about failing to kill LinkedList. One philosophy (and the one Rust chose) for a stdlib is that this is only where things should live when they're so commonly needed that essentially everybody needs them either directly or to talk about. So, HashTable is needed by so much otherwise unrelated software that qualifies, BloomFilter, while it's real useful for some people, not so much. Aria cleaned out Rust's set of standard library containers before Rust 1.0, trying to keep only those most people would need. LinkedList isn't a good general purpose data structure, but, it was too popular and Aria was not able to remove it.

Having multiple hash tables feels like a win (they're optimized for different purposes) but may cost too much in terms of the necessary testing to ensure they all hit the quality you want.


Linked lists are necessary when you have structures that can't be moved. Very important in a highly concurrent or lockfree environment.


How could you use the Rust standard library's linked list for that?


A deque will usually work just as well in that environment and it will always have better general performance characteristics. There are very few legitimate uses of linked lists.


A deque is a doubly-linked list under the hood.


Array based deques are very common (and usually much faster). Doubly linked lists are absolutely awful for memory locality.


Can't you have a list of pointers instead?


Resizing/deleting items in list in most implementations are either not lock free or not threadsafe


Yeah, but what if you just use smaller lists and point to the next/previous list?

Hold on...


Linked lists aren't either so....?


I personally have two habits that made me not notice:

1. I generally prefer ArrayHashMap

2. I tend to not delete things from hash maps


Not that many hashmaps see hundreds of millions of entries in their lifetime. Given that Zig isn’t widely used, it’s probable that noone has really stumbled upon this behaviour in a non-benchmark setting.


Actually it seems according to the issue that TigerBeetle (one of the bigger zig projects out there) noticed this issue [1]. It's also on their issue tracker [2].

[1] https://github.com/ziglang/zig/issues/17851

[2] https://github.com/tigerbeetle/tigerbeetle/issues/1191


When you use a language that's in alpha- (maybe beta- now?) stage, this kind of thing should be expected. Even with the latest version of Zig, perfectly correct programs can segfault due to miscompilation, so performance issues are not even the biggest worry you should have.


One thing I really don't unserstand is how bun already reached stability with it's 1.0 release (https://bun.sh/blog/bun-v1.0) while being written in Zig, which still hasn't reached it's 1.0 release.


If you test your software built on Zig close to 100% then you can be pretty confident you didn't run into any of the Zig compiler bugs. Bun has certainly run into them but they can almost always be worked around... if you work around all problems and manage to catch everything then yeah, you can have an application that's production worthy while being built on a compiler that's not quite there yet.

The problem is, of course, you almost certainly didn't test anything near 100% and probably not even everything users may do with your software, so it's a risky bet.


Versioning isn’t an exact science, and people use whatever they see fit, both for technical but also marketing purposes.


Bun is compiled to a binary, so not sure it matters how stable the underlying language is if Bun itself has a stable API?


That’s the point OP is making, even if the program is bug-free, given the compiler’s current state there’s a chance the compiled binary has bugs (due to miscompilation).

Now that can happen with every compiler but using one in still in alpha significantly increases the risk.


Does it help at all that Zig is LLVM-based?


Not particularly. And zig is moving away from llvm anyway.


If we're talking SemVer then 1.0 only means a commitment to a public interface. It doesn't mean the code has no more bugs and ready for use in a spaceship.

Conversely, taking on a dependency prior to its 1.0 just means you're on the hook for dealing with API changes as they come up. But as long as the API you expose doesn't change, it's fine to be on 1.0 yourself.


Are there any links / examples for the miscompilations?



Thanks, seems like most are from LLVM or packed structs.


It's surprisingly common for fledgling programming languages… cough https://accidentallyquadratic.tumblr.com/post/153545455987/r...

(I know https://news.ycombinator.com/item?id=38138223, but at least I'm not the first one this time.)


Zig is rather new, so it doesn't surprise me.


So in other words, bug in Zig library causes linear execution time on something that isn't supposed to be linear.


Factor looks cool, but what's it really about? Can someone who loves the language explain why?


I'm a bit biased as I've been working on and with Factor for 15 years and I am also the author of the linked article and "Re: Factor" blog.

I find Factor to have a compelling sweet spot of concise syntax, dynamic features, and (relatively) high performance. But in particular, I think we've done a good job with a few things:

1) Making the language very "clickable", so you can introspect and dig down into all function and type definitions.

2) Provide a lot of batteries-included in the standard library.

3) Make it super easy to accept contributions, there's a part of the distribution that is "extra" and has low barrier to entry and then we commit to keeping code working and promoting things to the main standard library when they are useful/documented/tested enough.


I always suspected that Factor appeared as an attempt to re-live Forth magic, but at higher level, more applicable outside of embedded domain



It's as flexible as commonlisp or smalltalk.


Not a good showcase of Factor code. All variable names 1 letter, seemingly simply chosen subsequent letters in the alphabet, instead of any names, that would indicate what the code does.


I agree, I've written much better Factor code. This was mainly to get a matching test case working simply...


Title is a bit clickbait. This is regarding Hashmaps specifically. Read or at least scan through it where the author will submit a fix for the Zig implementation resulting in Zig's Hashmap being 50% faster than the Factor implementation.


I've edited the title to make that clearer. Thanks!


Why hashmaps? Python has named tuples.

Oh - this puts a lot of keys into one hashmap, not a ton of objects with the same keys. :)


Python also has dicts.


Of course.

I was just saying that a lot of the time hashmaps don't get very many entries.

There is serialization and deserialization where Cap'n Proto, Flatbuffers, and Protocol Buffers have a savings over JSON by not repeating key names of a list of maps.


I fail to see how is this relevant to the article. HashMap/dict can as well store a mapping from an integer to an integer. Or from a token to an integer, or whatever, and it has nothing to do with serialization, JSON inefficiency or lists of maps.


Use the structure that best fits the problem. The saving to which you refer applies only to serialisation, not to the use of the structure by an algorithm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: