Thanks for the benchmark references. I would be curious what you think of my ben...

JacksonAllan · on Nov 14, 2023

As I’m sure you know, benchmarking hash tables is pretty difficult because there’s many variables that affect their performance and it’s hard to cover all use cases. Let me start by explaining the benchmarks I linked you to earlier. Eventually they will be open-sourced and published as part of a comprehensive review of C (and some C++) hash tables, but that’s potentially months away now that there is a war in Gaza keeping me very busy in my day job.

The "uint32_t key, uint32_t value" benchmarks test how the hash tables perform when the hash function and key comparison function are inexpensive, traversing buckets is inexpensive (i.e. does not cause many cache misses), and moving elements is cheap. These benchmarks disadvantage tables that store metadata in a separate array (which here are Absl, Boost, Martinus, and Fastmap) because doing so necessarily causes at least one extra cache miss per lookup.

The "uint64_t key, 256-bit struct" value benchmarks test how the tables perform when the hash function and key comparison function are inexpensive, traversing buckets is expensive (a cache miss per bucket), and moving elements is expensive. These benchmarks disadvantage tables that don’t store metadata in a separate array (or do but access the buckets array with every probe anyway to check the key) and that move elements around a lot (e.g. Robin Hood).

The "16-char NULL-terminated string key, uint64_t value" benchmarks test how tables perform when the hash function and key comparison function are expensive. These benchmarks disadvantage hash tables that lack a (metadata) mechanism to avoid most key comparisons or do a lot of rehashing (this is where the performance of the liner-probing/back-shift-deletion tables goes nuclear if they’re not storing hash codes or home bucket indices).

As I mentioned earlier, in these benchmarks the max load factor is set to 95% (although the SIMD tables rehash early even after we modify the fixed max load factors hard-coded into the libraries). Measurements are taken at intervals of 50k. Each data point is the average of five runs. We can make the lines smoother, and therefore make the benchmarks more readable, by upping the runs to ten or more (although adding more runs hides the variability of the maps that suffer from this problem – e.g. notice how Khash’s plots are usually much more squiggly than those of the other tables?). This approach allows us to see the performance of each map across the whole spectrum of load factors from about 0.48 (the troughs) to 0.95 (the peaks).

Of course, these benchmarks don’t cover all cases and combinations – just a hopefully representative sample. They also don’t show memory usage (in general, the SIMD maps have one or one-and-a-fraction of a byte of overhead per bucket, Fastmap has two bytes, and the Robin Hood maps should have anywhere from two to eight bytes). They also don’t show the effect of tombstones when we do lots of deletions (e.g. the “erasure” benchmarks don’t show why the tombstoneless Fastmap may be superior to Boost in this regard).

Now, on to your benchmarks, which I had a quick look at earlier and again just now:

The first thing that jumps out at me is that it looks like you’re only testing longs as keys. In other words, you appear to only be covering the first scenario - perhaps the ideal scenario for an open-addressing table - that I described above: the hash function and key comparison function are inexpensive, traversing buckets is inexpensive, and moving elements is cheap. Your results may well vary when you change any of these variables.

The second thing I’m concerned about is the way you handle load factor and measurement intervals. Take your “Average time for successful lookup (90% load factor)” benchmark, for example. Are you just setting the max load factor to 90% and then measuring at 10k, 100k, 1m, 10m, and 100m elements? If so, then you’re not measuring the tables at a 90% load factor – you’re measuring them at whatever their load factors happen to be at those intervals. This might explain, for example, why the measurement of the Robin Hood map at 1m is so radically different from its measurement at 10m in that benchmark, or why your plots look like bell curves in your “Average time for successful lookup (99% load factor)” benchmark. If I’m right, then I think you need to either use a much smaller measurement interval so that you actually capture the performance near the target load or benchmark the tables not at element-count intervals but when they approach the target load factor. If you do the former (as I do), then there might not be any need for the separate lower-max-load-factor benchmarks because the highest-max-load-factor benchmarks will – as I mentioned earlier – also show the performance at lower load factors.

Of course, it’s nice that your horizontal scale is exponential, not linear like mine. For my benchmarks, I’m basically assuming that the trends we can clearly see for the 0-20m range will continue at higher element counts (I can’t see why they wouldn’t, and the difference made by slightly more cache misses as the element count grows should become less and less apparent at higher counts).

Anyway, that's my 2c - hope it's helpful. Let me know if you have any suggestions regarding my benchmarks. I followed you on GitHub so that I can share some more comprehensive (i.e. many more C and C++ hash tables) benchmarks with you once I have time to run the full suite again (it takes a few hours, and I only have one computer).

senderista · on Nov 14, 2023

Thanks for the details on your benchmarks. I would like sometime to extend BLP to a more generic setting; as I said I think any trick used with RH would also work with BLP. I just used an integer set because that's all I needed for my use case and it was easy to implement several different approaches for benchmarking. As you note, it favors use cases where the hash function is cheap (or invertible) and elements are cheap to move around.

About your question on load factors: no, the benchmarks are measuring exactly what they claim to be. The hash table constructor divides max data size by load factor to get the table size (https://github.com/senderista/hashtable-benchmarks/blob/mast...), and the benchmark code instantiates each hash table for exactly the measured data set size and load factor (https://github.com/senderista/hashtable-benchmarks/blob/mast...).

I can't explain the peaks around 1M in many of the plots; I didn't investigate them at the time and I don't have time now. It could be a JVM artifact, but I did try to use JMH "best practices", and there's no dynamic memory allocation or GC happening during the benchmark at all. It would be interesting to port these tables to Rust and repeat the measurements with Criterion. For more informative graphs I might try a log-linear approach: divide the intervals between the logarithmically spaced data sizes into a fixed number of subintervals (say 4).

JacksonAllan · on Nov 15, 2023

I'll try to download and play around with your benchmarks when I have a chance. After reading you explanation of how you create the tables at the desired load factor, some of those plots definitely look rather odd to me. What I'd expect to see across all your benchmarks is a bunch of upward curves tapering off at the top (or perhaps just an upward liner lines, given that your horizontal scale is exponential). Basically, the performance of a given table at the same load factor should be fundamentally similar irrespective of how many elements are in the table, except that the higher the count gets, the less frequently the table will benefit from incidental cache hits when consecutive lookups coincidentally hit the same part of the buckets arrays. You can see this in my benchmarks (except for the cumulative "Total time to insert N nonexisting elements" benchmarks and the iteration benchmarks, which are a whole other can of worms). Notice how my peaks grow higher with the element count but tapper off on the right-hand side? In contrast, your plots' data points (which I think should be analogous to the peaks in my graphs) seem to jump around, with the high-element-count points often appearing lower than the low-element count ones. This seems very unexpected.

H̵e̵r̵e̵'̵s̵ ̵o̵n̵e̵ ̵i̵d̵e̵a̵:̵ ̵I̵t̵'̵s̵ ̵b̵e̵e̵n̵ ̵m̵a̵n̵y̵ ̵y̵e̵a̵r̵s̵ ̵s̵i̵n̵c̵e̵ ̵I̵ ̵t̵o̵u̵c̵h̵e̵d̵ ̵J̵a̵v̵a̵.̵ ̵H̵o̵w̵ ̵d̵o̵e̵s̵ ̵J̵a̵v̵a̵'̵s̵ ̵g̵a̵r̵b̵a̵g̵e̵ ̵c̵o̵l̵l̵e̵c̵t̵o̵r̵ ̵w̵o̵r̵k̵?̵ ̵D̵o̵e̵s̵ ̵i̵t̵ ̵k̵i̵c̵k̵ ̵i̵n̵ ̵i̵n̵t̵e̵r̵m̵i̵t̵t̵e̵n̵t̵l̵y̵?̵ ̵C̵o̵u̵l̵d̵ ̵t̵h̵e̵ ̵g̵a̵r̵b̵a̵g̵e̵ ̵c̵o̵l̵l̵e̵c̵t̵o̵r̵ ̵b̵e̵ ̵m̵u̵d̵d̵l̵i̵n̵g̵ ̵y̵o̵u̵r̵ ̵m̵e̵a̵s̵u̵r̵e̵m̵e̵n̵t̵s̵,̵ ̵a̵n̵d̵ ̵i̵f̵ ̵s̵o̵,̵ ̵c̵a̵n̵ ̵i̵t̵ ̵b̵e̵ ̵d̵i̵s̵a̵b̵l̵e̵d̵?̵ Edit: Sorry, I just reread your comment and saw that you already addressed garbage collection.

senderista · on Nov 15, 2023

Despite my disclaimer about GC (and my effort to use JMH properly), I find it difficult to trust microbenchmarks on the JVM. I don't know when I'll have time for this, but "someday" I'd like to port this whole codebase to Rust/Criterion (which should be straightforward because the algorithms/data structures are "trivial"), and see if the more surprising artifacts persist. I do find the overall differentiation between RH and BLP surprising; I expected them to have pretty similar performance profiles.

In any case, I would definitely appreciate someone else rerunning the benchmarks on a different platform/JVM!