A new fast hash table in response to Google’s new fast hash table

antirez · on May 29, 2018

Does this implementation stops on rehashing? We should stop considering seriously hash tables that stop for rehashing because they can be used only for a subset of problems, due to the latency. This is one of the main reasons I'm now in love with radix trees and trying to get rid of hash tables at every level in my code.

wrsh07 · on May 29, 2018

It's a bit funny to write off a hash table that is by design competing with another hash table by saying "it's not a different data structure, so I'm not interested."

Moreover, this kind of optimization only makes sense if (1) latency, not throughput, is your bottleneck or (2) if you're really concerned about parallel accesses.

A reasonable default in general is to use a hash table + mutex until that is your bottleneck. (I've yet to encounter a scenario where switching from a hash map would provide significant gains in performance-relevant code)

Disclosure: I work at Google and use the fancy hash map discussed by Matt. I didn't work on it, though.

ztorkelson · on May 29, 2018

That's not a very charitable interpretation of the GP. He's not criticizing hash tables because they aren't radix trees. He's criticizing this and other hash table implementations because they have undesirable performance characteristics with respect to his needs, and noting an alternative data structure which he has found to be more suitable. I think that's fair game.

Note that there exist hash table implementations which support incremental resizing.

I do agree, though, that it's a leap to go from "hash tables with this property are inappropriate under these constraints" to "hash tables with this property should not be considered further", precisely because there is such a broad spectrum of data structure requirements across the industry.

eternalban · on May 30, 2018

> Note that there exist hash table implementations which support incremental resizing.

Rather, there exist hash table algorithms that support incremental resizing.

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/cours...

stochastic_monk · on May 30, 2018

Additionally, a hash table could be implemented in such a way that it could be updated during resizing. As long as you set up locks in the right places, you could spend the extra memory to do an out-of-place rehash and allow both concurrent writers and your rehashing thread to write to the new table. It's doable, if difficult and error-prone.

beeforpork · on May 30, 2018

Radix trees are very nice data structure. There is one drawback I never see in comparisons: they may require non-constant amount of space for an insertion operation, i.e., they may introduce more nodes than the one that is inserted, because nodes may need to be split. This means you cannot burden the space allocation on the caller of 'insert' (and the deallocation on the caller of 'remove'), e.g. by (statically) preallocating a node cell in each item that can potentially be part of a radix tree.

This is important in environments where you have or do not want to use malloc/free. E.g. in embedded systems, exactly where some properties like worst case time behaviour is important and, therefore, radix trees seem like an option: they still might not be.

Well, but hash tables are not an option either in these situations, so good old binary search trees win (red-black, avl, 2-4, whatever).

ztorkelson · on May 29, 2018

I've found myself reaching for radix trees more and more for the same reason you listed: competitive and predictable performance.

That said, I can see two counter-arguments:

1. Rehashing can be avoided if the table size is known in advance. 2. Rehashing is immaterial if the data structure is write-once, read-many.

In those cases, we might reasonably expect a hash table to perform better.

wahern · on May 29, 2018

If the use case is write-once, read-many, you can simply build a perfect hash table. Using techniques like CHD the cost is IME usually in the range of 1x (no cost) to 3x, depending on how compact you want the table, how large it is, and how much time you want to spend up front.

Examples from my own small library (http://25thandclement.com/~william/projects/phf.html), using 1 million random integers:

  80% load -> 0.47s
  85% load -> 0.56s
  90% load -> 0.74s
  95% load -> 1.10s
  98% load -> 1.35s
  99% load -> 1.68s

If I hash the items as C++ string objects rather than uint32 scalars then the time is doubled, presumably because of the dynamic allocation. Which goes to show the "perfect" in perfect hash table doesn't need to be the dominate cost.

That said, 90% of the time I just use a simple red-black tree--a so-called intrusive implementation. It makes memory management much easier in C, and rare is the situation where a hash table (let alone a perfect hash table) can make a real difference relative to focusing on all the other inefficiencies.

usefulcat · on May 29, 2018

> 1. Rehashing can be avoided if the table size is known in advance

With an open addressing implementation (Google's dense_hash_map, for example) this is only true if you rarely or never need to remove items from the map.

bluestreak · on May 29, 2018

It does rehash when full. Presumably it would rehash less often than most because it remains stable at very high load factor.

https://en.wikipedia.org/wiki/Radix_tree looks like a funky structure for _very_ specific work loads.

MaxBarraclough · on May 29, 2018

Are radix trees always better? I doubt it.

utopcell · on May 29, 2018

Not always, but take a look at this recent comparison between modern radix trees and hash maps [1].

[1] https://bigdata.uni-saarland.de/publications/ARCD15.pdf

jandrewrogers · on May 29, 2018

Not always better, but for most cases where people typically use a hash table it frequently is, assuming the radix tree design and implementation is excellent. There are still a few cases where well-designed hash tables are a better choice e.g. fixed size cache maps with unique keys. But to agree with the OP, there are very few hash tables in my software these days.

The downside of radix trees is the space of algorithm designs is pretty exotic, especially high-performance implementations. It is much easier to write a good hash table implementation than a good radix tree implementation.

thecompilr · on May 29, 2018

Depending on implementation and usage, radix trees can get extremely segmented, slowing down significantly over time.

ztorkelson · on May 29, 2018

Would you mind elaborating? I'm not quite sure what you mean by segmented here.

MaxBarraclough · on May 29, 2018

I'm not thecompilr, but if I understand them correctly:

In the worst case, the cache behaviour of a radix tree could be awful, if each node lives in a different cache-line, no?

I imagine this problem can be ameliorated with a custom allocator, though.

Been a while since I've studied hash-maps in depth, but with a good hash function, that wouldn't be a big worry, would it?

ztorkelson · on May 29, 2018

Gotcha. Yes, while there are implementation techniques that can help (e.g. arena allocation, prefix compression, adaptive node sizes), the data structure is fundamentally a tree, and so one can expect a higher number of random memory accesses (when compared with suitably optimized hash tables, like those discussed in the article).

In some scenarios, radix trees can overcome that with better space efficiency (i.e. implicit key storage), but I think it's fair to say that there are many scenarios (if not most!) where hash tables outperform radix trees.

Of course, that all hinges on how you measure performance. As Salvatore pointed out, if you're looking to minimize variance, hash tables might not be suitable due to their growth characteristics. Radix trees also give you efficient sorting and range scans out of the box, and are a viable immutable persistent data structure as well. If none of those are meaningful to your use cases, then hash tables could certainly be a better fit.

MaxBarraclough · on May 29, 2018

Arena-allocation? As in (roughly speaking) a Stack<Node> data structure whose lifetime is bound to the lifetime of the tree, right?

How could we support deletion of nodes?

stochastic_monk · on May 30, 2018

This is getting deep into some wizardry, but if you used a custom pool-like allocator you could group contents close together in cache, and, if necessary/appropriate, even rearrange the elements in the container to be more cache-friendly for your purposes.

thecompilr · on May 30, 2018

You could, but the premise of using a tree was to avoid unpredictable rehashing latency, if you start compacting the tree every now and then, you basically pay the same price.

MaxBarraclough · on May 30, 2018

That approach tastes rather like a copying+compacting garbage-collector.

ztorkelson · on May 29, 2018

Right. When you ask "how could we support deletion of nodes", am I right to assume you mean "how could we reclaim memory of deleted nodes"?

Conventionally, a free list would be used, though admittedly that sacrifices locality which was one of the original motivating factors. To the extent that deletion is a significant component of the workload, then yeah, this doesn't help.

MaxBarraclough · on May 30, 2018

Yes, exactly. It would of course be trivial to unlink a node without reclaiming it, but then you're leaking.

Radix trees never rotate, right? Can that be leveraged to help with locality?

thecompilr · on May 30, 2018

Thank you. Exactly. Even worse each node could live in a different page. Hash maps can span multiple pages, but the lookup should succeed in 1 or 2 tries for a good hash function.

kwillets · on May 29, 2018

Try these string keys:

    { i * "a" : i = 1 to n }

(also great for breaking string sorts)

switch_kickflip · on May 29, 2018

I am essentially a hash table fetishist, and I'm a little bummed out that this doesn't go into more detail about how the damn thing works. How do you do implement a chaining hash table that uses a single flat array?

andromeduck · on May 29, 2018

Like a FAT format but with relative indices. It should be pretty trivial pointer arithmetic.

Edit: bytell_hash_map.cpp:464 bytell_hash_map.cpp:58

Apparently there's one extra layer of indirection via a lookup table to get the final offset.

So basically the metadata byte is a bitfield union of + status flag + status code (not sure it's actually required) or offset enum. The offset enum then indexes into a look up table, the result of which is the relative offset from the current index.

lentil_soup · on May 29, 2018

He mentions he'll update the blog post with a link to the talk he gave once it's online

wichert · on May 29, 2018

You can look at the source in his github repository: https://github.com/skarupke/flat_hash_map

titzer · on May 29, 2018

I looked through the code and o_O there is a lot of C++ boilerplate. The core part of it is not really documented well, and there is this mysterious jump bits array...if I had more time, I could figure out it. But that's the point of a writeup, explaining things with a diagram, etc.

jcelerier · on May 29, 2018

> I looked through the code and o_O there is a lot of C++ boilerplate.

This boilerplate allows me to choose between std::unordered_map and flat_hash_map in my whole code just by toggling a compile time switch in my code. It ensures that the data structures has the same API than all other C++ containers.

titzer · on May 29, 2018

I understand what it is for. YMMV but for me C++ is pretty much a constant mental stack overflow when reading the guts like this. It seems like C++ libraries are getting more and more configurable in a way that hurts readability and understandability and library designers don't seem to recognize there is a fundamental tradeoff here.

kwillets · on May 29, 2018

I actually have separate implementations of an algorithm with and without templates; the template one is for benchmarking with different combinations of options; the non-template one is for readability, and it's only about 60 lines.

jcelerier · on May 29, 2018

> It seems like C++ libraries are getting more and more configurable in a way that hurts readability and understandability and library designers don't seem to recognize there is a fundamental tradeoff here.

well, I mean... if it was not configurable, it would not really be useful, except for toy examples, and for those the default std::unordered_map is more than sufficient. To be useful, a hash table implementation at least needs to allow to configure the hash function and the comparison operator.

Tomis02 · on May 29, 2018

To be honest, when it comes to C++ boilerplate, I've seen much, much worse. At a first glance the code looks somewhat clean.

dividuum · on May 29, 2018

That's not the same as having a high level description of the concepts used. For details I can still dive into the details later.

redcalx · on May 29, 2018

You do hash % bucketCount to find your bucket/slot. Each array element is also a linked list node (with a next pointer), thus if the slot is populated you follow the linked list until an empty slot is found and attach to the end of the linked list.

I figure it works best if you over allocate slots, so that on average you only have to do very few hops on each linked list walk (or no walk at all).

Aside: This is the method employed by .NET's Dictionary<K,V> class.

Further - it's generally faster to use structure of arrays rather than array of structures, thus in the above scheme the dictionary items/payloads and the linked list pointers could be in separate arrays.

zawerf · on May 29, 2018

Sort of off topic: In theory hash tables are supposed to be O(1) (under the RAM model). But looking at these graphs, they seem to grow linearly after they stop fitting in the cache which is especially apparent on the unsuccessful lookup graph. Since it's on a log-scale graph, O(log(n)) would probably fit better.

Just thought it's an interesting divergence between what's taught in school and in practice. I wonder if there's something that's more predictive than big Oh for these types of analysis.

pdpi · on May 29, 2018

The biggest divergence between what's taught in school and what happens in practice is that the Big O limits you saw in school use a different abstract machine relative to what we use today, in practice. Specifically, I'm talking about the assumption that random memory access is O(1), where in reality modern machines' cache hierarchy makes it behave more like O(log n) or O(sqrt n) or something to that effect.

OskarS · on May 29, 2018

Is that really true, though? It's certainly the case that a cache miss takes longer than a cache hit, but the time for a cache miss doesn't depend on how much memory you have. Or does it? Like, a cache miss isn't going to be slower if you have 64 gigabytes of memory compared to having 1 gigabyte of memory, it's just a cache miss. That makes random memory access O(1) in the sense of complexity analysis.

cjhanks · on May 29, 2018

There is typically L1 cache, L2 cache, L3 cache, RAM, and SWAP. These ranges from kb to mb, and then to gb. And each lower level is faster than all levels above it.

Most of the relevant memory here is L2 and L3. As you have more items in memory, the statistical likelihood that the page you're requesting will be in an efficient cache level decreases. Eventually you have so many items that you're statistically only getting the performance of RAM (let's ignore the swap case)... that's the point where you'll get closer to a flat `O(1)`... but it will be a lot slower than the `O(1)` with more cache hits.

avar · on May 29, 2018

All true, to add to this there's all sorts of other caveats on modern processors. E.g. it may be slower to access any of L1-3 or RAM depending on what processor core you're on.

This is because even though your program can transparently access all that memory, the underlying machine is implemented in terms of memory areas and it can be costly to "jump across" a memory boundary[1][2][3].

1. https://linux.die.net/man/8/numactl

2. https://linux.die.net/man/1/taskset

3. http://iopscience.iop.org/article/10.1088/1742-6596/664/9/09...

pdpi · on May 29, 2018

More RAM doesn't mean more cache misses, but more elements in your data set does — "n" isn't the amount of memory in your system, it has the same meaning here as it does everywhere else in complexity analysis.

Small datasets fit in L3, smaller datasets fit in L2, and if they're smaller still they'll fit in L1, all of which are progressively faster. If your data set is bigger than L3, you need to hit RAM, and bigger still you'll need to hit the SSD or HDD — going progressively slower.

OskarS · on May 29, 2018

I mean, yeah, obviously, I'm not disputing that, but the point is that it's still "constant time", it doesn't scale with the amount of memory.

My point wasn't that it doesn't get slower if you have larger datasets: of course there's a change when you go from cache-hits to cache-misses. My point was that trying to say "it has higher complexity than O(1)" is an inaccurate way of describing that slow-down. "O(1) random memory access" doesn't mean "fast memory access", it means "memory access where the time is bounded by a constant".

smallnamespace · on May 29, 2018

That's because people use big-oh notation when they really mean something close to theta-oh and are looking for a tighter average case bound (e.g. any graph where you benchmark 'average' case data).

E.g. for almost any commonly used algo you could say it is O(e^n) and you would be technically correct as far as what big-oh demands.

O(1) is indeed an upper bound on memory access, but in the presence of cache it is not the tightest possible bound, hence one reason for divergence against numerical estimates given by rough big-oh bounds.

pdpi · on May 29, 2018

> O(1) is indeed an upper bound on memory access, but in the presence of cache it is not the tightest possible bound, hence one reason for divergence against numerical estimates given by rough big-oh bounds.

Quite the opposite — O(1) is in fact too tight.

0xffff2 · on May 29, 2018

We're really talking about a system with finite memory, and taking O(1) to be equal to log(n) when n is such that the data entirely fills the memory. Under these conditions, O(log(n)) gives a tighter and more accurate bound, but at the expense of not being strictly correct (i.e., it is not truly an asymptotic bound.)

bzbarsky · on May 29, 2018

It all depends on what you want your constant to be.

If your constant is "the worst-case time it takes to swap in the memory from my disk", and your dataset is guaranteed to fit in local disk then memory access is in fact O(1). With a really bad constant. But in practice people care about how much better than that you can do...

Now technically the "guaranteed to fit in local disk" assumption means you're really no considering asymptotic behavior. But then the question is what one really means by asymptotic behavior in this case. What do you consider your "memory access" latency to be once your dataset is too big for the sum total of all existing storage media?

myWindoonn · on May 29, 2018

Yes. Here is a four-part series exploring RAM models: http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-p...

coldtea · on May 29, 2018

You should see it from the other side: a cache miss takes longer the bigger the data you have to fit in your cache (as you go through bigger and slower cache levels).

So, as your N (length of data) changes, access time changes too.

adrianN · on May 29, 2018

You just need to use a machine model that takes into account the cost of address translation and the TLB. Check out https://arxiv.org/abs/1212.0703 for some introduction to the topic.

Nokinside · on May 29, 2018

You should think memory as fixed length continuous blocks of different sizes.

Inside the processor there are cache-lines (usually 64 bytes). They are blocks of memory that are are tagged by CPU at once and moved together.

In the typical n-way set associative cache architecture the main RAM memory is divided blocks with n-lines each. Each set on the memory cache can hold up to n-lines from the same memory block. 8-way cache would divide 1 GB RAM to 1,024 1 MB blocks. If you work with more than 512 bytes (= 8X64) at the time within that block, there will be cache misses. In other words, CPU caches have have limited amount of cache lines dedicated to large continuous block of RAM (unless they are fully associative caches)

From CPU to DRAM access there is typically 64-byte and 4096-byte regions with range cross penalties. I think 64-byte cross penalty is typically less than 10 cycles, 4096-byte region range cross penalty is several tens of cycles (this on the top of the penalty of accessing DRAM).

rb808 · on May 29, 2018

Interestingly in Java 8 they changed HashMap buckets to hold a binary tree instead of linked list to change from O(n) to O(log N).

https://stackoverflow.com/questions/24673567/change-to-hashm...

utopcell · on May 29, 2018

To be more accurate, binary trees are constructed only after a bucket ends up containing a small number of elements, which should only happen rarely unless a bad hash function is chosen.

dbaupp · on May 29, 2018

That's defending against hash collisions which isn't what the grandparent is noticing (the slowing down of memory accesses due to not fitting into cache).

josefx · on May 29, 2018

I think they implemented that change to deal with a denial of service attack in http servers. A malicious client could send a large list of parameters with an identical String::hashCode() value, resulting in a single large bucket. Other workarounds included a size limit on the parameter list or a hash function with some parameters randomized at program start.

2FC00C733A55 · on May 29, 2018

FYI this only happens if the objects implement comparable; if you do not, they will not make binary trees.

andromeduck · on May 29, 2018

Big O is about doing the big picture high level analysis without wasting time on every little detail like the complexity of addition and multiplication. High performance implementation on the other hand is all about twiddling the constant factors in your algorithm(s) for your specific hardware architecture/topology.

See:

http://kaldewey.com/pubs/FAST__SIGMOD10.pdf

whatshisface · on May 29, 2018

The parent's claim implies that on a machine with an abstract cache, the complexity of hashmaps could be proven to be something else. You may be able to recover the right behaivor by modeling a machine with infinite memory, divided up into an infinite tower of caches where each level was P times bigger and Q times slower.

steve_musk · on May 29, 2018

You definitely have to take the time complexity of addition/multiplication into account when doing big O analysis. But in most cases they are O(1). It is the constant factors you don’t worry about.

andromeduck · on May 29, 2018

Yeah but the same could be said about memory accesses. You can always pessimize and assume it misses LLC at which point becomes a constant factor.

steve_musk · on May 29, 2018

Yes you also have to take memory access into account if you are doing a proper big O analysis, but not for the reasons you implied. I don’t think you are thinking about time complexity correctly. The question is not how long does a memory access take, but how does it scale as the number of bits to be fetched increases. The naive answer for a Von Neumann arch is O(N) where N is the number of bits, although the answers here regarding the speed of light bounds are interesting.

For addition and multiplication the current best time complexity is some long term with logs and other terms I can’t remember, but it is not constant.

However, in most cases such as sorting involving memory accesses or mult/addition we are using a fixed bit size, so we can correctly think of them as being O(1).

dspillett · on May 29, 2018

> divergence between what's taught in school and in practice

I remember this sort of thing being mentioned but not in as much detail as I would have gone into (though I'd self-taught a lot of this before Uni, so for me it was more an exercise in unlearning mistakes and formalising & rounding off knowledge than it was about taking in new concepts, perhaps for others cramming the extra detail in would have been counter productive). Algorithm courses would say things like "of course caching methods should be taking into consideration, you'll cover this in architecture modules" and architecture modules would say "you'll cover the effect on some algorithms in much more detail in your software engineering modules"!

All the JS sorting demos that were an active fad a short while ago were simplified in the same way: they assumed that either a comparison or a swap was always the most expensive operation and the other can be discounted, or that all swaps/comparisons are equal due to uniform memory speed (no accounting for caching), and no consideration was made for what was being sorted (inserts are much more expensive in an array than a linked list for instance). I keep meaning to find the time to put a less superficial demo together that allows such things to be modelled at a basic level, perhaps even trying to illustrate the effect of a more distributed architecture (where there is more than one processing unit involved and everything isn't in local fast memory).

Of course, for the most part these simplifications are perfectly valid, but they should always carry a caveat noting what simplifying assumptions are being applied so people learning know that the issues could be more complicated so some critical thinking should be applied in any real world situation.

MaxBarraclough · on May 29, 2018

> Just thought it's an interesting divergence between what's taught in school and in practice.

Not really, no. This isn't a matter of complexity theory being less useful than promised, it's just a matter of understanding what it tells you.

In complexity theoretic terms, platform-specific micro-optimisations are constant factors. They can have a real impact, but they're only worth worrying about after you've got good complexity.

Bubble-sort in fine-tuned SIMD assembly code will out-perform bubble-sort in Python, but will be far slower than a Python quicksort (except for small inputs).

Notice that we're discussing which hash-map to use, as opposed to which data-structure to use in the first place. Their basic time-complexities are the same, so we have the luxury of worrying about the constant factors.

Linear-scan dictionaries can't compete here, no matter how well micro-optimised. (Again, except for very small inputs.) Complexity theory tells us why.

Additionally, complexity theory is useful when reasoning about large problems (high values of n), but it doesn't tell you how to do the platform-specific micro-optimisations that are important when attacking small problems. Implementation details will dominate there, rather than the number of algorithmic operations.

Anyone who teaches complexity theory without explaining these points, has done their students a disservice.

> I wonder if there's something that's more predictive than big Oh for these types of analysis.

When it comes to real-world performance, there's no substitute for just measuring real-world performance. Detailed analysis of things like cache behaviour, and behaviour under highly concurrent access, could be neat though.

Edit For completeness, I should mention that in extreme cases, there are indeed algorithms with superior complexity which are in practice completely useless, as the break-even point is so awfully high that they could never be useful. Matrix multiplication, for instance, where the algorithms with the best time-complexity are quite literally never useful. https://en.wikipedia.org/w/index.php?title=Computational_com...

Perhaps that contradicts my whole point :-P

The other limitation of complexity theory is that in practice we often care about average-case performance, whereas theorists tend to care more about worst-care performance.

voidmain · on May 29, 2018

The increasing latency [1] of access to larger amounts of storage isn't a constant factor, though, even asymptotically. There was a fun post on HN a while ago arguing that random access latency to N bytes is asymptotically O(N^0.5), from both empirical data about real systems and theoretically from the Bekenstein bound and the speed of light. And even if you don't buy that it clearly can't be less than O(N^1/3). Don't go preaching this in job interviews, though :-)

The truth is that complexity theory is always done relative to some abstract machine model, and as with all models it's up to you to determine if the model is a good fit for your practical reality. "To the extent that mathematics reflects reality it is not certain, and to the extent that it is certain it does not reflect reality." There exist abstract models, like the cache oblivious model, that do a somewhat better job of grappling with this issue but don't require to just throw up your hands and rely solely on benchmarks of real hardware. There is probably room for new theory in this area.

[1] not throughput, which isn't as hard to scale up

timerol · on May 29, 2018

The article in question: https://news.ycombinator.com/item?id=12383012

It's fun to think about, but the presentation played fast and loose with a few concepts. Sparked some lively discussion because of that. I really like the idea of needing to count random memory accesses as costing more than sequential accesses when analyzing algorithms in a big-O-type notation.

Ar-Curunir · on May 29, 2018

back when HDDs were a thing, and sequential accesses were much, much faster than random accesses.

So people developed algorithms that maximised locality of reference, like B-trees. An interesting line of work in this vein is that of cache-oblivious algorithms :)

MaxBarraclough · on May 29, 2018

> even if you don't buy that it clearly can't be less than O(N^1/3)

Maybe I'm missing something, but why's that?

To the rest of your comment: all good points.

voidmain · on May 29, 2018

Well, if with your best technology you can pack B bits of information in 1 m^3 of space, and you need to store N bits of information, you are going to need (N/B) m^3 of space to store it. And then the speed of light limits you to around (N/B)^(1/3)/c seconds of latency to get to the farthest bits.

(This ultimately violates the Bekenstein bound - as your server cube grows it will eventually collapse into a black hole - but this is not really a concern with forseeable technology! More practically, though, Earth's gravity makes it hard to build very high in the air, which limits you to O(N^1/2), and things like cooling and maintenance access probably impose a limit somewhere in between.)

MaxBarraclough · on May 29, 2018

Makes sense, thanks.

adrianN · on May 29, 2018

Imagine your bits stored in a sphere around your CPU. Take into account the time light needs to reach the bit you want to touch.

utopcell · on May 29, 2018

The performance gap between performing cache-hot lookups vs lookups that miss can be 10X. The increase is more a reflection of that rather than doing more operations.

The RAM model doesn't concern itself with memory hierarchies or slow ops (ie divisions are supposed to be as fast as additions.)

For example, if you were benchmarking cuckoo hashing lookups, the results would look similar even though the number of operations is deterministically O(1) rather than expected O(1).

zzzcpan · on May 29, 2018

It's not a log-scale graph for the things you talk about. Y-axes are linear scale. So for O(1) algorithm time has an upper bound and plateaus as caches stop being helpful and every operation has to go all the way into the RAM.

phamilton · on May 29, 2018

I really liked this series: http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-p...

While, like others have said, big O and execution are different, in practice the article makes a lot of sense when evaluating execution.

utopcell · on May 29, 2018

If we were talking about the worst lookups, then even in the RAM model they would become slower as the map grows.

To be more specific, if you place n balls into n bins, on average each bin will contain 1 ball, but the largest one, with high probability, will contain O(log(n)/log(log(n))) of them.

dekhn · on May 29, 2018

can any hash table whose hash function operates over the entire string ever show O(1) time?

alayne · on May 29, 2018

The O(1) amortized access of a hash table is for the number of entries, not key hashing or comparison. Doubling the number of entries does not double the retrieval time.

As a hash table gets larger, the cost of key hashing (for the same domain of keys) does not increase. Of course hashing and comparison performance can still be important, it's just not what is generally being analyzed with basic complexity theory.

dekhn · on May 29, 2018

I guess I get the idea that the O(1) is for the number of entries, but in my experience 99% of prod time is spent hashing keys on amortized O(1) systems, so focusing on the basic complexity theory would seem to be leaving a lot of performance on the table and also contributes confusion.

nine_k · on May 29, 2018

> can any hash table whose hash function operates over the entire string ever show O(1) time?

If we limit the length of the has key, the time taken by the hash function is also limited and turns into a constant. Big-O is all about asymptotic behavior.

OTOH noticing how the hash function's run time depends on the size of the key can be separately useful. I suspect all such functions are linear in practice, though.

dekhn · on May 29, 2018

I'm talking about functions where the entire string is used to that's necessary on the internet when dealing with billions of strings that are similar to each other

gpm · on May 29, 2018

One thing to think about is that accessing a byte memory in a system with n bytes is pretty much guaranteed to be at best a log(n) operation. After all you need log(n) sized address to index the memory, and you need to read every single one of the bits in the address to index the memory.

In practice on any particular machine all the addresses are the same length (and we have finite memory), so it doesn't quite explain the effect you are observing in the article. But it does hint that it makes sense.

ant6n · on May 29, 2018

I thought a hash table has O(n) access.

bufferoverflow · on May 29, 2018

Average O(1), worst case O(n)

So worse than a B-tree or a Red-black tree for the worst case.

ant6n · on May 31, 2018

"Average O(1)" is an oxymoron. Big Oh is about the worst case. The only thing that makes sense is "Amortized O(1)", but I believe its possible to create access cases where hash tables are slower.

zzzcpan · on May 29, 2018

Only some primitive hash table algorithms have O(n) worst case access. No one implements those algorithms anymore though.

IncRnd · on May 29, 2018

Big-O specifically refers to worst case, though, people use Big-O to refer to average or worst case scale in-practice.

tniemi · on May 29, 2018

I personally prefer Judy arrays: http://judy.sourceforge.net/

MaxBarraclough · on May 29, 2018

From http://judy.sourceforge.net/doc/10minutes.htm :

> From a speed point of view Judy is chiefly a 256-ary digital tree or trie (per D. Knuth Volume 3 definitions). A degree of 256-ary is a somewhat "magic" N-ary for a variety of reasons -- but mostly because a byte (the least addressable memory unit) is 8 bits. Also a higher degree means reduced cache-line fills per access. You see the theme here -- avoid cache-line fills like the plague.

Sounds like a neat data-structure made with cache behaviour in mind. I'm not getting much out of the creator's attempts to explain it, but there's a Wikipedia article: https://en.wikipedia.org/wiki/Judy_array

ebikelaw · on May 29, 2018

Wheres the source of this newer Google table? Can not find it nor the announcement of it.

utopcell · on May 29, 2018

It is not open sourced yet but a presentation was given in cppcon17

utopcell · on May 29, 2018

Here: https://m.youtube.com/watch?v=ncHmEUmJZf4

Recommended watch. Matt is a fun presenter.

ebikelaw · on May 29, 2018

But ... how does it appear in the benchmarks of this article?

Arcuru · on May 29, 2018

The author claims to have written an implementation of it themselves.

From the article:

> This last one is my implementation of Google’s new hash table. Google hasn’t open-sourced their hash table yet, so I had to implement their hash table myself. I’m 95% sure that I got their performance right.

ilija139 · on May 29, 2018

I wonder how it compares to Google's "Learned Index Structures" https://arxiv.org/abs/1712.01208

stochastic_monk · on May 29, 2018

I don’t think there’s much to write home about RE: Learned Index Structures since classical structures can soundly outperform at least the trumpeted Google “success story” [1]. It’s hype, not substance.

[1]: https://dawn.cs.stanford.edu/2018/01/11/index-baselines/

jacksmith21006 · on May 29, 2018

Completely disagree. It is NOT simply about the results but looking at the direction and a new approach.

Ultimately it also comes down to the power required to get some task done.

Also how can a paper submitted to NIPS be hype?

stochastic_monk · on May 29, 2018

As an idea and an application of machine learning, it’s important and worth exploring. Claiming it’s better is false. That’s a distinction worth making.

koverstreet · on May 29, 2018

Completely different - that paper applies to comparison based indexing.

jacksmith21006 · on May 29, 2018

Great link. I will be curious to see if the Google, Jeff Dean, approach works in real life situations.

What is interesting is you can use the TPUs and parallelize the approach.

Ultimately it is about using less power to get some work done.

masklinn · on May 29, 2018

Would be interesting to have comparisons with Rust's std hashmap and indexmap (formerly ordermap) as well.

program_whiz · on May 29, 2018

Props, this is an amazing contribution. So much respect for someone who can really dig in, figure this stuff out, and make a better data-structure for nothing more than the love of doing so. Everyone making their living as a programmer/tech person owes it to people like this who make the basic tools.

vanderZwan · on May 29, 2018

His ska-sort (a really fast in-place radix sort generalisable to almost all data) is also very interesting, I'm surprised there aren't more people working on the data structures he builds to make little improvements.

I may actually have figured out a way to (theoretically) make it faster, but I don't grok C++ templates so can't write a pull-request (the general idea is: instead of a standard full swap, use two temporaries and ping-pong between them while swapping elements to do half-swaps. This reduces required assignments to move an element to its final position from three to two).

utopcell · on May 29, 2018

To be fair, the max load factor for this hash map is set to 0.5, which is quite wasteful in terms of memory. If you allow for 50% load in dense_hash_map (and most hash maps for that matter,) they become much faster themselves.

SolarNet · on May 29, 2018

The hashmap the author wrote is .95, it's the google ones with a 0.5 load factor.

utopcell · on May 29, 2018

Is this true? From the article: ''ska::flat_hash_map has a max_load_factor of 0.5''. ska::flat_hash_map is the author's implementation (but maybe not the most recent iteration ?)

jpetso · on May 30, 2018

Right, ska::bytell_hash_map is the most recent iteration and ska::flat_hash_map was the previous one.

utopcell · on May 29, 2018

I like this work. We should be seeing more exploratory posts like this one.

I find it interesting that both google's and skarupke@'s hash map have exactly the same name: flat_hash_map. Maybe boost-inspired ?

jbapple · on May 29, 2018

This post has some overconfident and under-researched claims:

> This came about because last year I wrote the fastest hash table (I still make that claim)

I'm extremely dubious of claims of the "fastest $DATA_STRUCTURE" without qualifiers about workload, hardware, and time/space trade-off. In particular, the last graph of this post shows some workloads (400k int keys, int[16] values, unsuccessful lookups) where the author's implementation of Swiss Table (as Google called their table) is 3x-10x slower than "bytell_hash_map", the author's own design.

> google_flat16_hash_map. This last one is my implementation of Google’s new hash table. Google hasn’t open-sourced their hash table yet, so I had to implement their hash table myself. I’m 95% sure that I got their performance right.

The talk announcing Swiss Table clearly didn't have enough detail to fully replicate their results. This 95% claim is rather astonishing in light of that.

general_pizza · on May 29, 2018

Hmm I’m seeing the opposite. The last graph shows Google’s hash table outperforming in all but a few places, including at 400k. Perhaps you mixed the axes up, lower is better here.

I also just want to comment that skepticism to high claims is good but your comment seems a bit too harsh. Briefly looking at his post where he goes into his “fastest hash table” it’s quite long, very detailed and has similar graphs. That doesn’t seem under-researched to me.

jbapple · on May 30, 2018

> Hmm I’m seeing the opposite. The last graph shows Google’s hash table outperforming in all but a few places, including at 400k.

You're exactly right, General Pizza. I was trying to say that the Google table was faster, even though the author claimed that his or her table was "fastest" without qualification.

Thanks for pointing that out; I'll leave my post as it is so yours continues to make sense in context.

coconutrandom · on May 29, 2018

meta: classic HN, top post is a Negative Nancy regardless of how impressive the new thing is. Is there a name for this phenomena yet?

vanderZwan · on May 29, 2018

I know the type of post you talk about - dismissive easy critiques like "correlation does not.." - but GP's remark is not quite like that:

- it states the article has some overconfident claims.

- it gives detailed arguments for why

> Is there a name for this phenomena yet?

"bachelors' wives and maidens' children are well taught"?

hobofan · on May 29, 2018

"Critical thinking"?

nabla9 · on May 29, 2018

No and there should not be. There unjustified critique and justified critique.

´

newnewpdro · on May 29, 2018

A healthy dose of skepticism