Hacker News new | past | comments | ask | show | jobs | submit login

You are definitely right for most hash tables. Determining the right bucket is O(1) but there will be collisions. Open hashing solves it by creating a linked list within buckets which you will need to traverse. Closed hashing solves it by probing, you might test several buckets before finding the right entry. In both cases, you get O(log(n)) as the average lookup performance.

You actually get O(1) with perfect hash functions, but doing so requires choosing a hash function by the data set. This is normally only feasible for precalculated hash tables, not those that are being filled dynamically.




The average lookup performance on a hashtable is definitely not O(logN) -- it's amortized O(1). Just Google for "hash table performance" and observe that all the results are saying O(1), not O(logN). I'd like to request some reputable sources from you showing O(logN).

The modal number of buckets you have to look into (or the modal size of the linked list) is 1. If it's more than 1, then your hash table is too full, and you need to enlarge it. This is admittedly an expensive operation, but it still amortizes out to O(1). And if you know in advance how many elements will be in a hashtable (this is common), then you never even need to resize.


It's important to distinguish between graphs of performance as measured on real hardware and theoretical results from computational complexity theory. It might be clearer to say "near-constant time over the region of interest" if you're talking about the former because "O(1)" has a precise meaning and refers only to what happens in the limit. Real programs usually only work for "small" n: they stop working long before the size of the input reaches 2^(2^(2^1000)), for example. But whether the algorithm is O(1) or not depends only on what happens when the input is bigger than that.

Another way of putting it: if you modify your practical program so that it runs on a theoretical machine that handles any size of input then you will probably find that another term appears in the execution time, one with a "log" in it, and that term eventually dominates when n gets big enough.

Yes, I am being a bit pedantic. But "O(1)" is mathematical notation and you have to be pedantic to do maths properly.


It should be obvoius that hash table lookups are above O(1) as long as collisions exist. Tweaking the load factor can reduce the probability of collisions, but that's merely a constant factor to the lookup complexity. As to whether O(log(n)) is a good approximation for the average, we can certainly discuss that. It depends a lot on the algorithm, but https://probablydance.com/2017/02/26/i-wrote-the-fastest-has... shows (first graph) that the reality is typically even above logarithmic for some reason.


> It should be obvoius that hash table lookups are above O(1) as long as collisions exist.

This doesn't follow at all. If the average number of lookups is 2 instead of 1, then it's O(2), which is O(1) because it's just a constant factor. The number of collisions needs to grow proportionally to the total number of elements being stored, which does not happen because the hashtable itself is also grown to keep the average number of collisions constant and low.

A hashtable that is 80% full has the same small number of collisions regardless of whether the hashtable has a thousand elements in it or a billion. Meanwhile, a BST would have to make 10 lookups in the first case and 30 in the second case -- that's logarithmic for you.

Your link shows lots of real performance figures which are dominated by CPU cache misses and other non-intrinsic factors. That's what's responsible for the reduction in performance on larger data sets, not anything intrinsic to hashtables themselves. If you look at the aforementioned first graph, the best hashmap performs at around at worst around 10ns up through ~200K elements, whereupon it blows out the fastest CPU cache and then takes around a max of 25 ns for the max size tested (~700M?). Factoring out the cache issue, it definitely does not look super-logarithmic to me. It might not even be logarithmic; I'd need to see raw data to know for sure (it's hard to tell by estimating from the graph).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: