If I understand correctly, you still need guarding locks for this if reading and writing at same time, or have multiple writers. Isn't this the same as for the native map?
It's not clear from the code; the author probably should clarify what "Goroutine safety" means - for reads or for read/write? Your correct about the approach for native maps.
EDIT: The package description punctuation is hard to grasp, but upon a second read, "Goroutine Safety." might be intended as a headline for the next line, which starts with "No." Read that way, your understanding sounds correct.
Depends on the implementation. Some algorithms[1] have a load factor no more than 0.80 before insert/lookup performance starts to degrade significantly. Others, like the algorithm used in Rust[2], can achieve load factors in excess of 0.98 without breaking a sweat.
That's why I said "good hash table". Even a simple open-addressing hash table with quadratic probing can easily go to load factor 0.8 with reasonable performance. With cuckoo-hashing you can get much higher and still guarantee worst-case constant-time lookups and amortized constant-time insertions.
The page dosen't say words like "faster than a hash table".
The hash table has time complexity O(1) on get/set operations. And this tree has a constant-level time complexity, and should be a bit slower than hash table.
Besides the redundancy of the golang array auto-expanding, this tree is more space-efficient than a hash table, If you implement it in C without the auto-expanding feature, the space utilization can be better.
What do you mean by constant-level time complexity? That the number of levels is constant? That's just because you bound the total number of keys to 2^32. With that constraint, anything is constant-time, even linear search.
This is basically a trie built on the chinese-reminder decompositions of the keys. I'm not an expert in Go, but I'm sure you can use fixed-size arrays, and thus implement any open-addressing hash table with no space overhead.
Hmm.. The 2 * 3 ... 29 can be larger, if we want a larger tree.
Comment 1: O(2 * 3 ... 29) is indeed constant level, but we don't like it. The put/get/delete operations on this tree are all constant level, but at most Sum(lg2~lg29) checks.
Comment 2: Yes, we can use fixed-size arrays and build an open-addressing hash table with no space overhead (load factor is 1), this makes 100% table space utilization, but any insertions would cause table to resize, which is an O(N) copy (where N is the table size). However, insertions on a tree node can be done locally, without moving the whole tree.
Take 10 consecutive prime numbers:
2, 3, 5, 7, 11, 13, 17, 19, 23, 29
And they can distinguish all uint32 numbers:
2*3*5*7*11*13*17*19*23*29 > ^uint32(0)
Huh? How do I represent 31 (or any prime greater than 29)?
You don't represent primes, the primes are used to split the keys up and decide where in the tree it goes.
e.g. to represent 31, you take 31%2=1, so you use child 1 of the root node. If the root node already has child 1, you take 31%3=1 and use child 1 of that node. And so on.
Oh, I see. It's like a B-tree with a larger node size at each level, yes? But then what's the advantage of basing the node size on primes? Wouldn't you get the same advantages with simpler code by using successive powers of two at each level instead of successive primes?
The idea is that you will find a place in the tree for your given number at a relatively shallow depth, because every number in the ring Z/p1×p2×..pn×Z has a unique list of remainders in the separate rings Z/pnZ. As long as the product of the primes is larger than the size of your key space, then you won’t get any collisions.
I’m not sure using consecutive primes starting from 2 is actually the best choice though. You could use any primes you like (or for that matter any collection of factors which are all coprime), so long as their product is more than the largest possible key. The choice of factors should probably be tweaked based on benchmarks in real-world use. For instance, the factors could be 25=5², 29, 31, 32=2⁵, 33=3×11, 37.
[Excuse the lack of subscripts in the notation here; no universally installed fonts contain a subscript n, as far as I know. Likewise I’d prefer to use ℤ rather than Z, but I’m afraid it might show up as little boxes for some readers.]
When you use primes for your modulo, you get unique representation. This is a result of the Chinese remainder theorem. If my key is k, then calculating
k mod 2
k mod 3
k mod 5
...
k mod 29
uniquely determines k mod 2* 3* ...* 29. This way there are no collisions. Any set of pairwise relatively prime moduli would work, but using using primes is simple and it's easy to extend (just add the next prime to the list moduli).
This would then dictate the location in the hash tree (depending on how deep it is) of any entry to hashes to 31 or to those values modulo those primes.
Assuming that the underlying implementations of the Item interface will use some form of hash algorithm to determine the return value of the Key method, how does this deal with hash collisions of two distinct items?
For this case, conflicts can be solved via the traditional separate chaining strategy, to open a list on the Item.
This htree was originally implemented as the indexing container to store key-value poisition informations for the bitcask storage engine. And string keys are not stored in the htree to save more memory. Collisions of two string keys are distinguished by checking the disk.
The Item interface requires a uint32 typed key, meaning that if you need a hash function to derive the key, you are responsible for resolving the collisions :D
MapDB is great, thank you for creating and supporting it!
But I think the most important feature of it is that it implements common Java interfaces from java.util.concurrent like ConcurrentMap. Drop-in in a best sense of the word.
I'm not very familiar with Golang ecosystem, but my impression that it lacks in this department (of common interfaces).
In that case you could just use a binary tree taking from lsb to msb, and then expand down a level only on collisions.
The reason for primes is that it is more likely to distinguish between numbers earlier in the tree so that it doesn't get as deep as fast. Where a binary tree couldn't tell 0 and 256 apart until the 8th level, this could tell them apart at the second level: 0 mod 3 = 0, 256 mod 3 = 1
Is it possible to implement this as an immutable data structure? Or is the duplication necessary for each insertion making it too inefficient to be practical?
Higher branching factor gets you a better constant factor because the depth of the tree is the base-(branching factor) logarithm of the number of nodes. The number of pointers you need to follow to reach a leaf is equal to the depth of the tree, and pointer-dereferences are likely to cause a cache miss, which is a relatively slow operation.
You're correct that asymptotically, it's the same thing.
(I'm assuming you're a 5 year old who's at least been through a course on algorithms and data-structures):
It's vaguely stated in the abstract:
"Hash-Tree is a key-value multi-tree with fast indexing performance and high space utilization."
The use cases of hash trees are somewhat similar to hash tables (A map where you can quickly look up a key and get it's value), but you'd use a hash tree when you're willing to sacrifice a bit of performance for better memory usage.
"Although hashtable is very fast with O(1) time complexity, but there is always about ~25% table entries are unused, because the hash-table load factor is .75. And this htree is suitable for memory-bounded cases."
> "Although hashtable is very fast with O(1) time complexity, but there is always about ~25% table entries are unused, because the hash-table load factor is .75. And this htree is suitable for memory-bounded cases."
Do you know htree achieves this space efficiency? I ask because it looks to me like every node has an array of children, like a standard tree. Doesn't the cost of storing one additional pointer for each key add up to at least 25% space wasted?
>Do you know htree achieves this space efficiency?
This is a new data structure to me too but if I understand it correctly, one aspect of it, if you do it right, is you don't have to store the tree at all. You just store the data in an array, and use the procedure to look up the item in the array. So the tree is virtual. Anyone who gets it, am I right about this?
You would potentially need a sentinel bit or something at each element of the array to tell you whether there is a value there. Or if all your values were non-zero (say you are storing pointers to other structures) then that is already taken care of.
In golang, array is auto-expanding so the children array allocates more space than its length, which depends on the golang internal implementation.
> "Child nodes are stored in an array orderly, and checked by binary-search but not indexed by remainders, array indexing will result redundancy entries, this is for less memory usage, with a bit performance loss. "
I just cannot find some way (such as linked list?) to use the exact space with at-least binary-search time complexity.
Every not open addressed hash table can use load factors up to 100%, compared to 50-75%.
Robin hood is a special case.
Concurrent leapfrog hashing would save the go lock and performance problems better. No need for trees when you can use a proper concurrent in-memory hash table.
And you can use string keys.
So your mother sends you down the basement to grab some onions. Then you go down and you find it fast, because it's stored in a well structured way.
It's cool because this is a blueprint for a well organized basement.
> Goroutine Safety.
Could somebody elaborate on these two points? Why is it a "safe container" and how is it more "goroutine safe" than a map?