An efficient way to make hash maps with insertion ordering

chris_st · on July 1, 2021

The surprising/interesting thing about this article is the notion that insertion order is the order that you want to iterate through a hash map.

In my experience, I'm doing either (A) using them as associative memory (I want the value for a key) and I don't ever iterate, or (B) I don't care about order when I iterate (I'm summing values, say) or (C) I want the keys in some sorted order, that probably has nothing whatsoever to do with insertion order (counting keywords, then showing them in alphabetical order).

So I'm curious why the insertion order is preferred, in your experience.

ErikCorry · on July 1, 2021

The most important thing is that there is some defined order. This makes it easier to debug, and reproducibility is just a very nice property. See https://en.wikipedia.org/wiki/Reproducible_builds Simple stuff like making golden output files for your tests becomes easier.

Certainly, having an ordering on the keys themselves (eg. alphabetical order) is an alternative to insertion order. I'm fine with that. But eg. the built-in maps in Go, Perl or JS don't support that.

So often you end up getting the keys, and then immediately sorting them before iterating. That's taking something that should be O(n) and making it O(n log n). I remember doing this all the time in Perl.

And often there's no order, so you have to create one for your key type. That's extra work, and it's just easier to use a data structure that doesn't require it.

I would be interested to see performance comparisons for maps where the keys are ordered without having to sort on each iteration. Often they have tree structures internally, which means cache-unfriendly pointer chasing, but probably you can remove most of this (and the log-n access time) with wider trees like B-trees.

simcop2387 · on July 1, 2021

For perl yoy can get this but you have to explicitly ask for it outside the perl interpreter. This is dome by setting up the hash seed and tell it not to add turbulence during runtime.

export PERL_HASH_SEED=0; export PERL_PERTURB_KEYS=0;

The article doesn't mention why python and perl made the choice to add more randomness but ti's because there were attacks out there where people could determine which bucket keys would be put in ahead of time and calculate a set of keys to cause pathological behavior in anything that made a hash from user input, this was commonly url parameters or POST requests. This meant that a simple web request with the right information in it could cause your application to continually reallocate memory and expand the hash size to accomadate the inputs. Usually this just burned cpu time and made other requests time out but it could also result in excessive memory usage and the OOM killer ccoming into play.

The above environment variables work to set a known seed and then not mix any new randomness in, this doesn't get you an jnsertion order but it will make the hash itself as reproducible as possible (if you don't put the keys in in the same order your results will vary still). I uwe this in a now defunct (because of second/third/fourth system syndrome) fuzzer I had going to detect problems in the prerelease versions of perl.

ErikCorry · on July 1, 2021

I don't think compact dictionaries are necessarily vulnerable to this issue. You could randomize the hash seed without affecting the iteration order.

simcop2387 · on July 2, 2021

It definitely depends on the implementation and I do imagine you're right you could make it secure pretty easily without affecting it. But switching from non-compact, non-guaranteed would have resulted in API changes internal to perl that would have been far more difficult to do. This is because of some design decisions in the past that result in perl having to be rather conservative about internal interpreter changes.

Perl uses a system called XS to write C code that interacts with the perl interpreter. XS doesn't have a defined API other than "what the perl internals do", this is because it's not really a C api but in fact a way to extend the interpreter itself (you can replace parts of the parser and do some other shenanigans) and it exposes basically all of the fields, structures, and functions that are used inside the interpreter to all C extensions to the language. There are some nice modern alternatives to doing things this way, but the end result is that unless we want to break just about every library on CPAN that uses any native code some care has to be applied to not change too much of the API at any given time. That's also why there's Devel::PPPort that tries to wrap both new and old APIs/structures in a way to provide some amount of compatibility between perl versions where there is otherwise no compatibility.

sgeisenh · on July 1, 2021

> The most important thing is that there is some defined order. This makes it easier to debug, and reproducibility is just a very nice property. See https://en.wikipedia.org/wiki/Reproducible_builds Simple stuff like making golden output files for your tests becomes easier.

Relying on this type of behavior when it isn't required is somewhat of an anti-pattern. And can often lead to mindlessly copying the results of an implementation in tests.

A quick aside: any sort of generic ordered container that supports O(n) iteration must necessarily require O(n log n) work to construct in the average case. So the asymptotic complexity is necessarily the same as sorting the keys of a hash map with undefined iteration order to artificially create a defined order.

It isn't too hard to simulate the behavior of insertion order iteration by maintaining an array alongside the hash map. And surprisingly enough, space and runtime complexity is practically identical to containers that implement insertion order iteration ;-)

ErikCorry · on July 1, 2021

> any sort of generic ordered container that supports O(n) iteration must necessarily require O(n log n) work to construct in the average case

This doesn't apply if the order you want is insertion order.

These hash maps are constructed in O(n) on average, just like any other hash maps.

Java's LinkedHashMap is also constructed in O(n).

sgeisenh · on July 2, 2021

"ordered container" meaning that the container is ordered based on the objects it contains.

__s · on July 1, 2021

A result of insertion order is that when I print {b:2,a:1} I get "{b:2,a:1}". This can be nice for being able to store something like an HTML element's attributes as a dictionary while still serializing it back to the order it was declared with

Sorted order also requires sortable keys-- not something that's required when hash maps only require hashing & equality

A stable ordering is nice, without the ability to sort insertion order meets this ask. https://mail.python.org/pipermail/python-dev/2012-December/1... shows how in Python this useful property also comes as a side effect of an optimization

cabalamat · on July 1, 2021

> A result of insertion order is that when I print {b:2,a:1} I get "{b:2,a:1}".

This implies that {b:2,a:1} is a different data structure than {a:1,b:2}, which disagrees with my intuitions about how associative arrays ought to work.

If I wanted a data structure that cares about order, I could use [(b,2),(a,1)] and [(a,1),(b,2)]

OskarS · on July 1, 2021

This is for, like, debug purposes. Makes logs and things like JSON requests easier to read because the output matches the source. Logically though, they are still unordered for things like equivalence checks, so those two examples you had there still compare equal.

Since order is unspecified, but when serializing the hash table you have to pick one, you might as well go for insertion order and make it easier for the programmer to read. Especially since this kind of hash table gets that “for free”

rurban · on July 3, 2021

For debug purposes you throw away security and define a fixed seed, which implies the ordering.

Equivalence checks on unordered containers are different, and any framework which cannot check equivalence of unordered containers are worthless. Forcing ordering on inherently unordered containers does have it's price, it's certainly not free. You can pay that price at serialization, or for testing, but it makes not much sense elsewhere.

Unordered maps don't hate you, you are just not equipped to deal with it. Insertion ordered maps were a thing in the 80s with lisp property lists, which did hold your environment of symbols properly. But since then everyone switched over to proper hashmaps already. Just some not yet: Emacs, python, PHP, ...

You cannot get rid of old sins easily

__s · on July 4, 2021

CPython has insertion ordered maps as of 3.6 after a memory usage optimization, & didn't before. 3.7 made it spec

uyt · on July 1, 2021

Compact dict which preserves insertion order are now the default implementation of dicts in python 3.7. They gave a talk about it in https://www.youtube.com/watch?v=npw4s1QTmPg

dheera · on July 1, 2021

There are also some use of dictionaries where the ordering matters (e.g. some mongoDB commands) and you can't use JSON or BSON and you have to use goddamn SON.

slashdev · on July 1, 2021

One case where this is useful is where you want determinism. You don't need a particular order, but you do need it to be consistent.

This comes up a lot in compilers where you don't want the output code to vary on each run.

This is specifically a big problem with Go, because maps use random iteration order so you don't depend on it. I feel that was a mistake. I would rather they took a page from Python and JavaScript and used insertion order. The reason they did not was to avoid painting themselves into a corner with respect to algorithms and options in the future - which is reasonable.

Rust uses a randomly seeded hash algorithm on purpose to defend against hash DOS attacks.

Hashing doesn't necessarily mean indeterminate, outside of Go and Rust, but it can be tricky to guarantee because it depends on the algorithm and the memory allocator (which is often multithreaded and can introduce non deterministic behavior.)

echelon · on July 2, 2021

Maybe it wasn't your intent (I also could be reading this wrong), but I feel like throwing Rust under the bus here is unfair.

Rust has a ton of different collections types you can use. You can also replace the hashing and seeding algorithms. It's the most flexible of the lot out of necessity. Systems programming requires fine control over data structures.

slashdev · on July 2, 2021

> Maybe it wasn't your intent (I also could be reading this wrong), but I feel like throwing Rust under the bus here is unfair.

I don't see how that was throwing rust under the bus. I didn't even state my opinion, just the fact that it makes this performance for safety tradeoff (an unusual one since most hash tables do not attempt to protect against hash DOS attacks.)

You can replace the hashing algorithm. The rust compiler does this.

But most people don't fiddle with the default, which makes it a questionable decision in my books.

_nalply · on July 2, 2021

Throwing Rust under the bus...?

I don't think that anyone really could do that. I am chuckling as I am writing this.

arsome · on July 1, 2021

> The surprising/interesting thing about this article is the notion that insertion order is the order that you want to iterate through a hash map.

I agree, this is what java calls a "LinkedHashMap" and what python calls an "OrderedDictionary" - there are times when it's quite useful, but often it's simply not necessary, uses extra memory and shouldn't be the default when you just need a hashmap.

stickfigure · on July 1, 2021

Or you could look at it as "relaxed ordering is a performance optimization" and shouldn't be taken prematurely.

In Java, there is no default; you ask for a HashMap or a LinkedHashMap explicitly. In dynamic languages like Python or JS I prefer default insertion ordered maps because if I wanted tightly optimized code, I wouldn't use those languages in the first place.

ErikCorry · on July 1, 2021

The nice thing is the compact dictionary doesn't use extra memory, so why not make it the default?

deathanatos · on July 1, 2021

It does, though. See the diagram in the article. There's two allocated arrays there: one for the real hash table (on top) and one for the "backing array" which then stores the actual items. The "extra" bit is the indexes in the top allocation: those indexes would be omitted in a normal (unsorted) hash map. (That is, a normal hash map allocates a single array, which is the storage for both the hash table and the items it contains.¹)

(There's also some extra time requirements, too, to chase that additional pointer from the top table to the bottom one.)

¹N.b. in some languages those items will be references, but that doesn't matter for the comparison here.

ErikCorry · on July 1, 2021

Note that the bottom array can be densely packed, whereas the top one must have gaps to avoid excessive probing. If you only had one array you would have to put the gaps in that array, and it is at least 2 words per entry (much worse if you are inlining objects). So it's not so clear. If you want to save space you can also omit the hash bits stored in the index and reduce the width of the index entries to 16 or 8 bits on small maps.

In practice the compact dict does very well on space. Much better than Java's unordered bucket-based HashMap for example.

It's true that there's an extra pointer chase.

deathanatos · on July 1, 2021

Ah, that's an excellent point, and it does muddle the water about which would win out.

Deletes could still leave holes in the storage array, but I think I agree (and your article also states it) that the common case is likely fill table with data, and then perform lookups.

(And I had thought bucket-based hashmaps had very much fallen out of favor to the point where they really weren't the choice of standard libraries anymore, precisely because of the additional allocations & pointer chasing they required? I'm not a Java person.)

pclmulqdq · on July 2, 2021

C++ unordered_map unfortunately forces a bucket-based implementation. I don't know about Java.

ErikCorry · on July 2, 2021

The existence of the Map.Entry objects in Java implies a bucket based approach, but I don't think that's how it works these days, except perhaps for LinkedHashMap.

This stack overflow answer implies the Map.Entry objects are reused during iteration which means they don't need to exist the rest of the time and a flat KVKVKV backing array could be used. https://stackoverflow.com/questions/5455824/doesnt-that-iter...

Alternatively, new Entry objects are created for each step of the iteration, then heavy inlining could enable the JIT to eliminate that allocation again.

tialaramex · on July 1, 2021

It may (perhaps, I can't tell) not use "extra memory" compared to the data structure they had before, but it definitely is using memory it wouldn't need if it didn't preserve order.

Otherwise it would be more or less a violation of the pigeonhole principle. The insertion order is information, if you don't need to store that anywhere but you promise to give it back anyway that's a hole in your information theory.

floitsch · on July 1, 2021

With that reasoning a sorted list would need more memory than an unsorted list.

tialaramex · on July 1, 2021

How so? Sorting changes the order of the list, to the extent it's adding information (the sort order) it's also destroying the same amount of information (the previous order).

It's easy enough (including by accident) to deliver insertion ordered iterator behaviour for small hash maps that have never had any items removed, but after that things get sticky very quickly.

floitsch · on July 1, 2021

You were stating that "[...] it definitely is using memory it wouldn't need if it didn't preserve order.".

There is no reason for that. A sorted list doesn't need more memory than an unsorted list. And an insertion-sorted map doesn't need to use more memory than a map that doesn't keep the order.

Independently, some properties we deem valuable are often side-products of the way the data-structure works. For example, an efficient binary search tree will naturally tend to be balanced. In the case of efficient hash maps, it is now common to add the key/value pair at the end of an internal vector. Maintaining the insertion order this way doesn't add any cost.

tialaramex · on July 1, 2021

It actually does have a cost. In fact I'm pretty sure it has two separate types of cost, so let's talk about both of them.

Firstly, and this is orthogonal to my original point, whenever people say "Cost" they actually silently mean "... that I care about" and so we're involuntarily obliged to agree to stop caring about anything else. Because of Hyrum's law even if you don't promise ordering (as Python's "Compact Dict" developer proposes at one point) your users will rely upon it and you're stuck with whatever you implemented. This is a real cost for a language or standard library. Look at the sad state of C++ standard collections.

But my main point is that the requirement for ordering forces you to dispose of otherwise permissible optimisations which could destroy the order. You actually ran head first into this in the article, when repeatedly removing and adding elements blows up by clogging your data structure with tombstones.

Swiss Tables wouldn't clog up with tombstones. Since they don't care about ordering they only need tombstones at all in uncommon special cases (specifically, if the entire rest of the group is full or tombstones), in most cases Swiss Tables get to just set removed (key, value) pairs as empty. But this optimisation isn't open to you.

joshuamorton · on July 1, 2021

Sticking tombstones in the backing array doesn't actually cost you anything though. The backing array still needs to be the same size. Like maybe you end up having to re-hash more often?

>Because of Hyrum's law even if you don't promise ordering (as Python's "Compact Dict" developer proposes at one point) your users will rely upon it and you're stuck with whatever you implemented. This is a real cost for a language or standard library. Look at the sad state of C++ standard collections.

Yes but this cost goes away if you commit to it, and the alternative (like what go does) to force the order to change all the time also comes with a complexity cost vs. doing "nothing" and having the order be arbitrary and sometimes, but not actually, reliable.

filoeleven · on July 2, 2021

I don’t think this holds for persistent data structures that use HAMT. Those have great performance along with the benefits of immutability, and I’m perfectly okay with deferring order to a pretty-printer.

It wouldn’t handle insertion order, probably best to do alphabetical by default then. Chrome dec tools does this for js objects (which are map-like) already, and I’ve never had complaints about it.

eximius · on July 1, 2021

note: it has been the default on python for some time, then made official (instead of an implementation detail) after some 3.x release. 3.7 maybe?

chris_st · on July 1, 2021

Turns out it's the default implementation in Ruby as of 2.7 or so as well.

majormajor · on July 1, 2021

As of 1.9 :) https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-...

oweiler · on July 1, 2021

In Kotlin, calling mapOf() gives you a LinkedHashmap. Which is a weird choice IMHO.

_old_dude_ · on July 1, 2021

When you want to output values in the same order as the user give it to you e.g. when you read then write JSON or TOML files and those files can be written by a real human. Changing the order feel arbitrary.

When you have a language that support keyword arguments (kwargs) like Ruby or Python more recently.

paavohtl · on July 1, 2021

I've relied on the order of keys for URL routing in JavaScript / TypeScript. Each route of the web application is unique, and the order matters - it's surprisingly easy to end up with otherwise ambiguous routes like /products/:ean and /products/all in real world applications. One benefit of using an object and not just an array is that I can validate at compile time with TypeScript that each route has exactly one handler.

An insertion-ordered hashmap works really well for this without requiring any extra syntax or data beyond object literals.

spuz · on July 1, 2021

Imagine you want to write a JSON parser and prettyfier. Then writing the keys out in the same order as they were inserted would be very important. Yes, it's true that key order doesn't matter to the computers that consume JSON but humans consume JSON too and key order often matters to them.

chris_st · on July 1, 2021

You may be able to tell from this reply that I've never built a JSON parser and prettyfier :-) but I'd guess that in this case, I'd never have any reason to ask for a bit of the JSON "out of order", so I'd just use arrays...?

nly · on July 2, 2021

It's also important if there are duplicate keys

foobarian · on July 1, 2021

While we can sort the keys before iterating a plain hashmap, getting insertion order is particularly cumbersome because we need to keep some extra information about the original ordering of the elements. So I guess I'm kinda glad that this is the default implementation, and getting other sort orders is easy with just having access to the keys.

macspoofing · on July 1, 2021

>So I'm curious why the insertion order is preferred, in your experience.

It's a little bit like asking what's the use of a linked list. A map which can iterate through keys in insertion order is just a data structure.

A recent use-case for me was using timestamps as keys that map to captured state at that time.

chris_st · on July 1, 2021

> * It's a little bit like asking what's the use of a linked list.*

Well, no, it's more like asking why you'd use data structure X instead of Y.

I was asking why you'd want that property as part of your data structure, and pointing out that (AFAIR) I've never needed it.

For your use case, I guess you wanted to be able to retrieve the data for specific time stamps? Otherwise I'd think you'd just put it into an array.

recursive · on July 1, 2021

> A recent use-case for me was using timestamps as keys that map to captured state at that time.

That sounds like sorted key order would suffice just fine.

metalliqaz · on July 1, 2021

When working with data that has been scanned/parsed from a file, it is often very useful to me that I retrieve the data in the order it was found in the file. This gives humans some kind of continuity between the input and the output of the program.

mypalmike · on July 1, 2021

It's useful for when you are in a job interview and are asked to build an LRU cache. :-)

marvy · on July 2, 2021

I think LRU does NOT work with this trick, because there's no easy way to "move-to-front" in O(1) time.

The article mentions Java's LinkedHashMap, which actually can do LRU. But as the name implies, they rely on a linked list, where move=to-front is natural.

But this superpower does not come for free: a Java-style linked hash map will use more memory than the technique described in the article.

ErikCorry · on July 2, 2021

I think you can move to front by removing the entry and then adding it again. This won't be O(1) currently, but with the tombstone trick from my blog post I think it would be.

marvy · on July 3, 2021

Ooh nice! I didn't think of that! Since appending to an array is cheap, you can just move-to-end instead of move-to-front, and then iterate in "reverse" order, so from the outside everything seems to Just Work.

kadoban · on July 2, 2021

One thing of note: if you want sorted order, there's no reason to be using a hash table.

It would require either worse than O(n) amortized iteration, or worse than O(1) expected insertion (A log n has to go somewhere to satisfy the theoretical lower bound for sorting).

If you want sorted, you're going to use a tree based structure.

I'm also not sure how you'd even naturally get sorted order into a hash table. I'm sure it's possible, but on the face of it it just seems bizarre.

ErikCorry · on July 2, 2021

https://news.ycombinator.com/item?id=27704081

mamcx · on July 2, 2021

People want a table, like in rdbms.

Hashmap is what them get. The properties of it (only useful as a direct index) is too narrow for the kind of use that in Langs like python has, where a dictionary is used everywhere (like a vector).

How some argue a hash map must be used (only useful to get a direct index) is like saying "vector can be used to only get a positional index") and wonder why people iterate it.

In other words: Collection want to be iterated.

sorokod · on July 1, 2021

Other then having a deterministic order without the need to specify additional ordering constraints, it is usefull when you are populating the map while linearly processing some data.

nly · on July 2, 2021

Cache with FIFO eviction and O(1) lookup

jerf · on July 1, 2021

"The other feature you quickly find yourself needing is the ability to use arbitrary objects as keys in a hash map. In languages that don’t support that (Perl, Python, Go, JavaScript until recently)"

I don't think that's right. Python and Go do effectively support that. Each has their own take on the problem of being unable to use mutable keys, but I index by some sort of object all the time. Makes iteration over hashes substantially more useful when you get objects for both hashes and keys.

Back in the Python 2.0 era, before it grew all the other features that were advantages over Perl, before Python even had generators/iterators, this was one of the big reasons I preferred Python over Perl. Only being able to index by strings wasn't necessarily a huge stopper in theory, but in practice it was very dangerous, because you really need to encode your keys. It was extremely common to see

    print($hash{$keyPart1 . "|" . $keyPart2});

when what you need is

    print($hash{pipeEncode($keyPart1) . "|" . pipeEncode($keyPart2)});

to be safe, where pipeEncode can be as simple as a routine to backslash encode pipe and backslash characters (although no simpler than that; you must do both of them to be safe, that's the minimum). This could result in non-trivial bugs and even security issues when you start encoding things like $object . "|" $permission and hackers start sticking pipes in the names of objects.

But between the inconvenience of that if you know what you're doing, and the pile of code you'll encounter from developers who don't know why that's important (and may even argue in code review), it was definitely the sort of thing that grinds you down over time.

In Python it was safe to

    print(hash[(keyPart1, keyPart2)])

and it was consistent and safe. Or use an object, etc.

ErikCorry · on July 1, 2021

It looks like your "pipeEncode" is generating a string that reflects your desired key equality function. I mention this approach in the blog post, but I consider it a poor substitute for being able to use arbitrary objects and specify the equality function on the map or set.

The issue of mutable keys is a slightly different one. If you mutate any of the properties of your object that are used by the map you are going to have a bad time, so don't do that. And I guess if your maps are sufficiently simple (eg JS's object-identity maps) then the user can't make that mistake, but at what cost?

If they are generating strings as keys and they mutate the object after creating the string then this will also break so they haven't even really avoided the problem.

jerf · on July 1, 2021

"I consider it a poor substitute for being able to use arbitrary objects and specify the equality function on the map or set"

My point is that it's an even worse substitute than most programmers realize, because to use it properly you have to understand how to encode parameters. The thing that people usually use, string concatenation with some delimiter, is fundamentally flawed.

(My favorite... and, alas, I've inherited a system that uses this, though fortunately it hasn't surprised me yet... is using "underscore" as a delimiter, for values whose names routinely include underscores! Fortunately, nothing ever tries to extract the original values from the key programmatically, and it's not really in a place hackers are going to attack. But still... yeesh.)

The one exception I've sometimes made is that if you happen to be in an environment where you know a certain value will never be used, you can use that as the delimiter; I've used ASCII NUL for that a few times. But you have to be sure that it's not just a "weird" value that "nobody would ever use", but something truly excluded by the context, something that regardless of what is input by someone somewhere is completely impossible to ever get to your code. Generally, the characters you can type on a keyboard are not a good idea.

nemo1618 · on July 1, 2021

Go does not support using some types as map keys, including slices, channels, functions, and other maps. Slices are the most annoying, and I have seen plenty of code that uses fmt.Sprint(s) as a workaround. Fortunately the compiler now recognizes when you convert a []byte to a string for use as a map key, and will not allocate a new string.

arp242 · on July 1, 2021

In Go it depends if the type is "comparable"; i.e. if "==" works.

You can use arrays in Go; e.g. map[[2]int]string, and you can also use channels; although I'm not sure what the rules for comparing channels are exactly off-hand (I'm struggling to come up with a scenario when this would be useful off-hand actually).

The big problem with slices and maps is that they can be modified. That is, what happens if you modify a slice after you used it as a map key? In slices this is worse than with maps because the backing array can change if you run out of cap space. And also, do you compare by value or identity? And again, what happens if either changes?

I'm not sure if it's possible to come up with a set of rules that wouldn't take people by surprise in at least some cases.

jerf · on July 1, 2021

Python either. I correct my post above to switch to indexing by a tuple, because what I originally had is wrong:

    >>> d = {}
    >>> d[[1,2]] = 10
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unhashable type: 'list'

This is why I qualified my statement with "Each has their own take on the problem of being unable to use mutable keys,".

Go can be consistent in a simple way because of its type system, it can see if any part of a key has something in it that can't be hashed: https://play.golang.org/p/rf8IqPb76Em

Python, as befits Python, has default behavior for instances that I believe is "is" equivalency, but you can override that with various double-underscore methods to do whatever.

vharuck · on July 1, 2021

>Python, as befits Python, has default behavior for instances that I believe is "is" equivalency

Python dicts consider two keys the same if they have the same hash value and are "=="-equal. So __eq__ and __hash__ are the dunder methods to finagle. Python's sets are the same way.

A useful example is with the pathlib library.

    from pathlib import Path
    p = Path('a.txt')
    q = Path('a.txt').absolute()
    p is q # False
    {p, q} # Only one element

"q" carries some different info than "p", but it refers to the same file location. So not considering them distinct values is a good decision for this package.

joshuamorton · on July 1, 2021

Note that this is an oversimplification, as

    p = Path('a.txt')
    q = Path('a.txt')

    p is q  # False

since p and q are different objects and "is" equality checks if the objects are the same (this can be interpreted approximately as "have the same memory address"), so it'll almost never be true. And in the cases where it is ( 2 is 2), you shouldn't rely on it, as most of them are optimizations.

nirs · on July 1, 2021

p and q are the same file now, but after you change current directory their are not. They should never be equal.

ecshafer · on July 1, 2021

This is a weird article. It creates the idea that hash maps are uncommon? surprising? I am not really sure. Then makes the claim that you would expect them to be ordered by insertion order, and not just randomly which is the common usage of hash maps. Ordered Hash Maps exist, and isn't a insertion order hash map, basically just a list? But Hash Maps are so common and used absolutely everywhere its almost like its aiming at people who have used them but never any other data structures.

pydry · on July 1, 2021

Python switching all dicts to being ordered is my favorite new feature from the last five years.

It's useful all over the damn place.

rob74 · on July 1, 2021

Hash maps are a fairly recent addition to the standard data structures "library", because they are relatively performance-intensive. They are of course very useful and therefore commonplace now, but I think that only came about after the rise of dynamic languages which have built-in hash maps (Javascript objects, PHP "associative arrays" etc.). So nowadays any new programming language which wants to attract web developers (e.g. Go) needs a built-in hash map.

boardwaalk · on July 1, 2021

Ehh, hash tables were invented in the 50s and were and are used wherever they are useful. I’m pretty sure a running joke is that every decently sized C program contains a half dozen hash table implementations. They’re not recent.

mypalmike · on July 1, 2021

Hash maps have been commonly used in C since just about forever.

"isn't a insertion order hash map, basically just a list?"

No. They are different in the computational complexity of random access operations.

benibela · on July 1, 2021

But the standard C++ std::unordered_map only came with C++11

EamonnMR · on July 1, 2021

When I was first taught Java in college I didn't find out about hash maps for a while because it was considered too advanced for beginners (ohh scary templates!)

myphs · on July 1, 2021

This is quite interesting. I wonder, why this was decided to be too advanced. In my uni we teach it from the very beginning.

adrianmonk · on July 1, 2021

Having hashmaps be ordered is not always desirable. It's not necessarily meaningful for every application[1]. Also, it can have overhead.

If the problem is that the developer doesn't get what they expect, often the solution is to require the developer to be explicit about what they ask for.

I think that could work well here. Instead of having map.keys(), the API could have map.orderedKeys() and map.unorderedKeys(). (Or make it okeys() and ukeys() to be more concise.)

Yes, it's slightly more typing, but to me that's a small price to pay to ensure you get what you want.

---

[1] For example, maybe the order that's meaningful to me has something to do with the values rather than the keys. If the map represents git commits, maybe the order that's meaningful to me is a topological sort, and the map isn't going to provide that.

stickfigure · on July 1, 2021

This would be genuinely weird. The in-memory representation will make one or the other approach more efficient. If you want ordered, pick an ordered implementation (eg Java's LinkedHashMap). If you don't care, pick an implementation that doesn't care (eg HashMap). If you want some special ordering, pick something that can use a Comparator (eg TreeMap) or just sort the data when fetching.

A Frankenstein LinkedRandomTreeMap would pay the overhead of all approaches for an extremely exotic use case.

adrianmonk · on July 1, 2021

I suppose these ideas can be combined. An implementation that doesn't support ordering could be of a type that only has an unorderedKeys() and lacks orderedKeys(). Which is more explicit than having a keys() function that behaves differently than other types' keys() functions.

Also, I guess you could define unorderedKeys() to mean "I'm not picky about the order" (rather than "it must not be ordered"). Then all types could have an unorderedKeys() function.

stickfigure · on July 1, 2021

If unrderedKeys() isn't picky about the order, then there's no reason to have a separate method. Just keys() will do, and different implementations can have different behavior.

Also note that relaxing the ordering requirement when obtaining keys does not necessarily improve performance; if you've already done the work of maintaining an ordered list (ie LinkedHashMap), it's actually faster to iterate in order.

Furthermore, what exactly does orderedKeys() mean anyway? Insertion order? Does reinsertion change order? If some sort of natural ordering (eg, topological) then how do you configure that?

I think the "state of the art" is basically correct here: An abstract Map interface that provides basic operations, and a variety of implementations all with different behaviors and performance profiles. Sorry.

dragontamer · on July 1, 2021

There's all sorts of application-specific assumptions in this post, that it isn't funny. Ex: Use of hash-tables with insertion ordering to handle a insert-once task queue (preventing "double tasking" if a task is ever somehow inserted twice?)

Solvable with a Linked-list, circular buffer queue + hash table combined together (hash-table to prevent double-insertions. Linked-list / circular buffer queue for the FIFO functionality).

If O(log(n)) insert/removes are sufficient and you have priorities associated with each element, a Heap data-structure works (all "repeats" will be next to each other, because max-heaps always keep the maximum element on the top)

The goal isn't supposed to be to make a data-structure that magically works in all situations. The goal is to define standard library "components" that can be combined together in ways that matches the performance demands of a particular application.

---------------

And this is all still single-threaded land. If you want a highly performant multithreaded data-structure, you're probably going to have to write your own from scratch (instead of using building blocks from the standard library).

There's just too many performance assumptions that goes into a multithreaded data structure. Is it read-mostly / write-rarely? Read/write balanced? Single-producer / multi-consumer? Multi-producer / single-consumer? Multi-producer / multi-consumer? Is the data-structure able to "count the threads" that may access it and use thread-barriers (thread-pools may practically limit a data-structure to say 64-threads, which means a 64-thread barrier / semaphore can be a highly-performant solution to synchronization) ? Or is it a fully dynamic design where an unbounded number of threads may exist (unbounded number of threads means the sempahore method won't work).

--------

I mean... most people don't care about performance. So a standard library data-structure with a mutex is just fine for well over 95% of applications. But if you care about performance, then suddenly things get complicated.

pclmulqdq · on July 2, 2021

In my opinion, a hash map that needs to chase an extra pointer to do a lookup and uses insertion order to provide deterministic iteration order seems like a hash map that hates me. Using keys that aren't strings is also table stakes for hash tables these days.

If your desire is only deterministic iteration, just use a hash function that gives that to you. Many libraries randomize this in tests so you don't rely on it, but you can always set a custom hash function. This will give you a much faster happy path than something like this structure.

masklinn · on July 2, 2021

> In my opinion, a hash map that needs to chase an extra pointer to do a lookup and uses insertion order to provide deterministic iteration order seems like a hash map that hates me.

In most cases it's the other way around: Python switched to compact hash map for performance reasons, the natural ordering was a side-effect of it:

* The compact hashmap scheme has much better memory behaviour (as befits its name) because the the storage buffer is contiguous instead of being sparse so less space is wasted on empty cells, this is especially important in a language like Python where everything is a reference and each entry is 24 bytes

* It can also be further optimised by making the indices in the array adaptive (Pypy does that though I don't think CPython does): if you have less than 255 entries (which is rather common especially in languages where most everything is a dict — again python) you can use an array of u8s instead of usize.

* Finally the primary purpose of the compact hashmap for Python was iteration speed: in a normal hashmap unless you have a very high load factor the map is full of holes iteration needs to skip over, and furthermore the hole locations are basically random so the branch is completely impredictable, even with tombstoning the compact dict has a much better predictability.

pclmulqdq · on July 2, 2021

I suppose that most people don't ever erase anything from a Python dictionary? Erasing items would lower the effective load factor of the array (or be O(n) so you can do a big memcpy). It's interesting that this highlights how Python users use dictionaries very differently from how I use hash tables (as a C and C++ programmer).

In the last few years, people have been designing hash tables to run at much higher load factors, but if the decision was made before ~2018, I can see why the iteration speed argument would result in a structure like this.

masklinn · on July 3, 2021

> I suppose that most people don't ever erase anything from a Python dictionary?

It's the tombstoning mentioned in both my comment and the essay (tombstoning is basically the amortisation of deletion), however you need a lot of deletion to fuck up your branching as much as a middling-load-factor hashmap does.

> Erasing items would lower the effective load factor of the array (or be O(n) so you can do a big memcpy).

It will, however:

* maps which get items deleted from are a minority, more so for those which both get items deleted from and get iterated

* once enough items have been deleted, the buffer can be shrunk in order to remove the tombstones (converting it back to a very dense array).

comex · on July 2, 2021

> If your desire is only deterministic iteration, just use a hash function that gives that to you.

For one thing, even with a deterministic hash function, the keys still have to be hashable in some deterministic way - no using pointer values as hash codes. Making objects properly hashable is always possible to do, but in many languages it requires additional work, so people often don’t bother, at the cost of nondeterminism. Insertion-order maps make determinism ‘just work’.

Even assuming you do have a deterministically hashable objects and a deterministic hash function, then sure, the map is deterministic in the sense that starting from an empty map and performing the exact same series of operations gives you the same iteration order. If you run the same program on the same inputs, you should get the same output. And often that is all you need.

But sometimes you want one part of the program to be deterministic even if another part is changing, and nicely-behaved hash maps can make that easier. For example: both insertion-order maps and sorted maps, but not standard hash maps even with a deterministic hash function, share the property that adding a new element won’t change the relative order of two existing elements. If you’re mutating one part of a map and not another part, yet the latter part is changing order anyway, in some sense that part of the map is behaving nondeterministically.

Similarly, they both share the property that a map->list->map round trip, or map->string->map round trip, results in an indistinguishable map. So it’s easier to guarantee that a program behaves deterministically even if you sometimes need to (nondeterministically) pause it, serialize its state to disk, and reload it later.

glangdale · on July 2, 2021

Thank you. I think that the idea that a hash map that doesn't provide insertion order "hates me" is a bit overblown. Not that this isn't occasionally what you want, but that feels like an exotic new data structure, not that the original hash map is somehow wrong and user-hostile.

arsome · on July 1, 2021

Is there a proper way to use custom objects as keys in hashmaps in JS (well, ideally TypeScript compliant)? Coming out of the C++ world where this is trivial and something I use quite often this seems pretty infuriating. Maybe a general data structure package that's worth pulling in?

EDIT: After doing a bit of research I found "tstl", looks interesting, their hashmap appears to support custom hashers and equality comparisons:

https://tstl.dev/api/classes/std.hashmap.html

I'm curious about experiences with it if anyone's had the chance?

mediumdeviation · on July 1, 2021

Create a wrapper class that takes a stringifying function that converts the key objects into a primitive and use a Map<string, [K, V]> as the backing store. fast-json-stable-stringify can be used as the default stringifier, and if the objects are large or have a natural key a custom stringifier can also be supplied. Since we are using TS we can have our custom class implement Map<K, V>.

arsome · on July 2, 2021

That's very similar to the solution I'm using and that I hate - it involves unnecessary generation and storage of strings and is far less ideal than something like the implementation in tstl I linked above where you can simply provide a custom hash function and equality comparison.

upzylon · on July 1, 2021

If the problem is just that you want to use non-primitive keys, that's what Maps are for.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

arsome · on July 1, 2021

I'm aware of them, but the problem with that is you can only compare an object by instance if I recall correctly. There's no way to override the hashing and equality check that I could find, meaning even for simple objects you're basically forced to serialize your keys in some way still, which is disgustingly inefficient and slow for no good reason.

doteka · on July 1, 2021

So much this. I remember being so excited that Map and Set were added to the language, only to find that they were worse than useless for objects. At least Set can be handy in a pinch, but I have never seen an actual usecase for Map that isn’t solved better with plain old JS objects. Curious if anyone else has?

upzylon · on July 1, 2021

There definitely are valid use cases for a Map, and even more now that we have WeakMaps (e.g. associating data DOM nodes). Or mapping functions to values, e.g. something like this:

  const returnCache = new Map<() => any, any>()

  const memoized = <T extends () => any>(f: T): ReturnType<T> => {
    if (returnCache.has(f)) return returnCache.get(f)
    const v = f()
    returnCache.set(f, v)
    return v
  }

gildas · on July 1, 2021

> I have never seen an actual usecase for Map that isn’t solved better with plain old JS objects. Curious if anyone else has?

With a Map, you can easily and efficiently associate data with DOM nodes without changing them.

ErikCorry · on July 1, 2021

A quick look at the source implies that that iteration order is not predictable.

noneeeed · on July 1, 2021

Curious to read this and not see a reference to Ruby.

In Ruby hashes are ordered by insertion, can have any object as keys, but the hash key value can be defined on a class by class basis (by overriding the default `hash` method on the class).

The ordering is one of those features that I need very infrequently, but when I do it significantly simplifies that bit of code.

I've always found the usefulness and flexibilty of Ruby's hashes to be a double-edged sword though: they are extremely useful, but this leads people to use them in situations where defining a proper class might be better for long-term maintenance.

weatherlight · on July 1, 2021

I've used this feature of Ruby Hashes to create something that behaves like a a LRU cache :)

noneeeed · on July 5, 2021

Interesting, I'm trying to work out how that would work. Would you mind giving a quick overview?

peterkelly · on July 1, 2021

Rust has HashMap and BTreeMap. You use the latter when you care about ordering. There's also a couple of options for dequeues. A good overview here explaining which one you'd choose depending on what oyu want to do: https://doc.rust-lang.org/std/collections/index.html

Use the right data structure for the job. Maps are an interface, not an implementation.

edflsafoiewq · on July 1, 2021

The indexmap crate provides a hashmap that iterates in insertion order.

floitsch · on July 1, 2021

Rust has hash maps that hate you.

The default hash map visits the keys in arbitrary order (https://doc.rust-lang.org/std/collections/struct.HashMap.htm...).

You could use a BTreeMap, but that's not insertion order and requires the user to come up with some sort of ordering.

IshKebab · on July 1, 2021

They don't hate you. That's how it's supposed to work. Otherwise you get issues like the one discovered in the article.

Would you say Rust has Vecs that hate you because prepend isn't O(1)? Or does JavaScript have Maps that hate you because they aren't sorted?

There's no perfect container. It's all trade-offs.

ErikCorry · on July 1, 2021

The article also explains how the issue discovered can be fixed.

And JS maps are insertion ordered.

c-cube · on July 2, 2021

The article doesn't expand much on the performance cost. Another comment mentions Swiss tables, which are what rust HashMap is based on these days, and it's unlikely that python's maps are as optimised. Not to mention the unavoidable memory overhead.

floitsch · on July 2, 2021

Which unavoidable memory overhead?

The hash map as described in the article doesn't seem to waste any memory.

edflsafoiewq · on July 2, 2021

The indices. The article stores Hash+Index+Key+Value, compared to storing Hash+Key+Value.

floitsch · on July 2, 2021

It's not that simple.

The index table needs to be sparse (for the hashing to work), and thus has quite a few entries unused. It also wants to be small in memory to improve cache locality.

Therefore you want to have the entries in the index-table as small as possible.

As example, let's say you have n entries in your map which is 2/3 full. Let's assume the key and the value each take 1 word.

In the index+key+value case you would thus use: n1.5 words for the indexes, and n2 words for the keys/values. -> 3.5 words per stored key/value.

In the key+value case you would need (2n)1.5 words. -> 3 words per stored key/value.

However, that assumes that the index-table uses a full word per entry. That's usually not the case, as you can use smaller types while the map is small enough. On 64-bit machines, the index-table will rarely use more than 32-bit arrays. In that case you would only need 2.75 words per stored key/value.

That's not really fair, though, as the value/key table has some unused space, so that it doesn't reallocate/copy every time a new key/value is added. Implementations can play with how much extra space is allocated for those unused slots.

Also, the "hash" you mentioned can be combined with the redirecting index. For Toit (the map of the blog post), the index table uses 32 bits. In these 32 bits it stores both (part of) the hash code, as well as the redirecting index.

It might be the case that the hash+key+value approach uses less memory, but I would not be confident to say that it's always the case.

bullen · on July 1, 2021

I wish JavaScript had linked hash maps as default so that JSON would not require arrays to keep the order of objects!

Hierarchies would not have to look like this: http://root.rupy.se/meta/user/task/563541707719927980

ajanuary · on July 1, 2021

Javascript already does retain insertion order (except for special casing of things like integer keys)

JSON objects, however, have unordered keys.

JSON != Javascript.

patrickthebold · on July 1, 2021

This reminds me of one bug I introduced a long time ago:

This was a node backend, and there was some code that was recreating an object be reinserting all the key value pairs in a precise order so that that's how it would serialize in the json and then the UI depended on that order. But to me, it was just recreating the same object so I deleted that code, causing the bug.

In retrospect, I should have at least asked the committer what they were trying to do.

In my defense, this was before javascript guaranteed the insertion order would be the order kept.

usefulcat · on July 1, 2021

First off, WTF is up with disabling copying? That is seriously annoying. Now on to what I came here to say..

"When there are enough of these tombstones, the hash map gets rebuilt, but that’s OK. Due to the magic of amortized constant time, this doesn’t slow you down on average."

If amortized time is all you care about, that's fine. But there are some applications where worst case time is more important, and unfortunately I've seen some hash map implementations that implicitly assume that the worst case never matters.

prutschman · on July 1, 2021

In some browsers you can get around this by selecting the text, keeping the mouse button down, hitting the keyboard shortcut for 'copy', then letting go. The selection will still be emptied, but the text is already in your clipboard.

benibela · on July 1, 2021

I benchmarked dozens hashmaps when building my own programming language in Pascal

Such an 2-array insertion order map turned out to be one of the fastest maps

I was a little bit surprised

AtlasBarfed · on July 1, 2021

Java is treated as the "prior art" like it is some sort of ancient script? That is amusing.

Since Java was the first mainstream GC language with sufficient man-hours of engineering, and it was an opportunity after almost two decades of C/C++ libraries to start anew with the standard library, but its standard library always seemed simply evolutionary/"best practice" (not my favorite phrase, sorry) from what came before it.

Is the hashmap impl in the JDK actually novel for its time?

But

ErikCorry · on July 1, 2021

As I understood it, there was an implementation of "compact dictionaries" in Java, but it was not the implementation of any of the collections that were built into the language. It is "prior art" in that it was done before compact dictionaries were reinvented for Python.

abecedarius · on July 1, 2021

IIRC it was for the collections library of the E language, whose first implementation was in Java (erights.org). The E designers had a self-imposed constraint that everything computational had to be deterministic. The standard way of doing iterable hashtables lacked this property.

I tried to check my memory by following the link to the Raymond Hettinger talk, but he doesn't say just what prior work he was talking about, only that it was in Java.

cb321 · on July 2, 2021

"An Efficient Representation for Sparse Sets" by Preston Briggs & Linda Torczon dates from Rice in 1993 covers the case where the index can be a direct map (for small key universes - like, say, the range of `char` or `short`, where one can be sure of no collisions/need to handle them).

The idea is simple enough that it could have been a subsection in some seminal early database work, but I also lack a reference for that.

Storing some (or all) of the hash code in the hash index part as this article mentions is a good idea I had independently. This is especially true if you do linear probing on the hash table part and care about worst case latencies. (You will basically never have > 2 full cache misses if the CPU BP/prefetcher can recognize a linear scan.)

abecedarius · on July 2, 2021

It wouldn't surprise me. My memory is of two things:

- I was into E in the 90s and this idea was new to me then.

- I vaguely remember reading someone online mentioning that E was the known prior art to Hettinger's reinvention of this. But this dim memory is not reliable, so I wanted to check it.

And yeah, storing the hashcode is a technique I saw in the wild earlier than I saw the insertion-ordered hashtable.