Diving into the world of hash tables

VHRanger · on June 26, 2017

If you care about performance, it's important to note how the underlying representation of a dictionary data structure.

For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast. Retrieving a key is O(logn) but a very slow O(logn), because of the cache misses.

Other implementations is to keep a sorted vector of keys and a respective vector of values. There's loki::assocvector, booost::flat_unordered_map that do this for instance. Now insertion is slow, but iteration and retrieval are very fast (a fast O(logn) by binary search with few cache misses). It's also memory efficient, since no pointers between elements.

If you have one big dictionary you would use throughout an application, and know the data up front, a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array. Now you have a memory efficient and extremely fast dictionary data structure.

One other strategy any intermediate coder can implement is to have two unsorted coindexed arrays. You don't even need a hash function for this. Now iterating and insertion through the table is extremely fast, and it is memory efficient, but finding a key is just a fast O(n). So this is good for smaller tables. In C++ you could implement it as a std::pair<vector<key>, vector<value>>. If you need a quick small map in a function this is often the fastest data structure you can implement without too many headaches.

tjalfi · on June 26, 2017

Chandler Carruth talks about unordered_map performance[1] in his CppCon 2014 talk "Efficiency with Algorithms, Performance with Data Structures". He endorses most of the other alternatives you mentioned.

[1] https://youtu.be/fHNmRkzxHWs?t=46m39s

obstinate · on June 26, 2017

> sorted vector of keys . . . reserve necessary memory and sort it . . . unsorted coindexed arrays . . .

Most of the things you mentioned are not hash tables, but members of a parent concept, dictionaries. Hash tables all by definition involve some sort of hashing of the key. The two main categories of hash table are chained hash tables (std::unordered_map does this, at least in the implementations I'm aware of) and open addressed hash tables, which use probing instead of secondary data structures to resolve conflicts.

VHRanger · on June 26, 2017

You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values. It still keeps most of the properties we like, especially if doing the coindex-sort operation is too expensive for some reason.

obstinate · on June 26, 2017

> You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values

So you're saying you're going to hash the keys, then sort them according to the hash, with tie breaking on the key itself? I'm not aware of any sorted table that does this, but I'm sure some exist. I suppose you'd get something of a win if N was large, and the keys had long common prefixes, and you didn't care about the ordering property.

But in that case you'd probably use an actual hash table, not the algorithm you just described. Unless there's something I'm missing.

VHRanger · on June 26, 2017

Sorry I misspoke. Lookup is always O(1) in a hash table. But this could be a "weak dict" that is you don't actually store the keys, as long as you can reference a key through a hash you can lookup.

In the proper "data structure", we usually store the key and the value -- iteration through keys and/or values and/or pairs are probably supported operations.

Finding a key in the structure (either with binary tree search, or binary search on a sorted array, or linear lookup on an array) varies. So does iteration and most other operations.

Eridrus · on June 26, 2017

It's an interesting fiction that we tell everyone that hash tables are O(1), when in reality hash operations are typically O(n) in the key size, which is O(log n) in the value space, it's just much cheaper to have your Own(log n) operations be reading sequential bits, rather than following pointers, reading whole keys, etc.

Probably not something people usually run into, but it does show that constants matter.

ramchip · on June 26, 2017

> a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array

This explanation confuses me. Is there one or two arrays? What does "coindex" mean?

dom0 · on June 26, 2017

Equal indices refer to the same entity in all places. I.e. two arrays, one for the key, one for the value. The value of the key at index n in the key array is found at the same index n in the values array.

YZF · on June 26, 2017

You mean insertion into a bucket when there's a collision? And your n is the number of entries in the list? But n should always be very small relative to the total number of items in the hash map such that overall insertion and lookups are O(1).

VHRanger · on June 26, 2017

I mean insertion into the dictionary in general.

obstinate · on June 26, 2017

I think you might be a little confused. Even in hash tables with chaining, one does not tend to spend much time traversing linked lists, because the typical bucket will have only one member. This depends on the load factor of the table, but most practical implementations will eventually rebucket and rehash if their load factor grows too high. "Getting the key loaded" -- i.e. finding the memory location that contains the key -- is O(1) on average in all practical hash tables. It does not typically require any traversal at all.

You keep on talking about ordered tables like red black trees, as in this comment, which is another sign that makes me wonder if you might be confused.

arethuza · on June 26, 2017

The early versions of String.hashCode in java only looked at the first 16 characters of the string - when used for things like URLs this led to rather poor performance!

https://en.wikipedia.org/wiki/Java_hashCode()#The_java.lang....

misja111 · on June 26, 2017

Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1). And the same applies for cache misses. All these things don't really matter except if you are using hash tables with not very many elements in them.

VHRanger · on June 26, 2017

> Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1).

If you mean in a chained hash table, where you chase up to 100 pointers to get the value, the performance is atrocious.

Friendly reminder: traversal of a linked list and a contiguous array are both O(n). In the real world one is two orders of magnitude faster than the other.

> All these things don't really matter except if you are using hash tables with not very many elements in them.

The lookup of a value given a key is probably the least affected operation. If all you care about is a "weak dictionary" (you only really need to store values and lookup from keys), all of this is mostly jabber. If you store the keys to iterate through, or access for some reason, all those things start to matter a whole lot.

panic · on June 26, 2017

For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast.

Has this changed recently? Last time I used unordered_map, insertion was also slow because it had to allocate a linked list entry for every item in the hash table.

VHRanger · on June 26, 2017

No it's part of the standard. The STL guys can't implement it differently.

Here are a few alternatives:

MCT closed_hash_map

Sparsehash's sparse_hash_map or dense_hash_map

loki::assocvector

boost::flat_hash_map

emeraldd · on June 26, 2017

This is one of those data structures that everyone should try building at least once, kind of like linked lists, etc. Semi-unrelated, I built a couple of toy implementations a few years ago using the same basic ideas:

* This one is the my original written in something halfway resembling scheme: https://github.com/arlaneenalra/Bootstrap-Scheme/blob/master...

* And this implementation is part of a half finished byte code scheme that I haven't touched in a few years. Another project I need to get back to. Interface: https://github.com/arlaneenalra/insomniac/blob/master/src/in... Internals: https://github.com/arlaneenalra/insomniac/tree/master/src/li...

They were kind of fun to build and I'd recommend giving, especially if you have some skill but don't think you have the chops. To get a working toy isn't really all that hard once you understand the principals.

rdtsc · on June 26, 2017

It is really a basic data structure in most modern languages. Either built on or part of the standard library.

I can see if this was published in late 90's but today, I don't know.

And just use the built-in ones don't invent your own unless there is a very good reason for it.

dom0 · on June 26, 2017

> And just use the built-in ones don't invent your own unless there is a very good reason for it.

Uhm, hash tables are one of these things where plenty can be gained by tailoring the structure to the application. It's not a one-size-fits-all, and it's not nearly as easy as it would seem to make a good general-purpose HT.

zzzcpan · on June 26, 2017

Built-in implementations usually are kind of slow, waste memory, have broken iterators [1], even in modern languages. If either of those things are important - it's better to do some research and maybe even invent your own.

[1] when you can't both iterate and insert items consistently in the same loop

iainmerrick · on June 26, 2017

"Usually", really? Which languages are you thinking of, and have you benchmarked them?

flukus · on June 26, 2017

It's often true due to a few factors. One is that they are safe, whereas a specific use case may not need the same level of safety. They generally optimize for the general case as well, but in your own code you can optimize for the very specific use case you have. One I ran into many years ago was implementing my own singly linked list by adding the next pointer to another piece of data. In this one specific case it was worth removing another layer of indirection. I was still young though, so there were probably even better ways of handling it.

I have only encountered these scenarios a small handful of times, but I'm not a very low level developer.

taeric · on June 26, 2017

Hash Table has basically been elevated to a general idea lately. Pretty much anyone that is looking for an association between keys and values is using what they will call a hash table. To that end, few people actually know anything about how they are implemented. (And apologies if this comes across as negative, I do not mean it as a value judgement.)

This was different back when you would pick something like an alist to store associations. There the implementation stared you in the face. Same for the times you had to implement your own hash table. I don't exactly yearn for those days. Though, I am curious if alists actually win in speed for a large number of smaller hash tables.

yxhuvud · on June 26, 2017

Depends on implementations of the alist and of the hash table. It is even fully possible that the small hash table will use an alist for the cases with few (and small) elements.

taeric · on June 26, 2017

Fair. I was making a huge assumption that the alist would be implemented as you statically see it in code.

My point was supposed to be that an alist really has an obvious implementation, whereas a hashtable actually does not. My main objection being that there is a ton of glossing over what goes into an actual hashtable. While I would expect someone to be able to do a basic alist implementation, I have grown away from expecting folks to do a basic hash table.

hprotagonist · on June 26, 2017

"this is a data structure that is at the guts of basically every python object. Classes are dicts. Modules are dicts. Packages are probably dicts. Don't worry though, it's underrated!"

@_@

whatshisface · on June 26, 2017

Python has its own very cleverly optimized dict implementation:

http://pybites.blogspot.com/2008/10/pure-python-dictionary-i...

hprotagonist · on June 26, 2017

hettinger's updates at pycon this year: it's better! https://www.youtube.com/watch?v=npw4s1QTmPg

dom0 · on June 26, 2017

Python (and Ruby and some others) form the category of hash table interpreters—since most things are hash tables.

draw_down · on June 26, 2017

I like both kinds of data structures- maps and lists.

(Just kidding, list is but a degenerate case of map)

Cyph0n · on June 26, 2017

Quick question: are there special hash functions that are optimized for use in hash tables? Or do typical hash table implementations in e.g. Python just use standard hash functions like MD5?

Edit: It's 100% clear now. Thanks for the great answers everyone!

robmccoll · on June 26, 2017

Typically the hash functions that you are familiar with due to cryptographic or data consistency use (SHA family, MD family, etc) do not make for good hash table choices because they produce hashes that are much larger than needed and are slow to compute so that they have better cryptographic properties (extremely low collision, no information leakage about inputs, difficulty of guessing inputs). When picking a hash function for a hash table, you want a function that makes a hash just big enough and with low enough collisions while still being fast and easily dealing with variable length keys. This could he something as simple as byte-wise XOR or addition with some shifting as you iterate the key followed by a mod or even bitwise AND mask to pick an index.

dom0 · on June 26, 2017

However, collision resistance must be still quite good for use in a general-purpose hash table or a HT that is possibly exposed to attackers, otherwise denial-of-service attacks become very easy.

Many "modern" implementations (Python, Ruby, Perl, Rust, Redis, ...) use SipHash with a random seed for this very reason.

gizmo686 · on June 26, 2017

In order to use a standard hash function, you would first need to serialize the object. I am not aware of any programing language that does this.

Instead, the approach taken by (at least) Java and Python, is to define a "hash" function of objects, the classes can overwrite. The standard way of implementing such a function is to combine the hashes of the objects fields.

Python advises doing this by wrapping them in a tuple, and returning hash(self.a, self.b, ...). [1]

Java takes a simmilar approach, but does not make an explicit recomandation on how to implement hashCode(). In my experience, most programmers just XOR the hash of the fields, which (depending on the object) could be very sub-optimal, but is often good enough. Based on the doc, the typical implementation for Object.hashCode is to just take the memory address of the object.

[0] https://docs.python.org/3/reference/datamodel.html#object.__...

[1] Pythons tuple hash function may be found here: https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupl...

[2] https://docs.oracle.com/javase/7/docs/api/java/lang/Object.h...

jwilk · on June 26, 2017

Python has migrated from hg to git. The current version of tupleobject.c is here:

https://github.com/python/cpython/blob/master/Objects/tupleo...

splicer · on June 26, 2017

Good question! Yes. Here’s an example: http://cyan4973.github.io/xxHash/

obstinate · on June 26, 2017

There's also FarmHash, whose 32-bit version is 2x as fast as xxHash on little endian machines (at least according to this benchmark suite https://github.com/rurban/smhasher).

jwilk · on June 26, 2017

Contemporary versions of Python (≥ 3.4) use SipHash for hashing strings:

https://lwn.net/Articles/574761/

OTOH, the function for hashing integers in extremely simple:

  >>> hash(42)
  42

hueving · on June 26, 2017

I recall reading at some point about Go having the option to use a call to AES on the crypto chip on the motherboard for fast high quality hashes, which is cool.

booleandilemma · on June 26, 2017

"They didn't teach me about these in my coding bootcamp"

sjroot · on June 26, 2017

"An underrated data structure"

Uh......

skybrian · on June 26, 2017

This article explains "2-Level Hashing" which is apparently "dynamic perfect hashing." While interesting, I'm wondering how much it's actually used? Or is it just of theoretical interest?

[1] https://en.wikipedia.org/wiki/Dynamic_perfect_hashing

_doky · on June 26, 2017

My data structures professor made a habit of proclaiming every few classes: "Give me a hash table, and I could rule the world"

spankalee · on June 26, 2017

I wonder how hash tables are "underrated". They're probably _the_ data structure, if there is one.

afinlayson · on June 26, 2017

I think 90% of technical interview questions result in hashtable in it. underrated wouldn't be the word I'd use.

Izmaki · on June 26, 2017

No they're not. And if they are, they are more advanced than this brief overview.

ribs · on June 26, 2017

This word "extremely" has me skeptical from sentence one. Maybe I'm jaded.

mhh__ · on June 26, 2017

wtf i love hash tables now! ;)

mabbo · on June 26, 2017

I think when the author says "underrated", what they mean is "I didn't realize how important this is". Hash tables are used everywhere, by everyone, for a lot of things.

Maybe they don't give it enough time in school for people to realize it is the king of practical software development.

shishy · on June 26, 2017

The author of the article does not say underrated at all from what I can see; just something the OP included in his title. Your point still stands, though.

iainmerrick · on June 26, 2017

Yeah, that's a weird title! I assumed it must be the actual title of the post, but no.

rectangletangle · on June 26, 2017

Nearly every expression in high-level languages relies on multiple hash lookups. This is part of the reason these languages are regarded as "slow." I suppose you could use a tree in place, and get reasonable performance. However the hash table's nature allows you to pretty much throw more memory at the problem in exchange for speed (though this is hardly unique to this particular data structure).

For instance `a = foo.bar.baz` in Python involves 3 hash gets (local scope, foo scope, then bar scope), and a single set operation (local scope). This is part of the reason Python programs can be optimized by assigning a deep attribute lookup to the local scope outside of a loop's scope, and it will yield improved performance relative to doing the deep attribute lookup inside the loop's scope.

  a = foo.bar.baz
  for _ in range(20):
      print(a)

vs

  for _ in range(20):
      print(foo.bar.baz)

eutectic · on June 26, 2017

High-level is not a synonym for interpreted, dynamic.

bogomipz · on June 26, 2017

>"For instance `a = foo.bar.baz` in Python involves 3 hash gets (local scope, foo scope, then bar scope), and a single set operation (local scope)"

Can you explain where exactly and why a set operation is performed? Thanks.

rectangletangle · on June 26, 2017

When the name `a` is assigned the value of `baz`, Python is setting a key name `a` in the local dictionary (hash table) `locals()`.

Basically `a = 1` is syntactic sugar for `locals()['a'] = 1`

  >>> a
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  NameError: name 'a' is not defined
  >>> locals()['a'] = 113
  >>> a
  113
  >>>

One interesting side-effect of this is you can assign names that are not valid Python syntax explicitly.

For example:

  >>> locals()['foo-bar'] = 1
  >>> locals()['foo-bar']
  1
  >>> foo-bar
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  NameError: name 'foo' is not defined
  >>>

The name `foo-bar` can't be literally referenced, because the interpreter attempts to interpret it as the subtraction operation `foo - bar`.

adrian17 · on June 26, 2017

This only works in global scope. It won't work in a function:

    def f():
        a = 5
        locals()['a'] = 6
        print(a) # 5

Inside a function, accesses/writes of locals use an internal array of local variables, to skip dicts. Same with constants. See:

https://github.com/python/cpython/blob/master/Python/ceval.c...

hueving · on June 26, 2017

locals() is presented to the user as a dictionary, but is that the way cpython actually works with it internally? I've run into weird GC issues that imply it's not a normal dictionary and it's just presented to the user as one for programattic access.

klodolph · on June 26, 2017

No, that's not how it works internally. The Python compiler resolves all local variables at compile-time. You can see this for yourself by defining a simple function f, import dis, and calling dis.dis(f).

Global variables use a dictionary, however. The disassembly actually looks similar for both.

rectangletangle · on June 26, 2017

That's a good question, I'm pretty sure it's not a normal dictionary. However, I'd have to go through the CPython source to confirm. Maybe someone who's more familiar with CPython's implementation will chime in.

ubernostrum · on June 26, 2017

It's easier if you look at the bytecode produced for the expression, which is:

    1           0 LOAD_NAME                0 (foo)
                2 LOAD_ATTR                1 (bar)
                4 LOAD_ATTR                2 (baz)
                6 STORE_NAME               3 (a)
                8 LOAD_CONST               0 (None)
               10 RETURN_VALUE

The CPython implementation is a stack-based virtual machine. The first instruction here, LOAD_NAME, pushes co_names['foo'] (basically, whatever 'foo' is bound to in local scope) to the top of the stack. That's at least one hash table lookup, since the scope is a hash table mapping names to values. Then LOAD_ATTR replaces the top-of-stack value with the value of its 'bar' attribute. That's another hash table lookup, since attributes are a hash table mapping names to values. Then another LOAD_ATTR to get to 'baz', that's another hash table lookup. Then STORE_NAME binds the value at the top of the stack to the name given as an argument ('a'). That's an insert or update of a hash table.

So, the expression 'a = foo.bar.baz' involves at least three hash table lookups and one insert or update.

widdma · on June 26, 2017

Since Python is an interpreter, it keeps a dictionary/hash table of the local and global variables. These can be accesses by the functions `locals()` and `globals()`.

`a = ... ` can be thought of as `locals()["a"] = ...`

(although you can should not actually modify the local variables this way)

jcranberry · on June 26, 2017

The way you phrase that makes it seem as if the language being slow is somehow the fault of hash tables when it's the use of dynamic linking.

mpweiher · on June 26, 2017

That's because it is the hash tables.

You can have dynamic binding (assuming that's what you meant, dynamic linking is something else) with way fewer hash lookups.

_tqci · on June 26, 2017

Really. Hash tables are one of my most used data structures.

obstinate · on June 26, 2017

In fact, what performance oriented devs need to be taught about these days is how to not use hash tables, or, more generally, dictionaries. They are a super-easy data structure to reach for, but often it is possible to arrange your data such that their use is unnecessary.

One example that was brought up elsewhere in the thread is the way python looks up variables in successive scope dictionaries. This is obviously terrible for performance, and that's a big part of why Python is slow.

But how are other languages fast? Doesn't every language have to resolve variable names to memory locations? Well, yes, but fast languages do this mostly at compile time. How? Well, I'm not an expert in this, but at a high level, each variable is assigned a location in memory or registers, and then future references to that variable are rewritten to refer to the memory location by register name or memory address. This takes the whole "looking up a name" issue out of the path of work that has to get done at runtime. And you've switched from looking up things in tables to operating directly on registers and memory locations.

BTW, this has nothing to do with high-versus-low level. It's more about how mutable the process of name resolution is after program compilation. One could theoretically write an assembly language where memory locations are names, not numbers. If reflective features like runtime scope editing are available, this would be a very low-level language that still requires some kind of dictionary lookup at runtime.

flukus · on June 26, 2017

> In fact, what performance oriented devs need to be taught about these days is how to not use hash tables, or, more generally, dictionaries. They are a super-easy data structure to reach for, but often it is possible to arrange your data such that their use is unnecessary.

A lot of devs are completely unaware of what's happening at the layer of abstraction below them and this is one of the ways that comes out. The number of elements it takes before hash tables are faster than iterating through an array can be surprisingly large and yet it's one of the first "optimizations" that get made.

Some other related problems are not knowing how expensive network calls are, not knowing what the ORM is doing, caching in memory even though the database already is and more. They just don't think about performance issues in code they don't write.

hueving · on June 26, 2017

99.999% of projects are not going to have any meaningful hot path in variable resolution.

If your program is sensitive to that, even a simple GC pause is going to destroy your performance and you need to be out of managed memory languages at that point.

There are a lot of reasons python can be slow, but this is far from one of them.

flukus · on June 26, 2017

> 99.999% of projects are not going to have any meaningful hot path in variable resolution.

It won't show up in the hot path because the performance cost so pervasive that profiling tools will ignore it, it's everywhere and you can't escape it without compiler optimizations. This cost will be in the hot and cold paths. This and data locality are the two biggest performance issues in dynamic languages and a lot of effort goes into compiling it out.

Here is a good article on how V8 tries to deal with it: https://www.html5rocks.com/en/tutorials/speed/v8/

For statically compiled languages it can show up but often you'll have to write a dictionary free version to see it. Some profiling I've done with c# at various times over the years shows that it's slower than a list until you have more than 50 to 100 items. The caveat is that I normally keep the dictionary version because it's more semantically correct.

adrianN · on June 26, 2017

Eh, one of the standard performance tips for hot loops is to store everything relevant, including functions, in local variables because the interpreter looks in the local dictionary first. The last time I optimized Python code that made a significant difference.

obstinate · on June 26, 2017

To be clear, I don't think this is the only thing that makes python slow. It's probably one of the top five or ten things preventing it from being fast, though.

adrian17 · on June 26, 2017

> Well, I'm not an expert in this, but at a high level, each variable is assigned a location in memory or registers, and then future references to that variable are rewritten to refer to the memory location by register name or memory address. This takes the whole "looking up a name" issue out of the path of work that has to get done at runtime.

Python does this for local variable names in functions. Because of this, moving a tight loop from global scope to a function can make it a couple percent faster.

haburka · on June 26, 2017

Are you serious that dictionaries are a problem for performance oriented developers?

Performance oriented devs should be concerned with bottlenecks, not incredibly minute details. There's almost no situation I can think of where smallish dictionaries are much better or worse than any other data structure when it comes to performance.

Of course, if you're writing a compiler then it can be a serious difference. Most developers don't write compilers though.

adrianN · on June 26, 2017

Yes, if you write performance sensitive code you have to be very careful with dictionaries. See for example this nice talk by Chandler Carruth https://www.youtube.com/watch?v=fHNmRkzxHWs

obstinate · on June 26, 2017

Well of course. :) If you don't write perf sensitive code, don't worry about perf. But if you do, in many cases avoiding hash tables can become important.

Cyph0n · on June 26, 2017

Agreed. I mean, for example, doesn't the entire Lua language revolve around hash tables?

throwaway2016a · on June 26, 2017

Same with PHP.

dang · on June 26, 2017

OK, we changed the title to the first sentence of the article.

fenomas · on June 26, 2017

Mods, can we get a title change? As I write this every comment is taking issue with the "underrated", which isn't claimed in the article.

OJFord · on June 26, 2017

Though, in this case, changing it from the original does seem warranted. (This was just another bad choice...)

iainmerrick · on June 26, 2017

What's wrong with leaving it unchanged? The blog post title seems fine.

pvg · on June 26, 2017

You can mail the mods and ask them, they usually fix these fairly quickly.

deckarep · on June 26, 2017

It's actually sets that are underrated. All languages should come with a reference implementation.

Hashtables aka Maps aka Dictionaries aka Associative arrays are just fine.

rdtsc · on June 26, 2017

Though once you have hash/maps/dicts set are with reach by making sets from hashes were keys are the element and values are just true or 1 or something like that.

But I think you probably meant having and using set operations effectively in day to day tasks, as in "make 2 sets and do a set different operation" instead of "do a for loop on first hash check if it is in the second, then put results in an accumulator.

Another thing is to think about set of sets. Can that be useful sometimes? Implementing that is slightly trickier. You'd need to be able to get a hash of a set. Python has frozenset https://docs.python.org/3/library/stdtypes.html#frozenset. I've used those on occasion.

Then of course there is Erlang sofs (sets of sets) module. Stumbled on it by accident. Oh my, it comes complete with an introduction to set theory and relational algebra:

http://erlang.org/doc/man/sofs.html

It just struck me as so out of place with the rest of the standard library modules. Would like to know its history

deckarep · on June 26, 2017

Yes, while a set can be derived or mimicked if you have a hashtable for example, I'm specifically referring to all the common set operations you'd typically want to have.

Using a map/hashtable as a set is just the tip of the iceberg.

In many cases you don't need a key and a value and you can get rid of a lot of loopy and complicated code if you had a simple set to utilize with its useful operations.

bogomipz · on June 26, 2017

>"Though once you have hash/maps/dicts set are with reach by making sets from hashes were keys are the element and values are just true or 1 or something like that."

Can you elaborate on how you can derive a set from hashes by using values of True or 1? Might you have a link? Thanks.

rdtsc · on June 26, 2017

> Can you elaborate on how you can derive a set from hashes by using values of True or 1? Might you have a link? Thanks.

Sure. Np!

I mean that a simplified set is just a hash where the elements of the set are keys of the hash table and the values can be anything. I used 1 or True as example.

As in adding an element would be:

   my_dict[element] = 1

Then membership check is:

   if element in my_dict

Then removal is deleting:

   del my_dict[element]

and so on.

In other words, the reason sets are sometimes not explicitly there is because they are easy to implement on top of existing data structures.

Basic operations like union, difference, intersection between two sets can be done with a few simple for loops.

But like I mentioned in other comment, there is one interesting aspect to set (and hashes) in that the element now have to be hash-able. That kind of depends how mutability and identity works in the particular language.

bogomipz · on June 26, 2017

Thanks for the clear explanation, this makes total sense. Cheers.

afiedler · on June 26, 2017

Ruby's standard library does this. Here's the code: https://github.com/ruby/ruby/blob/trunk/lib/set.rb

freyr · on June 26, 2017

Python essentially does this as well.

The idea is that an item is in the set if it's in the hash table. You can add, remove, or test for membership in constant average time.

Set operations still need to be built on top of this.

ma2rten · on June 26, 2017

I would argue that Hashtables are a specific implementation of the abstract datatype "map" or "set".

xyzzy4 · on June 26, 2017

Sets are just like hashes where the value is always "true" for each key.

mcpherrinm · on June 26, 2017

While that's true you could implement one that way, it's very nice to have the set operations implemented. Intersection, union, subtraction, etc. And having a uniform set type makes API signatures more consistent.

Plus you may be able to optimize Set<T> to use less space than Map<T, bool>.

slig · on June 26, 2017

> While that's true you could implement one that way, it's very nice to have the set operations implemented

Unrelated, but does anyone know why the new JavaScript set implementation is so limited? Why didn't they bother doing this right?

TheAceOfHearts · on June 26, 2017

I think I remember reading a claim that they pushed for a small API surface in order to make sure it got through. Now that it's part of the language, anyone else can work on getting those extra features in.

You can implement most basic functionality easily enough, MDN even has an example [0]. Although I agree that it should really be part of the language.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

douche · on June 26, 2017

Because it is JavaScript, and there's some kind of unspoken rule about not doing things properly and instead releasing broken things.

deckarep · on June 26, 2017

This exactly, I hear over and over: just use a hashtable...totally misses the point.

OJFord · on June 26, 2017

Does it really matter what the 'value' is? A set is surely implementable as a supertype of a hash, where the value is totally arbitrary; it only matters that the entry exists.

With:

    {'A': true, 'B': false}

you seem to be suggesting `B` is 'not in the set'. What's `C`?

jackweirdy · on June 26, 2017

Not in the set. You're describing the same thing as the parent comment, but you're saying "arbitrary value" which they substituted for "True"

freyr · on June 26, 2017

They're not really the same thing then. In OJFord's implementation, there's a one-to-one correspondence between states: the element is a member of the set iff it exists as a key in the hash table.

In the true/false implementation, two distinct hash table states (false or key not set) map to one set state (not a member). The program must check whether the key is set, and then check its value.