Hacker News new | past | comments | ask | show | jobs | submit login
The Universal Data Structure (elbenshira.com)
189 points by elbenshira on June 30, 2015 | hide | past | favorite | 99 comments



"Hashes are always O(1) reads, inserts and writes."

Maybe, once you've found a location to read, insert, or write to. The author neglects the runtime cost required for the hash algorithm itself, which may not be trivial; computing the hash of a string key is typically an O(n) operation.

Furthermore, unless a suitable table size is selected, integer keys (should one use a map like an array) will eventually hash to the same value, requiring even more time to iterate through all the remaining values that match that key until the desired value is found.

"I don’t know why you would ever use [a linked list] over an array (or a hash)..."

Here's why: because arrays take up space whether you use it or not. Linked lists don't suffer from this problem, but at the cost of 1-2 pointers per item. Has the author seriously never managed memory before? Please tell me this article is a joke.


It's even tagged "bad-theory." I think it's pretty clearly a joke! A really good one!


"And we can’t forget our favorite JavaScript interview question of all time: If you only had twenty-four hours to implement arrays in JavaScript, how would you do it?"


Aren't they just objects that have integer properties; that's why you can

    for (k in ['a', 'b']) {
      console.log(k);
    }
and get back 0, 1? However:

    var a = {0:'a', 1:'b'};
    for (k in a) {
      console.log(k);
    }
also outputs 0, 1.

Edit: turns out, you can even have doubles, too:

    var weird = {3.14:'hello', 6.28:'world'};
    // for loop above emits: 3.14, 6.28
    console.log(weird[3.14]); // emits 'hello'


> Edit: turns out, you can even have doubles, too:

That's because Javascript doesn't actually have integers, it just has "Number":

> The Number type has exactly 18437736874454810627 (that is, 264−253+3) values, representing the double-precision 64-bit format IEEE 754 values as specified in the IEEE Standard for Binary Floating-Point Arithmetic

http://www.ecma-international.org/ecma-262/5.1/#sec-8.5


JavaScript objects can only have string keys (this may have changed in ES2015 with Symbols). If you pass a non-string value in an object literal or inside a []-accessor, that value is converted to a string.


I'm not really versed on JavaScript. I presume this is a joke about the lack of a proper array structure in JavaScript? How would one answer this question anyway?


This is probably intended to be a joke about how Javascript was originally developed over about 10 days, which has been the root of many of JS's problems.

http://www.computer.org/csdl/mags/co/2012/02/mco2012020007.p...


But this heritage produced, years later, a great presentation about the evolution of JavaScript... that was best viewed without JavaScript.


While it's probably joke, JavaScript's Array is fine (although it's more like a 'vector'). JavaScript VMs implement it as a real array (in fact, plain objects are often optimized into arrays too).


I think I came around to that realization when he starts comparing Postgres to HTML tables.


sounds reasonable. normalization vs denormalization

ok, the last paragraphs kind of give it away... but there are good points in terms of keeping the data flat


>Maybe, once you've found a location to read, insert, or write to. The author neglects the runtime cost required for the hash algorithm itself, which may not be trivial; computing the hash of a string key is typically an O(n) operation.

Yeah, I know, it always came off to me as BS to repeat that hashes are O(1) lookup, like we're making a special exception. Anywhere else, you can look up the algorithm and derive its big-O without having to memorize, but not this one.

Pretty clearly, as the hash table grows linearly, the keyspace has to grow linearly and the key size logarithmically. Since you get the keys from the output of a (nice, well-mixed) hash function, then the computation for where to insert has to increase logarithmically -- a longer hash output requires more steps.

You only get O(1) by selectively ignoring one kind of asymptotic behavior (that of the computations needed for the hash).

Reddit discussion: http://www.reddit.com/r/compsci/comments/2z74z8/why_is_hasht...


This is mostly an issue with terms. Most (all?) of the operations we call constant-time, say, comparison, are technically logarithmic in the number of bits on real computers.

Since that's usually not relevant to big-O analysis, we can sidestep the issue by specifying what we're counting: rather than say that mergesort is in O((log n) log (log n)) time, we say it takes O(n log n) comparisons.

Whether you think of that as a theoretical computer with a constant-time comparison instruction or as a real computer with a commonsense (2^64) upper bound, the upshot is that we don't care. As long as what we're studying is sorting, not comparison, it will fall out of every algorithm, so the difference isn't important even in theory.

It's still an important thing to notice, though-- there are plenty of cases where you definitely do care about counting bits.

In more detail: http://cs.stackexchange.com/questions/1643/how-can-we-assume...


But those situations aren't the same -- comparison sorts take O(n log n) because it is expressed in terms of n, the number of elements, which is orthogonal to element size. Even if you did account for the max element value V, it would be constant with respect to that value. The full bound would then be n log(V) log n -- regardless of your choice of V, you don't affect the scaling with respect to n.

But for hashtables, what exactly are we studying that doesn't care about hash time? The whole reason that a hashtable is supposed to save time is that you locate the desired key's value by computing the location from the key itself. To whatever extent that computation requires more steps, that cannot itself be abstract away -- if only because a limitation on key size is a limitation on table size.


That example may have misfired-- perhaps consider that sorting an array of n elements requires addressing n elements, which requires arithmetic on words of log n bits, which are therefore log time operations. Limiting the size of each element won't help your complexity.

But we really don't care about that very much. In this case.

It's kind of described in that SO post I linked-- you can assume a constant-time hashing operation, but if that offends your conscience, assume an "integer RAM" model where arithmetic operations up to a certain word size w are constant time. Then you can observe that any reasonable w is inconsequential to your problem domain, and go back to treating hashing as a constant-time operation :)

The idea is that the fact that w increases at least as the log of n is shared by all computations modeled in integer RAM, so it's not a useful contribution to analysis at that scale. It's in the model, it's just not generally useful to the model. If you ever find yourself asking, is the bit width relevant? You can do some math and get a straight answer.

Of course real world algorithms often hash strings which complicates the analysis, but that's orthogonal, like my misstep earlier in implying the size of an element rather than the size of n. The mathematical sense in which hashing time must increase with the log of n is a fact of all algorithms that have ns.

I think your surprise at this just stems from wanting a rigorous definition for something that's more of a modeling tool, with different models to choose from. In real life, there's the time on the wall since you clicked "Run", and everything else is theorygramming.


You can't have it both ways: on the one hand, say that big-O is about arbitrarily large unrealizable values, but on the other hand make a bunch of exceptions for where things don't matter in practice, lest you be theory-gramming.

If you go the route that requires you to handle arbitrarily large tables, then computing a hash with a long enough value necessarily requires more steps, even if you assume all operations on length w integers are constant time -- because eventually your keys have to be larger than w, after which computing the hash increases in number of steps with the key size.

This is still different from the sorting problem. You can have a consistent computation model in which memory seek times are constant (even if unrealizable). That seems fundamentally different from a model where a hash computation is constant steps but also has arbitrarily long output.

The issue isn't whether the scaling of the hash computation matters in practice, but whether the big-O is properly O(1). And even if you did just care about "in practice" limitations, you're stuck with the problem of the hash computation taking longer than any realizable trie lookup would (as another poster mentioned is common), meaning that the O(1) claim misrepresents performance relative to another data structure.


Edit: Ugh, the old version of this post turned out really lectury, and I'm still working on my own intuitive understanding here, so I'm just gonna start over.

Basically-- you're not wrong. It's a quirk of analysis that a hash table is allowed to take the modulo (for example) of an arbitrarily large number, producing an arbitrarily large number, in "constant time", while a trie is not allowed to descend to an arbitrary depth in constant time.

But saying that such a thing is possible is a simplifying assumption that we make, in general, for non-numeric algorithms: arithmetic is O(1).

Of course, real programs cannot do this for arbitrarily large numbers. So if you are actually concerned with arbitrarily large numbers, you must do something else. Thus, other sorts of analysis which would give you a different number for hash table complexity. Yay!

And if you're having difficulty using the reported big-O value under the canonical model to compare algorithms which have very similar amortized performance in the real world, your problem is just that big-O is not for that thing you're doing.


Thanks for taking the time to revise and give a thorough and fair reply.

To give background, my main concern is with the exceptionalism: the defining (good) thing about STEM is that there's actual logic to the discipline, where if I forget an answer, I can re-derive it. It's not just a game of memorization and regurgitation.

But O(1) for hash lookup feels like the old soft-sciency "hey, you just have to memorize this". I don't have one consistent model that lets me derive bounds, and which produces O(1) for hash lookup and O(log n) for trie lookup. The other convention of constant seek times is unrealistic [2], but it's at least consistent. This isn't.

Rather, it's supposed to be a big-O table. If these were presented as "in-practice" bounds, with its own consistent logic about which resources are scarce relative to which, that would be a different story. Then I would have a consistent set of conventions that produce O(1); it wouldn't come off as "oh, we assume the rules don't apply here".

>But saying that such a thing is possible is a simplifying assumption that we make, in general, for non-numeric algorithms: arithmetic is O(1).

Where? Every treatment I've seen takes arithmetic as linear in the number of bits. [1] This is still consistent with O(n log n) for sorts because you can assume a bound on element size, and still only have constant factor overhead.

>Of course, real programs cannot do this for arbitrarily large numbers. So if you are actually concerned with arbitrarily large numbers, you must do something else. Thus, other sorts of analysis which would give you a different number for hash table complexity. Yay!

But (as above), the problem is it's not even clear what sort of problem class we were doing that let's you assume this particular thing can be bounded but others can't.

[1] https://en.wikipedia.org/wiki/Computational_complexity_of_ma...

[2] Pretty sure that in the physical universe, memory seek requires n^(1/3) because you have to cram it into a volume whose size increases with the cube of longest distance between addresses, and travel time is linear in distance.


> Where? Every treatment I've seen takes arithmetic as linear in the number of bits. [1] This is still consistent with O(n log n) for sorts because you can assume a bound on element size, and still only have constant factor overhead.

This is probably the only point of disagreement here. If I understand your critique of the O(1) result, it's that addressing n elements takes arithmetic (hashing) on numbers of log n bits.

ISTM that, assuming an integer RAM model, that same critique must apply to any input. A mere pointer into an array of n elements has log n bits, so just incrementing and testing the pointer can't be O(1), and even iterating that array can't be O(n).

It's still hard for me to see any reason why the one or two arithmetic operations in a for loop can be O(1), while the larger but still constant number used by a hash function can't.


>This is probably the only point of disagreement here. If I understand your critique of the O(1) result, it's that addressing n elements takes arithmetic (hashing) on numbers of log n bits.

It's not the addressing I object to -- this whole domain consistently assumes O(1) seek times. It's the computation of the hash itself. A hash function that, by supposition, minimizes collisions is going to have to avalanche the whole thing such that it has to perform a complex calculation -- not mere lookup -- that incorporates multiple combinations of the elements. The growth in cost is not from doing seeks over bigger ranges but from computation. You can no more assume away that growth than you can assume that md5 is just as expensive than SHA256.

>ISTM that, assuming an integer RAM model, that same critique must apply to any input. A mere pointer into an array of n elements has log n bits, so just incrementing and testing the pointer can't be O(1), and even iterating that array can't be O(n).

I never objected to this before but I've always disagreed: iterating needn't be log(n) because any such procedure can be optimized away to get the next element via either a) a seek, or b) incremeting a value, which is constant amortized (because bit rollover is exponentially rare).


Since we call cases where something that looks polynomial on the surface but actually performs in NP "pseudo-polynomial," does it makes sense to call the cases where something more or less takes constant time "pseudo-logarithmic?"


Well, there are terms linearithmic and quasilinear that already get some usage.

> https://en.wikipedia.org/wiki/Time_complexity#Quasilinear_ti...


Or maybe, since O(1) is in O(log n), you can cover your bases by just calling everything logarithmic regardless :)


How about we refer to this as O(k) instead of O(1)?


O(n) in the size of the string not the hash map.

Also, if the key space is known before hand, it is possible to build a perfect minimal hash table with 1.3n total keys (n is the size of the keyspace)


> Linked lists don't suffer from this problem, but at the cost of 1-2 pointers per item.

Another cost is the cache misses. If you are doing operations in a sequential fashion using an array would have a significant advantage if the number of elements is high. That's another tradeoff to take in consideration.


It can be avoided by choosing a good allocation strategy for the linked list, allocating nodes in an arena can eliminate cache misses.


You'll still have more than if you had used an array, due to the low information density (2 extra pointers / node).


computing the hash of a string key is typically an O(n)

This is a good point, though to be fair, this aspect of it is usually also ignored when analyzing the main alternative data structures, which are balanced binary trees. At least every successful lookup in such a tree with strings as keys also has to read the entire string.

This means that hashes probably still have an advantage over balanced binary trees. Tries, on the other hand, could really beat hashes when the keys are strings. In practice, they might have a cache and branching disadvantage, though.


Actually, modern tries can have good cache behavior, especially at larger sizes. Hash maps inherently store data haphazardly inside themselves while tries keep it in order with similar keys near each other in memory.

At the very least, this makes it easier for your algorithms to be more cache friendly, because tries' cache behavior is predictable in a way that hash maps' isn't.

This is even more relevant for branch prediction: hash functions (by design) are harder to predict, giving the lookup algorithm of a trie an advantage.

A cache aware design like "adaptive radix trees" delivers really good performance in practice, comparable or better than hash maps while also supporting fast iteration and certain queries (like iterating over all keys with a given prefix).

Take a look at the adaptive radix tree paper[1]: the idea is very accessible, and the resulting performance very impressive. They also generalize the system to work with different types of keys through a systematic method for turning values into binary strings.

[1]: http://www3.informatik.tu-muenchen.de/~leis/papers/ART.pdf


I don't know why the complexity estimates I find for Hashes are always so bad. They never account for growth (or depending on the implementation shrinking on delete), never account for the hash function, etc.

By the time you hash a key, you could have likely already inserted it into a trie.

Lookups on hashes are also not O(1) for similar reasons. You have to hash the search string, then compare the value at whatever location it hashed to (usually a string-comparison operation which aren't O(1)) and depending on the collision strategy, do more things if it doesn't match, but isn't an empty value.


When you say O(something), the something has a unit.

Hash table lookups and insertions take time O(1), when talking about number of items already in the hash table - and that 'when talking about' is implicit and doesn't have to be said as anyone talking about the complexity of a hash table knows that, or would state otherwise as it would be an exception.

Talking about the length of strings used as keys in the hash table is therefore nonsense when we have already agreed that we're talking about the number of elements in the hash table, as length of the strings isn't a parameter in that function.

And they never account for growth - yeah, it's amortised isn't it? That's what we wanted to do when we do an O().

I mean what you are proposing would be an interesting study - the complexity of hash tables parameterised for all sorts of different properties such as the size increase function or the load factor or whatever, but it's not what anyone means when they talk about complexity of a hash table unless they state otherwise.

So nobody's getting it wrong or making a mistake. They're using some assumptions that everyone's already agreed on and know about. If you want to talk about the complexity of something else you need to clarify.


I have two kind of nits with this logic, but I could totally be wrong, and you should feel absolutely free to correct me.

I'm fairly positive that a unit of measure should _never_ be variable, otherwise it's fairly pointless. And if you don't think hashing a 1TB string takes significantly longer than a 100byte string ...

Futher more, big O notation is supposed to be a wide upper bound, but I don't think that works well if it's not actually an upper bound. If you were to tell your bosses something took constant time, and it took an hour for string A (1TB) and 100ms for string B (10B), I'm pretty sure your opinion wouldn't mean much after that.

The hashmap should absolutely be considered in terms of the hash function, because it's the longest running part of the algorighthm. To do otherwise is disingenous.

Using that logic, I could call any iterative algorithm O(1) since the top level function only gets called once.


I think the problem you would have with your boss would be that your boss asked you 'how long will this program run', and if you told them O(1), the question you are really answering is 'what is the time complexity of this algorithm, parameterised by the number of entries, as the number of entries tends towards infinity'.

If your boss really wanted O(), then they wouldn't care that hashing one key takes a day and another a second, because they're thinking in terms of a hash with infinite entries, so the difference between a day and a second to hash is irrelevant.


If you released software that was exponential, but your QA department only ever tested small inputs, I think it would be negligent to omit to your boss and|or clients the rate at which run-time could expand.

and I think it would be a horrible manager to not care about the difference between a day and a second.


Yeah I agree but my point is none of what you are talking about has anything to do with O(), which doesn't attempt or claim to do what you want to do.

It's like someone gave you a hammer and you're saying it's broken because it doesn't cut wood very well. It's not designed for that.


To implement the hash table you wouldn't have to hash the whole string... of course this will depend on the data that you are trying to store. Assuming that the data is random, 100 bytes vs. 100 terabytes, you only need to figure out what bucket the data is saved. You could still base it on this concept if sightly modified


Yes. A valid hash function certainly can be defined as a F(n): N x N -> n % 100

But that certainly cannot be considered a reasonable hash function. A string is basically an array of bytes (or code-points, in case of UTF-8).

To have any decent property (like, producing different outputs for miniscule changes in the input), you have to touch every element in the array.

For custom objects, yes, you don't have to hash every property, but for strings, yeah, the hash function will almost always depend on the length of the string.


Part of the problem when estimating hash complexity is that what's usually considered is something that's basically memory + offset. Basically a few mov's and an add.

However, this is absolutely dominated by complexity of the hashing function, growth (which can be amortized, but is not O(1)) deletion (also not O(1)) and comparison functions (which are usually O(mn) or O(n) (or some similar depending)).

We end up measuring the things that are the most minimal in hashing and pretending like the expensive operations, which aren't O(1), don't exist. This is wrong and dishonest when considering algorithms. We know for example that string comparison is not O(1), and it's usually a part of most hash table algorithms, and yet it magically disappears when analyzing hash table complexity and everything is somehow supposed to be considered as a mem+offset which is stupid.

Hash-tables also usually have exponential memory growth complexity which nobody ever pays attention to. Of course there are versions which don't do this and/or have fixed memory sizes, but the random kind you find in most languages grow O(n^2) or similar. And this is also ignored and we pretend it doesn't happen and resizing magically becomes part of the "1" in O(1)....even though copying arrays isn't free and as resizes happen arrays get bigger and big-O should get worse.

Hash-table operations can be pretty expensive and faulty big-O analysis doesn't help. So sure, for a static, fixed size, hash table, with no growth, instantaneous hash functions that use temporal oracles, don't need to deal with collisions, O(1) is correct. But these hash functions pretty much don't exit in nature, and most programmers won't be working with them. The standard libraries for most languages sure as hell don't implement such ideal hash tables.

O(n) is at least an honest approximation. I honestly have never seen a well considered analysis of hash function complexity but I know it grows super-linearly from just using them a lot. This kind of fanciful analysis doesn't help anybody.

So yes, O(something) where something has a unit. But the unit sure as heck isn't whatever "1" is supposed to represent. It's probably closer to string length or array length (or some combination of the two), but a single "add" it is most definitely not.


All of what you're saying is true, about the speed of real algorithms running on real hardware with real problem sizes.

But if you're talking about big-O, you are explicitly not talking about that. You're talking about how the speed of the algorithm hypothetically scales as some parameter tends to infinity.

To wit, O(n) doesn't mean "this algorithm takes kn time to run for a given n", it means "this algorithm's runtime for all n > c for is bounded above by nk for some c and k".

Sound like a analytic club that's rarely accurate to real-world performance? Yup, that's big-O :)


Right, so I'm asserting that the parameter usually called for in big-O analysis of hash table operations is the wrong one since it measures the lookup complexity and not the hashing complexity, which is usually O(n). And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

This analysis is using the hash table length as the parameter under consideration, but that's silly, because most of the complexity of hash-tables is in the hashing which (depending on the hash function) usually has a known complexity of O(n). Where n is the length of the input string.

This is far more important in hash table complexity analysis than table size because this part of the operation dominates runtime.

You can also do multi-paramter big-O analysis. O(mn) is a thing for example, with two differently sized parameters, both of which contribute to the complexity of the algorithm and can't be dismissed.

So charitably, if you need to provide a big-O for say, hash table inserts, it's reasonable to say O(mn) where m is the length of the table and n is the length of the input string, but it's not necessary since table length usually has little contribution to the complexity of the algorithm...hence why people keep saying inserts are O(1) on average, because table length isn't much of a determinant in hash table operation complexity. Just like big-O hides constants, we can hide parameters that are dominated by another. O(1) is doing this backwards.

My guess is that O(1) is some weird cargo-culted remnant from some early work done on Hash tables as a generalization of Arrays, likely from when the hash functions were just modulus operations on integer keys or some other simple, fixed length hashing method that was easy to ignore (like the universal hash in Corman's "Introduction to Algorithms". But modern hash-tables can't make these assumptions and I, and quite a few other folks, think that this needs to be rethought.

Some examples (I don't agree with all of these, but I think it makes the point):

https://stackoverflow.com/questions/2771368/can-hash-tables-...

http://lemire.me/blog/archives/2009/08/18/do-hash-tables-wor...

> Sound like a analytic club that's rarely accurate to real-world performance? Yup, that's big-O

If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.


> And this results in meaningless complexity analysis which gives you things like "Insert is O(1) on Average but O(n) Worst Case or Θ(n)".

I dunno, that just sounds like a time complexity to me. Quick, what's the time complexity of quicksort?

> This analysis is using the hash table length as the parameter under consideration, but that's silly

I think you're just talking about something else, then? Sure, the analysis generally assumes a finite key size, and looks at performance as the table size increases. That's just pragmatism; people generally have a bounded key size and an unbounded amount of data.

If your complaint is that treating the key length as finite results in a complexity of O(1), then... that's the point. Treating the key length as finite results in a complexity of O(1).

Table size isn't much of a determinant. It isn't any of a determinant, on average. Only the key length matters. That conclusion is the whole point of this analysis.

> it's reasonable to say O(mn)

I'm confused if this is what you meant to write-- this is not an accurate complexity for a hash table insert because, as you have pointed out, a hash table insert doesn't depend on the table size. There should be no factor of m. Edit: Er, technically O(n) is in O(mn)? Is that your point? But O(mn) doesn't simplify to O(n) unless m is constant, which I don't think you're saying.

With respect to key size n and table size m, the average complexity should be O(n). If we let key size be finite relative to the size of the table, that gives us O(1). But if you don't like that, you can let key size grow to infinity and you're right back to O(n).

None of this is going to tell you if it's the right data structure to use, though.

> If your complexity analysis isn't measurable by real-world performance, it's likely that you aren't analyzing the correct parameters.

No, in this case you are looking at the correct parameters but using the wrong model. At least, I think; it's still not clear what you're trying to get done here where big-O is letting you down.


On a hash with a million elements, the O(n) hashing of a 10-char key is negligible.

Also, you have to make apples to apples comparisons. In this case, you're comparing the time to search against the number of elements, and that's it. If you want the time to hash AND search - as a function of n - then your analysis holds, but you can't then compare that against other datastructures that do not have an equivalent hash step.


> computing the hash of a string key is typically an O(n) operation.

When strings are immutable, the hash values can be cached so the computation is amortized (this is how it works in Java)


Additionally, it's provable that in the context of a physical computer, no data structure actually has O(1) read / write / anything performance over an arbitrary amount of data.

Consider that the universe enforces both a maximum information density per cubic meter, and a maximum speed at accessing a given physical location.


> computing the hash of a string key is typically an O(n) operation.

The n in operation time on data structures usually refers to the number of members in the data structure. String length would be a different variable.

For example worst case for finding a string member in a linked list would be O(n * k) where n is the elements in the linked list and k is the string length. I.e. the worst case here assumes lots of members with shared prefixes.

So the O(1) in hash tables actually is O(1 * k).

This distinction often is important because the length of strings and the number of members are independent variables.

And for many use-cases (e.g. member names in classes or text-protocol keywords) k is essentially fixed, e.g. a maximum member name length of 255 characters or a similar limitation. And for big O notation that means O(1 * 1), since it's a constant.


> Linked lists don't suffer from this problem, but at the cost of 1-2 pointers per item.

And decreased memory locality (more cache misses) and increased number of allocations.


Worst case performance of a Map is O(n).


This is a perfect example of the kind of humor that belongs on HN. Actually had me nodding along in parts, then screwing up my face at others. By the time I was sure it was satire, I was committed enough to see it through to the end.

Best of all, I expect sincere discussion of the merits of the argument in this thread. Well executed, and heh heh.


Is this blog post a dig on Rich Hickey's talk 'Simple Made Easy'?


Just speaking for myself, it seemed like a pretty good-natured mix of thoughtful musing on the nature of universal computation, and gentle mockery of the idea that universal computation means there's one right answer for anything. I'd be surprised if it's meant to skewer any one viewpoint in particular.


This is an acceptable description.


Like a well trained Pavlov dog[1], reading the title brought "Lua table"[2] to my mind, the most flexible data structure I have worked with, by far.

[1] https://en.wikipedia.org/wiki/Classical_conditioning

[2] http://www.lua.org/pil/2.5.html


> reading the title brought "Lua table"[2] to my mind, the most flexible data structure I have worked with, by far.

Which is not necessarily a good thing. PHP's array and JS's Object are essentially the same thing.


Lua tables accept arbitrary objects as keys, and make a difference between `foo[1]` and `foo["1"]`.

Plus all the metatables goodies: weak key and/or value references, prototype inheritance, ...


They even mark the difference to the degree that foo[1] is stored in a special "array part" of the table, while foo["1"] ends up in the hash part. Predictably, the array part uses less memory.


> Lua tables accept arbitrary objects as keys, and make a difference between `foo[1]` and `foo["1"]`.

That sounds... completely normal? Python dictionaries will do that too. So will Java HashMaps.


That comment was responding to a comment about PHP and JS, both of which do not make such a distinction.


There is another subtlety that values associated with integer keys are stored as an array. dicts and HashMaps don't do this.


And not even all integer keys! Lua will efficiently handle sparse arrays using tables.


Having read over the entire thing, I have only one issues with it: The Universal Hash structure is only universal if the language implementing it permits you to create cyclic structures, so you can manipulate the "pointers" or relevant language concept to create the cyclic structures. There are a handful of languages that don't permit that, such as Erlang, and some languages like Haskell that permit you to use "tying the knot" to create cyclic structures [1], but ones that can sometimes be difficult to manipulate after the fact.

In those languages, you'll probably need to use a data structure well known for its simplicity, efficiency, and broad cross-language-platform compatibility: The Resource Description Framework's graph model. Graphs are also a well-known candidate for "universal data structure" [2]. Also, it's semantic, which is a clear advantage over the entirely not semantic UniversalHash, because semantic is better than not semantic when it comes to universal semantics of semanticness.

Semantic.

Otherwise, the article is pretty solid and all serious developers should definitely consider implementing it forthwith, as they will find it a very, very educational experience. I know I've seen people implement this model in Java before, and I can certainly vouch for the fact that it was an education for all involved.

I'm going to say "semantic" one more time, because it's been my observation that the more you say it, the smarter you sound, so: semantic. Oh, yeah, that's the stuff. Mmmmmmmmm.

[1]: https://wiki.haskell.org/Tying_the_Knot

[2]: Seriously, if you really do want to discuss the "universal data structure", perhaps because you want to analyze data structures very generically for some mathematical reason, graphs really are a good candidate. Not necessarily RDF graphs, which are both bizarrely complicated while lacking simple features; the three blessed collection types are a weird combination of features.


Do you really need to create "true" (i.e. language level?) cyclic structures? Shouldn't you be able to simulate cyclic structures at the cost of requiring more space (and probably time) to compute the simulation?


That would get into the stuff at the bottom, when you simulate other data structures within your data structure. A non-cyclic data structure can simulate cyclicness with IDs on nodes and things that store links... it's generally how you do it in Haskell, in fact, since while saying "it can't" do true graphs is perhaps a smidge overstrong it is certainly not practical to try to modify knot-tied structures. (I've seen the question "How would I do a graph?" repeatedly on /r/haskell, and "use IDs in a map" is generally what comes back.) But you're putting a layer on top of your store.

(By the by... you know you can trust me, because... semantic. Semantic.)


"...We propose that many, and maybe even all, interesting organizations of information and behaviour might be built from a single primitive operation: n-way associative lookup. ..."

http://www.vpri.org/pdf/tr2011003_abmdb.pdf


While this may be a mockery, title probably alludes to Universal Design Pattern[1], which is not so easily dismissable idea.

[1] http://steve-yegge.blogspot.com/2008/10/universal-design-pat...


This had me worried:

"God-given axioms, which is academic lingo for a truth we accept purely by our God-given logic, like: if something is not true then it is false, something cannot exist in an empty set, something logical is logical."

But then I saw this and started laughing:

"Remember when everyone used tables to lay out their HTML? Well, that proved to be a horrible way to do things, because tables are inherently inflexible. It’s a strictly geometric constraint. Now we all use divs and CSS, because we get a much-more flexible CSS engine to define our layout. Postgres and her SQL friends are all table based, just like the <table> of 1999. Don’t be 1999."


Why did the axioms bit have you worried? OP is pretty much exactly right, except for the part where we accept it by our "logic". Axioms are simply true, end of story; logic does not apply to axioms themselves, only how they can be used in relation to other axioms.


Axioms are simply something that you accept as true to build a model. There is no 'simply true, end of story' or 'proven by logic' to axioms.

Many axioms maybe picked because they seem 'obviously true', (or more likely, because they are useful) but that doesn't make their truth simple or make them the result of logic. (For an example take a look at the existence of infinity).

Additionally, the 'axioms' he lists are all what I would generally consider tautologies. (Although you might argue that the first one is actually an axiom of bivalent logic systems).


Satire, aside, I think a very short addition to the last line holds a lot of truth:

"A hash is simple. A hash is fast. A hash is all you need to start with".

I can think of plenty of good reasons to stop using a map/hash/associative array in code, but I can't think of very many good reasons not to start coding with associative arrays as your default data structure. If there's a performance/memory problem, fix it later. I've seen a lot more code suffer from premature optimization than I've seen suffer from using data structures that were a little too inefficient.


When in doubt, use brute force. -- Ken Thompson

Using hashes as a first choice data structure is not necessarily a bad idea. 1] Until profiling a working implementation demonstrates otherwise, other data structures may be premature optimization.

[1] Clearly an improvement over the Lisper's association lists.


It's not an improvement over small association lists. At least in Common Lisp hashes are pretty heavy and assoc lists will outperform them if there are less than ~100 elements.


Ehh, relational databases already proved that "sets" are the true universal data structures. Anything can be built upon them, including (hash)maps.


A relation is equivalent to a map from its key to its non-key attributes. (A hash-map is just an implementation detail in how a map/relation is implemented.)

So, really, that's not a different universal data structure.


Well, and sets can be built by maps, so they are clearly of the same constructive strength in that regard.


Any info about this, with examples?


The Poe's law is strong in this one. It had me going.


I wish they had used the word "map" in place of "hash" throughout this entire article. The use of hashes is an implementation detail and wholly irrelevant.


> The use of hashes is an implementation detail and wholly irrelevant.

Not in this case. An ordered map implemented as a tree has the same interface but with very different operation complexities


My comment was meant tongue in cheek. Maps are _the_ universal data structure: they can be used to map any input to any output. The rest is just an implementation detail.

One might even call them "functions", but that ruins the joke.


I missed an opportunity here.


>"I hope you’re convinced. A hash is simple. A hash is fast. A hash is all you need."

Hash tables are cool but they are far from being the only thing you need.

The problem with this kind of subtle satire without a disclaimer at the end is that some people would fall by this and blindly follow it (collisions weren't mentioned not even once nor cache misses).


I don't think the satire was particularly subtle!


Given the number of comments with "Please tell me this is a satire" in this thread, I would say it wasn't particularly explicit.


Then they obviously didn't read to the end:

"Unlike most academic work that has little to no practical implications, I think blindly following the stuff here will prove to be incalculably beneficial for you."


Jonathan Blow made the point recently that academic programming language writers always make the same mistake of trying to take an idea and fully radicalize it.

When you go from "Objects are super easy and useful in this language" to "Everything Is An Object" you basically doom yourself to using objects to implement a bunch of stuff that doesn't really make sense as objects and could be implemented much easier as another data structure.

Big-brainded academics love the challenge of "ooh, can I make everything an object?" because they are always free to decrease the scope of their research a little to compensate for the implementation taking a long time. And the more phenomena you can contort into agreement with your thesis, the more scholarpoints you get.

Blow advocates "data-driven programming" which, as a rule of thumb, I translate in my head as "don't move anything around you don't have to."

For example, rather than just copying a giant array of JSON objects over the wire when you only need some image URLs each with an array of timestamped strings, you write the code that serialized that data. And if you do that a few times, write the tooling you need to make that kind of thing easy.

The pitch is that it's not more work. And I'm kind of convinced. It just gets rid of so much pollution when you are debugging.

Your first cut of things is often a little weird: "do I need a generator generator here?" but typically you realize that a simpler solution works just as well in the refactor.

When you hack in a "wrong but easier to sketch out" solution into your code as the first try, it often just lives like that forever. Correct, confusing code often collapses into correct, simple code. Simple, functional-but-wrong code just seems less inclined to self improvement.

And I am continually surprised by how many problems, when simplified down as much as possible, are best described with the basics: functions, structs, arrays. You need fancy stuff sometimes for sure, but most of our human problems are trivial enough that these blunt tools suffice. I just often won't be able to see it until I've renamed all the variables three times.

What's interesting is I've been doing JavaScript programming this way, and Jonathan Blow is... shall I say... not a fan of JS. But I think the concepts translate pretty well! It's just instead of targeting raw memory, you target the DOM/js runtime which is actually a pretty powerful piece of metal if you have the patience to learn about the runtime and keep thinking about your data every step of the way.


It took me until:

> Hashes are always O(1) reads, inserts and writes.

To realise that this was a joke.


Good tongue-in-cheek. Technically, however, the hash was outdone by the Lisp CONS. CONSes could represent every conceivable data structure in 1959 and they were eaten raw by the CPUs of the time. But then, there were not so many people available for the following-blindly part.


That's one of the reasons I love awk (actually, gawk): this is the only data structure it has.


It is* also true for PHP - almost any data structure internally is just a linked hash map.

*or was, I am not sure about current state


Basic had that. It was called arrays - yay for A$()


This... this is just very, very dedicated satire, right?

Let's replace main memory with hash maps and then implement existing data structures on that.

Yes. This is satire.


Also note that the post is tagged as 'bad-theory'


"Unlike most academic work that has little to no practical implications, I think blindly following the stuff here will prove to be incalculably beneficial for you."


One commenter on the page said: "In retrospect this post is obviously satire."

I would agree - didn't realize it at first, but it does sound a bit too OTT on second thoughts...



... is entirely different? I'm not sure what you're trying to say, since all you did was post a link.


While this is satire, it brings to mind some bit of industry history. My first reaction was, "you want to define a type class called Associative because you're talking about an interface, and that got me thinking about OOP vs. Haskell's type classes (a superior approach) (...and then I realized that the OP was a satire.)

The major historical selling point of object-oriented programming (OOP) to the Forces of Evil-- not all business people are "Forces of Evil; really, there are some great business people out there, and so I'm referring specifically to cost-cutting mini-Eichmanns who've attempted to commoditize, humiliate, and infantilize us with "user scrum stories" and a culture of mediocrity-- was that OOP (after diverging far from Alan Kay's original vision) would allow the Forces of Evil to replace high-cost experts with teams of mediocre programmers, and thereby ruin the balance of power. Culturally, it worked (the culture of mediocrity is well-established in the software industry); economically, it failed (because large teams of mediocrities are actually fucking expensive, because a "10x" engineer only costs about 1.5-2.5x an average one).

The sales pitch for OOP to the Forces of Evil was that OOP would make it possible to hire a couple of low-paid body-shop programmers too stupid to recognize the OP as either (a) satire or, missing the joke but still correct, just wrong. Smart wizards in the open-source world and at companies like Google would do the actual engineering that made efficient hash-maps possible, and CommodityScrumDrones would staple shit together using topologies thereof, without really understanding any of the technologies they were gluing together, and probably ignorant of why these things are sometimes called "hashes" in the first place.

The problem is that when CommodityScrumDrones grow up and become middle managers and get to start making technical choices, they often make bad ones. They reject PostgreSQL as too hard to too old and use some NoSQL JSON storage engine that was a "Show HN" project 17 days ago.

Even though the CommodityScrumProgrammer phenomenon has been a massive failure in economic terms-- the Forces of Evil have won on ego terms but utterly dominating their targets, but they've lost money-- it has been a cultural success that has inflicted mediocrity, anti-intellectualism, and subordinacy to "The Business", on the software industry. And now we have people calling important technical shots who have literally no idea why the OP is either satire or wrong.


Why was this guy down voted? Everything he said is correct, if a bit hyperbolic at points.

This is relevant: http://www.smashcompany.com/technology/object-oriented-progr...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: