The reason why the Trie doesn't improve (and actually it does worse) is probably that the strings are too short. At an average of 5-6 characters per string, each one has size comparable to a single pointer.
A way to make the plain string search faster would be to use binary search. It is possible to do it on a space separated string by jumping at a generic position of the string (which could be in the middle of a word) and going back to the rightmost space preceding the position. It is easy to see that the algorithm is correct and if the words have bounded length the runtime is still O(log n).
BTW, by using a fair amount of succinct data structures and bit tricks it is possible to do way better than this: a succinct (binary) representation of the trie can use around 3 bits per node, and the labels can be compressed to 5-6 bits per label. However I would not implement that in JS :)
"A way to make the plain string search faster would be to use binary search."
Isn't that premature optimization, though? As I noted in my post - lookups, even in the worst case, take less than a millisecond. Presumably someone could do that if they wanted even faster lookup performance, although I don't need it in this case.
I definitely agree that a binary representation would probably get us pretty close to an ideal form - although I'm (as you are) skeptical that it could be done in a performant way in JavaScript.
Yes of course it is, I should have started the sentence with "if performance becomes a concern". Since in the post you say "if you were building a game that required multitudes of dictionary lookups you might want to go with a faster hash or Trie technique" I just wanted to point out that binary search would probably work fine as well, without the huge memory increase.
Ah yes - you're very right! I'm tempted to try fiddling around with a simple binary search across the string just to see what sort of performance characteristics it generates. I imagine that the result would be comparable to doing a Trie lookup (if a bit slower). Although, as mentioned before, we'll still have the file size overhead of the uncompressed string.
Are you just testing if words are members of a dictionary? If so, have you considered using a bloom filter?
There is a slight chance of false positives, depending on how you choose your parameters, but it's a very compact way to store a set of strings.
To make the initial parsing fast, but retain easy lookup I would store the bit vector in an ascii string, 6 bits per character, adding 32 to the raw value to get the character.
The wikipedia article can guide you on the parameters needed to get an acceptable error rate and size.
I looked into bloom filters after someone commented as such on my last post. I guess the possibility of false positives makes me a bit jumpy. I realize that it can be reduced with larger data structures but that prospect doesn't make me terribly excited (and having an as-close-to-zero-as-possible error rate would seem to create a significantly large data structure in the end - at least one that doesn't seem to provide much benefit over gzip compression). Although, please feel free to correct me if I'm wrong - if this does provide a tangible benefit with virtually no errors then I'd be very open to using this technique.
The way you'd want to use bloom filters when you want zero chance of false positives is to filter out impossible options before falling back to slower methods that don't give false positives. If the slower method takes 10 ms, and the bloom filter takes 1 ms, and you have a 1% false positive rate, you're sitting at about 1.1 ms in the average case, instead of 10 ms.
Rather than the fallback method being the list of all good words, how feasible would it be to use a bloom filter and then have the fallback method be an indexof list of all known false positives?
Would there be some more enlightened way of computing such a list than just crunching through every combination of letters up to length n? That is to say, could one devise a bloom filter with hash functions such that given the configured filter, a list of all possible matching elements could be produced?
As far as I can see, finding all false positives for a bloom filter is effectively impossible, since a false positive is simply a hash collision, effectively.
Okay, so... it's definitely possible, just really expensive. Depending on the lengths of words you cover, you have to crunch though a few trillion combinations of letters, looking for the false positives manually--- like computing a rainbow table, but without punctuation, mixed case, etc.
What I'm wondering, though, is if one could select a hashing function where the collisions were easily discoverable. For a "hash" function that's just m mod 8, the presence of a 1 bit means that you had to have had an input that was either 1, 9, 17, 25, 37, etc, up to the maximum word length. This may be a problem that's just as hard or harder than the rainbow table approach, but I'd be interested to explore its feasibility.
A Bloom filter with an error rate of 1/10^8 requires about 40 bits per item stored. That's about 5 bytes per element, which is comparable in magnitude to just storing the dictionary as a single string. It does also require quite a lot of independent hashes, but it's still going to be cheaper to do the lookups than a linear search through a string.
(On the other hand, it might not compare so favourably with "binary search until we've got it down to a smallish substring, then simple linear search".)
The number of bits you need in a bloom filter varies as a function of the number of items you have (in addition to the desired error rate). Assuming a 100 items, a 1e-8 error rate only requires 480 bytes (= ~5 bytes per item).
You also don't need to do a lot of independent hashes. You can just take a linear combination of the results of two hashes and get basically no performance degradation (see: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf).
Also, you probably don't actually need a 1e-8 false positive rate, since you're going to fall back to a canonical (but slow) lookup table for positives in most use cases. Bloom filters are most useful when your real dictionary is slow and you're expecting a lot of 'no match' results.
> Also, you probably don't actually need a 1e-8 false positive rate
Absolutely not. I just wanted to demonstrate that even if you do want a very low false-positive rate from your filter you don't need a very large data structure, contrary to jeresig's fears. Bloom filters are very, very space-efficient.
Yeah, I think they just fascinate me, and I'm still looking for a good use for them in real life.
I haven't had a lot of cases where I was willing to tolerate errors, but I figured in a word game nobody would notice. :)
One other thing to look into if you're using eval() to parse your json data: A couple of years ago, the flickr developer blog observed that building their data structures with string.split() was much faster than parsing the json equivalent with eval(). YMMV on newer browsers.
> Presumably someone could do that if they wanted even faster lookup performance, although I don't need it in this case.
I actually wanted to do that. I've been following your dictionary posts because I'm doing something similar for an iPad PhoneGap app. I created a dictionary with 600k words & n-grams ( http://ktype.net/wiki/research:articles:progress_20110209 ) and want to make really smart predictions (not just basic spell-check/auto-correct). My raw dictionary is 6MB (too large for localStorage) and I load it from a string into a 1D JS array but it randomly crashes the iPad because of memory issues.
I was going to change my code to use a space-saving trie next week and lo and behold you posted your two posts and saved me a lot of ground work (so thanks for that!). But unlike you, I need to loop through large parts of the dictionary in order to make good predictions. E.g. 'mxm' should suggest 'maximum' and 'gd mrong' should suggest 'good morning'. While one could come up with really complex ways on minimizing the number of word lookups, it is far easier to just cycle through all dictionary words that begin with 'm' and filter for specific letters.
I'm still uncertain what I'll do. Maybe iOS Sqlite DB or iOS CoreData or JS optimized-trie. But once again, thanks for these posts. They've directly helped me.
A way to make the plain string search faster would be to use binary search. It is possible to do it on a space separated string by jumping at a generic position of the string (which could be in the middle of a word) and going back to the rightmost space preceding the position. It is easy to see that the algorithm is correct and if the words have bounded length the runtime is still O(log n).
BTW, by using a fair amount of succinct data structures and bit tricks it is possible to do way better than this: a succinct (binary) representation of the trie can use around 3 bits per node, and the labels can be compressed to 5-6 bits per label. However I would not implement that in JS :)