Hacker News new | past | comments | ask | show | jobs | submit login
JavaScript Trie Performance Analysis (ejohn.org)
119 points by sant0sk1 on March 17, 2011 | hide | past | favorite | 28 comments



The reason why the Trie doesn't improve (and actually it does worse) is probably that the strings are too short. At an average of 5-6 characters per string, each one has size comparable to a single pointer.

A way to make the plain string search faster would be to use binary search. It is possible to do it on a space separated string by jumping at a generic position of the string (which could be in the middle of a word) and going back to the rightmost space preceding the position. It is easy to see that the algorithm is correct and if the words have bounded length the runtime is still O(log n).

BTW, by using a fair amount of succinct data structures and bit tricks it is possible to do way better than this: a succinct (binary) representation of the trie can use around 3 bits per node, and the labels can be compressed to 5-6 bits per label. However I would not implement that in JS :)


"A way to make the plain string search faster would be to use binary search."

Isn't that premature optimization, though? As I noted in my post - lookups, even in the worst case, take less than a millisecond. Presumably someone could do that if they wanted even faster lookup performance, although I don't need it in this case.

I definitely agree that a binary representation would probably get us pretty close to an ideal form - although I'm (as you are) skeptical that it could be done in a performant way in JavaScript.


> Isn't that premature optimization, though?

Yes of course it is, I should have started the sentence with "if performance becomes a concern". Since in the post you say "if you were building a game that required multitudes of dictionary lookups you might want to go with a faster hash or Trie technique" I just wanted to point out that binary search would probably work fine as well, without the huge memory increase.


Ah yes - you're very right! I'm tempted to try fiddling around with a simple binary search across the string just to see what sort of performance characteristics it generates. I imagine that the result would be comparable to doing a Trie lookup (if a bit slower). Although, as mentioned before, we'll still have the file size overhead of the uncompressed string.


Are you just testing if words are members of a dictionary? If so, have you considered using a bloom filter?

There is a slight chance of false positives, depending on how you choose your parameters, but it's a very compact way to store a set of strings.

To make the initial parsing fast, but retain easy lookup I would store the bit vector in an ascii string, 6 bits per character, adding 32 to the raw value to get the character.

The wikipedia article can guide you on the parameters needed to get an acceptable error rate and size.


I looked into bloom filters after someone commented as such on my last post. I guess the possibility of false positives makes me a bit jumpy. I realize that it can be reduced with larger data structures but that prospect doesn't make me terribly excited (and having an as-close-to-zero-as-possible error rate would seem to create a significantly large data structure in the end - at least one that doesn't seem to provide much benefit over gzip compression). Although, please feel free to correct me if I'm wrong - if this does provide a tangible benefit with virtually no errors then I'd be very open to using this technique.


The way you'd want to use bloom filters when you want zero chance of false positives is to filter out impossible options before falling back to slower methods that don't give false positives. If the slower method takes 10 ms, and the bloom filter takes 1 ms, and you have a 1% false positive rate, you're sitting at about 1.1 ms in the average case, instead of 10 ms.


Rather than the fallback method being the list of all good words, how feasible would it be to use a bloom filter and then have the fallback method be an indexof list of all known false positives?

Would there be some more enlightened way of computing such a list than just crunching through every combination of letters up to length n? That is to say, could one devise a bloom filter with hash functions such that given the configured filter, a list of all possible matching elements could be produced?


As far as I can see, finding all false positives for a bloom filter is effectively impossible, since a false positive is simply a hash collision, effectively.


Okay, so... it's definitely possible, just really expensive. Depending on the lengths of words you cover, you have to crunch though a few trillion combinations of letters, looking for the false positives manually--- like computing a rainbow table, but without punctuation, mixed case, etc.

What I'm wondering, though, is if one could select a hashing function where the collisions were easily discoverable. For a "hash" function that's just m mod 8, the presence of a 1 bit means that you had to have had an input that was either 1, 9, 17, 25, 37, etc, up to the maximum word length. This may be a problem that's just as hard or harder than the rainbow table approach, but I'd be interested to explore its feasibility.


A Bloom filter with an error rate of 1/10^8 requires about 40 bits per item stored. That's about 5 bytes per element, which is comparable in magnitude to just storing the dictionary as a single string. It does also require quite a lot of independent hashes, but it's still going to be cheaper to do the lookups than a linear search through a string.

(On the other hand, it might not compare so favourably with "binary search until we've got it down to a smallish substring, then simple linear search".)


The number of bits you need in a bloom filter varies as a function of the number of items you have (in addition to the desired error rate). Assuming a 100 items, a 1e-8 error rate only requires 480 bytes (= ~5 bytes per item).

You also don't need to do a lot of independent hashes. You can just take a linear combination of the results of two hashes and get basically no performance degradation (see: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf).

Also, you probably don't actually need a 1e-8 false positive rate, since you're going to fall back to a canonical (but slow) lookup table for positives in most use cases. Bloom filters are most useful when your real dictionary is slow and you're expecting a lot of 'no match' results.


> Also, you probably don't actually need a 1e-8 false positive rate

Absolutely not. I just wanted to demonstrate that even if you do want a very low false-positive rate from your filter you don't need a very large data structure, contrary to jeresig's fears. Bloom filters are very, very space-efficient.


Yeah, I think they just fascinate me, and I'm still looking for a good use for them in real life.

I haven't had a lot of cases where I was willing to tolerate errors, but I figured in a word game nobody would notice. :)

One other thing to look into if you're using eval() to parse your json data: A couple of years ago, the flickr developer blog observed that building their data structures with string.split() was much faster than parsing the json equivalent with eval(). YMMV on newer browsers.

http://code.flickr.com/blog/2009/03/18/building-fast-client-...


> Presumably someone could do that if they wanted even faster lookup performance, although I don't need it in this case.

I actually wanted to do that. I've been following your dictionary posts because I'm doing something similar for an iPad PhoneGap app. I created a dictionary with 600k words & n-grams ( http://ktype.net/wiki/research:articles:progress_20110209 ) and want to make really smart predictions (not just basic spell-check/auto-correct). My raw dictionary is 6MB (too large for localStorage) and I load it from a string into a 1D JS array but it randomly crashes the iPad because of memory issues.

I was going to change my code to use a space-saving trie next week and lo and behold you posted your two posts and saved me a lot of ground work (so thanks for that!). But unlike you, I need to loop through large parts of the dictionary in order to make good predictions. E.g. 'mxm' should suggest 'maximum' and 'gd mrong' should suggest 'good morning'. While one could come up with really complex ways on minimizing the number of word lookups, it is far easier to just cycle through all dictionary words that begin with 'm' and filter for specific letters.

I'm still uncertain what I'll do. Maybe iOS Sqlite DB or iOS CoreData or JS optimized-trie. But once again, thanks for these posts. They've directly helped me.


It sounds like you need to build on the strengths of the JS implementations you care about, and encode your data structure in a string or something. Have a jump table in the beginning with the first characters, and then likewise for other characters, with some other convention to denote leafs. You won't be able to add to the trie, lookup won't be as fast, but it would still be O(log N) and parsing would be dead simple.


Here's an example of decoding and searching a DAWG in real-time in javascript.

It encodes a dictionary of 106,495 words (an old Scrabble dictionary) into 427,982 base64 characters.

http://www.davidst.com/jsanagram/


Note: This was an experiment to see if javascript could be fast enough to do this sort of thing. It was never finished. The count of anagrams should be updated and the crossword finder should be hooked up. Feel free to take it and hack on it if you like.


I wrote a implementation of Ternary Search Trie[1] in Python and found that for insert/update/delete I could get my performance to be on the same order to one order of magnitude slower than python built in dict.

Considering that Tries offer a great base to build a flexible string matching algorithm off of I consider the prototype a success.

The author's approach to optimizing the JavaScript trie is pretty good but there are still improvements that could be made. I would be curious to see if the Ternary approach could yield with interior node collapsing (like a Patricia Trie) would yield the performance the author is looking for.

Also as @ot noted using 5-6 characters per string is too short for a Trie's strengths to really shine. They work better as the average length of string increases.

[1] https://github.com/timtadh/tst


Great follow-up and appreciate the deep analysis with graphs. I am curious how you measure the Mobi Safari load times. Is there a tool (xcode?) to do this?


It was rather unglamorous - just loading the JavaScript string via Ajax and evaluating. I ran it on my iPhone 4 running iOS 4.3. Simple test is here: https://github.com/jeresig/trie-js/blob/master/load.html


Thanks, I assumed you would have some sophisticated tool that you use for benchmarks against jQuery mobile etc.


This gave me fond memories of my second year university project where we had to implement a rudimentary spell checker and I wrote a Trie in C.



(Saw that you also replied on my blog post, dumping my reply to you here, as well.)

IndexedDB may help a bit in this case. It'll cut down on memory usage (at the cost of additional reads on the disk - which will likely increase battery usage on mobile devices) and <em>may</em> decrease search time. Although easily the biggest benefit of my current technique is that it's capable of easily using <code>localStorage</code> - and that has fantastic cross-browsers support (in contrast with IndexedDB, which doesn't).


Any recommended articles on Javascript benchmarking, client and/or server?


I love the idea of sharing the common suffixes.


[deleted]


…because although lookup performance is maximized and the amount of data sent over the wire is minimized, the processor time required to parse the data structure and the memory required to store it were both well out-of-bounds for the target device.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: