It just so happens that I've been working on trying to build the fastest possible trie implementation in JavaScript to speed up `lz-string` and potentially `js-search`. The whole process I went through is too much to write down here, but you can basically read it here:
The short version is: replace those single-char strings with the result of `charCodeAt()`, and then use `charCodeAt()` during search too. The reason is that integer key lookup is much faster than string key lookup:
It's important that all of your keys are integers though, so the trick is to actually store/look up `charCodeAt()+1`, and then store the indices at key 0.
If you feel like taking a risk, you can make a slightly more complicated data structure where you replace the object with an array and do a linear search:
For most data this is the fastest, because the array will only be a handful of characters (except at the root). However, in certain special cases (specifically, if you have a lot of unique characters at the root) it becomes prohibitively slow.
That is what the spec says, but one way to interpret that is to say that they only have to behave as if they were.
Hashing an integer or double can be much simpler than hashing a string, and like the other comment said: for small enough keys you might even get away with using an array.
edit: to give another example, the spec also says all number are doubles, and yet all JS modern engines use optimisations in places where it can guarantee a value is a 32-bit integer; in some cases adding `|0` (a forced truncation by bitmask-or zero) as a compiler hint can speed up code.
edit2: this is also part of why you want your objects to be as type-stable as possible (it also makes code less bug-prone that way because easier to reason about). This includes the shape of objects. See these slides from 2011:
Not if they can reasonably be an array. I believe that the integer keys of an object are usually stored in an array in V8 unless they grow really large, or something.
What's the memory consumption like? A node for each letter seems like it would result in substantial memory consumption. I was thinking an object pool might help here, but I'm not sure if its possible to use it given the nesting of the nodes
Also, if the library removes stop words, does that mean its limited in which languages it supports? I'm assuming only English stop words are filtered
Not being critical here, just genuinely curious and don't have time to look through the code ATM
Lunr.js [1] used to use a trie data structure and memory usage was a concern. The problem is that although the data structure compresses prefixes, it does nothing for suffixes. Depending on the corpus this can lead to a lot of duplication.
Switching to a trie-like structure that compresses prefixes and suffixes can lead to significant savings. Building the structure can be a bit more burdensome, so there is a trade off there. There is a paper describing the approach [2] and, if you're interested, my JavaScript implementation [3][4].
Actually, it is supposed to be [1, 1]. The actual value of the element corresponds to the score it has, and the index of it refers to the index in the data.
The article doesn't exactly simulate what Wade returns, but uses a simpler version. I added a note about it now.
[1, 1]
The item at index 0 is 1. This means that the item in index 0 of the data ("Hey") has a score of 1. The item at index 1 is 1, meaning that the item at index 1 of the data ("Hello") has a score of 1.
Edit: Your example with Wade isn't working because it ignores the first item in the query ("he") as it is a stop word. Searching for "h" will return both.
I'm still curious about using it. The GitHub example [1] works as described, but I can't get a result from the 'he' example. Is it saying no match was found by returning an empty array?
EDIT: Okay, saw your edit. Might be nice to use an example that's not a stopword. ;) Nice work on the library and the blog post.
Cool, glad you understood. The library returns an array of results, and will return an empty array if there are none. Try searching for "h" or "hey" for an example.
It wouldn't take much work to insert data actually, in fact it would be done pretty fast. Updating something will take more time, as it needs to update the whole structure.
I totally agree, search is an extremely interesting topic and I learned a lot writing Wade. I'd definitely recommend people to learn more about how this kind of stuff works.
Very interesting write-up - I use DefiantJS http://defiantjs.com regularly and it seems to work similarly but instead it pre-processes the data into XML and then uses (fast) in-browser xpath to do searched
Thanks. DefiantJS uses a different way of searching because it needs to be able to search JSON. Wade is focused on a simpler use case of searching within an array of strings.
Yeah, it's really interesting stuff! I explored a variety of different algorithms when first writing Wade, including Knuth-Morris-Pratt and Boyer-Moore. In the end I ended up using a trie as it has great performance at the cost of a little preprocessing.
https://github.com/pieroxy/lz-string/pull/98
https://github.com/bvaughn/js-search/issues/42
The short version is: replace those single-char strings with the result of `charCodeAt()`, and then use `charCodeAt()` during search too. The reason is that integer key lookup is much faster than string key lookup:
https://stackoverflow.com/questions/10639488/faster-to-acces...
It's important that all of your keys are integers though, so the trick is to actually store/look up `charCodeAt()+1`, and then store the indices at key 0.
If you feel like taking a risk, you can make a slightly more complicated data structure where you replace the object with an array and do a linear search:
https://jsfiddle.net/vanderZwan/ccfxv4sg/1/
For most data this is the fastest, because the array will only be a handful of characters (except at the root). However, in certain special cases (specifically, if you have a lot of unique characters at the root) it becomes prohibitively slow.