Hacker News new | past | comments | ask | show | jobs | submit login
Inside Wade, a 1kb search library (kabir.ml)
88 points by kbr on July 19, 2017 | hide | past | favorite | 29 comments



It just so happens that I've been working on trying to build the fastest possible trie implementation in JavaScript to speed up `lz-string` and potentially `js-search`. The whole process I went through is too much to write down here, but you can basically read it here:

https://github.com/pieroxy/lz-string/pull/98

https://github.com/bvaughn/js-search/issues/42

The short version is: replace those single-char strings with the result of `charCodeAt()`, and then use `charCodeAt()` during search too. The reason is that integer key lookup is much faster than string key lookup:

https://stackoverflow.com/questions/10639488/faster-to-acces...

It's important that all of your keys are integers though, so the trick is to actually store/look up `charCodeAt()+1`, and then store the indices at key 0.

If you feel like taking a risk, you can make a slightly more complicated data structure where you replace the object with an array and do a linear search:

https://jsfiddle.net/vanderZwan/ccfxv4sg/1/

For most data this is the fastest, because the array will only be a handful of characters (except at the root). However, in certain special cases (specifically, if you have a lot of unique characters at the root) it becomes prohibitively slow.


You're right, there are still a ton of optimizations I can do. I'll check out all of these links, thanks for the great tips!


Using `charCodeAt()` for numeric keys is an interesting optimisation, I'm definitely going to try that out with Lunr.js.

I'm also interested in _why_ that is quicker though, aren't all object keys converted to strings anyway?


That is what the spec says, but one way to interpret that is to say that they only have to behave as if they were.

Hashing an integer or double can be much simpler than hashing a string, and like the other comment said: for small enough keys you might even get away with using an array.

edit: to give another example, the spec also says all number are doubles, and yet all JS modern engines use optimisations in places where it can guarantee a value is a 32-bit integer; in some cases adding `|0` (a forced truncation by bitmask-or zero) as a compiler hint can speed up code.

edit2: this is also part of why you want your objects to be as type-stable as possible (it also makes code less bug-prone that way because easier to reason about). This includes the shape of objects. See these slides from 2011:

https://www.slideshare.net/newmovie/know-yourengines-velocit...


Not if they can reasonably be an array. I believe that the integer keys of an object are usually stored in an array in V8 unless they grow really large, or something.


What's the memory consumption like? A node for each letter seems like it would result in substantial memory consumption. I was thinking an object pool might help here, but I'm not sure if its possible to use it given the nesting of the nodes

Also, if the library removes stop words, does that mean its limited in which languages it supports? I'm assuming only English stop words are filtered

Not being critical here, just genuinely curious and don't have time to look through the code ATM


Lunr.js [1] used to use a trie data structure and memory usage was a concern. The problem is that although the data structure compresses prefixes, it does nothing for suffixes. Depending on the corpus this can lead to a lot of duplication.

Switching to a trie-like structure that compresses prefixes and suffixes can lead to significant savings. Building the structure can be a bit more burdensome, so there is a trade off there. There is a paper describing the approach [2] and, if you're interested, my JavaScript implementation [3][4].

[1] https://lunrjs.com

[2] http://www.aclweb.org/anthology/J00-1002.pdf

[3] https://github.com/olivernn/lunr.js/blob/master/lib/token_se...

[4] https://github.com/olivernn/lunr.js/blob/master/lib/token_se...


Cheers! Thanks for sharing. EDIT: I like the site design too!


Yeah, the memory is the only downside of Wade. I'm still looking into ways to optimize the trie used while still keeping the size around 1-2kb.

The stop words that are being removed can be configured via `Wade.config.stopWords`, which is an array of stop words that Wade will remove.


I appreciate your efforts at keeping the library footprint small. It seems like less and less people are concerned with this nowadays

Good thinking on making them configurable!


> In the end, we will be left with a results array like:

  [1, 1]
Is that supposed to be [0, 1]? If not, how does that correlate to ["Hey", "Hello"]?


Actually, it is supposed to be [1, 1]. The actual value of the element corresponds to the score it has, and the index of it refers to the index in the data.

I'll add a note in the article.


I'm sorry, I'm still a bit confused. If it "refers to the index in the data", how is it referring to both "Hey" and "Hello"?

Running the following doesn't seem to work for me:

  var Wade = require('wade');
  var search = Wade(['Hey', 'Hello']);
  search('he'); // returns []


The article doesn't exactly simulate what Wade returns, but uses a simpler version. I added a note about it now.

    [1, 1]
The item at index 0 is 1. This means that the item in index 0 of the data ("Hey") has a score of 1. The item at index 1 is 1, meaning that the item at index 1 of the data ("Hello") has a score of 1.

Edit: Your example with Wade isn't working because it ignores the first item in the query ("he") as it is a stop word. Searching for "h" will return both.


Okay, I gotcha with the [1, 1].

I'm still curious about using it. The GitHub example [1] works as described, but I can't get a result from the 'he' example. Is it saying no match was found by returning an empty array?

EDIT: Okay, saw your edit. Might be nice to use an example that's not a stopword. ;) Nice work on the library and the blog post.

[1] https://github.com/KingPixil/wade#usage


Cool, glad you understood. The library returns an array of results, and will return an empty array if there are none. Try searching for "h" or "hey" for an example.


This is nice. How much work would it be to make this incremental, i.e. update the data structure whenever new data is added or data is changed?

PS: Am I the only one who thinks it is quite strange that "search", being one of the pillars of CS, has so few open-source libraries dedicated to it?


It wouldn't take much work to insert data actually, in fact it would be done pretty fast. Updating something will take more time, as it needs to update the whole structure.

I totally agree, search is an extremely interesting topic and I learned a lot writing Wade. I'd definitely recommend people to learn more about how this kind of stuff works.


Very interesting write-up - I use DefiantJS http://defiantjs.com regularly and it seems to work similarly but instead it pre-processes the data into XML and then uses (fast) in-browser xpath to do searched


Thanks. DefiantJS uses a different way of searching because it needs to be able to search JSON. Wade is focused on a simpler use case of searching within an array of strings.


Really nice explanation on how it works, I think a lot of people could benefit by understanding how a search engine works!


Yeah, it's really interesting stuff! I explored a variety of different algorithms when first writing Wade, including Knuth-Morris-Pratt and Boyer-Moore. In the end I ended up using a trie as it has great performance at the cost of a little preprocessing.


Hey Kabir! Congrats on making the frontpage :) I loved reading this post a couple days back.


Thanks! Glad you liked it.


it'll be really helpful if you explicitly list its supported language.


Well, for future reference, the library itself is for Javascript. The post is more focused on how it works, rather than what language it uses.


i was asking the language it supports, i.e. English, Spanish, Chinese, etc


My bad, it supports English, but the only feature dependent on this is the part that removes stop words.

You can easily configure the stop words that Wade removes by editing `Wade.config.stopWords`.


It will likely have trouble with languages that don't include spaces around words. This will likely work for most latin based languages though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: