Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Tries (drmcawesome.com)
72 points by mriley on Aug 23, 2010 | hide | past | favorite | 14 comments



I've recently been working on directed acyclic word graphs. They are a small modification to tries and can be built with (I think) the same time complexity. For each node, count the distance from EOW when building, maintain a list of nodes with the same EOW distance and merge the ones that are equal strings. They use much less space by sharing suffixes as well as prefixes. http://en.wikipedia.org/wiki/Directed_acyclic_word_graph


for an excellent example of a trie implementation, the haskell bytestring trie datastructure is really nice to look at. Its docs can be found at http://hackage.haskell.org/package/bytestring-trie and the very readable sourcecode at http://hackage.haskell.org/packages/archive/bytestring-trie/...

edit: note that its not strictly a trie, but can be used as one, in general its more like it does a trie-like key value lookup.

edit: it is also worth noting that its pretty fast, counting reading data in from the file system and other overhead, loading 1.5 million key - value pairs of realworld production data into the trie, and then making my first query, in under 30 seconds. And thats without even doing the sort of preprocessing that can make it much much faster! (namely, lexigraphically sorting the strings I'm using as keys before doing the k-v insertions)


personally I prefer this intro: http://marknelson.us/1996/08/01/suffix-trees/

or this one which is slightly harder: http://linux.thai.net/~thep/datrie/datrie.html


"First, lookup time is O(1) in the size of the trie."

The article skims over the tricky part - the dependency on the size of the alphabet. Which you can no longer treat as insignificant in the days of unicode.


Good luck building a Unicode trie -- the branching factor would be too high, never mind lookup time. Instead, you'd make the trie of an encoding, probably UTF-8 (off the top of my head) that would enable you to keep the branching factor at 256, which is already rather large but doable (You can switch to Judy arrays if the wasted space bothers you.)

Does anyone know, is there a Unicode encoding that enables you to map arbitrary ranges (so I can, for example, use the greek alphabet only at 1 byte per character or less)? I suppose UTF-8 is already hard enough to decode.


Ternary search tries attempt to solve this problem (among others):

http://en.wikipedia.org/wiki/Ternary_search_tree

http://www.strchr.com/ternary_dags


Or you use a sparse data structure instead of an array in each node, for example a binary tree. This gives you ternary trees.


> Instead, you'd make the trie of an encoding, probably UTF-8

There are some new implications you need to account for if you do this, as you no longer have one node per letter, but a path per letter. E.g. subtrees will no longer correctly represent substrings, making autocompletion slightly trickier.


Well, the version of this I did in C for the shop I was at about 10 years ago had a "memcmp"-like interface. Just pass in a pointer to the start of the bytes (int, asciiz, unicode if you will) and the length.

I also dynamically allocated the key/pointer arrays within each node so that while it was sparsely populated, it was only big enough to hold the largest byte defined in that node (e.g. - 'A' = (char) 65, so byte positions 0 to 65 would be present, but not 66 to 255 until needed.

I was nice to see the impression of my coworkers when a bunch of qsort() / bsearch() code was replaced with this. We could afford the memory, and the speedup was fairly impressive.


This is interesting. I wonder if you could use the a similar algorithm for mapping sentences. That is, construct a ternary DAG and keep inserting words (rather than letters) into the trie.


Yes, of course. You can index tries with pretty much anything, though elements with constant-time comparison (individual bytes/chars, pointers, hashes, etc.) will have less overhead.


How is "trie" pronounced? Like "try"?


According to Wikipedia:

Following the etymology, the inventor, Edward Fredkin, pronounces it /ˈtriː/ "tree".However, it is pronounced /ˈtraɪ/ "try" by other authors.

http://en.wikipedia.org/wiki/Trie


Is it me or there's no RSS in the site?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: