Introduction to Tries

naz · on Aug 23, 2010

I've recently been working on directed acyclic word graphs. They are a small modification to tries and can be built with (I think) the same time complexity. For each node, count the distance from EOW when building, maintain a list of nodes with the same EOW distance and merge the ones that are equal strings. They use much less space by sharing suffixes as well as prefixes. http://en.wikipedia.org/wiki/Directed_acyclic_word_graph

carterschonwald · on Aug 23, 2010

for an excellent example of a trie implementation, the haskell bytestring trie datastructure is really nice to look at. Its docs can be found at http://hackage.haskell.org/package/bytestring-trie and the very readable sourcecode at http://hackage.haskell.org/packages/archive/bytestring-trie/...

edit: note that its not strictly a trie, but can be used as one, in general its more like it does a trie-like key value lookup.

edit: it is also worth noting that its pretty fast, counting reading data in from the file system and other overhead, loading 1.5 million key - value pairs of realworld production data into the trie, and then making my first query, in under 30 seconds. And thats without even doing the sort of preprocessing that can make it much much faster! (namely, lexigraphically sorting the strings I'm using as keys before doing the k-v insertions)

dustrider · on Aug 23, 2010

personally I prefer this intro: http://marknelson.us/1996/08/01/suffix-trees/

or this one which is slightly harder: http://linux.thai.net/~thep/datrie/datrie.html

praptak · on Aug 23, 2010

"First, lookup time is O(1) in the size of the trie."

The article skims over the tricky part - the dependency on the size of the alphabet. Which you can no longer treat as insignificant in the days of unicode.

Robin_Message · on Aug 23, 2010

Good luck building a Unicode trie -- the branching factor would be too high, never mind lookup time. Instead, you'd make the trie of an encoding, probably UTF-8 (off the top of my head) that would enable you to keep the branching factor at 256, which is already rather large but doable (You can switch to Judy arrays if the wasted space bothers you.)

Does anyone know, is there a Unicode encoding that enables you to map arbitrary ranges (so I can, for example, use the greek alphabet only at 1 byte per character or less)? I suppose UTF-8 is already hard enough to decode.

shadytrees · on Aug 23, 2010

Ternary search tries attempt to solve this problem (among others):

http://en.wikipedia.org/wiki/Ternary_search_tree

http://www.strchr.com/ternary_dags

jules · on Aug 23, 2010

Or you use a sparse data structure instead of an array in each node, for example a binary tree. This gives you ternary trees.

Deestan · on Aug 23, 2010

> Instead, you'd make the trie of an encoding, probably UTF-8

There are some new implications you need to account for if you do this, as you no longer have one node per letter, but a path per letter. E.g. subtrees will no longer correctly represent substrings, making autocompletion slightly trickier.

Roboprog · on Aug 23, 2010

Well, the version of this I did in C for the shop I was at about 10 years ago had a "memcmp"-like interface. Just pass in a pointer to the start of the bytes (int, asciiz, unicode if you will) and the length.

I also dynamically allocated the key/pointer arrays within each node so that while it was sparsely populated, it was only big enough to hold the largest byte defined in that node (e.g. - 'A' = (char) 65, so byte positions 0 to 65 would be present, but not 66 to 255 until needed.

I was nice to see the impression of my coworkers when a bunch of qsort() / bsearch() code was replaced with this. We could afford the memory, and the speedup was fairly impressive.

VolatileVoid · on Aug 23, 2010

This is interesting. I wonder if you could use the a similar algorithm for mapping sentences. That is, construct a ternary DAG and keep inserting words (rather than letters) into the trie.

silentbicycle · on Aug 23, 2010

Yes, of course. You can index tries with pretty much anything, though elements with constant-time comparison (individual bytes/chars, pointers, hashes, etc.) will have less overhead.

bruceboughton · on Aug 23, 2010

How is "trie" pronounced? Like "try"?

michael_dorfman · on Aug 23, 2010

According to Wikipedia:

Following the etymology, the inventor, Edward Fredkin, pronounces it /ˈtriː/ "tree".However, it is pronounced /ˈtraɪ/ "try" by other authors.

http://en.wikipedia.org/wiki/Trie

lt · on Aug 23, 2010

Is it me or there's no RSS in the site?