Hacker News new | past | comments | ask | show | jobs | submit login

He talks a bit about how to pick the right number of successive letters to use as hash keys - which is where you can get a handle on alphabet sizes. I would guess (maybe it actually says) that he Wikipedia dump in the benchmark was UTF-8 or ASCII and, either way, treated as an alphabet of 8-bit characters. The DNA case is kind of interesting (2 bits min but more likely 3 or 4 in a typical genome record).



An illustration from the book I cited above showing the importance of the alphabet size (y-axis) and the pattern length (x-axis):

http://i.imgur.com/KGOZW.jpg

In the experiment he used patterns of different length on the same text collection. As you can see in the graph, different algorithms perform best for a certain alphabet size.

He describes the text collection as "text corpus taken from wikipedia text dump" so I'm guessing the alphabet size is around 90?

It's also probably not a good thing that all the strings he is searching for are prefixes of the same pattern.

References:

Shift-Or: http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION...

BNDM: http://www-igm.univ-mlv.fr/~lecroq/string/bndm.html#SECTION0...

BOM: http://www-igm.univ-mlv.fr/~lecroq/string/bom.html#SECTION00...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: