That works if each phrase is one word. Many (most?) of the terms in the vocabula...

jodrellblank · on Jan 11, 2012

It would have to be altered; let's see:

Create hashtable where the key is the first word and the value is (rest of sentence, link). In the case of collisions, a list of rest of sentences and links.

Checking the document would still go word by word, if no lookup fine, if one result compare the rest of phrase with the following document words, if a list, do that repeatedly.

Runtime varies a lot with distribution of input phrases - if there are 50,000 beginning "phospholipid" that's not good.

I don't know how to estimate the runtime well, best case is no matches or single word phrases, and then as above. Worst case is every document word matches a long list of phrase endings but none of them complete, but even then the searching is only searching a fraction of the phrases in len(longest-phrase) chars of the document.

O(doc_words * num_of_phrases_by_prefix * avg_phrase_length)

I'm uncomfortable here, wishing I had more algorithm/data knowledge to draw on, and maybe wouldn't have gone down this route had I paid attention to many words from the start. Every problem doesn't need a hashtable. Maybe phrase-trees...

pjscott · on Jan 11, 2012

In the case of collisions, you need to search for a bunch of possible rest-of-sentence phrases in the text. This is exactly the problem that your algorithm tries to solve! If you apply it recursively, you get, essentially, trie matching with words instead of characters.

http://en.wikipedia.org/wiki/Trie

jodrellblank · on Jan 11, 2012

There is some bunch searching, but most of the document doesn't get searched, and the bits which get searched aren't searched for most of the keyphrases.

Instead of:

    For phrase in keyphrases:
        If phrase in document:
            pass

This

    For word in document:
        Some_phrases = get_keyphrases_starting[word]
        if document.next_5_words.find(some_phrases)

Instead of searching 100,000 keyphrases in the whole document, this searches 10 rest-of-sentences in the 5 words following a mention of 'phospholipid', 18 rest-of-sentences in the 3 word following 'proteomerase' and nothing in the text about the lab temperature, and never touches most of the keyphrases at all for the document.