Hacker News new | past | comments | ask | show | jobs | submit login

Here's a quick one:

1) tokenize each text into a different bag(set) of words.

2) Compute the Jaccard index[1] using the two sets.

Here's another

1) tokenize each text into a multi-bag(set) of words, keeping track of token frequency

2) keeping the token frequency, order the sets into lists

3) map the lists of words onto an n-dimensional space (where n is say...all of the words into the two documents) as vectors

4) compute the cosine similarity [2]

Here's another:

1) tokenize the texts into two bags of words

2) compute the set difference going both ways.

3) does either difference contain discriminator tokens that rule it out as being from that person

4 (optional)): extend to 2-3-n-grams

Here's another (a variant of the one above):

1) compute 1-2-3-n-grams from one of the texts

2) insert the n-grams into a set

3) compute the same for the second document and test for set membership

4) compute the number of total n-grams from your second document

5) compute (non-in-set/total-n-grams) * 100 to yield a "uniqueness" measure

6) determine if the second document is "unique" enough

And another:

1) assuming you have a sample corpus from a writer and want to know if a new text belongs in that corpus

2) follow the method above but for step #1 and 2 do it with the entire reference corpus

And another:

1) produce an ontology of discriminator terms and categories unique to the writer

2) use an (named entity recognition) NER tool of some kind to find those terms in each document

3) use the set of found terms as an alternative to a bag of words for the Jaccard or Vector models above

You may need to play with stopword list removal, tokenization schemes and n-gram windows (for example, omitting 1-grams might focus the analysis on phrase usage vs. vocabulary usage)

1 - https://en.wikipedia.org/wiki/Jaccard_index

2 - https://en.wikipedia.org/wiki/Cosine_similarity




This is an implementation in Python for the second option, but instead of word frequencies it uses word presence (a binary vector in the end):

https://github.com/atrilla/vsmpy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: