Hmm, maybe I've been overcomplicating the problem in my mind. You've given me some good ideas.
Bigrams, as your own example shows, are too simple: in both examples, "car" will get related to "a", instead of "getting" and "driving".
Maybe if I parse all sentences with dependency and built dependency bigrams, and score sentences with frequency/inverse_freq and length of sentence (short sentences are better).
Bigrams, as your own example shows, are too simple: in both examples, "car" will get related to "a", instead of "getting" and "driving".
Maybe if I parse all sentences with dependency and built dependency bigrams, and score sentences with frequency/inverse_freq and length of sentence (short sentences are better).