But but...Why? I would've just written: def find_bigrams(input_list): ngrams = [...

fiatmoney · on Aug 18, 2014

You don't want to re-structure your code to change the degree of N. N is a hyperparameter, you expect it to change as you figure out what gives you the best result. You also typically want 1, 2, 3, etc. grams, not just of one degree, and it's silly to have to call different functions for each of those.

And n of quite large degrees is not uncommon in hardcore natural language processing, or bioinformatics, both of which Python (wrapping Numpy and Scipy, usually) is heavily used for.

For instance, Chinese doesn't tokenize its words (all the characters are packed) which means you usually end up doing something like taking N-ngrams (of potentially large degree) on the character space, doing a lot of lookups into a dictionary and a language model, and seeing if you can get everything to "fit" so that all characters are accounted for and the resulting sentence makes sense.

syllogism · on Aug 18, 2014

I didn't think of character n-grams, that's a case where yeah, you do want larger n. Same with bioinformatics.

But as far as word ngrams goes, I've been doing NLP research for over ten years, and you almost never want 4 or 5 grams, let alone ngrams of greater length. The data's simply too sparse to be useful. So, it's really a matter of generating bigrams and generating trigrams, which I think it's reasonable to have separate functions for.