The other side of the coin is that the long tail is long:
the frequency of any word in a corpus is inversely proportional to its rank in the frequency table. For large corpora, about 40% to 60% of the words are hapax legomena [words appearing exactly once], and another 10% to 15% are dis legomena[words appearing exactly twice]. Thus, in the Brown Corpus of American English, about half of the 50,000 words are hapax legomena within that corpus.https://en.wikipedia.org/wiki/Hapax_legomenon
So the chance of seeing a hapax in any given sentence is really high.
What's worse, the more rare words in a sentence are often the important content words. When I try to decipher sentences in some language I don't know much of, what I end up understanding is often something like "And then he XXXXX-ed the YYYYY just like that hahaha" (I understood 8/10 words! And I even know that word 4 is a past tense verb! But not at all the meaning).
(Not that you shouldn't study the most frequent first, that's still a good rule.)
> (I understood 8/10 words! And I even know that word 4 is a past tense verb! But not at all the meaning).
That's an important point. Knowing 80% of the words in a text doesn't mean you understand 80% of its meaning, that is, you probably wouldn't get high marks on a basic comprehension test.
I've seen some studies that indicate you need to know at least (around) 95% of the words in a text in order to understand it "enough" of it. (I don't have the links right now but could look it up at home if you're interested).
Right, though I think the long tail is beyond the line between language and culture. There comes a point where additional words are not a matter of understanding utterances, but of following culture.
Effectively none of the English-speaking world would bother to say "Sochi" without the olympics, but they knew the English language and had enough culture to understand from context that it is a place in Russia which hosted the 2014 winter olympics.
If you know enough of the language that you can ask "what's that?" at a non-disruptive rate in conversation (or look it up quickly in a dictionary or encyclopedia), I think that counts as good enough.