Hacker News new | past | comments | ask | show | jobs | submit login

The blessing of language is that word frequency follows a power law. The first couple hundred words cover much of the language anyway.



This is https://en.wikipedia.org/wiki/Zipf%27s_law

The other side of the coin is that the long tail is long:

the frequency of any word in a corpus is inversely proportional to its rank in the frequency table. For large corpora, about 40% to 60% of the words are hapax legomena [words appearing exactly once], and another 10% to 15% are dis legomena[words appearing exactly twice]. Thus, in the Brown Corpus of American English, about half of the 50,000 words are hapax legomena within that corpus. https://en.wikipedia.org/wiki/Hapax_legomenon

So the chance of seeing a hapax in any given sentence is really high.

What's worse, the more rare words in a sentence are often the important content words. When I try to decipher sentences in some language I don't know much of, what I end up understanding is often something like "And then he XXXXX-ed the YYYYY just like that hahaha" (I understood 8/10 words! And I even know that word 4 is a past tense verb! But not at all the meaning).

(Not that you shouldn't study the most frequent first, that's still a good rule.)


> (I understood 8/10 words! And I even know that word 4 is a past tense verb! But not at all the meaning).

That's an important point. Knowing 80% of the words in a text doesn't mean you understand 80% of its meaning, that is, you probably wouldn't get high marks on a basic comprehension test.

I've seen some studies that indicate you need to know at least (around) 95% of the words in a text in order to understand it "enough" of it. (I don't have the links right now but could look it up at home if you're interested).


Right, though I think the long tail is beyond the line between language and culture. There comes a point where additional words are not a matter of understanding utterances, but of following culture.

Effectively none of the English-speaking world would bother to say "Sochi" without the olympics, but they knew the English language and had enough culture to understand from context that it is a place in Russia which hosted the 2014 winter olympics.

If you know enough of the language that you can ask "what's that?" at a non-disruptive rate in conversation (or look it up quickly in a dictionary or encyclopedia), I think that counts as good enough.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: