Hacker News new | past | comments | ask | show | jobs | submit login
Google 5gram corpus has unreasonable 5grams (nlpers.blogspot.com)
25 points by numeromancer on June 16, 2010 | hide | past | favorite | 9 comments



It's funny that the comments on this post have been spammed with the sort of duplicative trash he may be seeing in the corpus. I never imagined that spam comments could actually be relevant.


The trolls/dwarfs one is almost certainly due to the works of Terry Pratchett.


Most of the weird ones are explained in the comments, you just have to dig through some spam.

* The trolls/dwarf one was part of the blurb repeated in bookstores and reviews for a Terry Pratchett novel

* the poet wicked the woman was part of a list of plays with "Wicked" (about the witch from Oz) being in the middle

* The prince compiled the Mishna is apparently something from Jewish lore.

I'd guess there'd be similar explanations for the others just like "the matrix reloaded the matrix" which he calls out, just not as obvious.

One of the commenters points out that any string of 5 words is already an outlier so you're naturally going to get noise like this mixed in.

If I google the first one it seems to be some standard boilerplate text used in collecting data about chemical exposure. The second is a list of UK reality TV series "Popstars: The Rivals, Shattered, The Farm". So they are widely used on the web, just not in normal speech.


Dwarf Fortress could account for some part of that, too, even though trolls don't show up until later in the game. Actually, there are probably quite a few games and such where that sort of thing could happen. But I'm guessing that most of the weird data came from spammers, even if I think that some of those are real.

It's too bad they can't get n-grams only from published books. That would get rid of most of the spam, provided they filter out that book spammer who reprocesses Wikipedia into gibberish-filled books.


I am glad to find out there are some other issues with it. My main disappointment with it however remains its license. It would of been really cool if Google released it with a CC or MIT license but instead its restricted for academic usage only. Better to spend time with other corpus'.


What other n-gram corpus' are there? I've been looking, and all I can seem to find is some web spammer trying to sell me one.

I don't particularly care about the license, but I'd rather not have to build my own spider to crawl and generate one for me.



Better yet if they also released the corpus over bittorrent instead of requiring one spend $100+ for the cds. Let everyone play with it, not just full-time researchers.


Quite possibly the crawler collecting the corpus hit a crawler trap (intentional or unintentional) -- or perhaps web-based game output (which when visited by a crawler became a de facto trap) -- which multiplied the implausible phrases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: