Hacker News new | past | comments | ask | show | jobs | submit login

In other words:

Just because 50% of the internet are porn, you shouldn't use a 50% porn corpus to train your Markov chains.




An important idea. Obviously you should use a corpus made from books (of a relevant genre and/or time period preferably) not from the whole Internet.


If you build that corpus from your OCRed text that has a consistent misinterpretation then that misinterpretation will be amplified as the correct answer.


And this would be a linguistic "prion" - systemic defect that replicate itself, yet does not rise up to the level of complexity we require of a life-form.


Obviously you are to build it from known-valid texts that are already proof-read.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: