In other words: Just because 50% of the internet are porn, you shouldn't use a 5...

qwerty456127 · on Nov 4, 2018

An important idea. Obviously you should use a corpus made from books (of a relevant genre and/or time period preferably) not from the whole Internet.

robin_reala · on Nov 4, 2018

If you build that corpus from your OCRed text that has a consistent misinterpretation then that misinterpretation will be amplified as the correct answer.

DenisM · on Nov 4, 2018

And this would be a linguistic "prion" - systemic defect that replicate itself, yet does not rise up to the level of complexity we require of a life-form.

qwerty456127 · on Nov 4, 2018

Obviously you are to build it from known-valid texts that are already proof-read.