If you build that corpus from your OCRed text that has a consistent misinterpretation then that misinterpretation will be amplified as the correct answer.
And this would be a linguistic "prion" - systemic defect that replicate itself, yet does not rise up to the level of complexity we require of a life-form.
Just because 50% of the internet are porn, you shouldn't use a 50% porn corpus to train your Markov chains.