Hacker News new | past | comments | ask | show | jobs | submit login

I used to have a job that involved parsing large textual datasets. It was fascinating to me how far you could reconstruct a history of a dataset just by looking at encoding errors - and practically no dataset I've seen came without them. Sometimes I could be certain of several specific import/export steps, each introducing a new layer of encoding errors on top of the previous one. Other times I could correlate timestamps and see when specific data entry bugs were introduced and when they were fixed.

Strictly speaking once you lose the information about the encoding of a string you can't say anything about it. But given some heuristic, some contextual knowledge (like how the author of the post guesses that "M<fc>ller" means "Müller") and a large enough amount of data you can pretty much always work back through and correct the errors. Well, as long as someone didn't replace all 8-bit characters with question marks, but that was very rare in my experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: