Hacker News new | past | comments | ask | show | jobs | submit login
An OCR cliche: Into his/her anus (2009) (wraabe.wordpress.com)
343 points by userbinator on Nov 4, 2018 | hide | past | favorite | 90 comments



The Wesleyan-Methodist Magazine‎ – Page 433 “carried this child in his anus to Derry”

I guess so the child could smell the Derry air.


I want to believe that the secret purpose of Wesleyan-Methodist Magazine was to set up the reading machines of a future civilization for this pun.


Too fine a pun to let pass without remark.


A pun so funny that you didn't get downvoted or a user/mod commenting about how this isn't reddit.

GG


This reminds of an eBook of Neuromancer that I read, it was occasionally missing the letter f. For the most part I just added it back mentally without really thinking about it, but then sometimes I hit a passage like this: "He turned, pulled his jacket on, and licked the cobra to full extension." That one took a moment.


Probably the original text was using ligatures for fi and fl and they got lost in conversion.

https://en.wikipedia.org/wiki/Typographic_ligature#Stylistic...


Yup. I have to manually detect and correct for all the possible ligatures in all possible unicode in my text to speech pre-processor scripts. I hate them.


If you have a Unicode library available, you might try asking it to convert the text to NFKD or NFKC normalization form. This will take apart ligatures (the former will also take apart accented characters).


"this gives us efficient space-time trade-offs" :-(


Those are HTML entities. Most modern programming languages come with tools to decode this, e.g. in python:

    text = urllib.parse.unquote(text)


urllib.parse.unquote() is unrelated to HTML. It undoes URL-encoding:

https://docs.python.org/3/library/urllib.parse.html#urllib.p...

In Python ≥ 3.4, you can use html.unescape() to decode HTML entities:

https://docs.python.org/3/library/html.html#html.unescape


You are 100% correct. I mixed the two encodings up. Thanks.


I wonder if at some point your e-book went through macOS's Preview program.

At work I sometimes have to copy blocks of text from a PDF into another document. If I do it with Preview, I lose the fi and fl ligatures. It only happens with PDFs created in-house, so I guess it's some kind of stylistic thing that comes from the guy who lays out the PDFs.

I eventually learned to use Adobe's own Acrobat, instead, and it works fine.


Please send to bugreport.apple.com


I'd argue this is a feature and not a bug. When copy-pasting text from PDFs, I'd love to not have to deal with unicode and ligatures. There's another comment upthread here where someone's complaining about having to deal with unicode.

If Preview can do this automatically, please don't change that feature.


I think the GP commentor meant that the ligatures are converted lossfully into an arbitrary substituant character (e.g. fl -> l), rather than that they’re taken apart losslessly.


For clarity, I was describing how in Preview fl -> NULL.

Preview for some reason just drops it entirely.

In Acrobat, fl -> f and l adjacent.


Not gonna lie, I still can't parse what this is supposed to be after a number of readings. What's the actual sentence? I'm so curious :P


> licked the cobra to full extension

should be

> flicked the cobra to full extension

The cobra is a weapon in the Neuromancer's universe, something like an extendable knife/club.


Another example:

https://books.google.com/ngrams/graph?content=fuck&year_star...

If you don't know what's going on it looks like the word "fuck" was more common in the 17th century than today. But actually it's the word "suck" written with a long s ("ſuck"), which you can see is easily OCRed incorrectly.


After you guys fix this with Markov chains or whatever, I look forward to reading: the proctologist was thorough but found no sign of blockage in her arms.


That would be a clbuttic mistake, indeed.


Reminds me of "Don't kick a man when he's clown"; Google finds 2 PDFs with this, due to bad OCR:

https://www.google.com/search?q="Don't+kick+a+man+when+he's+...

Credit: https://twitter.com/ObeyComputer/status/1050131788830560258


At the family printshop my mother experienced some typing glitches, she would sometimes type "simulate" instead of "stimulate" and "muck" instead of "mark". This led to two disasters, which required us to stop the presses (the pressman was a great proofreader!),

In a political brochure: "...will introduce bills to simulate progress..." In a funeral booklet: "...he left his muck upon us all."

She asked me to go into computer's master dictionary and patch it to disable the words 'simulate' and 'muck' so it would bring these mistakes to her attention.


I'm pretty sure that for many political brochures these words could be used interchangeably


OCR-ed texts should really be proof-read before being published. Pirates usually do this, that's funny Google doesn't. Also Markov chains can help by highlighting unusual word combinations, I doubt anal children occur often in correct texts.


Proofing transcriptions is difficult. You can go a long way with computer models, but eventually you’ll get a feeling about a word not quite fitting and need to check back to the source text. I guess you could run a high quality / more iterations OCR at this point, but more likely you’re relying on human proofing, and it doesn’t scale if you’re relying on humans. There’s a reason the Standard Ebooks project I contribute to has <250 books compared to Gutenberg’s 50k+: quality takes effort.


Pirates usually do this, that's funny Google doesn't.

Pirates use cheap labor to solve the problem.

Google's approach (and Facebook and Twitter and...) is to see every problem as solvable through an algorithm.

If this approach worked, we wouldn't have so many errors in published OCR'ed documents. Or social media tearing the world apart, for that matter.


Surely pirates use OC-Ahrrrrrrr ?


> Pirates use cheap labor to solve the problem.

Really? They usually do it themselves for free AFAIK.


That's pretty cheap.


Google scanned 25 million books, and would show the user the original scan anyway. The text was only used for the search index.


Google also provides epubs, which show the OCRed text.


Maybe the problem are said Markov chains? It's possible the phrase "in her anus" occurs more frequently on the internet than "in her arms".


In other words:

Just because 50% of the internet are porn, you shouldn't use a 50% porn corpus to train your Markov chains.


An important idea. Obviously you should use a corpus made from books (of a relevant genre and/or time period preferably) not from the whole Internet.


If you build that corpus from your OCRed text that has a consistent misinterpretation then that misinterpretation will be amplified as the correct answer.


And this would be a linguistic "prion" - systemic defect that replicate itself, yet does not rise up to the level of complexity we require of a life-form.


Obviously you are to build it from known-valid texts that are already proof-read.


Readers read books. There should be a way to signal errors like that and get them corrected. Also a way to make simple edits in our own ebooks in the reader.


This should be the case on the web, too. If I put together a blog post and someone wants to correct my spelling, it should be easy for them to suggest a correction (and see it when they read without approval.)

Maybe you'd have to worry about graffiti or spam, though. A git PR model would be fine for low-traffic situations though, and maybe there's something similar that scales to higher traffic.


On many Russian websites (including some pirate online libraries where you can read/download books for free) one can find a message saying "if you find something misspelled please select it and press Ctrl+Enter to report". I have never used this facility as I'm hardly too literate in Russian and have never found anything looking like a mistake on such sites (perhaps because others have already reported all the mistakes) but it feels odd I've never seen anything like that on non-Russian websites.


I think there was once an encyclopedia-inspired website that worked on that model. One could even add new content, in addition to correcting typos and misinformation.


The wiki model (without tweaks) isn't appropriate for newspapers, blogs, discussion forums etc though. Maybe no system would be ideal for all of them.

I think what I want is something that allows for technical improvement while maintaining authorial intent and "ownership" (in a conceptual if not legal sense) without optimising for consensus-gathering.

It's also not everywhere. Maybe what I'm after is a browser extension or something...


I thought they were using captcha to perform that function.


What if the captcha are mostly shown to uses trying to get into porn sites, their minds would be primed for the wrong answer.


I just saw your reply to one of my older comments. I ordered part 00HN577 from eBay, and swapped the trackpads. Very easy, all that was required was a small screwdriver.


Oh, thank you. I'm going to try it!


That is why the new Captchas ask the user to identify traffic lights and storefronts, as opposed to fire hydrants and fence posts!


Captchas don't provide context.


Apparently, "feces" where "faces" should be is a thing as well.

"feces sticking out of large pipes, looking hungrily at the camera"

https://archive.org/stream/The-Colonel-Who-Would-Not-Repent/...


I like how if you read it out of context, the first part of that sentence seems perfectly fine in something like a text about sewage; and then the second part catches you by surprise.


I have a work injury, acquired from 30+ years of coding :)

As a youngster, I read a lot and at great speed, but after I started coding my read speed dropped dramatically. Attention to detail while writing or reading code seems to have re-wired my brain for accuracy instead of speed ;)

What jumps at me in the article is not the misinterpretation of arms, which results in funny but somewhat "working" language, but rather "[..] hitting lier feet against stones" where the 'h' is interpreted as 'li'. That brought me to a full stop.

Also makes me think of "kerning" vs. "keming" :)


> Also makes me think of "kerning" vs. "keming" :)

The game Path of Exile once had a line in the patch notes which simply said "Fixed keming." Made my day.


A designer at my former workplace had a full-zip hoodie. To one side of the zipper: "Ker", to the other: "ning".


Maven documentation porn.xml


I am reminded of the anecdote of one of the first mass printings of the Bible in London in the 1600s having a grave misprint: "Though shall commit adultery".

Proof reading is as ever important.


And now I'm thinking of Rimmer's parents in Red Dwarf. Devout seventh day advent hoppists, due to a missing letter in "...and the greatest of these is hope."



The Kindle version of "A Game Of Thrones" (first book in the series) has "Dome" everywhere instead of "Dorne" (the name of the kingdom in the south). Apparently it was OCR'ed from the printed book.


How is that acceptable past a month of its release, is nobody correcting it?


Welcome to the 21st century, where quality, accuracy, and precision are sacrificed at the altar of "scale".


This is one of those cases where applying some kind of additional dictionary algorithms to "autocorrect" OCR (which may have been what actually happened) won't help, since the wrong reading is an actual dictionary word while the correct one is a name.


Funny

The arms/anus confusion should be fixed with a language model on top of the letter predicting network.


Without a very long-range model I don't think that would help. "in his/her anus" and "in his/her arms" can both be correct in the right circumstances; it takes quite a bit of surrounding context to tell which one is more likely. (While doing some research in Google Books I even found a couple that looked like OCR errors until I read beyond the search snippet.)


How does OCR software integrate letter and language models? Do they first make a best guess at the letters and then try to correct it with the language model? https://en.wikipedia.org/wiki/Optical_character_recognition#... gives me that impression, but I'm not sure.

Brains are said to have a lot of feedback from higher levels of sensory processing to lower. Maybe you don't need as good a language model if its evidence is integrated more tightly with the rest.


But that would completely ruin the article, if it were OCR'ed.


Only without human oversight. Human proofreading might not be economically feasible, but maybe detecting likely OCR errors and making humans decide whether it's an actual error or not would be feasible.


After stopping laughing I went back and checked the Standard Ebooks corpus to see if any of this mistake had slipped through; luckily it seems that in the intervening 9 years someone at Gutenberg and / or archive.org has corrected this particular issue in the source transcriptions.


Gutenberg is designed to avoid this sort of thing, although some slip through: originally, they didn't use OCR and now they use the distributed proofreader thing.

Archive.org is kind of a mess, though.


Yeah, I usually submit about 10-15 corrections to Gutenberg per book I proof; generally they’re in good shape. The bigger problem with Gutenberg is that older books omit all accents, which is a huge problem for who series of books. I’ve been trying to produce Maurice Leblanc’s series of Arsène Lupin stories for Standard Ebooks and Gutenberg generally spells the titular protagonist’s name wrong.


You'd think that the OCR process would somehow call attention to words that have a high probability of being wrong and especially of being wrong in a problematic way. You don't want to require humans to read and sign off on everything, but with something like that, it shouldn't be that hard to have something that is very quick for a human to see the scanned image and compare it to the transcription, simply on the basis of the word "anus" being in there.


I was reading “Creative Selection” by Ken Kocienda last week. Goes behind the scenes of him designing the iPhone keyboard early in its development (good read)

In any case he mentioned there is a hate word dictionary specifically so that the autocorrect never suggests such words even if they seem to be a close match. You basically have to type those words perfectly.

In another related bug Xerox document centres which weren’t even technically doing OCR were changing numbers from 1 thing to another in scanned IMAGES due to high level compression substituting numbers - much more dangerous! https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...


What is the article saying about 'pertistent' vs 'persistent' ? Is that a word, what is its meaning?


The book it’s from, Uncle Tom’s Cabin, is full of creative orthography designed to reproduce the English pronunciation of slaves in the pre-Civil War South.


It's not a word. Either it's a pun on 'pert' (impossible to tell without context) or it was a typo in the original (seems more likely).


> Either it's a pun on 'pert' (impossible to tell without context) or it was a typo in the original (seems more likely).

Haven't read the original, but it was probably meant to add character to the way the character speaks. Either to make fun of the character for not being able to pronounce words correctly, or to make them more pitiable, or just as a matter-of-fact detail.


Likely because of the long s in historical texts - it's often misrecognized as an f or t


No, the book says "pertistent", and the transcriber mistakenly wrote "persistent". Neither of those is a recognition error. Rather, the error is that the transcriber did recognize what was meant, and ignored what was actually written.


Indeed.

https://books.google.com/books?id=7JfT9yq0zAQC&pg=PA78&lpg=P...

> If another copy from the same edition has the error corrected, such cues may help to identify early and late printings and contribute to a more comprehensive account of the book’s printing history.

In other words, when transcribing books you want to preserve misspellings that occur in the source text.

It’s actually quite interesting because that means that automatic spellchecking of OCRed text while helping to improve the quality of the transcript could also introduce unwanted corrections. But doing like the OP did and comparing their transcripts with those of Google Books was clever.


Tangent: scribal errors are often classified as to whether they are committed by scribes who do understand the language they're copying or scribes who don't. (Some errors can be committed by either kind of scribe, but will still tend to lean one way or the other.) Copying "persistent" where the text has "pertistent" is a good example of a kind of error that only a scribe who understands the text will make. (Though this particular case might not even be considered a scribal error.)


I was confused by your reference to understanding the text, so I looked it up. I think a scribe who really understood the text would recognise the intentional misspelling. The transcription error reflects a lack of understanding.

But the error does fall into the deliberate rather than unwitting category described here: https://sites.ualberta.ca/~sreimer/ms-course/course/scbl-err...


What intentional misspelling? "Pertistent" in the 1879 printing is a printer's error, and it's very clear if you look at the passage that it can't be intentional, because the same character uses the word 5 times in quick succession.

The transcriber's error is unwitting; he specifically comments on the fact that he didn't want to make it.


Reminds me of how when xkcd looked at which days of the month were most common, the 1st, 10th, 11th, 21st and 31st were more or less common than they should have been due to OCR error: https://drhagen.com/blog/the-missing-11th-of-the-month/


This online version of Miles Davis' Autobiography features a character called "dark Terry" i.e. Clark Terry.

http://yanko.lib.ru/books/bio/miles.htm



Given the examples in the piece involving children, I wonder if there is any danger of this resulting in a problem where a site gets accused of child pornography or gets blocked because of it sounding so wildly inappropriate or something.


I don't know, children do have anuses, and they are known for their curiosity. I'm sure many a parent has had to dig lego bricks out of various orifices.


Children have a few things that depicting or discussing would result in being blocked from most schools, even when stopping well short of pornographic depictions.


"You can lead a horse to water but you can't stop it from sticking Lego up its bum."

Probably my favourite quote from the Inbetweeners.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: