This reminds of an eBook of Neuromancer that I read, it was occasionally missing the letter f.
For the most part I just added it back mentally without really thinking about it, but then sometimes I hit a passage like this:
"He turned, pulled his jacket on, and licked the cobra to full extension."
That one took a moment.
Yup. I have to manually detect and correct for all the possible ligatures in all possible unicode in my text to speech pre-processor scripts. I hate them.
If you have a Unicode library available, you might try asking it to convert the text to NFKD or NFKC normalization form. This will take apart ligatures (the former will also take apart accented characters).
I wonder if at some point your e-book went through macOS's Preview program.
At work I sometimes have to copy blocks of text from a PDF into another document. If I do it with Preview, I lose the fi and fl ligatures. It only happens with PDFs created in-house, so I guess it's some kind of stylistic thing that comes from the guy who lays out the PDFs.
I eventually learned to use Adobe's own Acrobat, instead, and it works fine.
I'd argue this is a feature and not a bug. When copy-pasting text from PDFs, I'd love to not have to deal with unicode and ligatures. There's another comment upthread here where someone's complaining about having to deal with unicode.
If Preview can do this automatically, please don't change that feature.
I think the GP commentor meant that the ligatures are converted lossfully into an arbitrary substituant character (e.g. fl -> l), rather than that they’re taken apart losslessly.
If you don't know what's going on it looks like the word "fuck" was more common in the 17th century than today. But actually it's the word "suck" written with a long s ("ſuck"), which you can see is easily OCRed incorrectly.
After you guys fix this with Markov chains or whatever, I look forward to reading: the proctologist was thorough but found no sign of blockage in her arms.
At the family printshop my mother experienced some typing glitches, she would sometimes type "simulate" instead of "stimulate" and "muck" instead of "mark". This led to two disasters, which required us to stop the presses (the pressman was a great proofreader!),
In a political brochure: "...will introduce bills to simulate progress..."
In a funeral booklet: "...he left his muck upon us all."
She asked me to go into computer's master dictionary and patch it to disable the words 'simulate' and 'muck' so it would bring these mistakes to her attention.
OCR-ed texts should really be proof-read before being published. Pirates usually do this, that's funny Google doesn't. Also Markov chains can help by highlighting unusual word combinations, I doubt anal children occur often in correct texts.
Proofing transcriptions is difficult. You can go a long way with computer models, but eventually you’ll get a feeling about a word not quite fitting and need to check back to the source text. I guess you could run a high quality / more iterations OCR at this point, but more likely you’re relying on human proofing, and it doesn’t scale if you’re relying on humans. There’s a reason the Standard Ebooks project I contribute to has <250 books compared to Gutenberg’s 50k+: quality takes effort.
If you build that corpus from your OCRed text that has a consistent misinterpretation then that misinterpretation will be amplified as the correct answer.
And this would be a linguistic "prion" - systemic defect that replicate itself, yet does not rise up to the level of complexity we require of a life-form.
Readers read books. There should be a way to signal errors like that and get them corrected. Also a way to make simple edits in our own ebooks in the reader.
This should be the case on the web, too. If I put together a blog post and someone wants to correct my spelling, it should be easy for them to suggest a correction (and see it when they read without approval.)
Maybe you'd have to worry about graffiti or spam, though. A git PR model would be fine for low-traffic situations though, and maybe there's something similar that scales to higher traffic.
On many Russian websites (including some pirate online libraries where you can read/download books for free) one can find a message saying "if you find something misspelled please select it and press Ctrl+Enter to report". I have never used this facility as I'm hardly too literate in Russian and have never found anything looking like a mistake on such sites (perhaps because others have already reported all the mistakes) but it feels odd I've never seen anything like that on non-Russian websites.
I think there was once an encyclopedia-inspired website that worked on that model. One could even add new content, in addition to correcting typos and misinformation.
The wiki model (without tweaks) isn't appropriate for newspapers, blogs, discussion forums etc though. Maybe no system would be ideal for all of them.
I think what I want is something that allows for technical improvement while maintaining authorial intent and "ownership" (in a conceptual if not legal sense) without optimising for consensus-gathering.
It's also not everywhere. Maybe what I'm after is a browser extension or something...
I just saw your reply to one of my older comments. I ordered part 00HN577 from eBay, and swapped the trackpads. Very easy, all that was required was a small screwdriver.
I like how if you read it out of context, the first part of that sentence seems perfectly fine in something like a text about sewage; and then the second part catches you by surprise.
I have a work injury, acquired from 30+ years of coding :)
As a youngster, I read a lot and at great speed, but after I started coding my read speed dropped dramatically. Attention to detail while writing or reading code seems to have re-wired my brain for accuracy instead of speed ;)
What jumps at me in the article is not the misinterpretation of arms, which results in funny but somewhat "working" language, but rather "[..] hitting lier feet against stones" where the 'h' is interpreted as 'li'. That brought me to a full stop.
I am reminded of the anecdote of one of the first mass printings of the Bible in London in the 1600s having a grave misprint: "Though shall commit adultery".
And now I'm thinking of Rimmer's parents in Red Dwarf. Devout seventh day advent hoppists, due to a missing letter in "...and the greatest of these is hope."
The Kindle version of "A Game Of Thrones" (first book in the series) has "Dome" everywhere instead of "Dorne" (the name of the kingdom in the south). Apparently it was OCR'ed from the printed book.
This is one of those cases where applying some kind of additional dictionary algorithms to "autocorrect" OCR (which may have been what actually happened) won't help, since the wrong reading is an actual dictionary word while the correct one is a name.
Without a very long-range model I don't think that would help. "in his/her anus" and "in his/her arms" can both be correct in the right circumstances; it takes quite a bit of surrounding context to tell which one is more likely. (While doing some research in Google Books I even found a couple that looked like OCR errors until I read beyond the search snippet.)
How does OCR software integrate letter and language models? Do they first make a best guess at the letters and then try to correct it with the language model? https://en.wikipedia.org/wiki/Optical_character_recognition#... gives me that impression, but I'm not sure.
Brains are said to have a lot of feedback from higher levels of sensory processing to lower. Maybe you don't need as good a language model if its evidence is integrated more tightly with the rest.
Only without human oversight. Human proofreading might not be economically feasible, but maybe detecting likely OCR errors and making humans decide whether it's an actual error or not would be feasible.
After stopping laughing I went back and checked the Standard Ebooks corpus to see if any of this mistake had slipped through; luckily it seems that in the intervening 9 years someone at Gutenberg and / or archive.org has corrected this particular issue in the source transcriptions.
Gutenberg is designed to avoid this sort of thing, although some slip through: originally, they didn't use OCR and now they use the distributed proofreader thing.
Yeah, I usually submit about 10-15 corrections to Gutenberg per book I proof; generally they’re in good shape. The bigger problem with Gutenberg is that older books omit all accents, which is a huge problem for who series of books. I’ve been trying to produce Maurice Leblanc’s series of Arsène Lupin stories for Standard Ebooks and Gutenberg generally spells the titular protagonist’s name wrong.
You'd think that the OCR process would somehow call attention to words that have a high probability of being wrong and especially of being wrong in a problematic way. You don't want to require humans to read and sign off on everything, but with something like that, it shouldn't be that hard to have something that is very quick for a human to see the scanned image and compare it to the transcription, simply on the basis of the word "anus" being in there.
I was reading “Creative Selection” by Ken Kocienda last week. Goes behind the scenes of him designing the iPhone keyboard early in its development (good read)
In any case he mentioned there is a hate word dictionary specifically so that the autocorrect never suggests such words even if they seem to be a close match. You basically have to type those words perfectly.
In another related bug Xerox document centres which weren’t even technically doing OCR were changing numbers from 1 thing to another in scanned IMAGES due to high level compression substituting numbers - much more dangerous! https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...
The book it’s from, Uncle Tom’s Cabin, is full of creative orthography designed to reproduce the English pronunciation of slaves in the pre-Civil War South.
> Either it's a pun on 'pert' (impossible to tell without context) or it was a typo in the original (seems more likely).
Haven't read the original, but it was probably meant to add character to the way the character speaks. Either to make fun of the character for not being able to pronounce words correctly, or to make them more pitiable, or just as a matter-of-fact detail.
No, the book says "pertistent", and the transcriber mistakenly wrote "persistent". Neither of those is a recognition error. Rather, the error is that the transcriber did recognize what was meant, and ignored what was actually written.
> If another copy from the same edition has the error corrected, such cues may help to identify early and late printings and contribute to a more comprehensive account of the book’s printing history.
In other words, when transcribing books you want to preserve misspellings that occur in the source text.
It’s actually quite interesting because that means that automatic spellchecking of OCRed text while helping to improve the quality of the transcript could also introduce unwanted corrections. But doing like the OP did and comparing their transcripts with those of Google Books was clever.
Tangent: scribal errors are often classified as to whether they are committed by scribes who do understand the language they're copying or scribes who don't. (Some errors can be committed by either kind of scribe, but will still tend to lean one way or the other.) Copying "persistent" where the text has "pertistent" is a good example of a kind of error that only a scribe who understands the text will make. (Though this particular case might not even be considered a scribal error.)
I was confused by your reference to understanding the text, so I looked it up. I think a scribe who really understood the text would recognise the intentional misspelling. The transcription error reflects a lack of understanding.
What intentional misspelling? "Pertistent" in the 1879 printing is a printer's error, and it's very clear if you look at the passage that it can't be intentional, because the same character uses the word 5 times in quick succession.
The transcriber's error is unwitting; he specifically comments on the fact that he didn't want to make it.
Reminds me of how when xkcd looked at which days of the month were most common, the 1st, 10th, 11th, 21st and 31st were more or less common than they should have been due to OCR error: https://drhagen.com/blog/the-missing-11th-of-the-month/
Given the examples in the piece involving children, I wonder if there is any danger of this resulting in a problem where a site gets accused of child pornography or gets blocked because of it sounding so wildly inappropriate or something.
I don't know, children do have anuses, and they are known for their curiosity. I'm sure many a parent has had to dig lego bricks out of various orifices.
Children have a few things that depicting or discussing would result in being blocked from most schools, even when stopping well short of pornographic depictions.
I guess so the child could smell the Derry air.