reCATPCHA supposedly uses scans of old NYT articles like this as the source imag...

smackfu · on Nov 11, 2011

I wonder about that since I invariably get one legible word in reCAPTCHA, and one junk word. Not just illegible or non-OCR-able, but actually nonsense strings of letters, e.g. "umower", "dealiff", "etstcom". My theory is that the source OCR is incorrectly breaking up words, so some words get split into multiple parts. And reCAPTCHA is useless for that.

paisawalla · on Nov 11, 2011

The way it works is that the one legible word is the control string, and the other one is the challenge string -- in cases like this it's obvious which is which, but not always. The challenge string could be comprised of characters from different scans, each of which had failed recognition by OCR software.

The control word is there to prove that you're a human, and the challenge word is there for you to provide a small amount of work. In this case the work could benefit different scans at one time.

_delirium · on Nov 11, 2011

The archive is full-text-searchable, though, so they must have something internally. Not sure if they just don't want to expose it, or if it's OCR with too many errors to be presentable to the public.