reCATPCHA supposedly uses scans of old NYT articles like this as the source image for some of their challenges. The idea is to harness crowd intelligence to digitize their archives. One half of the challenge is a word whose meaning is known and the other is one that still needs human intelligence to interpret.
So it's a little sad then that, when you click through to the actual article, you get to read a blown up image.
I wonder about that since I invariably get one legible word in reCAPTCHA, and one junk word. Not just illegible or non-OCR-able, but actually nonsense strings of letters, e.g. "umower", "dealiff", "etstcom". My theory is that the source OCR is incorrectly breaking up words, so some words get split into multiple parts. And reCAPTCHA is useless for that.
The way it works is that the one legible word is the control string, and the other one is the challenge string -- in cases like this it's obvious which is which, but not always. The challenge string could be comprised of characters from different scans, each of which had failed recognition by OCR software.
The control word is there to prove that you're a human, and the challenge word is there for you to provide a small amount of work. In this case the work could benefit different scans at one time.
The archive is full-text-searchable, though, so they must have something internally. Not sure if they just don't want to expose it, or if it's OCR with too many errors to be presentable to the public.
So it's a little sad then that, when you click through to the actual article, you get to read a blown up image.