A long time ago I decided, that if I were to ever have a captcha system, it wouldn't use ones of that style, which are getting so.. randomized these days, it's hard for a human to read.
Frankly, I think knowledge or logic based captchas are the way to go.
"Todd is three times as old as Jane is. When Jane was ten years younger, Todd was five times as old as Jane is. How old is Todd?" for example.
Sure, takes longer to think about, but if someone can write a script to start parsing logic puzzles like that, quickly, and use it to defeat website signup authentication methods?
I'd take you on that bet. Write the system and I'll write the solver. I drink Guinness :)
A better system would be to ask the user to identify the gender of a person based on an image to tell whether the picture is of a dog or of a cat and so on, these tasks are trivial for a person and very very hard for a program.
My introduction to AI class required us to write a classifier for people in the news (20 different images of 10 different people). We were given the location of the people's faces, but we were able to get 60% accuracy using SVMs and the 32x32 block of pixels (nose, eyes, mouth region). This was the "baseline" system. Some systems were getting nearly 85+%. I must admit though, that this was a restricted dataset, but the faces were not all looking straight ahead the way eigenfaces are, and I'm sure with enough data, and enough features that sort of CAPTCHA could be defeated a large percentage of the time.
You are confusing two different tasks, identifying a person out of a small comparison group is relatively easy - just deconstruct the face & compare certain facial features. There was a ton of research on the subject and even some working commercial products (my schools AI lab uses one we built as a lock).
Identifying gender is significantly harder.
You want something a lot harder, have them click on the picture of the more attractive person, use data from a hotornot type site (just make sure the data isn't public). Good luck solving that with Support Vector Machines. If you want to generate more data just use build RE-CAPTCHA type system.
I don't see why it'd be harder with good features, and after looking at the article again, 35% accuracy was considered a success. Obviously, I'm not as qualified as you in this sense, but it seems logical based on results I've seen (again, admittedly not the same quality as you've probably seen).
I checked on the "obviously, I'm not as qualified as you in this sense" by looking at the user info, and I remembered that "Ideas to monetize new artifical intelligence" thread... so Marcus, sorry if it's offtopic, but how did you solve that problem? Are you doing captchas maybe? :)
In a way I understand trying to apply the algorithm manually for each client is wasteful, negotiation with each client is tiring especially when its with a 7B company.
I'm thinking about using the idea I got of building a web-service around it, and letting people find their own uses for it.
I've always thought that for blog comments, the captcha should actually take the form of a few short SAT-style questions which would test reading comprehension. E.g.
"Which one of the following is not one of the points the author is trying to make"
"Which of the following best describes the authors' opinion of Java?"
"What did the author cite as his source for his labor statistics?"
The problem is that problems that are easy to generate with a computer are usually easy to solve with one. Even if you rotated through 100 different variations, solving 10 of those would give the spammer a 10% success rate. If it is just for one small site, no one is going to break it, but there are easier ways to prevent spam on small sites (comment spam bots don't understand JavaScript for instance - something that I take advantage of on my blog).
A system like that might be good for keeping commenter quality high though :).
Consider an image-based, interactive captcha using drag and drop.
Show a map with a bee on it. Please guide this bee to the flower, passing the pond, the bear, and the scarecrow. Now bring the bear to the beehive; you would generate a hash string based on the sequence of visits. This may involve multiple back-and-forths, which is probably bad, but fairly brainless and robust, from what it seems, as little thought as I have given it.
Frankly, I think knowledge or logic based captchas are the way to go.
"Todd is three times as old as Jane is. When Jane was ten years younger, Todd was five times as old as Jane is. How old is Todd?" for example.
Sure, takes longer to think about, but if someone can write a script to start parsing logic puzzles like that, quickly, and use it to defeat website signup authentication methods?
I'll buy that person a pint.