Hacker News new | past | comments | ask | show | jobs | submit login
TextCaptcha: Simple Textual Captcha Challenges (textcaptcha.com)
65 points by miki123211 on Jan 2, 2020 | hide | past | favorite | 63 comments



My only contention with text captcha apart from those of bots is that these require you to be able to read, understand and know the answer of the question, and that too in the language that captcha is being served.

This might sound like a nonexistant issue, but lets take the example of a country like India with a huge number of internet users. There are many who use the Internet without being completely literate in English or even their mother tongue. Use cases like checking exam results, train tickets etc require captchas to be filled and the normal deformed letter ones do well. I doubt text captchas will be accessible to them.

But then again, user base matters a lot too and that should be taken into account always!


I was looking through the examples, and for a large percent of the users I deal with in the US, these questions would be too difficult.

I had a user struggle to find the exclamation symbol on the keyboard.


Off topic: If you are in Germany, the ReCaptcha is displayed in German, even if your browser's preferred language is "en". However, ReCaptcha is now ingrained in muscle memory for people, so they automatically click traffic light/stairs/bus/cars/cycles/palms/hills/zebra crossing, whichever is shown multiple times without reading instructions. For textual ones, people will adapt soon by pattern matching, whatever seems out of normal is the answer :)


There certainly is some benefit of this, but the barrier for bots is low. All it needs is a custom bot that can be written in minutes.

Because it seems to use a fixed set of question types. All of which a simple script can solve.

Example:

    What is Mark's name?
Writing a script that extracts the name from this string is trivial. And the other question types are similarely easy.

The next step would be to allow the site owner to put in domain specific questions. For example a chemistry website might ask "What is the formula of $moleclue?". And then provide ['Water'=>'H2O','Salt'=>'NaCl'...] as questions. Then the site would be protected from a general textcaptcha solver and bots would have to be customized for each domain.


I like the idea of domain specific questions, but I also think that questions are the biggest added value of this API. If you have your own list, then you can implement your own "captcha" pretty easily.


The value here is not that they are easily solvable, with enough effort your could probably write something that solves these too. Thing is, are you willing to go through that effort for something that only a few sites use?

This is in contrast to ReCaptcha, which is used my millions of sites, so using crowdsourcing and systems to make them easy to use is worth it there.

For a long time I used the fixed captcha, "What is 1+1?", but written in my local language, and managed to filter out all the spam...


Custom captcha is not very hard to implement. But for some reason very few sites actually use it, is it that we don't believe in the ability to write a few questions with honey pots or just overconfident that the Google service is the only one that people could use?


I like the idea because, according to ReCaptcha, I am officially a bot! And it works! Whenever I see one of those, I just close the tab and go away.


I found the Adafruit "what size is this resistor" captcha pretty clever: https://blog.adafruit.com/2010/04/21/resisty-resistor-captch...

Not suggesting it's broadly useful, but cool nonetheless.


I'm personally a fan of Lichess's "mate in 1" captchas.


I'm crap at chess, but there's this: http://tetration.xyz/ChessboardFenTensorflowJs/


> The answers are provided as MD5 checksums of the lower-cased answers which allows you to compare a users response with the answers without explicitally [sic] knowing the answer yourself.

Why bother? All the hashes I tried were able to be instantly cracked online by CrackStation.

If the hashes are kept server-side, then why not use plaintext?


Not only, it suggests the false impression that people can do it all client side... If you're sending the answers over the wire, they can simply be sent back to you without any real computational effort. No clever text processing required.

Keeping it in plaintext would have made doing this being dumb more obvious. Heck, two separate files would have even been better. Also you can do something like damaru levenshtein [1] on plaintext so you wouldn't have to be a stickler for exact matches

[1] https://en.m.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_...


Are new questions being continually added to the database of challenges? If not, one can quite simply request many challenges, have a human solve them, and remember the answers.

If you generate the questions based on fixed rules, it is only a matter of time before an algorithm or a human figures out all the necessary patterns.

I personally feel like these text puzzles aren't as energy-draining to solve as reCAPTCHA, so a human could reasonably solve many hundreds in an hour of free time or so. That doesn't even include having a TTS engine dictate the question to Google Assistant.


You leave this implication that modern "hard-to-crack" Captchas are somehow different when in reality when reCaptcha and co are cracked to death and you can literally buy 5000 captcha solves for less than 5 USD.

Captchas in general are a bit of an illusion really. The best Captcha you can have is a home-made one as the attacker has to go through actual manual effort rather than enabling --solve-recaptcha flag on the bot script.


I feel like there are orders of magnitude differences in cost and effort between “completely free because you downloaded a library” and “$5 per 5k and there’s a few HTTP requests in there too”


Is there though? Whom are you protecting yourself from? People who write curl scripts or serious attackers? Surely if someone wants to commit to an attack $5 is completely neglectible amount of money, right?

All you "protect" yourself from is casual script users and script kiddies which really can be solved by IP rate limit. If someone has access to thousands of IPs they can probably afford to drop $5 to solve the captchas too, right?


Having written code to bypass captchas, up to and including Google ReCaptcha: yes, there is a difference. A large difference.

$5 per request is not a negligible amount of money. In practice it doesn't cost anywhere near that amount to call a MechanicalTurk API which will solve ReCaptcha for you. But it's still significant for any nontrivial number of requests, such as in the use case of scraping.

You should adjust your priors here. You're focused on the narrow case where a win condition is achieved by spending n dollars to solve a single instance of ReCaptcha. People who use ReCaptcha are (in my professional experience) overwhelmingly more focused on requiring ReCaptcha to be solved for every individual request of a given type.

I have been in the position you speak of, where I had a revolving set of IP addresses, requesting servers and user agents, and $5 per request would have immediately shut my operation down. As it was, the actual ~$0.15 per request to solve ReCaptcha was sufficiently significant that I couldn't curate enough data for what I needed, despite having all the other resources you mention.


Seems like you really hadn't done your research. Deathbycaptcha has been around for probably a decade now and their rates were always something like $0.1 per captcha. That's not it though, captchas take profiles, so if you correctly configure your profiles then it becomes even less than $0.1 per solve as a profile might only get 1 captcha per 10 requests.

It's dead easy to get around captchas unless you're just a casual scripter that wants to `wget` that one article - then fuck that guy, right?


The cost mentioned upthread was $5 for 5000 requests, or 0.1 cent per request. Would that have allowed you to collect the data you wanted?


Possibly, I no longer do that work. But that was never the price I saw for the service. Cheapest I ever saw was still 10 times that, and requests frequently had to be resent due to spotty completion.


I think people who use CAPTCHAs know that they are protecting their site from script kiddies. Real attackers are going to find actual vulnerabilities to achieve much bigger damage than a script kiddie abusing the "comment" feature to write spam content, or otherwise making a large number of submissions for the webmaster to sift through.


CAPTCHA is typically one of several defences, and you underestimate the cost they cause attackers. One of the main problems for an attacker is not really the dollar cost of buying a CAPTCHA solver, the real inconvenience is really the time it takes to solve on. The attackers go from less than a second to complete a request, to 30-60 seconds to complete a request, a significant slowdown.


> Is there though?

If there wasn’t, people wouldn’t use captchas


Captchas are not "a bit of an illusion." In the past I ran very large, bespoke data mining campaigns, and ReCaptcha was continually a source of annoyance. My usual response to seeing ReCaptcha was to work on a different project or find a different dataset, because spending even a few cents per request to solve the captcha would make collecting the dataset unviable unless I was very certain it would pay off.


A neat way to improve on this would be to get a series of 3-5 challenges, especially if it's something like math and logic puzzles that build on each other, so it's a multi-dimensional problem, so it's difficult to cache or precompute.

The hash of the answer is just a string that concatenates the answers, and the challenges always mixes them in different orders. One possible example:

1. Does Red combine with Yellow to make Green or Orange? 2. If the answer to the last question was reversed, what letter would it start with? 3. If you added that latter to the end of these words, which word would be most edible? Mac, Nam, Pi, Snak

answers: orange,e,pi

Of course, you'd want to design the interface to support multiple choice selection.

I would also recommend against using MD5, since, even if the hash weren't known to the end user, which should be sufficient even in this case, the attack MD5 is most known for is the fact that it's trivial to generate text that could match any given hash, regardless of what was used to originally make it. It seems like a potential attack vector somehow, depending on the case, and it's not terribly harder to just use one of the many tried-and-true, known-not-broken cryptographic hashing algorithms. SHA-256 would be adequate.

I'm not 100% certain of all the logic behind this, there's always cases I'm not rigorous to consider, but I'd be interested in seeing how others might improve this approach in similar ways.


> it's trivial to generate text that could match any given hash

Source on this?

Wikipedia states what I have heard before which is that MD5 collision attacks are pretty trivial now, but carrying out a preimage attack as you describe remains theoretical at this time.

https://en.wikipedia.org/wiki/MD5


There is a way to defate this that is much simpler, in the examples on the page 5 out of 7 examples has the answer in the question. Just do MD5 sum of every word/combination of words in the question and you would find the answer to many of the questions. This together with a targeted dictionary would propbably give you a very high success rate for little cost. MD5/SHA-familly hashes are inexpensive to compute, you can do billions of the in a second. If you cant find the answer, then just request a new challenge untill you find one you can answer.


> it's trivial to generate text that could match any given hash

Actually I don't know why there is some hash used at all. According to the example, answers are stored in a session. CRC32 would do the job as well. Or no hashing at all. You would need some better hash in case when user downloads it. I can imagine different flow where you would need some better hash: I.e. you have some secret token, hash it together with captcha answer, send question with good hash to a browser and user sends back answer together with a hash he got. In such flow there would be no need to store values in a session.


Some interesting "behind the stage" insights about this post.

I've posted this a couple months ago. Just before christmas, I've received an email from the mods asking me to repost, as they thought the story was interesting. Initially after reposting, it didn't get much traction, but, when I look at it now, it actually has upvotes. However, the number of upvotes it has is much bigger than the amound of Karma i received. I wonder how correlated this facts are.


As recognized, captchas are not effective in keeping determined people out. Nowadays the method to put some price tag for access seems to be to use SMS verification.

However not everybody is very determined. There are for example bots that are using the comment forms to send spam. The expected payout for each posted message must be really, really low.

This could be useful for example in the comment form plugins created for various content management systems.


TextCaptcha can be a viable alternative to reCAPTCHA depending on the type of service you're protecting, and it doesn't come with the privacy drawbacks of Google's catch-all privacy policy. The personal data collected by the reCAPTCHA service is not clearly distinguished in their privacy policy [1], and there are no tools to exercise control over the data gathered by reCAPTCHA.

W3C has put together a comprehensive document about captcha types and their application: https://www.w3.org/TR/turingtest/

[1] https://policies.google.com/privacy?hl=en


This seems like exactly the sort of questions a big language model a la GPT-2 would be able to answer.

I'm curious how TextCaptcha would fare in terms of complexity compared to the other language understanding benchmarks.


Honestly, someone willing to spend effort & time & resources on a big language model will have enough resources to find other ways to access whatever they are hoping for, similarly how they can buy ReCaptcha solve services. So, in a sense if that is your threat model, you need multiple traps, neither of these will suffice alone. :)


This will work well as long as it's a niche captcha and creating automated solver it's worth the effort. Hardcoding something in JavaScript what will be injected to all form submits is even simpler and doesn't require anything from your users. It will work as long as bots don't execute JS or someone is willing to spent some time on modifying the bot.


How many of these logic questions are there? After refreshing http://api.textcaptcha.com/myemail@example.com.json about a hundred times, I'd estimate there are about 50 questions. Someone could write a script that solves each particular question manually.


I created something similar for my personal website. I don't get spam comments, but I think that's only because I'm too small to be worth the effort for spammers. It would be pretty easy to pattern-match the few different question types and find the right answer using a dictionary.


I just did some statistic and "four" is correct answer to 8% of questions.


> The answers are the MD5 hashes of correct lower cased answers: you should be able to check responses from real users you challenge with the question against these checksums.

That's a somewhat concerning choice of hash function. MD5's collision resistance is broken. It is well known to be broken, for more than a decade.

There's references to PHP on the page. 10 years ago [0] this message appeared in PHP's manual:

> The well known hash functions MD5 and SHA1 should be avoided in new applications. Collission attacks against MD5 are well documented in the cryptographics literature and have already been demonstrated in practice. Therefore, MD5 is no longer secure for certain applications.

[0] https://www.php.net/manual/en/function.hash.php#94104


Collision resistance is irrelevant if there's already a hash value specified. What matters here is being able to find preimages for a given output, and even if MD5 had a practical preimage attack, it would be too expensive to use just for cracking captchas. :-)

PHP is a red herring; this would apply for any language.


> PHP is a red herring; this would apply for any language.

I used the warning from the PHP manual, because the site creator should be familiar with it.

You'll find similar warnings against MD5 everywhere.


The problem is rainbow tables, not collisions.

Search Google for f6f7fec07f372b7bd5eb196bbca0f3f4, and you’ll see the answer is Friday. That stateless PHP example is rendered useless by this.


That's only relevant if you're using it to make signatures, thus you can make two inputs that hash to the same digest, and signing that digest creates a signature that's valid for both inputs.

In this case, the only threat model might be brute forcing the answer, but that applies to SHA-2 as well, since both are designed to be fast so that you can hash gigabytes in reasonable well. For that, something memory-hard such as Argon2 should be used.


> That's only relevant if you're using it to make signatures, thus you can make two inputs that hash to the same digest, and signing that digest creates a signature that's valid for both inputs.

It looks to me like they're just checking hashes, so if you have a collision, you can game it:

> The answers are the MD5 hashes of correct lower cased answers: you should be able to check responses from real users you challenge with the question against these checksums.


Is that a problem for this use case?


They're matching checksums to prove it's a human. You can check these checksums, and find or create input that will match it nearly effortlessly.

So yes, I'd say that's a problem for this use case.


I don't really see how. What does generating collisions get you? Doesn't this just require preimage resistance?


I like the use of MD5 hashes to make it stateless, you could also test on the client side before sending the form, its so annoying for the full form to be rejected because I couldn't guess the captcha.


This is also an attack vector that the article mentions. If you have logic on the client-side that can tell if the captcha is correct it lets an attacker brute-force it. (Basically as far as I see you can't send the hashes to the client without opening up this vector.)


True, I thought about scrypt and salts to slow down a brute force but for a 5 letter captcha I guess the search space would be too small no matter what you did.


How about keeping the same idea of a different question every time but mixing in some images? Like:

"How much is one + (image depicting a 6)?


It wouldn't be a text captcha, then.

It would certainly lose utility for visually-impaired users.

OCR is pretty well advanced by now; the image captchas that used to rely on it now use extremely distorted images.


Try solve using OCR next:

> How much is one + https://i.imgur.com/VbyiURo.png


Apparently I'm a robot. I can't add one + owl + owl and type in an answer I expect to have recognized.


Isn't there any kind of problems that can be automatically generated but can't be solved reliably by machines?


Winnograd Schema Challenges would be a step further for these text captchas. They are asking you to identify an ambiguous pronoun in a statement.

Examples:

- The man couldn't lift his son because he was so weak. Who was weak?

- The firemen arrived after the police because they were coming from so far away. Who came from far away?

- The drain is clogged with hair. It has to be [cleaned/removed]. What has to be [cleaned/removed]?

They are crafted in such a way that there are two variants of the challenge every time. Ex: The man couldn't lift his son because he was so heavy. Who was heavy?

I don't know if they are hard to generate automatically. They do require common sense to solve and it's still an open problem.

https://en.wikipedia.org/wiki/Winograd_Schema_Challenge


Yes, that was the idea of reCAPTCHA — give tasks that are hard for computers to do to (several) humans. It started with recognizing words from books that were scanned but where OCR failed.


interesting idea. the fact that I have never encountered one of these in use suggests that they don't work well enough.


or maybe it suggests that google has a monopoly


Aside: Not exactly this, but Amazon, Google(signup page) and few other big infra providers still use squiggly textual captcha. ReCaptcha is used by small players(no offense intended) as ux is often not main priority. You can try to spam amazon/google et al. signup pages and you'll be greeted by those sqiggly of olden days. :)


This is true for any new product


I've seen stuff like this in use for domain specific knowledge for forums for years - either there is a near infinite question pool or you're reduced down to something that's broken easily given a site-specific bot.


Over 10 years ago, off the shelf forum spam software, Xrumer, would build a database of these questions for you when it detected the field on a /register page. Since basically every website has a small, finite pool of questions (usually just one), you'd just sit down, answer them all, and then click "resume".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: