Hacker News new | past | comments | ask | show | jobs | submit login
The No CAPTCHA problem (homakov.blogspot.com)
215 points by homakov on Dec 4, 2014 | hide | past | favorite | 96 comments



This is why I still recommend using other form spam prevention techniques before sacrificing usability for a CAPTCHA. One of the most effective combinations for 80%+ of the sites I've ever dealt with is having a honeypot field in the form, plus some amount of time required to pass before the form can be submitted successfully. There are other ways to mitigate bots as well, but these two alone have sufficed for quite some time for anything but dedicated human-based attacks.

Granted, I think the checkbox CAPTCHA is much better than the UX disaster that is the 'type some hard-to-read letters' CAPTCHA, but it's still adding a burden on the user, rather than a burden on the bot.

(Source: I maintain the Drupal Honeypot module[1], and have used it in a ton of different situations where CAPTCHA/reCAPTCHAs would normally be recommended).

[1] https://www.drupal.org/project/honeypot


The weird thing about this entire No Captcha solution, in my opinion, is that it assumes that a captcha is the most efficient method for defeating spam.

In most blackhat circles, captchas are an afterthought. You figure out everything else (IPs, original content), then plug in a service like deathbycaptcha that solves the captcha for... looks like $1.39 per 1000 (thanks to ultramancool for the correction). (http://deathbycaptcha.com). What nocaptcha does is only show that captcha (which is already defeated), to a certain subset of the users who haven't been deemed trustworthy. So, the big bot builders will take a day or two and beat the system, and we're right back to where we started.

Honeypots, however, are brutal - especially if you throw a couple in there. When building a bot you build it for efficiency. If your site does anything abnormal (whether it's 'what's n+n?' or 'what popular figure comes through your chimney in December?') a bot is hopeless.

That being said, however, a bot is only hopeless so long as a solution isn't implemented widely enough to be worth breaking through for spammers. If, for example, Wordpress came up with 1000 questions like that, someone somewhere would come up with and sell 1000 solutions.

In some sense it may be the case that Google is one of the worst companies to create a a simple anti-spam API. I'm sure there's something they could do that would be more effective than this, but this won't really move the needle.


Just a small correction, it's $1.39 per 1000. Some competing services are as low as $1 per 1000.

> If your site does anything abnormal (whether it's 'what's n+n?' or 'what popular figure comes through your chimney in December?') a bot is hopeless.

Check out https://github.com/kbhomes/TextCaptchaBreaker for a great example of how trivial these are to break. And free too. Not to mention you could convert them to an image and feed them to a site like deathbycaptcha, antigate, etc. I've tried feeding some fun stuff like this through these services, you get interesting results and will likely have a high failure rate, but you'll get enough right to pass around 50-70% of the time.

Honeypot fields are pointless as a good bot just rips the whole form and fills in what it wants, if needed, executes JS too.

Of course, I'm assuming a determined attacker going after your individual site, not a bot just spamming random web forms. So it really depends on your threat model.


> If your site does anything abnormal (whether it's 'what's n+n?' or 'what popular figure comes through your chimney in December?') a bot is hopeless.

I run a forum using Invision Power Boards, which has a built in question-and-answer verification during registration. Soon after I set it up I watched a bot in the server logs completing the registration in only a couple seconds.

I suspect that because IPB is a big enough target, they farm out the questions just like they farm out CAPTCHAs, and build a database of questions and answers. You'd need to include some randomness in the questions to throw them off.


With both of your examples (and many others I've come across) those question type captchas can be done with a quick ping to Google and a sanity check on the answer

"what popular figure comes through your chimney in December" -> "Santa Claus - Wikipedia, the free encyclopedia"

"what's 1+1" -> "2"

They only really work if maybe the question is in the market of the site you're registering for "What's <some popular guy on site>'s last name" etc


Well the Santa Claus question will defeat bots for now at least... http://imgur.com/qX8pWrQ


That didn't take very long.


dammit you beat me to it.


And that's why Google can make you answer a captcha if it thinks you're a bot doing searches.


~$1.30 for 100 capachas, $50 for 10000 google searches per day, along with other search engines filling the same role.

probably the most damning thing is they are questions and answers from a database, unless you bother to make your DB unique (so no using openly available Q/A databases), everyone is going to have access to the right answers.

also, similar to paying people to answer capachas, you can pay people to answer questions


Perform the search from a different IP appearing as a different browser! This captcha arms race is already making me knackered


A custom spam prevention system, such as a honeypot or the simple "What's n+m?" field, works at 100% until your site is valuable enough for the attacker to spend a couple of cycles to circumvent the honeypot - then it drops to 0%. Publishing your solution as a module just serves to increase the value of circumventing it.

The value of "real" CAPTCHAs is that they retain their deterrence no matter how much effort the attacker invests.


The logic is sound, but doesn't play out in the real world that often. The reality is, the overwhelming majority of form spam comes from scripts and bots that are not that well-targeted, and in most cases, people who would normally install reCAPTCHA have no need for it.

Additionally, even though the module has been published for quite a while, and honeypot/time-gate techniques are fairly common, most bots can't account for the small adjustments that are made from site to site using adjustable settings for the module (like field name, time defaults, etc.).

The truth is: once your site or app is targeted by a human who is determined to spam you, the stakes are raised to the point where neither CAPTCHAs nor standard honeypots will deter everything. You will have to do at least some ongoing work to find a way to defeat the spammers.


>The logic is sound, but doesn't play out in the real world that often.

It does play out in the real world a whole lot if you have the misfortune to be in charge of certain kinds of websites.

> once your site or app is targeted by a human who is determined to spam you, the stakes are raised to the point where neither CAPTCHAs nor standard honeypots will deter everything.

reCAPTCHA stops the bulk of it. Yes, people can still use CAPTCHA farms, but at the very least they increase the attacker's cost and will generally reduce their spam rate by a lot. After 4chan rolled out reCAPTCHA, 5+ years of spam problems vanished in an instant.


> It does play out in the real world a whole lot if you have the misfortune to be in charge of certain kinds of websites.

Very true; for some classes of sites, this is definitely the case. I was speaking more in a general sense, as I see many developers and project teams install some sort of CAPTCHA on every site as a default. In most circumstances, I think people should go for the simpler solution first, then be ready to drop in something like reCAPTCHA (or some other active spam deterrent) only when spam becomes a real problem.


I'd agree with that. It's good to have the code production-ready at a moment's notice though, because the reaction time can sometimes mean the difference between having to clean up 100 spam submissions and thousands of spam submissions.


It takes an order of magnitude more effort to code around my custom-built anti-spam solution than it took for me to create it.

This doesn't always mean it's worth it, but it's a good leverage for beginners.


We use the Drupal Honeypot module on PortableApps.com and it works well. Mainly to block spammers from flooding comments and forum posts with bots. As one of the world's largest Drupal sites, we're under constant attack. We still use captchas for user registration, though, as we have a ton of accounts created by humans and then posted to by bots, which seems to be the m.o. of some of the larger spammers now.


The problem is, when you have a high value payoff for breaking in, then manual methods and people running manual scams becomes reasonable. I work for a classic car website, and the majority of our issues comes from real people, running real scams.

We have a lot of detection and followup in place, with a manual review process. The next generation will add the need for the user being logged in (though a click-through social network login will make it easier for most users)... At least in that case it's easier to tell when someone has a twitter account with no follows/followers who's never tweeted. Nothing is perfect though.

It really is an arms race, my point is the rules are very different depending on your market.


I created all sorts of honeypots (time based, hidden fields) and they quickly became worthless. As soon as the Russian bot networks target you, honeypots won't do anything to stop them.

Captcha has been the only thing that actually works. I will never use honeypots ever again. The headaches caused by those Friday night attacks aren't worth it.


impressed that works so well!

Another alternative that comes to mind (albeit much more complicated) is link encryption ala SpikeStrip: http://www.cs.ucsb.edu/~ravenben/publications/pdf/spikestrip...

This would mainly only work against scrapers, tho, and not so much for account creation.

My argument is that Google could have done better to offer a completely different solution (e.g. some sort of proxy service) than to add a (apparently fallible) whitelist to recaptcha.


I do not get the problem of hiring a clickfarm for 1$ an hour to click on cat pics.

If we take reputation, IP and cookie. All must be in order to pass. We want to spam a 1000 forms today. Scenario 1: The clickfarm itself fills in the Captcha. Result: Their IP's will soon be blacklisted, reputation of a third-world account will be inherently low. Scenario 2: We let the clickfarm send the answer to our own bot, which selects the right pictures. Result: Google will see a single IP and cookie trying out 1000s of captcha's a day, and ban you. Scenario 3: We let the clickfarm send the answer to our own bot, this bot uses a list of proxies that haven't yet been banned. Result: Google will see a single account cookie trying out 1000s of captcha's a day, from different IP's and ban you.

Can anyone come up with a scenario which involves reputation, IP and cookie that does not end up with Google detecting and banning your efforts? Cookie swapping?


Here's a scenario: a dissident living in a third world country with pervasive surveillance. He accesses the net using TOR, and disables cookies.

Now his IP is blacklisted, because there are lots of people using the same exit node; his reputation is low for the same reason, and the cookie is rejected. There's a good chance that this one person will be blocked, even though he didn't do anything wrong.

For a simpler case, private browser sessions over a VPN would suffer from the same issue.


I would argue that the problem of spam and hackers is a greater burden on society as a whole than someone in Iran not being able to get past a captcha.


I see where you are coming from, specially considering that spam makes up for a significant volume of the entire internet traffic. However, I'd think it wiser for one spammer to go free than for one person to be denied access to legitimate content.

I'm often being denied access to free content because I'm accessing from the "wrong" countries, and that's infuriating. If I start being locked out of free content due to my privacy measures, I'm probably going to start setting buildings on fire.


Do you think content creators should not be able to control access based on their own criteria? Are you somehow "owed" access rights to free content?


I'm not probably_wrong, so I can't speak for him/her.

But while I believe content creators should be able to control access, I think it's ridiculous to ban certain countries from access to certain content -- I don't quite see the point.

And also, being "owed" free content != being able to access free content you otherwise would be able to access were it not for a service like TOR or a VPN (e.g. escaping the Chinese firewall using a VPN service whose IP is banned from a website versus wanting to watch a movie but living in Germany instead of the U.S.)


Depends on what variables you're plugging into your moral calculus. Spam typically doesn't actually cause bodily harm, political repression, etc.


Spearfishing and hacking accounts does indeed cost some people their life savings and cause elderly people with semi-comfortable retirements to go into poverty. I think hackers do indeed cause society harm. It`s not just some email in a folder you never check.


Yeah, seeing a few spam links on a website is much more burdensome than free speech!</sarcasm>


That's not what I was talking about at all. And tell me how a being forced to solve a captcha prevents free speech.


The problem that the GP was talking about wasn't being forced to solve a captcha, but of getting blacklisted before even being given a chance to solve it.


Scenario 1 - we don't need to create a clickfarm for this, maybe we can clickjack random users online. Literally every porn site will be happy to make some money with it. Scenario 2 - of course we won't just use 1 IP, it will make us look Bad guy. Speaking of Scenario 3 - which account cookie are you talking about? The clickfarm has thousands of own trustworthy cookies but our bot doesn't send any cookies, it only solves challenges (neutral guy).

As soon as you have valid g-recaptcha response you don't need to persist any cookies - use it outright.


+1 also consider s/click farm/malware worm bot/g


Yeah malware bot will generate lots of free g-recaptcha-responses! Good idea


Scenario 4: you use botnets. 1 IP, 1 account = a few attempts, and you scale by renting a bigger botnet.


I have a hunch that it is Google's attempt to be on every form and know more about a Google user and their accounts on other websites. At least what websites they signed up for. I can stop Analytics but this is now out of my control. This is what a website owner required me to do to access their website.


Oh, it's already too late. Google already has enough data about you. I can imagine the future - people train bots like kids, make them visit different websites, google things and pretend to be humans. Your search history will be like your credit score.


I wouldn't be surprised if approval for a Visa depended on your search history.



It's a story about american airports it is to be expected. You should probably give your nose hairs a good trim someone might think you're going to pluck the longest one out and strangle the country to death.


What is the No CAPTCHA problem? What's being described here are problems that apply to all CAPTCHAs. Whatever 'human' detection system you put in place, humans can always be hired to solve them. The point of No CAPTCHA is not to fix these problems, it's to make it easier for 90% of people who don't care too much about cookie privacy etc. (or most likely have no idea it's even a thing).


The problem itself is described in the end: it's about using clickjacking to get a valid token on behalf of "good guys". And this problem has nothing to do with existing systems.

Google could have made it so much easier and more secure: a POST request to google.com/verify_me will have Origin header in it to prevent CSRF (only wordpress.com scripts will be able to get token). Also there would be no need to make a click. No CAPTCHA looks fancy but the real No CAPTCHA should always have visibility:none!


"No CAPTCHA looks fancy but the real No CAPTCHA should always have visibility:none!"

I agree, but I suppose they want something that's a Placeholder, if the user needs to type a captcha


Why? If no need to type any captcha - do the verification in the background, don't show me anything until you think I'm a bot


Because of page layout. Having a fixed size element is better than having something (that is not yours) that might be there or not.


There's still no need for a click.


IMHO the need for a click is just to lazy loading and thus, reducing server demand


Couldn't they just trigger that on form submission, then? "Please wait while we confirm you are human" is better than clicking and then waiting, and then submitting upon completion.


How many photos are in the universe of possible photos? How long would it take for outsourcing the process to tag all photos so a script could then do the matching?

Is the whole point of this to encourage hackers to get working on this AI challenge of identifying similar photos?

Either they need to hire a lot of people to sit around making these sets or they have an automated way of creating these sets which can be reversed. It would seem to be an arms race where google is paying people, but attackers can have people break it at a cost less than creating them (takes less time to match them up then to find good photos, clean them up, tag them, etc.).

An attacker would also just target the database where this is all stored. With the text recaptcha, it would seem that they have all of these photos and scanned books and you have 8+ character strings of [a-zA-Z0-9], random guessing would not be good enough, so the attacker needed to solve the OCR problem.

However, given the option to select x of 9 images, if you assume that the extremes are less likely of 1/9, 2/9, 8/9, 9/9- then I can hope to get lucky picking 4 or 5 each time, the order does not matter. If you distribute the attack to get around rate limits, etc. - perhaps just picking the first through fifth images gives you a sufficiently high success rate.


I think a good chunk of the images are captured by way of Google's Streetview vehicles [1]. I'm seeing blurry images of house and apartment numbers all the time. So I'd imagine there are always new images popping up that Google can feed into the recaptcha system that haven't been seen before.

[1] http://www.google.com/recaptcha/intro/#creation-of-value


Correct, I am referencing the new nocaptcha system. Those images would get stale as opposed to those in the traditional scanned book, street signs, house numbers in the recaptcha.


http://xkcd.com/1425/

Probably sums it up best.


I see this from totally a user experience side. NoCAPTCHA isn't about defeating spam (you're right, the spammers are going to solve it/hire someone to finish the job etc) - it's about making a better experience for the humans, while slowing down the spammers a bit. (contrary to current recaptcha system that slows down spammers a bit or not at all, and makes life mostly more crappy for humans)


People love not to think... Google is a business and the primary objective of any business is to make money (the vision/mission and others is for the people who love free lunch) Why captcha? to provide a service in a trade for "free" human recognition capabilities.

Q//But google now is better at recognizing those numbers.... A//Right... that's why they now request the next "way to expensive" to implement "free" service from you, your recognition... and association capabilities.


>>People love not to think

O_O

judging by the comment you wrote right after that, I would assume you are one of the people who likes not to think.

They are making people click checkboxes and deviating from the old model of recognition. Your comment makes no sense.


Yeah every once in a while... a little bit of heuristics, a little bit of laziness.


Interesting perspective on the changes! Our lead designer actually had similar concerns (can read them here: https://www.funcaptcha.co/2014/12/04/killing-the-captcha-wit...). You both look to be drawing the same conclusions. What are your thoughts on the metaphorical 'black box' being implemented into the new reCAPTCHA?


The picture recognition test is particularly annoying. Even in their example, of "match this" (cat), are we to assume we're matching all cats, or just cats of that color?

If they have to make very careful sets of photos to avoid confusion, then the sets of photos will be small enough to build lookup libraries for bots.


I'm reading this page: http://homakov.blogspot.com/2013/05/the-recaptcha-problem.ht...

Why don't they just invalidate the current challenge when a new one is requested? :S


There's no session ID for current user. They can try to use IP as identifier. Admins can send remoteip to google to prevent spoofing but that parameter is optional and I suppose they don't rely on it.


... Okay, why not establish a session then?


Would require an extra roundtrip... Problem is that you get challenges with client side and solve it with server side. It's website who should go, get a challenge for you, put it in your session cookie and make sure you don't go and get another one. Which complicates it a lot


Trigger warning: passing the CAPTCHA on homakov's demo page (https://homakov.github.io/nocaptcha.html) registers an account (blog?) at wordpress.org.


Another shameless plug - https://hashcash.io/ :)


In your demo I'd be more careful with user input

>$url = 'https://hashcash.io/api/checkwork/' . $_REQUEST['hashcashid'] . '?apikey=[YOUR-PRIVATE-KEY]';

hashcashid can change URL completely to something like ../../newpath?newparams#


it is always battle between make it simple to understand and bring best practices... in this particular case i chose simply to understand :)


The goal isn't to make things harder for bots, it's to make things easier for users.


they made it easier for users and for bots :)


It's no easier for bots. They still have to answer the old OCR challenge or a computer vision problem.


Using clickjacking we can get lots of valid tokens, no need to solve challenges.


You don't think Google will figure something out when a bunch of tokens from different IP addresses are all being used by one IP?


It can be helpful. There's (optional!) remoteip parameter server can use to send google IP address of current user. As in wordpress demo sometimes we can send requests with the browser.


And additionally it’s easy to just create empty Google accounts and then use them with the bots. Just create a few dozen accounts, use them with a few hundred bots, and you easily get full verification.


It it naive to think that attackers have only one IP at their disposal.


i think there are big issuses on the horizon here, it's going to get increasingly difficult to find simple problems that humans can solve and not bots. i'm not sure there is a fundemental answer


Google New reCaptcha using PHP - Are you a Robot?

http://www.9lessons.info/2014/12/google-new-recaptcha-using-...


That's quite funny


No CAPTCHA reCAPTCHA is not all Google is claiming it is. It only works if you're logged in to the site, so what's the point?

See http://ur1.ca/iza9d


Random blog article destroys entire Google team of high paid professional engineers specifically employed to solve this problem and they did it just using incognito mode.

Upvote FTW.


Lol no, i simply found original article too promising with minimum technical details so I decided to dig. And a weakness is a weakness, not a vulnerability. Something to think of.


Is this sarcasm? The purpose of Nocaptcha was to alleviate pain for users, as this blog says nothing else has changed. In fact Google said it would fall back to normal captcha when necessary we already knew that.


Seriously guys? This made to the top of the front page? First of all, to all people saying "HUR DUR GOOGLE WANTS YOUR BROWSING DATA", well they already fucking have/had it for a looong time.

Secondly, If you tell me that one dude [author] ruled the one+ year work of the engineering team at google as a flaw and simplified it as [So what Google is trying to sell us as a comprehensive bot detecting algorithm is simply a whitelist based on your previous online behavior, CAPTCHAs you solved.] and that you believe it, I would question your intelligence.

This is supposed to be tech savvy community at least to some degree, what the fuck.

Now, in the google's blogpost it reads [Advanced Risk Analysis backend for reCAPTCHA that actively considers a user’s entire engagement with the CAPTCHA—before, during, and after—to determine whether that user is a human.]

[However, CAPTCHAs aren't going away just yet. In cases when the risk analysis engine can't confidently predict whether a user is a human or an abusive agent, it will prompt a CAPTCHA to elicit more cues, increasing the number of security checkpoints to confirm the user is valid.]

So my guess would be they analyze users behaviour on the page where captcha is located, things like mouse movements, time it takes to type out the words, spelling mistakes corrected and whatever else humans do differently than bots - and only then combine that with your historical cookies. Maybe it is much more complicated than that, I, as well as you, don't know the details.

Do you really think that they would go ahead and implement a such system without rigorous testing of effectiveness? I am sure that they tested it extensively with users, AND with bots, and decided that it is better than the current system, and ONLY then deployed it. Rant off.


>So my guess would be they analyze users behaviour on the page where captcha is located, things like mouse movements

If they can track mouse movements why in incognito mode i'm not a human for them anymore? I was expecting same but from what I see it's just a whitelist. And it's OK. Problem is, which you probably didn't care to read, is it's vulnerable to simple clickjacking which opens another weakness - i can use your click on my page to get your reCAPTCHA token and feed it to my spam bot.

I'm actually happy with No CAPTCHA, because it's making progress. But it's not good enough (see the rest of comments, it could be a background AJAX request instead).


>>which you probably didn't care to read

I did read it. My point is, you, or I, or anyone for that matter does not know the inner details of how it works.

>>If they can track mouse movements why in incognito mode i'm not a human for them anymore?

Maybe having a clean cookie history is not good enough during the risk assessment.

Look, my entire point is, google is not a joke company. I am certain that they tested it for effectiveness before deploying.


> I did read it.

So what do you think about clickjacking issue? I made an assumption about their algo and maybe I'm wrong and they do track your mouse, but there's exploitable weakness. My post is 1) your algo seems simple 2) here's a bug in it.


The curious thing is, I could not replicate the clickjacking issue. Everytime I make a click on original wordpress registration page, I am verified as a human immediately.

If I do the click on your github page, I get a challenge. My clicks were never accepted as human on your github page. My clicks were always accepted as human on wordpress page.


No incognito tab? Maybe they fixed it


yes they fixed it but i don't know how. Likely there's a way to bypass.


> one dude

Since you obviously don't know who Homakov is I can't take your post very seriously.

Homakov has exposed several serious security flaws at Facebook and Google before. I'm pretty sure Google is actively trying to headhunt him since he is one of the best in the web security field.


He's probably best known to HN for his GitHub exploit with Rails in 2012. I wrote a profile of him earlier this year (http://jobtipsforgeeks.com/2014/03/27/homakov/) which talks about his background a bit more.


> Do you really think that they would go ahead and implement a such system without rigorous testing of effectiveness? I am sure that they tested it extensively with users, AND with bots, and decided that it is better than the current system, and ONLY then deployed it.

I think the gap between the marketing material for nocaptcha (a simplified website, a youtube video with animations) and the seemingly lacking actual implementation is why this blog post was relevant for me.

Like other tech people around here, I was hyped up by the "smarts" of a system that uses cursor detection etc. to silently validate that I am a human. This blog post seems to indicate that the validation is a much simpler issue of previously passed tests and the amount of data that Google has associated with the user.


That's exactly why I wrote this post. I wish Google proved me wrong and demonstrate us how they use cool tech to detect bots instead of user.isGoogleUser? and user.acceptedCaptchas > 5


>>>So what Google is trying to sell us as a comprehensive bot detecting algorithm is simply a whitelist based on your previous online behavior, CAPTCHAs you solved.

That is a bold statement, something presented as a fact, not a hypothesis.


Half of the post is about how the new technique is vulnerable to clickjacking.


The google's blogpost says that 98 something percent of old text could be deciphered by AI. My point is, regardless of vulnerabilities of the new system, I am certain that it is more effective than the old alternative. They would have tested it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: