Hacker News new | past | comments | ask | show | jobs | submit login
I’m not a human: Breaking the Google reCAPTCHA [pdf] (blackhat.com)
166 points by louis-paul on April 8, 2016 | hide | past | favorite | 68 comments



http://arxiv.org/abs/1602.02697

I don't work on reCAPTCHA but I imagine they could easily + significantly beef up image captcha by adding adversarial image examples that trip up would-be automators.

Note that adversarial examples generalize across models. Meaning a good adversarial example will work against SVMs, convnets, etc just the same.

If you're interested in the basics of how it works, Julia Evans has a great post: jvns.ca/blog/2015/12/24/how-to-trick-a-neural-network-into-thinking-a-panda-is-a-vulture

The key is that deep neural nets are actually very (piecewise) linear and that makes them susceptible to adversarial training.


Or use those images: http://imgur.com/a/K4RWn :)


Using the Google's own reverse image search to break an image-based captcha. Simple, but brutal.


Thought the same thing, that's ironic /and/ smart


Perhaps in response google could identify their own image they just served from recaptcha being google searched and throw the google image search into a captcha jail? .... if they have a large enough pool of images they are using, to be able to make that correlation, and if google search and recaptcha are capable of sharing that level of intimate detail ... probably a lot of problems with that tho.


"We ran our captcha-breaking system against 2,235 captchas, and obtained a 70.78% accuracy"

That's more impressive than it sounds. I'm pretty sure 70.78% is more accurate than I am with reCAPTCHA manually. A lot of the captcha's presented are very fuzzy, or have ambiguous questions, etc.


>I'm pretty sure 70.78% is more accurate than I am with reCAPTCHA manually.

exactly. Many reCAPTCHA are beyond simple recognition and make me start guessing. I expect we'll see new type of reCAPTCHA - you're a human if you make mistake and robot if correct answer is typed in :) Similar to those 1x1 images not visible to humans, yet visible to the robots.


Indeed. It's good that these guys are white-hats because they could have made a killing selling to spammers (for as long as they could go undetected, which could've been a while).


Going rate for humans breaking captchas is like $1/1000 captchas solved. A fully automated service could make some money, sure, but not exactly a killing.


From paper, "Assuming a selling price of $2 per 1,000 solved captchas, our token harvesting attack could accrue $104 - $110 daily, per host (i.e., IP address). By leveraging proxy services and running multiple attacks in parallel, this amount could be significantly higher for a single machine."


Yeah; that's the problem...you can make, with tons of effort, a fairly decent system to tell robots and humans apart, but it's much much more difficult to tell humans trying to do the thing directly from those solving the challenge remotely. It's an arms race of economics; the challenge has to be difficult enough that it slows down humans in sweatshops to the point where it makes the whole enterprise not worthwhile for the abusers while not pissing off your actual users. It also has to resist automated malicious use. Quite a tall order.

The best I've seen are the "which of these photos show mountains"-type challenges. I'd imagine that solving 5 rounds of those would take too long to make it worthwhile for spammers, but I'd also imagine lots of legitimate users getting irked at going through that to fill out a form.


"This work was supported by the NSF under grant CNS-13-18415"

I think that means you might be able to get source code via FOIA or similar :)


I am the only one who thinks google reCAPTCHA is just a tool that Google uses to train Machine Learning Algorithms? First it was used to help Google learn how to Read, now its learning to detect object, landscapes, ...


I'm pretty sure this is the advertised purpose of it. It stops bots AND helps machine learning.


*AND helps a company make profit by training their proprietary models.

If they were open, it would be a lot better (considering all the training is done by volunteers, too)


uhh. Everything they do is arguably for profit. That's why Google is an LLC. What do you expect them to do, put an "* this is done to earn us money" disclaimer on everything?


This is a little pedantic and doesn't change your point, but Google is not an LLC. It's a C Corporation.


I like to misidentify things to give them bad data.


Well, I do the same. I won’t destroy any of their dataset, or even have a measurable impact – but in turn, I won’t have a measurable impact in positive direction either.

If they want to make a profit from me, they should pay me, or they can license correct CAPTCHA results from me under GPL.


This argument doesn't really make sense to me. What browser, OS and device are you using to post this comment? Someone made profit when you bought the device, bought the OS and/or browser.

A healthy system needs some kind of motivation. In economies, that is profits/money. What's wrong with that? (I know I am being simplistic here but...)


> A healthy system needs some kind of motivation. In economies, that is profits/money. What's wrong with that? (I know I am being simplistic here but...)

Simple.

I buy a thing, I get something – in that moment the contract is over.

VW doesn’t come to me every 3 days with "You bought a car, to continue using it, take this and drive to Hanover and deliver it there".

When I bought my computer, or its parts, I bought them, I put them together, and that’s it. The manufacturer has never asked me to do work for them, or pay for them again.

I use ARCH Linux and Firefox – projects done by volunteers, and they profit by having a better product for themselves and others.

Google profits from ads, and from selling data. That’s a tradeoff, and a reason for me to try to use as few Google products as possible.

But when government organizations use ReCaptcha, and I have to work without pay for Google, then I have no choice, and it is not something I agree with.


The devices that I buy generally don't ask me to do work for them. If they do, for example by spying on my behavior, then I'm not so happy. On the other hand, I may accept some spying/doing work for companies if I get something in return - which happens for some Google products, like search.

When doing a captcha, I don't really get anything. It's something I have to do because the website I'm using has the problem that they can't find a better way to identify bots. So I do a captcha, fine. But, if there's benefits for whatever company offers them, i.e. I'm doing work for them without getting anything in return, or without everybody getting anything in return (the GPL option), then I'm again not so happy.


Napster, torrents, and the free internet warped an entire generation's perspective on things. This sort of "everything should be free" attitude isn't going away any time soon.


The advertised purpose was to digitize the worlds books and help libraries, digital humanities and the world.

This ended, and all trace of public good was erased (Check archive.org if you dont believe me) when Google execs mandated that only things that made money can be supported.


You do realize that it was the US court system that put the library-of-the-world version of Google Books on ice, correct? Larry & Sergey wanted a hippy "share all the books to everyone" view of this. The court said no, do what the book publishers say.


While you're mentioning the Internet Archive, you might also recall that our advertised purpose is to digitize all the world's knowledge :-)

I just had a chat with our book guy today about OCR corrections using captchas, one of our volunteers suggested it. We expect to scan 500,000-700,000 books this year.


The little captcha box does some processing before deciding what to show you. If you look suspicious, it can show you a more complex challenge. If you look like a normal browser, it might show you a house number to read (to improve its Maps product maybe). If you already have a cookie set because you already proved you're human, maybe it's just that check-box that says "I'm a human".


Unless you're able to explicitly state how this is done and/or how to trick it in think you're "high risk" - then seems like speculation; yes, I'm aware Google's said this, but never seen an proof of it. I've found bug in the past in the system that were easy to fix, told Google, but the bugs never were fixed.


Try doing a Google search from tor. The catches that tor users get are borderline impossible because of the suspicious ip and/or browser.


Not impossible, very often I get the "Body of water" one, but I do seem to get a challenge every 10 minutes or so, in each new tab.


I've experienced this first-hand. I used to always get the checkbox, but then one day I had to download a large series of related files from a website that used reCAPTCHA. I did all the clicking manually, but in a very repetitive and bot-like fashion, by opening a series of 10 new tabs at a time and then performing the same series of clicks on each tab to get to the download link. After a few minutes, I stopped getting checkboxes and started getting increasingly more difficult CAPTCHA challenges.


It's not speculation. It's section 2 of the paper, titled "Analyzing Risk Analysis System."


I can't find the original source, but the system for deciding whether to just show a check box or not was super complicated.Back around a year and a half ago when it was first released, some people on 4chan's technology board (/g/) were frustrated because they'd repeatedly be marked as suspicious (and never got the check box only) when posting on 4chan (which requires a captcha to be solved for each post). One user in particular was reverse engineering it and published a ton of super interesting stuff on Github (the levels of obfuscation were insane), but later got a job offer from Google (allegedly) in exchange for deleting it.


No, that's definitely what it is. It just also happens to be useful for blocking bots.


I agree with you


> Over the period of the following 6 months, text captchas appeared to be gradually “phased out”, with the image captcha now being the default type returned, as these captchas are harder for humans to solve despite being solvable by bots [3, 4].

This is so true. When I first go an image captcha, it took me 3 minutes to learn how to solve it. I had JavaScript disabled so I did not realize I had to do 3 pages to complete the challenge (after two, I reloaded and got a new route thanks to tor the page since I though i message up). Also I'm not sure what to click when there is a partial match, is that still a match? And don't even get me started with solving signs (in my first challenge I had to tell apart different kind of signs that were in the image). Is a no trespassing sign still a traffic sign? What about if it is a sign but I don't know which kind of sign (due to domain knowledge – especially when they're foreign signs). Another problem is that they require you to understand the language the challenge is in. You can still solve a text captcha if you don't speak any English.

I wish there was a way to force text captchas


Quite interesting how the authors even mention that this strategy is very economically viable.

From the paper: 'Assuming a selling price of $2 per 1,000 solved captchas, our token harvesting attack could accrue $104 - $110 daily, per host (i.e., IP address). By leveraging proxy services and running multiple attacks in parallel, this amount could be significantly higher for a single machine.'


Makes me wonder if they got paid for reporting it to Google.


You could also just pay a service that uses human workers in third world counties. It's a little over a tenth of a cent per captcha.


The author did compare their performance with captcha-solving services. His accuracy is comparable to the service with no extra cost to the attacker.

From the paper: "We compare our performance to that of Decaptcher, the (self-reported) oldest captcha-solving service. We selected Decaptcher for two reasons. First, it supports the image reCaptcha, charging $2 per 1000 solved captchas. [...] Interestingly, some of our summitted challenges rejected due to the service being overloaded, and had to be resubmitted at a later time, and received a time-out error as the solvers did not provide an answer in the time window allocated by the service. 258 challenges (36.85%) were an exact match. When taking into account the flexibility, 321 (44.3%) of the captchas were solved. The average solving time for the challenges that received a solution was 22.5 seconds. While the accuracy may increase over time as the human solvers become more accustomed to the image reCaptcha, it is evident that our system is a cost-effective alternative. Nonetheless, our completely offline captcha-breaking system is comparable to a professional solving service in both accuracy and attack duration, with the added benefit of not incurring any cost on the attacker."


Not so economical if you want to brute force a login page.


Or you could set up a pr0n site that shows the material only after the user has completed the captcha. This trick has been done before.


I've never seen a documented case of such tricks actually being used, and I've seen calculations that suggest the cost/benefit outcome is no better than just paying the poor to do the menial work.

The last such analysis I paid attention to was some years ago so the situation may have changed, but I suspect the hassles of running a porn site and CAPCHA proxy still aren't worth it:

* obtaining content sufficient to attract interest

* paying for bandwidth & other resources)

* writing the authentication system

* then maintaining it (every time the CAPCHA service(s) change their process you potentially need to make and test changes to your code)

* and you need to work around rate limits (depending on the CAPCH design it may not be possible to make the relevant requests client-side so if the services has rate limits you'll have to route through something that sufficiently randomises your source address).

* providing support

* dealing with bad press


The closest thing to a documented case I've seen is a report of people gaming the Time Person of the Year poll so that the top person was moot, and the first letters of the candidates spelt out "Marblecake. Also the game.":

By understanding how reCAPTCHA worked – the team was able to double their productivity (since they usually only had to enter one word instead of two). To further optimize their voting they created a poll front-end that allowed you to enter votes quickly while giving you an update of the poll status (and since it is a 4chan kind of crowd, they also provided the option to stream some porn just to keep you company while you are subverting one of the largest media companies in the world.

https://musicmachinery.com/2009/04/27/moot-wins-time-inc-los...

However, this is slightly different as people were deliberately solving CAPTCHAs (and watching porn) rather than wanting to watch porn and also incidentally solving CAPTCHAs that they had no direct interest in.


And also, if you got to the point where that porn site was then active and usable enough for the captcha cracking service... it would probably be more profitable just to monetise the porn.


I guess we're all waiting for browser extension to fill in captcha for us, which is sometimes hard to get right, great usability improvement.


I have a bad feeling CloudFlare is going to break the internet for Tor users with impossible-to-solve CAPTCHAs again after this...


Not that it can become much worse than it already is, though. The current reCAPTCHAS are already "fuck this shit"-inducing on Tor.


It very much can. Only a few months ago they were simply impossible. I tried to solve 30 of them while accessing a site and was still unable to. The current set are merely extremely annoying and time-consuming, but they can consistently be solved (at least by English speakers).


They can be consistently solved, the only problem is the endless repetition and "multiple answers needed" nonsense, which only seems to apply to Tor users.


I may be being overly-cautious, but does anyone have a summary of this that isn't a PDF from blackhat.com?

edit: Thanks for both responses!


TL;DR from https://www.reddit.com/r/netsec/comments/4dvifg/im_not_a_hum... (though you may be overly-cautious as blackhat.com seems to be the webpage of a computer security conference [1]):

"""

Live attack To obtain an exact measurement of our attack’s accuracy, we run our automated captcha-breaker against reCaptcha. We employ the Clarifai service as it shows the best result amount other services.

Labelled dataset. We created a labelled dataset to exploit the image repetition. We manually labelled 3,000 images collected from challenges, and assigned each image a tag describing the content. We selected the appropriate tags from our hint list. We used pHash for the comparison, as it is very efficient, and allows our system to compare all the images from a challenge to our dataset in 3.3 seconds. We ran our captcha-breaking system against 2,235 captchas, and obtained a 70.78% accuracy. The higher accuracy compared to the simulated experiments is, at least partially, attributed to the image repetition; the history module located 1,515 sample images and 385 candidate images in our labelled dataset.

Average run time. Our attack is very efficient, with an average duration of 19.2 seconds per challenge. The most time consuming phase is running GRIS, consuming phase, as it searches for all the images in Google and processes the results, including the extraction of links that point to higher resolution versions of the images.

"""

[1] https://en.wikipedia.org/wiki/Black_Hat_Briefings


Here is the opening paragraph from the pdf:

Since their inception, captchas have been widely used for preventing fraudsters from performing illicit actions. Nevertheless, economic incentives have resulted in an arms race, where fraudsters develop automated solvers and, in turn, captcha services tweak their design to break the solvers. Recent work, however, presented a generic attack that can be applied to any text-based captcha scheme. Fittingly, Google recently unveiled the latest version of reCaptcha. The goal of their new system is twofold; to minimize the effort for legitimate users, while requiring tasks that are more challenging to computers than text recognition. ReCaptcha is driven by an “advanced risk analysis system” that evaluates requests and selects the difficulty of the captcha that will be returned. Users may be required to click in a checkbox, or solve a challenge by identifying images with similar content. In this paper, we conduct a comprehensive study of reCaptcha, and explore how the risk analysis process is influenced by each aspect of the request. Through extensive experimentation, we identify flaws that allow adversaries to effortlessly influence the risk analysis, bypass restrictions, and deploy large-scale attacks. Subsequently, we design a novel low-cost attack that leverages deep learning technologies for the semantic annotation of images. Our system is extremely effective, automatically solving 70.78% of the image reCaptcha challenges, while requiring only 19 seconds per challenge. We also apply our attack to the Facebook image captcha and achieve an accuracy of 83.5%. Based on our experimental findings, we propose a series of safeguards and modifications for impacting the scalability and accuracy of our attacks. Overall, while our study focuses on reCaptcha, our findings have wide implications; as the semantic information conveyed via images is increasingly within the realm of automated reasoning, the future of captchas relies on the exploration of novel directions.


I must admit that I have not been able to pass any of the text reCAPTCHA challenges lately when I have wanted to access the more sensitive content. Either I am getting them wrong or my submits get ignored.


One of the more interesting references linked here was https://github.com/neuroradiology/InsideReCaptcha — apparently the reCAPTCHA JavaScript snippet uses an encrypted bytecode format running on a custom-built VM, which one intrepid person has decompiled.


Anyone else annoyed with the grammar errors on the paper?


"We automately searches Google for certain terms, followes links from the results, watches video on Youtube, searches on Google Maps, visites popular websites that contains Google plus plugins and widgets."

It's like reading a security paper authored by Gollum.


I worked on a little piece on recaptcha a while ago[0].

I'm glad to see research around how this captcha method is actually as breakable as any, but the real problem continues to be that there is no separation between the data Google collects for advertising, and the data Google collects from the captcha service.

Their privacy policy says they can link the captcha to your other online identity and use it to target you. It's especially cringeworthy to see lots of sites of dubious legality implement reCAPTCHA (like thepiratebay).

[0]: http://www.businessinsider.com/google-no-captcha-adtruth-pri...?


I don't get why, when and how Google uses reCAPTCHA within their own tools. E.g. within the Webmaster Tools, I can submit up to 500 URLs for manual fetch/render and subsequent index submission. So the rate limit is already there and reasonable.

However, after 4 submitted URLs, I get a reCAPTCHA. From then on for every URL, I have to complete it with additionaly visual quizzes.


Another way they get triggered is when people use browser/desktop based rank checkers.

There are also plugins some SEOs use to pull lots of requests. These tools are quite old now and not very useful, but people still use them.


Google recently killed off its page rank toolbar - which most people in SEO used to measure their success with the sites they worked on.

Google has confirmed it is removing Toolbar PageRank: http://searchengineland.com/google-has-confirmed-they-are-re...


Thanky you. I indeed han a chrome extension doing this. Removed and will check!


> I don't get why, when and how Google uses reCAPTCHA within their own tools.

This may or may not help with your quest.

Google triggers recaptcha when i use their search using one of my digitalocean servers as vpn and incognito mode, thus my IP address belongs to a datacentre, there isn't a cookie header and the user-agent is linux.


I am on OVH, and never get a reCAPTCHA.

(I used to use my OVH VPS for some time as VPN, so it has gained quite a reputation for a "normal" browsing history)


Thank you - I am using it absolutely genuine. The used account is in fact linked to an AdWords account which has spendings in the five digits...


I had the same thing the other day. I figured maybe it was because I was in incognito mode (because logging into half of Google's properties with uBlock and Privacy Badger enabled doesn't work). sigh


I have used services like 2captcha.com, to get solving costs down to $0.5-$1 per 1000 solved captchas.

Googles reCAPTCHA is hardly an effective solution... Also if google wanted to they could just automatically verify people without you clicking that checkbox. Because at the end of the day they already know if they are going to auto-verify you, or make you pass a test.


Someone should really make a web browser plugin that automatically solves CloudFlare, Google and Facebook captchas for people using ::gasp:: VPNs. I would happily pay $1 for every 1000 captchas served at me to be solved for me, even if it took the page a bit longer to load for me still. As long as the latency wasn't too much higher than it'd take me to solve it myself.

Just like with DRM, captchas only seem to punish legitimate users and perhaps very small-time bad actors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: