Hacker News new | past | comments | ask | show | jobs | submit login

Many a time I receive multiple challenges on a site despite having selected all images perfectly, and can't help but wonder, "Hey, are they getting me to do more work than necessary because they're running behind on their labelling backlog?". There's definitely a conflict of incentives in this case. If you're a website owner, you're better off choosing a different service which doesn't have adverse incentives, otherwise it can affect your site experience. And please don't put captcha on GET requests. Use a CDN if you're unable to handle bot load. And don't even get me started on CDNs that throw captcha.



I've found it isn't about "perfection". It is about selecting the same tiles as an "average" person would. I might stare hard at an image, think that one of the tiles contains a tiny fragment of a traffic light, and select it. That isn't what most other people have already done, so the captcha thinks I'm a bot and gives me tougher and tougher challenges. Ever since I stopped pixel-peeping and started quickly selecting the tiles that obviously had a bus in them, the percentage of time that I've gotten by first try has gone way up.


I wonder if this means self-driving vehicles' detection of important traffic features will be at the level of an irritated and disinterested web user who is trying to just do the minimum work to please an algorithm.


But it will be cheap! - The sound of 1000 business C-levels as your head gets removed by going under a truck.


They probably have some statistics in the background which tells them some form of Trustlevel.

You also need to assume that there are potentially control pictures in it as well.

I think this is a very liable approach.


I think you meant "viable", not "liable". Given the discussion, your typo is ironically amusing, though.


Yeah, it's not like they will label something a train just because a single person says so. But if you have 10k responses with 95% confidence saying it's a train, it's very likely to be the case.


But GP is describing exactly the opposite: there might be a train that's not immediately visible at a glance, leading most people to not label it.


For unambiguous images almost all humans will label them the same way. For ambiguous ones humans will differ. Presumably they'll accumulate stats on each image and will be able to detect cases like this.


unless a properly obfusicated bot net has seeded the data set with -everything is a train- responses to the tune of >>10k responses with 95% confidence saying it's a train<<


That's about as much attention the average driver pays anyways. I drive a motorcycle, I KNOW that.


I was always tempted to knock on people's driver-side windows when I saw them looking at their phone. Never did - figured they'd probably startle, with a non-zero chance they'd accidentally fling the car into me.


I yell at them. Loudly! Loud enough that people a block away turn to look.

But then, when you're staring at your phone while driving your car out of a parking lot and across the sidewalk where you only miss hitting me (before driving into oncoming traffic!) because I stopped, well you deserve that minor inconvenience of being embarrassed.


That's an interesting idea - maybe it would be smart to have captcha's do a "Point out all the motorcycles" or "Click all the pedestrians" setup.


I love a rant on this one...

Sometimes the bus/boat/truck has motorbikes sometimes bicycles. Is that a petrol-powered bicycle, or a motorbike to the [USAmerican?] person who wrote the rules!? Are all large yellow vehicles buses in USA or do you have minibuses, oh wait, are minibuses buses.

I've worked out fire engines are trucks for captchas, not sure about Transit-type vehicles, lorries are trucks apparently but goods trucks on railways are not trucks!

Is a traffic light only the lens/led array or the black light-holder too? Do pedestrian lights count as traffic lights? Are those weird lights hanging in the middle of junctions 'traffic lights'.

Wish they'd just tell you what counts.

I have noticed that times I realise after clicking that I missed a square they tend to go through whilst many times I get repeated captchas when I know I got it right. Success, as a user, seems impossible to predict.


> lorries are trucks apparently but goods trucks on railways are not trucks!

In the US, lorry == truck. Never heard the term goods truck before today, but I think it's what we call a boxcar.


A "truck" on a railway car is apparently a "bogie" in the UK.

A bogie in India seems to be a railway car in the US.

Then there are intermodal containers, smaller than boxcars, I think they are hauled by truck (lorry) after being unloaded from a flatcar.


> flatcar

AKA "flatbed" or "flatbed trailer" :-)


Pretty much a regular driver then!


Heck, better than a regular driver. At least what we're looking at is outside!


I'd rather that than a computer hemming and hawing over whether a single pixel is an oncoming truck and not turning just in case.


a computer "hemming and hawing" as that one accident where it couldn't decide if it was a bicycle or a person has nothing to do with the training. It's what the developers decided to do with input that had a low confidence score. There will ALWAYS be low-confidence ratings on real world data regardless of how good your training is.

Instead of saying "oh crap there's SOMETHING there we should stop" they said "huh, no let's loop on testing it until we figure it out or run it over....whichever comes first."


Also if the car wants to stop too much because of low confidence, just turn the brakes off.


Even if you're right, I doubt it's going to be much worse than the level of the average irritated driver.


I kinda go the other way -- could a FFT heuristic mistake this feature for a crosswalk? Then I'll select it, whether or not it's actually crosswalk. Most of the time, this works. It's a stick in the eye of our prenatal robot overlord.


Replying to an_ko's sibling comment:just like the data behind youtube music recommendations, populated by data carefully analysed from legions of bored toddler clicks vs. Spotify's obsessive teenager music curation


Same experience. Once I observed that most (about 9/10) times there are only 3 tiles to select, I stopped looking for a 4th and selected only the 3 most obvious.


I've noticed similar. Often with stop lights, where a tiny sliver of one does not neatly fit in the frame, spilling over ever so slightly to the next square which has no stop light otherwise. There's a none too subtle irony in that one is being punished for accuracy when the context is ultimately public safety.


This. Captcha wants me to choose crosswalks and you can see there’s that sliver of a crosswalk in a few pixels off in another tile. You’re not wrong! But you’re not right. Regression to the mean.

A hexagon would be better as a frame instead of a square.


So take the image, shift it by a fraction of a tile, and rerun.


Google is _the worst_ for that. At least hCaptcha is a bit less culturally specific.

Every time Google blocks me for refusing to label a motorbike as a "bicycle" I get utterly pissed off. And likewise with the traffic lights on the californian skies. Are the traffic lights the actual lights themselves, or the boom holding them up?

I'm not a human very often, according to Google. hCaptcha tends to let me in...


That same case happened to me! Another example is Parking Meters, they don't exist in my country and I'd never seen them before.


I recently failed a google-bicycle captcha.. "you can't fool me, that's a motorcycle not a bicycle!" I thought.. and then had to complete 2 more challenges. Including a cross walk one where one of the images was just asphalt and painted lines (no context/edges so it could be a parking lot, and airstrip,a highway, an intersection, or a cross-walk).


They also claim that Tigers are not "Cats".

"Please select all the Cats" then shows picture of a tiger among the common House Cats.


> Google is _the worst_ for that. At least hCaptcha is a bit less culturally specific.

One example I keep ranting about: I think countries outside the US have different terms for "crosswalks".

I personally know them as "zebra crossings", and it took a while for the reCaptcha request to click in my mind.


Pedestrian crossings here. I'd also not heard the term "crosswalk" until reCaptcha. Yet another bullshit Americanisation infecting the worlds culture.


I'm not a human very often, according to Google

If Google says you're a robot, it must be true! You should behave accordingly.

There's actually a comic strip in the newspaper going through this storyline right now. Brewster Rockit: Space Guy! was told by a CAPTCHA that he's a robot, so he's going through life that way. The other robots do not seem to be happy to have him as part of their culture.

http://www.brewsterrockit.com


I disagree, Google captchas in my experience are much quicker compared to hcaptcha

When cloudflare switched to hcaptcha I definitely noticed it.


Sorry, but Google captcha is specifically designed to annoy real people in some cases. They literally implemented slow fade-in / fade-out for images. This does absolutely nothing against actual bots, but annoying as hell for a real person.

Literally any other captcha is better than this.


I thought the fade thing was specifically to trip up bots. Like bots know what the picture is long before it is shown to the user, so if the bot clicks on it then the CAPCHA knows something is up.


Which is solved exactly after this is encountered the first time, eg

if opacity -ne 100% do_not_click_yet = true

So this is totally useless to prevent bots from solving it.


Surely then they look at timing, people will click anywhere from, let's say 50% opacity, but bots always wait?


Easy for a bot to fake that with a random number generator. If nothing else bot authors can collect their own statistics. I understand the bots have an army of people in the background for images they don't understand yet, just collect timing data from that set and have your random number generator emulate that timing data. (I'm guessing a bell curve)


Actually having army of people it's exactly how complex recaptha's are bypassed. It's less than $1 for 1000 captcha:

https://anti-captcha.com/

There of course options with image recognition, but they're less reliable.


Honestly, that service looks like something I'm almost tempted to pay for myself. $1 per 1000 recaptchas is a lot cheaper than how I value my time, at the very least. It's not like google couldn't pay people to do these ML training datasets; I resent giving them free labour.


Unfortunately I doubt that "recaptcha solver" can be built as browser extension. Most of advanced bots for parsing automation built on proprietary platforms like ZennoPoster and they basically heavily modify Firefox / Chromium.

Also latency of human-recognition service is quite high so while you wouldn't need to solve it you'll need to wait for number of seconds anyway.


I wonder if we could get Firefox to automatically solve them. That is in the main version so nobody has to see those stupid things again.


Few years ago back when I created such bots I had similar idea .

For certain this can't be mainlined. And if we talk about extensions then at least in past extension code didnt have enough capacities to automatically bypass recaptchas.

This would require fake mouse pointer control and it's obviously not one of features that extension api expose.


Do you seriously think that people who programm the bots incapable of taking it into account?

Bypass for this fading was obviously implementrd next day this first appear on reCaptcha.


Isn't it more for rate limiting?


You dont need fade-in / fade-out effects for rate limiting. Bots are obviously get to see images instantly once they're returned by the server as they dont need to wait for fade-in to complete. Because bots API is injected into browser internals instead.

If rate limiting is needed there is always CloudFlare way where you're literally show user "wait" and refresh page a bit later. This is annoying, but nowhere as much as reCaptcha fading is.


people tend to have very different experiences with google captchas based on how normal they are. if you block everything and try to anonymize your browsing as much as possible and otherwise do everything you can to look like a bot, you're going to get a very difficult captcha to somebody with all their browser settings on default.


Yup. This reminds me of the ‘introduction’ of an old hacker simulation game from 2004 that was quite prescient.

“ In the year 2012, the corporations of the world paved over the Internet, designing their own network system. Keeping the same name, they developed a system where every piece of information was audited and paid for before it was passed on to the world at large. Those who still followed the ideology of an open and uncontrolled Internet gathered what resources they could and formed the SwitchNet. Build mostly out of discarded technologies and backdoors in the current Internet, it allowed some manner of uncontrolled communication around the world. The "Hacker Outpost" is in need of new recruits to perform missions in information gathering against the corporations, which will allow them to increase the presence of the SwitchNet in the world.”

And the slightly different press release one: “ In 2012, a new Internet was introduced--one that prohibited users from posting anything on personal home pages, prohibited them from using software of their choice, and from having an e-mail address. Having no place to stay, hackers created the SwitchNet, an underground network operating on the old wires and infrastructure of the original Internet”


Yeah I do my routine browsing in private mode, no 3rd party cookies, no history, and an ad blocker. I get captchas everwhere.


Googles own captchas are Satan. Squiggly lines all over the place. Why they don't use the normally accepted captcha is beyond me


> I'm not a human very often, according to Google.

"On the Internet, nobody knows you're a dog." (Well, except Google, it seems.)


[flagged]


Don't forget that Recaptcha magically works better in Chrome, and even bettery-better if you're logged into a Google account. In FF (with tracking protection) you can expect to see the enforced wait and "Please try again". Honestly, it's awful. Half the time I have to second-guess what the average American would think which of the (noise-added, corrupted) images matches the description.


The reCaptcha check boxes straight up fail for me now in FF on Linux, have done for a few months.


Your comment suggests you would exchange money for an assassination. I know you are joking (right?), but this is not something that you should joke about.


Oh no, I’d absolutely chip in. These people certainly deserve it.


Shouldn’t do this no matter what ofc. But why the devs? Sure the devs are well paid and privileged. That’s mostly relative to others in society. They are still more cogs than anything.


Easier target, a couple of google devs hanging on a public square would do much to disincentivize others from working on similar products in the future.

At least executives can dream of hiding behind private security, for mere developers earning 300k/yr the situation isn’t so rosy.


It wouldn’t really disincentivize that much. You aren’t grasping the status quo and power dynamics of all of this. Double that $300K and even i would strongly consider risking my life for that amount of money for some time. That money is peanuts in the scheme of things but is enormous to many, many people.

Seriously. If Google doubled that money and the only way for me to be safe would be to stay in some glorified prison while working, I’d probably do it for a few years.

All you’re doing is pointing out how bad the power dynamics are and attacking the weakest and least powerful parts of said dynamics. When there are so many simple ways to get around this sort of scheming. Mine is one example. When you’re talking about cogs. It is extremely easy to replace them. It is sad you want to attack the working class (in this situation it’s privileged workers making a lot but in terms of the system, they fit into this).


I've noticed that some sites deliberately do this or have lousy code that fails to properly acknowledge captcha completions.

Take archive.is/archive.fo/archive.today, for example. If you're using Cloudflare DNS (1.1.1.1) or iCloud Private Relay, and you visit https://archive.is/, you'll get what looks like a Cloudflare screening page. It's not, though: that page is part of archive.is and is served to Cloudflare DNS users (which includes iCloud Private Relay users)--the use of reCAPTCHA in place of hCaptcha is a giveaway. You can complete the captcha as many times as you like, but you'll never get in.

And how many times have we completed a captcha on a form only to have it throw another captcha in our face without so much as an error message? Sometimes it's just lousy code.


I remember reading the CF founders about how archive.is didn't want cloudflare dns users to resolve to archive, so they respect that.

https://news.ycombinator.com/item?id=28495204


There's also a mode where it thinks you are a bot/sucker and gives you unlimited images until you give up. That's always fun.


>where it thinks you are a bot/sucker and gives you unlimited images until you give up.

I frequently get these from Cloudflare when using Tor Browser. Google is basically unusable with Tor Browser.


If anyone wants to see that, try launching the browser via Selenium. I used to do that to partially automate some activities, such as download bank statements. I'd have my Selenium using script open a browser and go to the bank, then wait for me to login and get to the account page.

I'd login, dismiss any popup or interstitial promotions the bank decided to give me, get to the account page, and tell my script to continue.

My script would then use Selenium to click the download button, click the "custom date range" radio button on download popup, fill in the range fields to cover the last 60 days, pick OFX for the download format, and start the download, prompting me to let it know when the download is finished.

When the download finished, I could then go to one of my other accounts at that bank, tell the script I'm there, and that one gets downloaded, and so on.

My bank isn't giving CAPTCHAs so that would still work if I were to get around to updating my script to deal with some redesigns they did of their pages which broke finding the relevant elements on the page.

But I've found that if I do visit a site that uses hCaptcha while using the Selenium launched browser, it seems to get stuck. Click to tell it I'm not a bot. Then get an image test. Answer that correctly and get another image test. Answer that correctly. Then it goes back to the click if you are not a bot thing, and repeats--two more image tests and back to the beginning.

Here's a program if anyone wants to try this and has the Selenium Webdriver package for Python3 installed. This will open a browser and take you to fanfiction.net. Trying to actually read any story will bring up the CAPTCHA.

  #!/usr/bin/env python3
  from selenium.webdriver import Chrome

  driver = Chrome()
  driver.get("https://www.fanfiction.net")
  input("press enter when done")
  driver.close()
  driver.quit()
I'm not sure if the looping is a Cloudflare thing or a fanfiction.net thing, because the latter is the only site I use that has Cloudflare's CAPTCHA.

It used to be that if you added

  from selenium.webdriver import ChromeOptions
and changed opening the driver to

    options = ChromeOptions()
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    options.add_argument("--disable-blink-features=AutomationControlled")
    driver = Chrome(options=options)
you could get past the CAPTCHA, but that stopped working a while ago.

There's this project to provide a Selenium Chrome driver that is supposed to not trigger anti-bot detectors [1], but it still hit the CAPTCHA loop when I tried it.

[1] https://github.com/ultrafunkamsterdam/undetected-chromedrive...


fanfiction.net has also simply broken the Calibre FanFicFare integration thanks to their CloudFlare shenanigans.

The workaround is to simply visit all chapters separately and then point Calibre at the Google Chrome cache folder.

So nice going there, fanfiction.net. Instead of offering a 1-click .epub download like AO3 (which is completely CDN-able with a very long TTL), you now had to serve 50 individual requests. Great engineering work there.

(Obviously they do this to serve ads on every request)


I would really like to see a fan fiction site that combined AO3's ease of downloading with Fanfiction.net's organization.


AO3 is OSS and vastly understaffed. Having worked on some of their tickets, IMO they could use 20 contributors working part-time for a year or two to stabilize it until the idea of useful new features becomes viable.

I strongly encourage anyone with Rails experience to contribute [0]. There is a giant test suite which definitely helps with stability. The ticket time-to-resolve is simply quite slow due to the above-mentioned understaffing, so don't be discouraged!

[0] https://github.com/otwcode/otwarchive/


Yeah there’s lots of detect and anti detect stuff going back and forth. It’s pretty silly and frustrating for situations like yours. Doing things for yourself to speed up mundane life things.

There’s so many anti-detect libraries on GitHub these days. Wonder how many work well.


I believe this is done to get answers for unsolved captchas. For example, I have a million photos of streets filled with cars, buses, motorcycles, streetlights, and crosswalks I want to add to my captcha database. I don't want to categorize them all myself, and I want the answers to be what the average person will identify, not what I or a machine will identify.

So, I send everyone two captchas. One has a known answer and is required to be correct to access the service. The second captcha answer isn't yet known, so it doesn't matter what the user selects. However, when they get the known answer right, we log their answer for the unknown captcha. Once we get a large enough sample, we then have our top answers for the unknown captcha and can start using it for verification.


I always assumed that's how it works so would do the first correctly and random clicks for the second. This is as I was uninterested though doubt it is still that simple.

I wonder, what are the minimum number of labels per image to ensure clean data?


I have found many times that if select an incorrect tile and then unselect it before submitting, I am not presented with multiple challenges. My guess is a bot would not exhibit this behavior.

Try it out next time.


Not anymore.


Usually in those cases, even if you make mistakes they get accepted. The larger the clicks, the less annotated / voted those images are, thus less severe their penalization method for wrong markings. I have observed sites that newly introduce such captcha basically accept if I just click 1/3rd of the right answers. Don't click the wrong answers as they are fully/partially introduced on purpose. It's just that you don't have to click all right answers.


Whoever made this new captcha I’m seeing starting to see everywhere:

https://imgur.com/a/hoyjctl

Thank you! itsso much easier than being a labeling bot for self driving cars.


That looks pretty easy for machines. I wouldn’t be surprised if CLIP could solve that out of the box. (Then again, I guess the same applies to “select all the traffic lights”)


yeah you won't be loving this one where they make you do 10 in a row and if you get one wrong you start again with 11 this time. also it'll fail you at random




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: