Google’s new reCAPTCHA has a dark side

dessant · on June 27, 2019

Google has been doing the same with reCAPTCHA v2 [1]. They are aware of the legal risk of outright blocking users from accessing services, so reCAPTCHA v3 contains no user facing UI, Google merely makes a suggestion in the form of a user score, so the responsibility to delay or block access and the legal liability that comes with it falls on websites.

reCAPTCHA v2 is superseded by v3 because it presents a broader opportunity for Google to collect data, and do so with reduced legal risk.

Since reCAPTCHA v3 scripts must be loaded on every page of a site, you must send Google your browsing history and detailed data about how you interact with sites in order to access basic services on the internet, such as paying your bills, or accessing healthcare services.

It's needless to say that the kind of data that is collected by reCAPTCHA v3 is extremely sensitive. Those requests contain data about your motor skills, health issues, and your interests and desires based on how you interact with content. Everything about you that can be inferred or extracted from a website visit is collected and sent to Google.

If you'll refuse to transmit personal data to Google, websites will hinder or block your access.

[1] https://github.com/w3c/apa/issues/25

benreesman · on June 27, 2019

Your comment adds a lot to the conversation, so I don’t want to be more contrary than necessary.

It’s nonetheless a shame that it’s so universally misunderstood how ad-supported megacorps make their money that even highly sophisticated users of the web still talk about the value of personal data (source: I ran Facebook’s ads backend for years).

Much like the highest information-gain feature for the future price of a security is it’s most recent price: ad historical CTR and user historical CTR (called “clickiness” in the business) are basically the whole show when predicting user cross ad CTR. The big shops like to ham up their data advantage with one hand (to advertisers) while washing the other hand of it (to regulators).

As with so many things Hanlon’s Razor cuts deeply here: if your browsing history can juice CTR prediction then I’ve never seen it. I have seen careers premised on that idea, but I’ve never seen it work.

__jal · on June 27, 2019

> It’s nonetheless a shame that it’s so universally misunderstood how ad-supported megacorps make their money that even highly sophisticated users of the web still talk about the value of personal data (source: I ran Facebook’s ads backend for years).

That may be the case for some people, but that is not my complaint, nor that of many folks I know.

I simply don't care how FB, Google and other surveillance outfits make money. I don't care about marketers' careers or their CTRs. I don't even care about putting a dollar value on my LTV to them.

I care about denying them visibility into my datastream. It is zero-sum. They have no right to it, and I have every right to try to limit their visibility.

Why? None of your business. Seriously - nobody is owed an explanation for not wanting robots watching.

But I will answer anyway. It is because of future risks. These professional panty sniffers already have the raw material for many thousands of lawsuits, divorces and less legal outcomes in their databases. Who knows what particular bits of information will leak in 10 years, or when FB goes bankrupt? I have no desire to be part of what I suspect will become a massive clusterfuck within our lifetimes.

If you're correct that this data has so little value, then it is more likely it will leak. FB and Google are the equivalent of Superfund sites waiting to happen, and storing that data should be considered criminal.

ajxs · on June 27, 2019

If I could upvote this comment twice, I would. This succinctly summarises my views on the subject. We shouldn't have to justify _why_ we don't want our private information harvested by these companies. I would still feel remarkably uneasy even _if_ Facebook and Google were demonstrably benevolent citizens of the online world, but we've seen time and time again how invasive and malicious they can be. The fact that both of these companies have political ambition makes the entire situation much scarier. Count me out.

stickfigure · on June 27, 2019

They have no right to it, and I have every right to try to limit their visibility.

That's entirely fair! But also: You have no right to use my website, and I have every right to limit your access.

Recaptcha is simply part of this negotiation.

ajxs · on June 27, 2019

Is that so? What about the webmaster who simply wants to combat bots using his page, is the extent of data gathering on Google's behalf just part of the deal? What if selling user data is against the webmaster's ethics? "Don't use it I guess" Sure, except that no one in the exchange was told the extent to which this data is used, or what for. Users of Google's Captcha aren't told about this exchange. I disagree entirely that it's a matter of voluntarily opting in and out of Google's domain. Their business model depends on becoming inescapable, and they're not being honest about how their services collect our data.

tg180 · on June 28, 2019

> Their business model depends on becoming inescapable

What will happen with v3 if I block gstatic.com? Will I be given the highest threat score?

Izkata · on June 28, 2019

Wait, now you have me wondering: If this is just javascript from another domain, what's preventing bots from proxying requests, intercepting this one, and replacing it with a dummy function that returns a "no threat" score?

tg180 · on June 28, 2019

https://developers.google.com/recaptcha/docs/verify

Izkata · on June 28, 2019

So... the answer is, nothing prevents it?

tg180 · on June 28, 2019

When the API reports a failed verification, the webmaster knows that the response has been tampered with?

__jal · on June 27, 2019

> You have no right to use my website

Of course.

> Recaptcha is simply part of this negotiation.

It is only a negotiation if I know it is there.

jjeaff · on June 27, 2019

I'm sure it will be mentioned in the 40 page privacy and cookie policy that pops up on every website asking you to agree before continuing.

nord73 · on June 28, 2019

Which is not compliant with GDPR.

_130o · on June 28, 2019

And Im sure it wont be explicitly stated, but simply rolled up under some paragraph as a blanket statement, something to the effect of:

"We track all of your activities and provide third parties the ability to do so as well - to provide a better user experience - and we may or may not sell or distribute the collected data at our own discretion and continued use of this site grants us permission in perpetuity. Further, should you decide to sue us, you agree to binding arbitration at a venue chosen by us, conducted by an arbiter of our choosing, in which case you promise to lose regardless of outcome. If you disagree, please leave the site now but just know, by being here and reading this, you have already granted us this power and we've mostly already collected what we needed from you. Thank you. Stop wasting our bandwidth now. Fuck off!"

KirinDave · on June 28, 2019

> It is only a negotiation if I know it is there.

You're not going to get through a site with a properly implemented captcha just by blocking it.

zxczxc111 · on June 28, 2019

The point is if you see a captcha you can actively decide not to use it. If the website has no such box then you can't decide not to use the captcha

KirinDave · on June 28, 2019

That's not a sinister "land grab" by Google, that's a fundamental aspect of the web, predating the advent of JavaScript. You reveal your identity quite thoroughly to the individual hosting services.

And it's difficult to imagine legislating that away, as it's sort of fundamental to all network computing.

worik · on June 28, 2019

Privacy Badger will know it is there

benreesman · on June 27, 2019

You’re commenting on HN, you know it’s there.

smt88 · on June 28, 2019

I'm commenting on HN. I've been developing for the web for 24 years. I don't know how and when my data is collected or shared most of the time.

The idea that FB and Google are openly making a trade with users is ludicrous. I'm horrified that you either sincerely believe that there's a fair negotiation happening or that you don't care (given your employment history).

zcid · on June 27, 2019

And what about the other 99.99999% of people that use the web? Do they also understand what is going on behind the scenes?

DEADBEEFC0FFEE · on June 28, 2019

I'm here, and struggle to follow many of the threads on HN. As a father, I don't really see how I can effectively prepare my kids for a surveillance internet.

snlnspc · on June 27, 2019

I didn't, but assume this is the case with everything. I mostly care about giving my data away for free (cut me in please), but none of my non-HN commenting roommates knew. Is their privacy less important than mine?

saagarjha · on June 27, 2019

I do, and I can make an informed choice. Unless your website has a very eclectic audience, I’m not the only one using your services.

vlozko · on June 28, 2019

Except it happens on government-owned sites from the local to the national level where I have EVERY right to visit, especially as my tax dollars are paying for it and it’s for services that are available to the general public.

qtplatypus · on June 28, 2019

That's not totally true. If you provide access to your web site to people then in many places in the world you can't limit that access in a way that discriminates against protected classes.

In the US for example you can't set up your web site in a way that accessing it discriminates against people with disabilities.

p49k · on June 27, 2019

That's what's great about GDPR. It makes privacy a fundamental right that can't be bargained away, much like you can't sign a contract binding you to slavery and you can't accept a bonus from your employer in exchange from losing your mandated breaks.

kevingadd · on June 28, 2019

You don't have a legal right to limit access for the disabled, which is what services like reCAPTCHA3 are doing.

alasdair_ · on June 28, 2019

The bigger issue, for me, is for things like Facebook gathering data on third-party sites that I had no idea* were feeding the information back to Facebook.

An even bigger issue, for me, is having my face added to their facial recognition algorithms, despite never once tagging myself in a photo. Is there a way to opt out of this?

pluma · on June 28, 2019

Recaptcha v3 is an invasion of privacy and a blatant violation of the GDPR.

It's as much of a negotiation as offering someone to pay through perpetually indentured servitude is: it's illegal and immoral.

pushpop · on June 28, 2019

This. Absolutely this!

Do I think Facebook/Google/etc are abusing my data right now? Probably not.

But do I think that large scale collection of my data could be abused in the future? Most definitely. If the Cambridge Analytica scandal has taught us anything, it’s that having access to this data is rife for abuse and often it might happen in unexpected ways.

And do I owe an explanation for wanting some basic privacy? Absolutely not. If a random stranger stopped me in the street and asked me lots of personal questions there wouldn’t be an expectation that I have to respond. Yet the likes of Facebook and Google seem he’ll bent on turning the discussion around when it’s data collected online.

dvdkhlng · on June 28, 2019

> FB and Google are the equivalent of Superfund sites waiting to happen, and storing that data should be considered criminal.

I would think, that in the EU, under GDPR, collecting, transmitting and storing that data is in fact criminal, or at least subject to heavy fines. And under GDPR it won't help to just note the data collection in the TOS or ask the user for permission (under threat to not allow access to the service). So I really wonder how google plans to run this in europe.

webmobdev · on June 29, 2019

Exactly! And this is my problem with Apple too - sure , they do some things right and are more conscious of "user privacy" than others, but at the same time they have also started abusing this to further spy on their users.

What use is "we are transparent with our users about the data we collect" when the user does not want you to collect the data in the first place? And they give you no option to opt out of such data collection? (And for what - just so that they can create a better ad network that can better exploit us with our own data?)

(And don't get me started on Safari spying and all their "anonymous" cookie collection crap without giving the user any choice in the matter, essentially forcing everyone of their users to opt-in to be profiled through their browsing history).

Mirioron · on June 28, 2019

And you never know what those corporations could do with the data politically. Recently there has been a lot of talk about how these types of companies seem to favor certain politics. Who's to say that they won't use this data in the future for influence?

benreesman · on June 27, 2019

You could either stop using these services or (as I suspect) you find them too valuable to dismiss entirely quarantine them to a VPN/incognito interaction in less time than it took to type that comment.

I don’t want to single you out personally but there’s a broad trend on HN of bitter-sounding commentary on the surveillance powers of these companies by people who can easily defeat any tracking that it’s economical for them to even attempt let alone execute that reeks of sour grapes that a mediocre employee at one of these places makes 3-20x what anyone makes (as a rank and file employee) anywhere else.

Again, you’re not likely part of that group, but seriously who hangs out on HN and can’t configure a VPN?

kbenson · on June 27, 2019

> You could either stop using these services or

How do you stop using a service when you have little or no indication that it does something like this before hand, and afterwards the privacy is already gone?

If I use a site and view my profile page and the url contains aa account id or username and some google or facebook analytics is loaded, or a like button is sitting somewhere, how am I to know that before the page is loaded? What if I'm visiting the site for the first time after it's been added?

It doesn't even matter if I have an account on Google or Facebook, they'll create profiles for me aggregating my data anyway.

> quarantine them to a VPN/incognito interaction

Which does very little. I spent a few hours this morning trying to get a system non-unique on panopticlick, but the canvas and WebGL hashing is enough to dwarf all the other metrics. There are extensions to help with that, but for the purpose I was attempting, were sub-optimal (and the one that seemed to do time-based salting of the hashes wasn't working right).

So, I don't have any confidence that a VPN and incognito really does much at all.

KirinDave · on June 28, 2019

> How do you stop using a service when you have little or no indication that it does something like this before hand, and afterwards the privacy is already gone?

It is small comfort for the average user, but the way you do it is use noscript. It makes the web awful, sure, but it won't happen to you.

> It doesn't even matter if I have an account on Google or Facebook, they'll create profiles for me aggregating my data anyway.

I sort of wonder what you envision this actually meaning. If I spam your website and you add a DoS filter for my IP, should I complain you made a profile of me? If when a user tries to log in I check the referrer to see if it contains a proper URL, have I violated your privacy?

kbenson · on June 28, 2019

> I sort of wonder what you envision this actually meaning.

I mean it to respond to the common response people sometimes give in conversations like these, which is "that's why I don't use Facebook" or "that's why I stopped using Google services". For this conversation, whether you use Facebook or not is irrelevant, they still gather your information, and in the same way myriad other advertisers (or however they bill themselves) do through online tracking. Google and Facebook are large, and have a portion that's easily visible, but they are not the whole problem by a long shot.

> If when a user tries to log in I check the referrer to see if it contains a proper URL, have I violated your privacy?

No. Noting which door a customer came into your store seems fine to me. That by default customers come in wearing the logo of the last store they visited is weird, but entirely something they can control. Having people shadowing all your customers while in the store looking and listening for tidbits they can report back on to get more info about those people is pretty creepy. As you suggest, the way to get around most of that is to dress blandly and say nothing.

Here's the thing, we're a market economy. There's a transaction going on, where we're trading away something (our information and privacy) to a company for some product, or possibly the right to view a product we might consider buying. How many people are actually aware of this transaction? If they aren't aware of the transaction, there's a name for that when it's a regular good, and it's theft (or fraud). The difference here is that most of our government systems don't apply any rights of ownership to this information, so our regular rules don't apply. I admit, they may not make sense to apply entirely, but at the same time, it's obvious that something is lost in the transaction, whether the person losing it realizes it at the time, or views it as important enough to make a big deal about when they notice.

KirinDave · on June 28, 2019

> Google and Facebook are large, and have a portion that's easily visible, but they are not the whole problem by a long shot.

I meant more like in a literal sense, but okay. Point taken.

> No. Noting which door a customer came into your store seems fine to me. That by default customers come in wearing the logo of the last store they visited is weird, but entirely something they can control. Having people shadowing all your customers while in the store looking and listening for tidbits they can report back on to get more info about those people is pretty creepy. As you suggest, the way to get around most of that is to dress blandly and say nothing

These human metaphors are powerful, but don't map at all to basic analytics concepts. There is no person watching you. There is no intelligence judging you. There are a series of conditions in a deterministic system provoked by your actions. If we could have done this before now, we would have because it's a whole hell of a lot more ethical.

> Here's the thing, we're a market economy.

I dunno where you are but I'm in the US which is most definitely not "a market economy" without a whole hell of a lot of qualifiers.

> There's a transaction going on, where we're trading away something (our information and privacy) to a company for some product, or possibly the right to view a product we might consider buying. How many people are actually aware of this transaction?

Roughly as many, I imagine, as folks who realized the shopkeeper could see them enter and leave. Most folks know local proprietors can and will kick you out and put up a photo if you act up.

> The difference here is that most of our government systems don't apply any rights of ownership to this information, so our regular rules don't apply.

This is just flatly false. I don't know what you're thinking writing this, but it's clearly neglecting copyright and patents. For what it's worth, I think the later is a bad system an the former is in desperate need of reform to sharply limit it.

> it's obvious that something is lost in the transaction, whether the person losing it realizes it at the time, or views it as important enough to make a big deal about when they notice.

I am trying to read your comment in the spirit it was intended rather than the literal delivery, so please forgive me if there is a subtle impedance mismatch here but...

Welcome to the future, I guess? The top 50% earners of the world has access to computers that would have once bankrupted a nation to produce, and the options are still surprisingly good for the next quartile. With that power, it means that the people around you are going to start noticing things and making decisions about them with the information they can now process.

Ideally, this will be a distributed thing, but right now due to the nature of our society, authority of this sort is highly concentrated. But the dam has broken. A total surveillance system for up to a modestly sized city, with realtime tracking and long term data storage, is well within the reach of anyone with $10000USD to spend on hardware. They can self-host it. The banality of this cannot be overstated. It's boring to do this now. It's not new ground. So much so that average people can monitor their homes with it, or know if their friends have gone missing with it.

To some extent, there is just no undoing this. Society will have fewer secrets and those secrets will be much more deliberate, and the only response that can work is to change your attitude.

kbenson · on June 28, 2019

> There is no person watching you. There is no intelligence judging you. There are a series of conditions in a deterministic system provoked by your actions.

I don't think it's creepy because there's a (theoretical) person watching me, I think it's creepy because they're cataloguing all my actions in a systemic was which pierces the veil of perceived privacy (mostly through anonymity).

> I dunno where you are but I'm in the US which is most definitely not "a market economy" without a whole hell of a lot of qualifiers.

I'm not sure how to respond to this without a specific criticism of how you think it's incorrect. That said, it's somewhat tangential to the point, even if it would be an interesting conversation.

> Roughly as many, I imagine, as folks who realized the shopkeeper could see them enter and leave.

I don't know. If every time I entered my local 7-eleven someone picked up a clipboard, flipped to a specific page, looked back at me, nodded to their self and then marked something on the page, I might decide to go somewhere else, at least most the time. If I knew the info was shared with all the other 7-elevens, and the local grocery chain, and some hardware stores, that makes me want to use all the places less.

> This is just flatly false. I don't know what you're thinking writing this, but it's clearly neglecting copyright and patents. For what it's worth, I think the later is a bad system an the former is in desperate need of reform to sharply limit it.

I said "this" to qualify what I was referring to (personal information) and distinguish it from other types of protected information, of the type you reference.

> To some extent, there is just no undoing this. Society will have fewer secrets and those secrets will be much more deliberate, and the only response that can work is to change your attitude.

I don't think that's the only response that can work. It's the only one that works completely, as deciding to not care is always a solution to caring, if you can pull it off.

The alternative is new laws. Are they perfect? No. Will they solve the problem adequately? Likely not. Do they have a chance of making a positive difference across the board for massive amounts of people by empowering them with regard to their own information? I dunno. Maybe? I think it's worth pushing for though. Otherwise, why do we have minimum wage and labor laws? At some point we could have thrown our hands up and said "screw it" about that stuff, but people pushed for it, and while they aren't perfect, I think we're all better off for them.

I don't believe there will be any perfect solution to this ever, or even a good or acceptable solution all that soon. I do think it's still worth raising my voice over, because I think there are some possible futures that are better than others with regard to privacy and personal information, and I think that's worth pushing towards.

literallycancer · on June 27, 2019

You use something that blocks scripts (like uMatrix) with an aggressive ruleset. On some sites you'll need to allow things to make them work. If they are loading trackers from the same servers that they load content from, you can't do much without wasting more time than you want. I'd say it breaks most of the tracking though.

More sites than you'd expect work without js or with first-party js only. It's annoying when you need to read a news site, because those are usually bloated garbage. Not a huge loss.

kbenson · on June 27, 2019

This was already with uBlock Origin. Also tried combinations of Ghostery and Privacy badger. All of it made very little difference for panopticlick, and that's probably a low-bar compared to what's common these days.

mirimir · on June 28, 2019

I don't care if every site that I browse using this VM knows that I'm Mirimir. I don't even try to hide that.

What matter is that my personas using other VMs, through other VPNs or Tor, don't get linked to my meatspace identity, to Mirimir, or to my other personas. And that's doable, I think.

kbenson · on June 28, 2019

Yes, and you go through quite a lot of effort to achieve that, given your other comment.

My main point is that the amount of effort you have to go through to achieve that is very high, and I wish it was considerable lower. There are technological changes that could help with this, and legal changes that could help with this.

I think a comfortable place would be if you visit the same online location using your main browser using one IP, and a private browsing instance of that same browser on another IP (through a VPN, proxy, or just new public lease), it would be nice if there was some expectation they didn't immediately have a high degree of certainty you were the same individual. For the general populace, this falls on its face.

Tor has quite a few mitigations to help here (e.g. simulated window/screen values), and Firefox has started to adopt some of them, but as mentioned here on HN frequently, Firefox sometimes has problems with CAPTCHAs and certain sites (I haven't had those problems, but I'm also not usually using it through a VPN), and I know Tor is sometimes blocked outright.

The point is that until most these protections (technological and hopefully some legal) are mainstream, completely protecting yourself is a double edged sword, since you also ostracize yourself from some sites and services. Tor is the equivalent of walking around in padded, baggy clothes and a ski-mask. Sometimes, like in the snow, it may seem fairly normal. Other times, like at the beach, it may preserve your privacy, but it's very uncomfortable and may cause people to avoid you, if not outright shun you and run you off. If everyone starts wearing masks and covering their hair, if you do the same you probably have a fairly high degree of anonymity and privacy through it.

In summary, I think Tor is a useful and necessary tool, but nowhere near sufficient for where I think we need to be generally.

mirimir · on June 28, 2019

> Yes, and you go through quite a lot of effort to achieve that, given your other comment.

That's true. However, it's mostly one-time effort. There are Linux and TrueOS workspace VMs, pfSense VMs as VPN gateways, and Whonix gateway and workspace VMs. All in VirtualBox.

There's ~no configuration required for the Whonix VMs. You just need to point the gateway VM to the pfSense VM that ends the desired nested VPN chain. And if there are multiple Whonix instances, rename the internal network that the gateway and workspace VMs share.

For the Linux and TrueOS workspace VMs, it's just like any OS install. You do have more machines to maintain, but mainly that's just keeping packages up to date. All of the devices are virtual, so you don't have driver issues.

Setting up the pfSense VMs is the hardest part. But once that's done, you can use them for years. pfSense is pretty good about preserving setup for OS upgrades. And there's a webGUI for changing VPN servers. But it's harder than using a custom VPN client.

So yeah, it's not so easy. However, someone could write an app that papered over most of the ugly parts. That even automated VM setup and management.

benreesman · on June 27, 2019

I assure you that a clean browser and IP will break any surveillance that I know about.

kbenson · on June 27, 2019

No, a clean browser and IP with the combination of what fonts I have installed, how my video card renders a canvas and WebGL instance (which may be affected not just by the video card you have, but the driver version used with it), my screen size, and a few other system level items that come through may or may or may not be enough to uniquely identify you. Along with linking to a prior profile if you screw up one time (or load a URL that has identifying information they can use), and you're busted.

So, sure, a clean browser and IP and never logging into a site you're previously visiting might be enough, but who does that, and doesn't that halfway defeat the purpose?

mirimir · on June 28, 2019

You gotta compartmentalize.

My meatspace identity uses a desktop that hits the Internet directly. It displays no interest in technical matters. Just banking, cards, shopping, general news, etc. It never accesses HN, or any of the other sites that Mirimir uses. Or that any of my other personas use.

Mirimir uses a VM, on a different host machine, and hits the Internet through three VPNs, in a nested chain. Some other personas use different VMs on the same host, connecting through different nested VPN chains. Some are Whonix instances, connecting via Tor, and reaching Tor through nested VPN chains.

So basically, each persona that I want isolated uses a different host machine and/or VM, a different browser, and a different IP address.

benreesman · on June 27, 2019

I appreciate the information-theoretic validity of your argument, but if you think that one of these firms cares enough about your buying preferences to burn enough compute to find that correlation then you either work for the CIA or are mistaken.

kbenson · on June 27, 2019

It doesn't take a lot of compute resources to have multiple profiles, and when evidence of a high assurance level (a referring URL that is known to designate a specific user of a major service) to link it with other profiles that also have that designation.

To me, that seems par for the course for any service that's generating profiles of browsing behavior and trying to make any sort of decisions based on it. It reduces cruft and duplicate profiles while also providing more accurate information. Why wouldn't it be done?

> the information-theoretic validity of your argument

The portion about canvas, WebGL and AudtioContext hashing is not theory at all, it's well known practice from years ago. Jest the other day here there was a story about some advertiser on Stack Overflow trying to use the audio hashing to tracking purposes.

Hell, if you get enough identifiable bits of entropy, you can probably assume weak to strong level matching using a bit-level Levenshtein distance that's low enough.

benreesman · on June 27, 2019

GitHub is always at your disposal. NV doesn’t sell the consumer cards to enterprises. So on AWS a multi-GPU box will cost you about 12 dollars an hour. If you can disambiguate, let’s just say 85% of profiles absent IP or cookies, well I think you just broke the academic SOTA and I’d love to make some calls.

Cheat sheet: you can’t.

byonge · on June 28, 2019

I sorta understand the sentiment where there are tools like :

https://amiunique.org/ https://browserprint.info/

whose results would have you believe that one's footprint is very unique. I'd be interested in hearing more about why this is hard to implement into an efficient process.

saagarjha · on June 27, 2019

> GitHub is always at your disposal. NV doesn’t sell the consumer cards to enterprises. So on AWS a multi-GPU box will cost you about 12 dollars an hour.

I don’t see how this is related to the claim, since it doesn’t solve the problem. But the advertising company that I let run code on my website will certainly do the job pretty well, I’d say.

benreesman · on June 28, 2019

I was pointing out that it’s a commercially applicable of a very strongly worded claim that I know would be expensive to test because I’m optimizing GPU intensive code at the moment. I don’t know where in this thread I generated so much ill will for trying to add knowledge to the conversation, but I’m not making shit up.

saagarjha · on June 28, 2019

There are tools that will supposedly do this to a high degree of accuracy. Are you saying that they are fake/don't work as well as they'd want us to believe?

byonge · on June 28, 2019

not much effort goes into de-duping customers? genuinely curious

I would've thought that would be a pretty useful exercise

amenod · on June 27, 2019

> You could either stop using these services or ...

Are you serious? Have you tried not using their services? Try blocking Google Analytics, Tag Manager, ReCaptcha, fonts, gstatic,... What you will see is that you can no longer access much of the Internet. Want to participate in StackOverflow? Good luck if you block Google.

My beef is not with them trying to find my data when I'm on their site(s). They are however everywhere, on almost every site I visit. Coupled with their (impressive) technical provess it is beyond creepy, and there is simply no way one can avoid them.

I don't know what the solution is or will be, but as far as I'm concerned, this should be illegal.

Freak_NL · on June 27, 2019

> Try blocking Google Analytics, Tag Manager

Blocking those two doesn't seem to break much, does it? I have uBlock Origin and/or Privacy Badger block them everywhere.

ReCaptcha on the other hand…

Just this week I needed it to complete the booking of an airline ticket and just now buying a high chair for my son. And today I've completed the blasted thing ten times in a row because of a game installer that was failing at a certain point (GTA V's Social Club thing); each attempt to figure out what was wrong meant completing the ReCaptcha again.

Fire hydrants, parking metres, pedestrian crossings, road signs, hills, chimneys, steps, cyclists, buses — that's what the internet looks like in 2019.

benreesman · on June 27, 2019

Unfortunately politically acceptable regulation only deters new ventures because it makes the costs of compliance too high.

The right vehicle for this is antitrust, but if you think you can sell that in this climate then I’ve got a great deal for you on the London Bridge.

dmitriid · on June 27, 2019

The costs of compliance are not too high. Compliance is actually ridiculously easy for new companies: they need to collect only the data they need. That is all there is

benreesman · on June 27, 2019

https://en.m.wikipedia.org/wiki/General_Data_Protection_Regu...

dmitriid · on June 27, 2019

Yes. Your point? It’s actually ridiculously easy to be compliant with GDPR.

Edit: That is, ridiculously easy for new companies. Incumbents have been hoarding data for too long and it was actually harder for existing companies to become compliant.

benreesman · on June 27, 2019

If you don’t think that lawyer fees scale linearly with regulation complexity you’re either an early Uber employee or mistaken.

When you’ve built a social consumer business in Europe that is profitable after compliance, send me a term sheet.

snlnspc · on June 27, 2019

I enjoyed reading what you said as a different perspective on the backend of ad technology vs privacy up until this comment thread.

I didn't build a profitable social consumer business in Europe after compliance, but I was part of a team that implemented compliance for a long existing company within the US due to them having clients and client's clients in Europe. They're profitable. Do you want my term sheet? Or are you weakly attempting to flex while complaining that people's basic right to privacy is preventing you from earning obscene amounts of money?

benreesman · on June 28, 2019

As I’ve mentioned I think elsewhere in the thread I left that business in no small part because it didn’t feel right to be in anymore. It was at a significant cost. I’m really lost on where in the thread I started to sound like a shill for business practices I (knowledgeably) don’t care for.

byonge · on June 28, 2019

What do you estimate the implementation costs of GDPR are? I've seen some research that put the numbers in the 10's of billions IIRC

It feels like a regulatory moat for the big players who can afford it. Sorta like a complex VAT policy.

dmitriid · on June 28, 2019

Those numbers are for existing companies who have been hoarding and selling user data with utter disregard to existing laws and user privacy.

If you do everything right from the start, the costs are minuscule.

dmitriid · on June 28, 2019

Why would you need lawyer fees? How is GDPR complex?

It literally is:

- you only store data you require to run your business

- you delete data if customer requested deletion

- you give the customer their data if they ask for it

If your profitable business is built upon selling customer data wholesale to third parties, then good riddance.

nord73 · on June 28, 2019

Google and Facebook et al stores and processes PII on non-customers, without informed consent given from users.

It's still early days. We'll see what will happen when the DPA's and the courts have fielded a few high profile cases.

amenod · on June 28, 2019

This! I hope it costs them dearly. I have never (willingly) given them consent to have my data, yet I know they have loads of it, just because other people I know are careless with data about me.

deckard1 · on June 27, 2019

> You could either stop using these services

No you can't. Facebook creates shadow profiles for every single person in the world. If any single one of your friends has WhatsApp, Facebook has your phone number. They have your phone number and the entire address book of your friend, who probably has friends in common. If two of your friends have WhatsApp and they both have your number...

You see where I'm going here? There are pictures of me on Facebook that I did not put there. From friends or friends of friends.

I'm not even scratching the surface of what Google knows with GPS and WiFi connections.

No one consented to any of this bullshit.

benreesman · on June 27, 2019

There’s a reasonable argument in there, but it applies to any world in which digital cameras are cheap.

This is in a sense the worst kind of argument: superficially correct but really meant to tap into a popular groundswell of sentiment.

The question isn’t “can FB use an off-the-shelf CNN to identify me personally” but rather:

“If it weren’t FB who would be doing it instead?”

and:

“Should cheap digital cameras be illegal?”

cyphar · on June 27, 2019

> The question is “If it weren’t FB who would be doing it instead?” [...] “Should cheap digital cameras be illegal?”

Those are a complete non-sequitur.

Facebook (and Google) analyse every single photo that goes through their system with state-of-the-art ML (it's so good that it almost beat humans at matching faces ~5 years ago). This is a scale of surveillance which the human race has never encountered before in our history[+], and is a serious problem that we (as a society) need to make a decision on. In many countries, car license plates are OCR'd and automatically tracked whenever they travel on almost any main public road. Facial recognition in public places and on public transport is becoming a prevalent problem. And wearing masks is illegal in many countries -- meaning there is no way of "opting out" of the pervasive surveillance in the physical world. None of these things were nearly as commonplace (or even technologically plausible) ~30 years ago.

Cheap digital cameras are a completely unrelated topic. And if such large-scale surveillance was made illegal then nobody would be doing it legally, and those doing it would be held accountable for the public health risk they pose. We don't let people build buildings with asbestos any more.

[+] The Stazi and KGB only really had filing cabinets for tracking people and physical surveillance measures. The Gestapo didn't even have that (the Third Reich had census data which was tabulated using IBM machines in order to track who was Jewish within the Third Reich).

benreesman · on June 28, 2019

I think you overestimate the degree to which SOTA computer vision is applied to a lot of images online, and I think bringing East Germany into it is pretty out of line.

darkpuma · on June 28, 2019

I think you're feigning offense to avoid addressing the substance of his comment.

rjf72 · on June 28, 2019

There's a very good reason to consider negative outcomes of the past in discussions such as this. Let's pretend companies like Google and Facebook are totally on the up and up; pretend the company that aims to facilitate a user tracking search engine for China that is doing things including literally blacklisting searches such as "human rights", is on the up and up.

The reason what these amazingly benevolent companies are doing and collecting matters is because the systems we build today are precisely what will power the dystopias of tomorrow. As the GP mentioned, Nazi Germany used census data to select and track their victims, aided by some primitive computational technology built for the Nazis by IBM. In spite of how primitive all of this technology was, it ended up being quite effective at enabling them to achieve their ends.

Now compare this to the systems we're building today. Genuinely bad people do, and will, manage to take power in any system. It's not a question of if, but when. And these systems that we're building will be at their disposal. It's the same reason that in politics if you're considering granting the government more power you shouldn't think about today, but about tomorrow. Not do I want "this" administration to have those powers, but do I want future administrations - whom I will vehemently disagree with, to have those powers?

losteric · on June 27, 2019

Most people here can avoid the impact of climate change - do you think we shouldn't talk about that either?

These are societal problems. It's good to care about people beyond yourself, and to talk about the professional ethical responsibilities of software engineers with regards to corporate mass-surveillance.

bad_user · on June 27, 2019

How about our friends and family? Should we configure a VPN for them too?

Btw the argument you just made applies to any form of surveillance or censorship. Just because your can still find functional VPN services for China, is China's great firewall OK?

And what happens when web services start blocking VPNs?

Netflix does it quite successfully. And I'm sure Cloudflare could provide such a service for free.

KirinDave · on June 28, 2019

Just so you know, that commercial VPN is almost certainly spying on you.

benreesman · on June 27, 2019

I’m not making a moral argument for the surveillance state, I wear Curve25519 on one arm and the word “citizenfour” on the other.

I agree that there is a vast and almost impossible to regulate overreach by these companies. Your argument is extremely compelling.

But when HN users complain about being spied on I smell a FAANG rejection letter.

heartbreak · on June 27, 2019

> But when HN users complain about being spied on I smell a FAANG rejection letter.

You’re projecting Ben.

saagarjha · on June 27, 2019

> when HN users complain about being spied on I smell a FAANG rejection letter

I work at a FAANG: here’s my complaint about being spied on.

benreesman · on June 28, 2019

So your comments should be at the top. You’re knowledgeable on the subject and even if we disagree that should be upweighted in a perfect world.

saagarjha · on June 28, 2019

stOneskull · on June 27, 2019

People care about others, not just themselves.

xg15 · on June 27, 2019

Unless the topic is affordable housing, that is.

benreesman · on June 27, 2019

I agree, but search “HN levels.fyi” to understand that we’re in the minority on that.

laughinghan · on June 27, 2019

If you think their argument is compelling, why are you insulting them?

(This is a genuine question. Many of your comments have added to the discussion and I've upvoted them. But I've also downvoted many that haven't.)

saagarjha · on June 27, 2019

> reeks of sour grapes that a mediocre employee at one of these places makes 3-20x what anyone makes (as a rank and file employee) anywhere else

This is not an argument and moreover not even true: there are companies that pay well and don’t collect reams of data on their users.

benreesman · on June 27, 2019

You didn’t address my argument and unless you’ve been on more comp committees than me then I would annotate that as sources needed.

laughinghan · on June 27, 2019

What argument? What possible position could "if you can't configure a VPN, you're probably mediocre" be an argument for?

saagarjha · on June 27, 2019

Like I said: it’s not an argument, it’s an attack. Plus I’m sure that there’d be many people here able to counter your claim regardless of the compensation number you drop.

13415 · on June 27, 2019

A VPN will not help you against advanced behavioral browser fingerprinting like in this new Captcha. Not only do they have lists of VPN servers anyway, if you inadvertently log into your Google account once from the VPN (e.g. by launching your browser from your normal account), then the VPN IP(s) will be forever associated with your account and normal IPs, and they already know from the Captcha data that you're one and the same person. All the VPN does is adding the information that you sometimes use VPN servers of company such-and-such.

heartbreak · on June 27, 2019

This is a ridiculous argument. Advanced technical competency can not be a prerequisite for maintaining personal privacy.

benreesman · on June 27, 2019

We’re on a site premised on entrepreneurship, and you’re pointing out what sounds like a big market gap. I angel invest now and then, if you have a plausible way to make two billion people care about something that we agree could be better my email is in my profile.

Even from the inside I didn’t see a way, but I’ve been wrong before.

FabHK · on June 27, 2019

Yes, looks like the industry cannot solve that problem alone, just like the electricity and chemical industries somehow didn't achieve clean air and water out of the goodness of their hearts. Another market gap. Or, wait, a case for government regulation.

laughinghan · on June 27, 2019

Your HN profile appears to be blank, actually. Is this you? https://github.com/benreesman

laughinghan · on June 27, 2019

It takes less time to lock my front door than to configure a VPN, but burglarizing an unlocked house still is and should be illegal.

__jal · on June 27, 2019

> or (as I suspect) you find them too valuable to dismiss entirely quarantine

You are wrong. I block the known IP blocks of the big surveillance shops and a lot of the small ones[1].

> sour grapes that a mediocre employee at one of these places makes 3-20x what anyone makes

Are you sincerely saying you believe people who are uneasy about surveillance are just jealous?

[1] Twitter is currently an exception, I was playing with something. But I'm going back to blocking them soon.

davidgerard · on June 27, 2019

> sour grapes

seriously?

benreesman · on June 27, 2019

search “HN levels.fyi”

dang · on June 27, 2019

I appreciate your comments in this thread. But could you please stop baiting people on this point? If there's one thing I've learned from running HN it's that the generalizations about the community that people come up with are invariably wrong. They're overgeneralized from a small sample of what the generalizer happened to notice—and since we're far more likely to notice what rubs us the wrong way, the results always have have sharp edges. In other words, people remember most the things they most dislike, then tar the whole with it. To borrow your phrase, the actual TLDR is less interesting.

benreesman · on June 28, 2019

Thanks for the mild rebuke dang, I think you do a great job meta-moderating this community.

I wish I had stayed out of this from the beginning, I see no merit in arguing about whether HN has some themes. I’ve been watching it daily for a long time as you can tell from the age of the account.

If you want to do something that would be both a good call as a mod and a favor to a longtime user, just whack this whole thread. I was trying to chime in with some knowledge but just wound up pissing everyone off.

dang · on June 28, 2019

I have to say I strongly disagree—I thought your contributions were excellent, and HN lucky to have you contributing on a topic that you know a ton about. If I contributed to your feeling otherwise then I wish I hadn't posted!

One thing I can offer from years here is: never underestimate the silent readership (I'd say silent majority but...associations). The vast majority of readers don't comment and most don't vote either. It doesn't mean they aren't following and getting a lot out of what you wrote. Usually it's only the most-provoked segment of the long tail that is motivated to respond. That's fine, it's the cycle of life on the internet—but it doesn't represent the whole community.

Please comment more.

luckylion · on June 27, 2019

> Again, you’re not likely part of that group, but seriously who hangs out on HN and can’t configure a VPN?

Recaptcha tracks users / devices, not IPs. A VPN won't help, it'll only lower your score. At that point: not allowing them to track you just means you can't use large parts of the web.

"You don't want that GPS tracker installed into your skull? Well, we won't force you, of course, but public transportation, government services and most grocery stores can only be used by GPS-skull-people"

jammygit · on June 28, 2019

As an aside, I read recently that gps is a system where your device reads signals sent from space and does not reveal your location. Very neat.

benreesman · on June 27, 2019

Wild speculative hyperbole hurts the case of people like you and I who care about doing something positive on the ground today.

luckylion · on June 28, 2019

Is it though? I'm somewhat lucky, because my government is generally technologically behind and loves literal paper trails, but yours isn't. Plenty of .gov sites use recaptcha. Sure, you can still visit those sites, it's just that, unless you pass a captcha test, they can't verify that you're actually a person (and not a Russian bot) and can't let you do certain things. If you want to use those government services, you need to allow Google to track you, or maybe they'll add a "sign in with Facebook" option so you have a choice.

With invisible captchas, you can't even sit down and solve a higher number of riddles to prove that you're really human and know what a fire hydrant looks like even though you look kinda strange. If Google doesn't believe that you are human, tough luck. Unless you have a personal connection or a solid Twitter following that an amplify your concerns, nobody at Google cares. Does your government care? It makes their life easier and normal citizens never really had problems with it.

DHL makes me solve a captcha to login and buy postage stamps. There probably are, or will be, public transportation companies that use recaptcha. It helps them to combat voter fraud (crime, abuse, election meddling, fake news, lots of things) if they know where (on the web, for now) you've been in the last 6 months.

You don't like the "implanting" part, because that's unrealistic? Just wait 20 years, and it may not be your head, but an RFID chip in your hand (yeah, those exist already). Until then, carry your gps tracker around and install their software on it, so it can collect data on your behavior to make sure that you're not a criminal.

bongobongo · on June 27, 2019

It is not "wild speculative hyperbole" not to give the benefit of the doubt to companies that have repeatedly demonstrated that they are not entitled to the benefit of the doubt.

benreesman · on June 27, 2019

GPS tracker installed in people’s skull sounds hyperbolic to me.

whenchamenia · on June 27, 2019

How about peoples pockets? Its hyperbole, but not a huge reach.

gfosco · on June 27, 2019

I think it's worth pointing out that the comment you replied to didn't mention money, advertising, or CTR. People are concerned about data collection for more reasons than that. You've seen these attempts and entire careers about it without "juicing" CTR, so perhaps that isn't the true intent.

benreesman · on June 27, 2019

I admit that I inferred the proposed intent for grabbing maximum personal data, but if you’re interested in anecdotes from the trenches: no one below senior director level gets a couple million in stock for any other reason than they pushed CTR by a few basis points. What I was trying to say is that seen through the lens of mechanism design no one is incentivized to query the like button table because there’s no upside in it.

jakobegger · on June 27, 2019

I'm not sure I understand correctly. Are you saying that all the personal user data is in reality not as valuable as everyone says it is? That is, all those megacorps are collecting terabytes of mostly useless data?

Then why is this data collected and archived in the first place?

benreesman · on June 27, 2019

I was never involved in those decisions but I suspect that when you’ve got a multi-dollar CPM and your biggest pain in the ass is pouring concrete and running power fast enough that a few PB of spinning disks are cheap enough that you hang onto it in case you ever find a way to make it useful.

ViViDboarder · on June 27, 2019

That sounds logical. It’s also exactly the reason many of us don’t want to give up our information to these companies. There is absolute uncertainty as to how it will be used in the future.

laughinghan · on June 28, 2019

Because it costs practically nothing. If the cost is zero and the expected value is greater than zero, then no matter how little value it has it's still rational to collect it.

The problem is that the individual bears a very small risk of something very bad happening: "consider the hypothetical case of a gay blogger in Moscow who opens a LiveJournal account in 2004, to keep a private diary. In 2007 LiveJournal is sold to a Russian company, and a few years later—to everyone's surpise—homophobia is elevated to state ideology. Now that blogger has to live with a dark pit of fear in his stomach."

https://idlewords.com/talks/haunted_by_data.htm

The individual of course gets no benefit from the small chance the company can monetize on this data trove. So even though chances are they aren't harmed at all by this data collection, arguably the expected value of the benefit/harm to the individual is negative (harmful). But that doesn't change the data collector's calculation, of course. That's why government regulation is necessary.

avarun · on June 27, 2019

Yes, that's exactly what he's saying.

It's mostly all collected because it's easier for them to collect it than not to collect it, and nobody is stopping them from collecting it.

EpicEng · on June 27, 2019

The fact that so much potentially sensitive data exists in a few repositories is in itself a bit foreboding. Who knows what companies will be able to glean from it one, five, or twenty years down the road?

My behavior on the web being tracked by corporations with little incentive to do right by me is worrisome.

benreesman · on June 27, 2019

I’m more concerned that they’re designing the next version of the Web right under our noses than that they know what kind of sneakers I’m 8% more likely to buy.

saagarjha · on June 27, 2019

However, I’m concerned that the data also allows them to see that I am 93% more likely to vote for a certain political candidate, 22% more likely to contract a chronic disease in the next ten years, and 16% more likely that I will have a friend that homosexual.

EpicEng · on June 27, 2019

I'm not thinking about ad delivery, I'm thinking about behavioral analysis. Knowing how a person thinks and acts can be a very useful weapon in the wrong hands, and FB and the like have done nothing to make me think their hands are the right ones (I don't think any are really.)

owldimoon · on June 27, 2019

I'm not sure where to add this comment, but I just wanted to briefly say that I appreciate your contributions to this topic. Both in terms of content and tone/delivery. These seem like constructive and valuable comments to me, so thanks!

ehsankia · on June 27, 2019

What other reasons do you think the original post was implying for Google to collect all this data?

benreesman · on June 27, 2019

I think the implication was that the leadership is hanging onto all that data because of an immediate fiduciary obligation. I suspect that it’s more in the nature of when you’re running a business in which a few hundred million QPS is slow that you archive in case it ever becomes useful.

bscphil · on June 27, 2019

Unless the point of your comment was to deny that Google is collecting this data at all (because, according to you, there's no financial incentive), I don't see the relevancy of your criticism. The complaint of the top level comment was that Google is collecting extremely personal data on us. Your response is that Google doesn't have an immediate financial incentive to do this. If you're not actually denying that Google collects this data, why does that matter? For most of us, the fact that our personal data has some financial value to a corporation is irrelevant to the fact that we don't want them to have it.

smsm42 · on June 27, 2019

That's the annoying part of it. They try to collect everything about me, down to my favorite color and the brand of tea I am drinking, and they can't even deliver a semi-relevant ad. Best they can do is to bombard me with shoe and riding classes ads for 6 month after I search for "weight of a horseshoe" and stuff like that. They kill the privacy, they make 99% of the sites unusable without an ad blocker, and at the end it doesn't even amount to them making relevant ads...

benreesman · on June 27, 2019

If I were you I’d be more worried that Google de facto controls whatever we’re calling HTTP these days than that they have a BigTable entry that ties a browser you once used to a preference for Earl Grey.

saagarjha · on June 27, 2019

This is a false dichotomy: I don’t see why I can’t be worried about both.

austhrow743 · on June 28, 2019

do you deviate from social norms in any way more significant than tea preference? Have you ever? No extreme political opinions? No fetishes? There's really nothing in everything big tech companies have about you that would be worse for you to have leaked?

smsm42 · on June 27, 2019

I am worried by both. In fact, I am worried by more than these two things about Google, but listing them all would probably take this discussion way off course.

dessant · on June 27, 2019

No offence, but your posts in this thread appear to be projections, and they derail the conversation.

The main topic we discuss is corporate surveillance. We are concerned about all the personal data that leaves our control. We are worried that evading this type of surveillance becomes increasingly difficult.

Some HN users may know how to mitigate these risks, but most people may not know how to defend themselves against corporate surveillance.

This is why me must speak up now, and not just for ourselves.

benreesman · on June 27, 2019

I am falling behind replying to all the comments that this has generated.

For the record I am inked all over with anti-equation group stuff: I agree that these companies are too big and powerful (and I would know).

I just don’t see a solution with the present judiciary. If anyone has a bright idea my email is in my profile.

I will thank you all in advance for not shooting the messenger.

laughinghan · on June 28, 2019

Your HN profile appears to be blank. I think I found your email on GitHub though?

For the record, many of your comments here have been thoughtful, and I've upvoted them. I've also downvoted many where instead of responding to other people's thoughtful comments, you just insult them instead. Those are also the ones that other people seem to be downvoting. I don't think anyone is shooting the messenger here.

benreesman · on June 28, 2019

I am not experienced at replying to several comments a minute, I’m sure I made some errors in judgement during that process, but I think this thread is as active as it is because people want to know how this shit works, not because I’m the apex troll of pushing people’s buttons. FB and Google are in a dubious market position here but they take a lot of flack on HN for how highly they pay and how hard the interview used to be.

jackpirate · on June 27, 2019

Can you provide any evidence that personal data doesn't improve CTR prediction for companies like Google/Facebook?

You state yourself that Google/Facebook publicly claim to advertisers that personal data improves CTR prediction. So I have a hard time believing that personal data isn't useful.

benreesman · on June 27, 2019

I’m already on a shaky limb being so candid about how the business actually works. If you want the opinion (albeit a little dated but still relevant) of someone who doesn’t give a fuck about who the truth pisses off I recommend a book called “Chaos Monkeys” written by a former YC (exited) founder.

luckylion · on June 27, 2019

> If your browsing history can juice CTR prediction then I’ve never seen it. I have seen careers premised on that idea, but I’ve never seen it work.

Isn't demographic targeting exactly that, based on your browsing history? Will showing an ad for a car wash have the same CTR for people that liked car products as for people that did not like car products? Or is your point that it still has to be a human that inputs "this is about car things, please show it to people that like car things" and it's not a magic AI that optimizes it automatically? And in that case: isn't that just a matter of time? Build the profile today, build the tech that uses it tomorrow?

c0vfefe · on June 28, 2019

Part of the point of objecting to big data surveillance is that we ultimately don't know how it's being used, despite what companies claim about its use.

Can I believe you? ...even if you're telling the truth, big corps can hide their most malicious practices from most of their own employees.

To me, it doesn't matter how e.g. Facebook actually uses my data today, because even if they're telling the truth they could change their policies tomorrow, or get hacked, or some third party (incl. the gov't) could get hacked, etc. It's better as a user to try and prevent such data from ever existing in the first place.

taneq · on June 28, 2019

> if your browsing history can juice CTR prediction then I’ve never seen it

That's great. So if my browsing history is useless then you won't mind not trying to snoop on it.

kjaftaedi · on June 27, 2019

Anectotally, I keep no browser history and do not feel my experience with captchas is different than a user who does.

benreesman · on June 27, 2019

I would contend that the reason for that is that none of the engineers involved get paid more if that experience is different.

anth_anm · on June 27, 2019

> source: I ran Facebook’s ads backend for years

Why would anyone ever trust a goddamn thing you have to say about their data?

Unless they pay your salary and are asking you to give your expertise on hoarding and abusing user data, obviously.

dang · on June 27, 2019

We've banned this account for repeatedly breaking the site guidelines and ignoring our many requests to stop.

If we allow users to harass and attack people who have genuine expertise for posting here, does that make HN better or worse? Obviously worse. Mob behaviors like this are incompatible with curiosity.

https://news.ycombinator.com/newsguidelines.html

benreesman · on June 27, 2019

Me spilling tea about the business is far more in the spirit of a whistleblower than a shill.

I have nothing to gain and everything to lose by shedding light on one of the most powerful entities in existence.

But TLDR it’s not as interesting as people like to think.

saagarjha · on June 27, 2019

You can gain internet points on a social website…

benreesman · on June 28, 2019

Check my join date, I have less than 500 points. If I was just like “foo is evil” every time foo came up up I’d have like 10^3 more.

bscphil · on June 27, 2019

> Since reCAPTCHA v3 scripts must be loaded on every page of a site, you must send Google your browsing history and detailed data about how you interact with sites in order to access basic services on the internet, such as paying your bills, or accessing healthcare services.

> If you'll refuse to transmit personal data to Google, websites will hinder or block your access.

I wonder how true this really is. 20% or so of web users have ad blockers, and most ad blockers block scripts like Google Analytics out of the box. It isn't hard to see that most of them will not make exceptions for a new Google tracking script. So any site that does any kind of testing at all is going to see that ~15% or so of their users drop off if they block users who don't have a reCaptcha v3 score. The only sane business decision in response to this is to go with some alternative.

(Of course, there will be some sites that continue to block users, it's just that they will mostly be the sites that already block users running ad blockers.)

judge2020 · on June 27, 2019

Even UBO doesn't block ReCaptcha by default, so I don't see Rv3 being added to easylist anytime soon.

AgentME · on June 27, 2019

It doesn't block it because it's generally not active on all pages of a site. The description of v3 sounds more like Google Analytics and will probably be treated similarly.

judge2020 · on June 28, 2019

I find a v3 block to be unlikely for the same reason v2 isn't in easylist - too much friction for the list users. Websites will likely break completely when performing actions if they don't receive any sort of verify token from the browser. It would probably be best to have a list for recaptcha v3 as a built-in optional filter so that users know they've enabled it and know why websites might be breaking.

bscphil · on June 28, 2019

On the other hand, the lists haven't removed ad / tracker blocking for sites that block ad block users. I still see "disable adblock to view this site" occasionally. I think leaving rules in place that result in blocked access to sites, but letting through Google tracking scripts on every page because it results in the same would be a huge misjudgment on the part of the list maintainers.

dao- · on June 27, 2019

I'm not sure there's a significant legal difference in the end. If someone could demonstrate that alternative browsers regularly get a lower score than Chrome, that seems like a pretty good antitrust case.

Or were you referring to the risk that individuals would sue Google for getting blocked from random, potentially essential websites?

bduerst · on June 27, 2019

Not GP but they most likely meant the second. The V2 prompt blocks people from accessing services, which could be construed as damages at scale.

You do bring up a good point about the V3 being potential antitrust issue, but that has always been a potential problem even with earlier versions of recaptcha. With V3, it's also deferring the liability to the webmaster. The action that the website takes with the score is up to them - in the end it's just a number.

miohtama · on June 27, 2019

From the service provider and devops perspective I find reCAPTCHA beautiful. It brings down malicious form fill, form spam, user creation and password brute forcing rates.

Also as a VPN user, I found out that migrating to more expensive, higher grade VPN, solved a lot of my problems.

In the end it is not privacy, not your VPN that matters from the service provider point of view. It matters that your IP address is spewing malicious garbage. I do not want to spend time sorting it out, as I can focus my activities to revenue generating tasks. Harming some cheap VPN users in the process is collateral damage, but I rather take it than build a form with a perfect attack mitigation and 10x cost.

I hope to see some alternative for reCAPTCHA that does not come with such a strong privacy oriented risks. hCAPTCHA https://www.hcaptcha.com/ seems to be interesting, also monetization point of view. But they are not yet well established company and I do not know what other risks their approach would bring.

OrgNet · on June 27, 2019

I don't even use a VPN and have lots of issues solving google's captcha...

miohtama · on June 27, 2019

Potential other causes

- Your ISP is a source of a lot of malicious traffic

- You have some browser extension or other adjustments that makes it harder to analyse you as a genuine web browser

For example, using a browser automation like Selenium testing triggers "hard" reCAPTCHA. Not sure if this because of some automated API that Selenium exposes, or just because your browser profile looks virgin (no cookies) without any prior reCAPTCHA solves.

OrgNet · on June 27, 2019

I use pretty standard extensions... uBlockO, decentral eyes, smart referrer... I just wish that companies would stop using Google's reCAPTCHA service.

Also my IP address rarely changes and I don't think that any malicious traffic is coming from it.

And I have Comcast, so I hope that they didn't blacklist all of us...

(I did talk bad about Google a few times though, maybe that's it)

saagarjha · on June 27, 2019

Those aren’t extensions that an average user would install.

bduerst · on June 27, 2019

Just Smart Referer alone is a likely culprit. Masking or having no referer is a prime attribute for low-level bots.

OrgNet · on June 28, 2019

oh... so i should not be able to use any websites because of the extensions I use?

dodobirdlord · on June 28, 2019

You should not be able to use any website that the host doesn't want you to use. That seems pretty straightforward. There's a strong correlation between profiles that look like yours and bots. Why should the web admin do free labor for you to put together a sufficiently nuanced bot-detection system to tell the difference, when the one they have is clearly good enough for them?

KirinDave · on June 28, 2019

Stop using smart referrer. It has no legitimate purposes. Referrer URLs are not the problem. You look like a bot and you going to get locked out of sites.

If you're actually concerned about that kind of data leakage, you want NoScript, full stop.

thekyle · on June 27, 2019

> Since reCAPTCHA v3 scripts must be loaded on every page of a site, you must send Google your browsing history and detailed data about how you interact with sites in order to access basic services on the internet, such as paying your bills, or accessing healthcare services.

I don't believe this is true. You only need to include the JavaScript on pages which actively use the reCAPTCHA score. For example, you might only include it on the login and user registration pages.

zymhan · on June 27, 2019

Did you read the article?

> To make this risk-score system work accurately, website administrators are supposed to embed reCaptcha v3 code on all of the pages of their website, not just on forms or log-in pages.

thekyle · on June 28, 2019

Google recommends that you include the code on multiple pages, however, it makes it clear in the official docs that this is absolutely NOT required for the reCAPTCHA v3 system to work.

So if the article stated that websites were required to put the code on multiple pages (as the comment I replied to did) then the article is factually incorrect.

luckylion · on June 27, 2019

Isn't the idea that they can decide whether it's a user or a bot based on what the user does in general, not just whether their browser executes JS on this page that you want to protect?

Running headless chrome is trivial, so just having it sit on the one page where you need to check it won't help much. Collecting more data on the user's action on your site will provide a much clearer picture, much like a video from somebody walking through a store will help you make a decision about whether he's trying to steal something than a single picture of him standing at the check out.

judge2020 · on June 27, 2019

The big "if" here is whether or not Google is actually factoring the user's activity into the score. For all we know, there could be a 80/20 split between "Google account activity" and "human-like behavior on website" when Google outputs a trust score.

stri8ed · on June 27, 2019

Is it that different from the way Google Analytics works?

frenchyatwork · on June 27, 2019

It is, in the sense that it's easy to disable Google Analytics by disabling tracking in Firefox, and there's no consequences. If a website uses reCAPTCHA, and you have tracking disabled, the website will break.

behringer · on June 27, 2019

Works for me. Assuming one uses something like Privacy Badger and it if it were programmed to block reCaptcha, these websites that require recaptcha will go the way of anti-adblocker popups. People will simply say no and hit the X and go to their competitors.

joaobeno · on June 27, 2019

Sure, my gov (Brazil) uses reCaptcha on the page where you can check your electoral status (For example: if you can vote, where, and if not, what is missing). Where can I find a competitor for that?

behringer · on June 27, 2019

Ask your political representative why they're relying on a foreign ad service to manage their government websites.

dessant · on June 27, 2019

You should expect a similar impact on your privacy.

The important difference is that unlike Google Analytics, reCAPTCHA v3 is inescapable. You cannot prevent the collection of your personal data, because then you would loose access to large portions of the web.

tinus_hn · on June 27, 2019

You can’t block recaptcha!

edraferi · on June 27, 2019

Why not? Is it always self-hosted?

blacksmith_tb · on June 27, 2019

I think they meant "you can't block reCAPTCHA and still access services behind it" - technically you could add a rule to uBlock Origin etc. to block it, but then you'd be unable to use those site/services.

erklik · on June 28, 2019

> Since reCAPTCHA v3 scripts must be loaded on every page of a site, you must send Google your browsing history and detailed data about how you interact with sites in order to access basic services on the internet, such as paying your bills, or accessing healthcare services.

From a technical pov, how does one access a user's browsing history from client-side javascript. Isn't that something the browser should protect? or do you mean more that since the reCAPTCHA gets loaded on each page, Google can track what that IP is visiting by where reCAPTCHA gets loaded?

kevingadd · on June 28, 2019

If the script is loaded in the host site's context (it would have to be), it knows your current location and can use DOM APIs to inspect your browsing history (on the current site, at least) and I believe on first page visit it will also be able to identify what website sent you there. It could also potentially register event listeners to watch what links you click.

_Codemonkeyism · on June 27, 2019

"reCAPTCHA v2 is superseded by v3 because it presents a broader opportunity for Google to collect data, and do so with reduced legal risk."

And if you use something to prevent tracking - in my case Brave - reCAPTCHA is a huge pain that often takes dozens of clicks to make it through - delayed by Google to wait out bots.

Some times I think reCAPTCHAs main goal is to bring back those opposing tracking back into the fold of Chrome with painful recaptchas.

dessant · on June 27, 2019

This comment has been detached, originally it was a reply to https://news.ycombinator.com/item?id=20295333.

modzu · on June 27, 2019

in other words this is a callout to all webmasters:

please consider not using recaptcha.

DyslexicAtheist · on June 28, 2019

the amount of sites asking me to identify fire-hydrants and traffic lights has dramatically increased since I turned on content-blocking that ships out of the box in firefox. Have been aggressively blocking already before (combo of: UBO, NandoDefender and steve black#s host file), but since turning on this inside FF ... there is no peace.

Anyone got some URL's that I can block all captcha attempts or does it mean I have to also sinkhole www.google.com[1] ?

( I don't have a problem not being able to access captcha enabled sites. )

[1] quick check tells me I would have to banish this endpoint which sucks because I'd have to parse the URL on every request and can't do it in DNS: https://www.google.com/recaptcha/api.js

bad_user · on June 27, 2019

If the v3 script is supposed to be installed on all pages of the website, in order to track the user's actions, I don't understand how that can be done without explicit user consent under GDPR.

snowwolf · on June 28, 2019

As long as the ONLY processing of the data is for fraud detection/prevention, then GDPR specifically allows it as a “Legitimate Interest”

Recital 47: “The processing of personal data strictly necessary for the purposes of preventing fraud also constitutes a legitimate interest of the data controller concerned…”

Recital 71: “decision-making based on … profiling should be allowed where expressly authorised by … law … including for fraud or tax evasion monitoring and prevention purposes”

bad_user · on June 28, 2019

First of all the "legitimate interest" part only works if the publisher can prove that the user data is only used for the stated purpose.

The fact that a third party server handles this is a problem. Because then the publisher has to have a data processing agreement in place with the third party.

This is what makes Google Analytics problematic too. The collection of analytics for improving the service can be a legitimate interest, however the data amendment for Google Analytics basically passes the blame on the publisher. I don't think many publishers read carefully Google's data processing amendment, otherwise they would drop usage of Google Analytics. Actually most publishers aren't even with GDPR for more serious reasons, like not anonymizing the user's IP or sharing data with Google for the purposes of ads targeting.

And there are many questions to be asked here.

Is that data private, for the use of the publisher in question, or is this a shared pool of knowledge between publishers?

If the later, then we have a problem, because even if there is a legitimate interest, it only applies to the publisher being visited. Can a user be blocked due to a profile that was built on another website? We are in murky waters.

---

Then there's always the question ... does the publisher really have a legitimate interest?

Claiming that you can have one under the law, doesn't mean you actually have it. There's a set of conditions that you have to comply with.

For example for the purposes of preventing fraud, at the very least you have to be able to show that fraud is possible. Just because you have a login form that's about managing the user's color preferences on the website doesn't mean that you can transmit the user's traffic to Google.

The requirements for legitimate interests are hard to comply with. And I have a hunch that in this case many websites won't comply.

superasn · on June 27, 2019

There are a lot of sites that are totally unusable on Firefox regardless how much you use ff.

I do all my mobile browsing on FF yet when I try to use some websites I always get this Recaptcha failed error(1) while it works flawlessly on chrome though I never use it often. Try it, maybe it will happen for you too.

Same happens on most sites which show you that "checking your browser" page via cloudflare too.

The web is very unusable unless you're using chrome because of such antics.

(1) https://cdn3.imggmi.com/uploads/2019/6/27/0dd96b25707ce6e236...

ulfw · on June 27, 2019

It's even worse when you're running a VPN (especially one of the major public ones). When I see reCAPTCHA I basically give up as sometimes I have to go through 6 or 7 full sets to be let into a site. It's the evil of the internet this.

oil25 · on June 27, 2019

reCAPTCHA on VPN is difficult, but on the Tor network, they are downright impossible. I've never been able to get past it, even after a few dozen painful attempts. That means Google services are entirely off-limits over Tor, even Search, which is a disgrace.

beefhash · on June 27, 2019

> That means Google services are entirely off-limits over Tor

If only it was Google services alone. CloudFlare loves serving up a ReCAPTCHA for Tor users before they can even passively read site contents. That hugely expands the damage done.

deftnerd · on June 27, 2019

Install the PrivacyPass Firefox or Chrome extension. It was developed by Cloudflare, Firefox, and Tor in partnership. It has you answer a ReCAPTCHA and using some crypto magic, generate a bunch of CAPTCHA bypass tokens that can't be traced to your specific computer.

https://support.cloudflare.com/hc/en-us/articles/11500199265...

https://blog.cloudflare.com/cloudflare-supports-privacy-pass...

https://blog.cloudflare.com/privacy-pass-the-math/

https://github.com/privacypass/challenge-bypass-extension

johndough · on June 27, 2019

Does not work with Tor.

The plugin requires "privacy passes". Those passes can be obtained by solving captchas, but when trying to do so, one is greeted with this message about being blocked: https://i.imgur.com/qXJfl6J.png

piyush_soni · on June 28, 2019

Slightly off-topic, but the users who use Tor regularly, how do you do that? For me, it has been terribly slow every time I tried to use it.

ronsor · on June 29, 2019

On Tor I get roughly 700 KB/s speeds, which isn't terrible for me

piyush_soni · on June 30, 2019

And what is that compared to any regular browser?

beardog · on June 27, 2019

Try rebuilding your Tor circuit when this happens.

https://tb-manual.torproject.org/managing-identities/

Sirened · on June 27, 2019

This sort of breaks tor though, doesn't it? Tor works really well if you stay on the same circuit for a while since it reduces the chances you have a compromised circuit. If you start getting recaptcha to block every exit node except those you control, you essentially have amplified your effective strength on the tor network.

AlexCoventry · on July 4, 2019

Tor is already broken for an adversary with that capability.

redwards510 · on June 27, 2019

This sounds pretty good, but you still have to pass a captcha in order to get a pass, and sometimes that is impossible (or at least I just give up because I lost interest after 20 puzzles).

If it was developed in conjunction with Tor, how come it doesn't come bundled with the Tor browser or Tails?