Hacker News new | past | comments | ask | show | jobs | submit login
The Tech Industry Doesn't Understand Consent – Opt-Out Is Not Consent (soatok.blog)
126 points by todsacerdoti 10 months ago | hide | past | favorite | 67 comments



At the very least, an action which is effectively instantaneous should never be opt out. That is, if they can flip a switch and scan millions of blogs in a few hours, or days, or even weeks, you need to be able to opt out before that happens. It shouldn't be a race between you and the website to see if you can find the opt out toggle before you lose control of your property.

A lesser example of this is when you sign up for a website, and are immediately opted in to their newsletter and various other spam email lists. The opt-in happened in ~1ms, but even if you opt out immediately after your first login, you'll still get added to their list by default.


I've always been amused by the fact that they'll newsletter and product sales bomb me within a minute of signing up for their service, but removing me from their lists may take up to 5-7 business days.


It made more sense when I understood that saying it may take up to 10 business days to opt me out was a statement recognizing the legal requirements rather than the technical requirements. I'm sure some companies just wait exactly 9 days, 23 hours, 59 minutes and 59 seconds to instantaneously opt me out. Malicious compliance, or as we used to call it, passive-aggressiveness.


We need a legally enforcible mechanism to make anyone think twice before using data without consent. This must have the effect with AI training models, that if a persons data has been incorporated without consent, or they legally revoke assumed consent, the data must be removed. At present, AFAICS that would mean voiding an entire model. That would be expensive and so be a stiff deterrent against abuse. Copyright has failed. But there are sutely other tricks to play.


Some kind of Regulation of Data to Protect it. And maybe make it General?

https://gdpr-info.eu/issues/consent/

> The data subject must also be informed about his or her right to withdraw consent anytime. The withdrawal must be as easy as giving consent.

> Last but not least, consent must be unambiguous, which means it requires either a statement or a clear affirmative act. Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing.

https://gdpr-info.eu/issues/fines-penalties/

> For especially severe violations, listed in Art. 83(5) GDPR, the fine framework can be up to 20 million euros, or in the case of an undertaking, up to 4 % of their total global turnover of the preceding fiscal year, whichever is higher. But even the catalogue of less severe violations in Art. 83(4) GDPR sets forth fines of up to 10 million euros, or, in the case of an undertaking, up to 2% of its entire global turnover of the preceding fiscal year, whichever is higher.


This would never work!

Seriously though, the hole I am trying to plug here is remedy.

Where is the individual remedy against someone who takes my data for their own profit? Sure I can get an official body to act for me, impose fines etc. But if, at near zero cost, I can force them to delete it - no exceptions - that's a powerful response. I'm probably mistaken but right now I don't see that to hand despite the fancy wording about data "belonging to the subject".


The issue is that every business operator relying on this overnight would have to shift massive amounts of capital into their legal departments to create a remediation pipeline, fundamentally shift their architecture to include permission first targeting of training information, and most importantly, the Courts would be instantly swamped with just cases dealing with this kind of case for years while lobbying would go on for years to cripple any such remedy establishment.

In short, what you propose is the abolishment of the click through EULA, and a return to exercise of contract law as originally intended. This kills the tech enabled business sector, because they'd essentially have to de-scale or die under an avalanche of judicial remediation.

Which I want, to be quite frank. It'd be awesome.


It would definitely be a very good thing to restore real contract law. The "click through EULA" was a fork in the road where our legal system went wrong and slid into a ditch and things have been spiralling downwards ever since. Tech companies relying on unjust methods absolutely should go bust, and good riddance,

> instantly swamped with just cases dealing

That's interesting. It's almost tempting to see that as an argument against justice :) Right now in the UK we've potentially 4k - 5k pending cases for about £600,000 compensation each against the Post Office. It is already technically bankrupt on paper and will have to be dissolved and re-nationalised under the next government.

So in many places in the world, the backlog of cases that courts cannot deal with is the only thing keeping some large criminal enterprises alive. In other words the sheer scale of their injustice is what protects them. What did Stalin say, "A single death is a tragedy, a million deaths is a statistic"?


>That's interesting. It's almost tempting to see that as an argument against justice :)

The almighty pocket veto. If I never get the chance to get around to it, dif it ever really happen?

Make no mistake; this is one of these sneaky things you can do to ensure that even if something you don't pike is the case, it never really materializes substantively, and fast enough to matter.

It's one of my major issues with legislation that creates new remedies, but without taking into account the means through which that process will be executed. While an independent judiciary is crucial; there does need to be affordance for info propagation.


Why do you believe training an AI model shouldn’t be considered fair use, especially when they are predominantly trained on publicly available information? It’s a completely transformative work where the model contains no part of the original data.


Google algorithmic disgorgement


If you shop in a chain store, they generally ask for your email and phone number usually in exchange for discounts. Giving them this information opts you into their email and sometimes text spam lists meaning several messages a day, some of which might include relevant information about a sale or discount you're interested in.

The end result is that people stop looking at their email because it's mostly marketing spam and they miss genuinely important emails.


> A lesser example of this is when you sign up for a website

This is one of the big reasons why I do not create accounts or sign up for things online if I can possibly avoid it.


I think the tech industry understands this perfectly well. Large parts also understand that actually asking for and respecting consent will reduce their income stream some. The primary directive of the tech industry is to profit regardless of the cost or harm to others.


> regardless of the cost or harm to others

Interesting choice of words. As I understand it, and I do not have the judgement transcript, in the British Post Office case those were the words on which Judge Fraser acquitted the postmasters and found against the Post Office.

I think the key words may have been "above all else" with respect to the "interests of the Post Office".

You don't get to say that. Ever.

In law, at least in Britain as I can see it, you don't get to put reputation or profit or anything else "above all else" or "regardless of the cost or harm to others". YMMV elsewhere.


Case in point would be when Apple implemented a change to iOS requiring apps to ask if you want to be tracked. I believe almost everybody says no which cost the industry a great deal of ad money.


In essence, this is equal to slavery - "someone pays me to flood you with SPAM - if you refuse that SPAM, I will starve to death; so please don't refuse the SPAM" We know that bees fly to the flowers by their own wish. Imagine how bees are sitting in their hive and out of nowhere thousands and thousands of flowers are coming to the hive and advertising "Here, be-be-be-beees, the best floral pollen for you! Take 3 for the price of 2!" If you were a bee - would you consider this normal?


The kicker is that a lot of this stuff isn't even profitable.


It is, in that it takes advantage of peoples' speculative nature.

"What if we had all this extra data on our customers?! Then we could do a lot more to make more money! So, it makes sense we ought to pay more for that data too!"

The data provider gets to profit based of this speculative "of course more information/more stuff is better!", while the data purchaser doesn't really have a chance to test whether their speculation is well-founded. (How precisely do you test what the causative effect of changes to your ad campaign are? If you make a change, and numbers go up, how do you separate numbers going up caused by any of the other parallel factors? It's hard to be scientific with advertising.)

The person selling the tool always wins, even if the tool is nothing more than a placebo effect. The person buying the tool on the other hand...well, good thing they have deep pockets, and don't really know how to spend their money well, and good thing too that there's always more more suckers ready to take their place (as long as we keep the birth rate up)!


It's a good point that data providers can make a profit opening this as a new revenue stream. We'll see how they end up working out for the end users of that data I guess.


That is interesting, because I think you're right, and so it raises the question, what for?

Is is just a kleptomaniac instinct to hoard data, despite the fact that it is really a liability [0] in most cases?

[0] https://www.schneier.com/essays/archives/2016/03/data_is_a_t...


I think it's a mix of:

1) True believers who think anything that can be done to get us further to AGI should be done.

2) Business types who think this could be the next internet and don't want to miss the train and who are willing to lose some money to get in early.


Both of those camps terrify me. :)


As a consumer, I agree with this take. Looking at it from the other side though, many businesses simply wouldn’t exist online with opt-in. To some extent, you need to understand that companies need to make a profit and we’ve developed a market where that’s not going to be direct payments by users.


> many businesses simply wouldn’t exist online with opt-in.

If a business can't exist without abusing people, perhaps it shouldn't exist.


Imagining an opt-in policy highlights how unethical these AI data-harvesting schemes really are. It's blindingly obvious that almost nobody would actively choose to donate their work to enrich AI companies without getting anything in return.


Guess I'm 'almost nobody'. I see zero issue with AI farming Stack Overflow, Twitter, Reddit, and the like - publicly accessible forums. The value to me was in the discussion. It's happened; I've extracted my value from that engagement. If an AI company can also extract value from that collection of discussions, it costs me nothing and I expect no compensation.

Thought AI companies farming copyrighted work on the other hand, that's a different story.


The article isn't talking about public discussion forums at all, it's talking about WordPress and its owner, Automattic. The article is a blog post on a WordPress-hosted blog. It then goes on to talk about consent in software in general.

Personally, I'm not really okay with what you're calling public forum harvesting either. I've put a lot of work into Stack Exchange answers and I am not okay with a for-profit company recycling and possibly outright regurgitating that work without attribution. (The latter would be a flagrant violation of CC BY-SA, of course.)


I understand that the author is mad and wants things to be opt-in. But I also think the author is smart enough to know that the tech industry understands consent just fine. It just doesn't care.

twitter and reddit will soon be bots talking to bots (if they aren't already) so the AI can train on that.


> Thought AI companies farming copyrighted work on the other hand, that's a different story.

Copyright also happens to be opt-out. You have to explicitly say “this is not copyrighted” for copyright to stop applying.

See your comment and my reply? Both copyrighted. Right now. As soon as we hit publish we started to own copyright. There is an EULA somewhere on HN that probably says we give HN implicit permission to host this content in perpetuity and can even make it available in APIs, show it to bots, etc. But that’s not the same as no copyright. If somebody who is not HN wants to screenshot this comment and publish it in a book, they in theory have to find us and ask for permission.


> Copyright also happens to be opt-out. You have to explicitly say “this is not copyrighted” for copyright to stop applying.

This isn't possible under US copyright law. You can say "this is not copyrighted" all you want, but it's still copyrighted. The closest you can get to voluntarily putting something in the public domain is to refuse to take enforcement actions against violations of your copyright.


Do you take issue with search engines indexing your comments, making them discoverable, displaying them in search results? Possibly near ads?


Search engines link back to the original sources, making them discoverable which is the “payment” for allowing it. That’s a very different use than an AI that doesn’t even know where the original content came from and provides nothing back to the original creator.


Inevitably there will be copyrighted images, audio, and text mixed in with random social updates and discussions. It should be on the LLM builder to seek active consent, rather than everyone else to be vigilant and/or sue to get their work out of the model's data.


> Guess I'm 'almost nobody'.

When you put it that way, yes, you obviously are. So is every other individual.


> I see zero issue with AI farming Stack Overflow,

> Thought AI companies farming copyrighted work on the other hand, that's a different story.

All post on Stack Overflow are still copyright of respective posters. They are offered publicly under Creative Commons license that require attribution.


In the US, everything you write anywhere online is copyrighted by you, unless you sign a copyright assignment agreement. It's automatic any time you put an expression into a fixed form, and there is no way to revoke that copyright.


As I understand it copyright has failed. Or rather we are into an age of naked double standards where courts will enforce the copyright of big-tech against you for "stealing" a movie, but will not enforce your copyright against big-tech for "stealing" your data for its AI.

EDIT: some links [0,1]

[0] https://www.reuters.com/legal/litigation/us-judge-finds-flaw...

[1] https://www.hollywoodreporter.com/business/business-news/art...


Copyright has absolutely not "failed".

Copyright is still deeply important to prevent behemoths from just straight-up taking stuff individuals wrote and profiting from it with no consequences.

For instance, without copyright, traditional publishers could just take everything the authors they currently contract with have written, and every other current author, and publish it without paying the authors a cent.

ML training is a legal gray area right now, because it's a new thing, and we haven't had time to properly sit down and both understand what its effects are, and how it should be treated legally. It is possible that this process, when it ends up happening, will be captured by corruption; it is possible that it won't. But using the current frustrations and anger about ML training as evidence that copyright has "failed" is a vast oversimplification that ignores the very real good that copyright does in our society.


Fair comments. Please allow me to rephrase that.

It's failing right now to protect millions/billions of people, because we've decided that it's "legal gray area right now".

Maybe it should be, I don't know. I mean maybe it's time we said bye bye to copyright?

There could be flip sides. If the world decides that ML sidesteps copyright then I look forward to the entire corpus of LibGen, SciHub etc being legally released as open models and the overnight demise of Elsevier et al. (I once wrote a fiction about that [0])

My objection here is to seeing the clear wishes of the majority being trodden over roughshod.

[0] https://www.timeshighereducation.com/opinion/2048-informatio...


> I mean maybe it's time we said bye bye to copyright?

This is exactly the kind of oversimplified, baby-with-the-bathwater proposal I was talking about.

No, we should not "say bye bye to copyright". We should actually take the harder, more complex steps, requiring actual critical thinking and analysis, to fix the problem, rather than just pretending that a one-step grand gesture will be a magic bullet.


> We should actually take the harder, more complex steps, requiring actual critical thinking and analysis,

Those are fine words. We're all about critical thinking and analysis round here. But way I see it, folks already did some real hard critical thinking and their analysis was "bollocks to that!"

And the judges said, "sorry the law that applies to you doesn't apply when big money is involved". One rule for you, another rule for them.

So I'm kinda thinking we'll maybe have to get a little more critical than you might be comfortable with.

> to fix the problem

It's always a good idea to pause right there. What is the problem? I mean seriously... what exactly is the problem going on here? Because from where I see it, the problem is a massive power imbalance

And it's a structural one. Because AI training compute and global crawling/scraping is expensive and in the hands of the few.

I don't think this problem would look the same if every kid was running AI training on a Raspberry Pi, and hoovering down JSTOR like Aaron Swartz. People would be getting arrested, no?


Well, yes. The problem you are identifying is primarily a structural power imbalance.

It is not a structural power imbalance that can be fixed by abolishing copyright. Indeed, abolishing copyright is vastly more likely to hugely increase the power imbalance.

You are looking at the problem too narrowly (identifying it as "a problem with copyright", rather than "a problem with the power structures in our society"; "AI training and compute...in the hands of the few" rather than "most of the money and resources in the hands of the few"), and thus coming to counterproductive conclusions about how we might solve it.

It's very satisfying to imagine taking a big hammer to a system we know to be corrupt and serving those without our interests at heart. But just smashing the system does not build a new one in its place. And until you address the power imbalances, any system built to replace one you smash—assuming you can manage to do the smashing, which is highly suspect—is nearly guaranteed to simply be designed to serve the desires of the powerful even more than the one we have now.


Some good thoughts, though maybe you underestimate my bead on the world, and perhaps overestimate my desire for "smashing". A more peaceful, and just, time when we simply take their toys away will come. That is certain. A question of "intellectual property" remains. In a post-exploitation world, would we still want or need it? Let's hope we keep living to see how it pans out. Respects.


Think of it this way. If Google makes 300 billions per year scrapping 1 trillion webpages, how much money should it pay for each webpage it scraped?

There's a point in bulk scraping that the logistics of giving people real money for data makes no sense. The payout out be to small to waste time thinking about. And the fees of costs of paying anyone in the world would be higher than the actual amount being paid!

I'm not saying it's morally right, I'm saying the only way to be commercially successful is to try to get away with it.


We're properly into the age of all-out "lawfare" now. I wonder what happened to the likes of Lawrence Lessig and Pamela Jones, and all those legal minds who used to weigh in on the side of ordinary decent people. We could use some easily deployed retaliatory weapons and countermeasures about now.


LegalEagle on Youtube speaks on cases from a laymen perspective.


You can really only laugh when companies like OpenAI say they're working on this problem, and them working on this problem is a tedious opt-out form that you need to fill out for each and every piece of work they may or may not have ingested, and no they won't be retraining the model. It's obvious they are acting in bad faith to anyone with a brain.


And no means no. It doesn't mean "sure you can neg me later".


Upsss missed that, we're updating your operating system right now.


Like sexual consent, they understand it perfectly well, rather, it just goes directly against their plans.


Personally, I feel it would be a much smaller problem if Tumblr had an internal AI thing going on. What users REALLY don't like is that they have confided a post with one website, and that website just shared that post with a third-party because if opens up infinite possibilities.

If Tumblr can take your post and give it to OpenAI, they can take your post and give it to anyone, and that's the problem. Because for users, what they post is "between them and Tumblr" and not anybody else.

I'll even say more. Artists don't care if you scrape their art to make AI generated art with it. Because when this happens, it's "between artists and scrapers" and not anybody else, so it's fine. What they do care about is when people post that AI generated art on the internet, or publish it professionally, or do basically anything with it.

In other words, there's a sense of privacy when there is only two parties involved, no matter what is going on, but the instant a third party gets into the system, the first party will freak out, because then you lose that sense of privacy.


You're saying consent isn't transitive.

That's why they always ask for (or more frequently just take) "a paid up, non-exclusive, irrevocable, worldwide license"


I think "consent" is not the right word for this, because it's never simply "consent" it's always boundaries and expectations. Consent implies there's always one yes/no answer for a process. In practice the process is always so complicated that it would require countless yes/no questions, to the point nobody sane would want to deal with it.

Just look at cookie banners and imagine if we had a consent popup for every single thing that needs to be download to show a page: do you consent to download the HTML? Do you consent to download the CSS? Do you consent to download the Javascript? Do you consent to download the images? The user just wants the see the page. You have to make assumptions about how far that consent goes, so it's absolutely transitive, the problem is at what point it crosses the line.


So, please forgive the necessary parallel here.

Let's think about dating and imagine if we both had to consent to every single thing that needs to happen to get it on. Do you consent to dinner with me? Do you consent to coffee at yours? Do you consent to us getting our clothes off? Do you consent to ... you get the picture.

Thing is, we actually do this, but silently.

The "assumptions" you rightly speak of are the mutual trust mechanisms that allow two adults to make clear shortcuts without explicit words. We move at a certain speed. We check for feedback. We use clever signals. We base our trust model on the NSA definition that "Trust is the ability to do harm" in context where there is some symmetry. Sex is actually risky for both parties.

And no means no.

But in the digital world we're talking about a massive power asymmetry. When you're with a big corporation it moves from dinner to data-rape in a few milliseconds.

No means "yeah but harder".

Consent doesn't disappear because the process got complex. Human relations are more complex than any web page.

It's about whether we obey the rules and show mutual respect. Big-tech corporations absolutely do neither. and that's the problem here, not complexity.

What might level the game, is if when I visit your site with my web browser, you risk that I could do you some serious harm.


Unfortunately, it's not a human on the other side, it's a system, whether it's a system built based on business rules or a system built based on computer algorithms, everyone gets the same inflexible system. It's not logistically feasible to create a custom experience for each user based on how far they consent to every little thing, as the use cases are infinite: you may consent to download CSS on this site, but not on that site, or on that site, but not on pages that match this regex, etc. If you tried to do that, your product would immediately flop because most users would find it too annoying and complicated to use compared to any system that makes reasonable assumptions about consent.

I agree that many modern software have no respect to humans, but that's not because the developers didn't want to implement a yes/no consent dialog or several of them, it's because they think the only boundary they can't cross is what the law says they can't, and everything else is fair game.


Sorry late to reply after work today, but you really got me thinking AR

> everyone gets the same inflexible system. It's not logistically feasible to create a custom experience for each user based on how far they consent to every little thing,

Is it though? Maybe this is where AI is going to be a win - in dynmaic protocols and bespoke interactions. What if we leave my "AI web-browsing agent" to talk to your "AI web server" to quite literally negotiate a whole bunch of preferences. Usual game theory applies. If your bot tries to defect or fuck mine over, or my bot is too greedy or defective we both get a suboptimal outcome? Maybe I am happy to trade certain bits of data like location or budget but will withold or lie about others, say age and gender. In response I might not get all the info I seek. By making the value of our transactional relations explicit some interesting things might happen. Maybe if your Wbserver agent is is a bit too greedy mine will shop around to other sites etc. respects.


Then the AI would be the one making assumptions about your consent. When it gets it wrong and your boundaries are violated, everyone will just shrug and say "well but the AI said it was okay."

In my opinion, algorithms can not fix this. The developer needs to know clearly what boundaries they can't cross, and the only way for that to happen is if there are obvious penalties and deterrents. If Microsoft commits egregious privacy violations but it makes them more money than it costs them, why would they stop?



If I see one more popup with the only available responses being "Yes!" and "Maybe" I am going to go ballistic.



"Yes!" in Green

"Remind me Later!" in Blue

Tiny X in top right corner of the box that's as close to white as possible without actually being white

Honestly should be a prison-worthy offense to display a dialog like that


Amazing that this has 69 upvotes in under an hour and still isn't on the front page.

Algorithmic suppression much?


“It is difficult to get a man to understand something, when his salary depends on his not understanding it.”


Press X to not be violated


"It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It"[a]

[a] https://quoteinvestigator.com/2017/11/30/salary/


I didn’t give any consent for any taxes either, but oh well..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: