At the very least, an action which is effectively instantaneous should never be opt out. That is, if they can flip a switch and scan millions of blogs in a few hours, or days, or even weeks, you need to be able to opt out before that happens. It shouldn't be a race between you and the website to see if you can find the opt out toggle before you lose control of your property.
A lesser example of this is when you sign up for a website, and are immediately opted in to their newsletter and various other spam email lists. The opt-in happened in ~1ms, but even if you opt out immediately after your first login, you'll still get added to their list by default.
I've always been amused by the fact that they'll newsletter and product sales bomb me within a minute of signing up for their service, but removing me from their lists may take up to 5-7 business days.
It made more sense when I understood that saying it may take up to 10 business days to opt me out was a statement recognizing the legal requirements rather than the technical requirements. I'm sure some companies just wait exactly 9 days, 23 hours, 59 minutes and 59 seconds to instantaneously opt me out. Malicious compliance, or as we used to call it, passive-aggressiveness.
We need a legally enforcible mechanism to make anyone think twice
before using data without consent. This must have the effect with AI
training models, that if a persons data has been incorporated without
consent, or they legally revoke assumed consent, the data must be
removed. At present, AFAICS that would mean voiding an entire
model. That would be expensive and so be a stiff deterrent against
abuse. Copyright has failed. But there are sutely other tricks to
play.
> The data subject must also be informed about his or her right to withdraw consent anytime. The withdrawal must be as easy as giving consent.
> Last but not least, consent must be unambiguous, which means it requires either a statement or a clear affirmative act. Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing.
> For especially severe violations, listed in Art. 83(5) GDPR, the fine framework can be up to 20 million euros, or in the case of an undertaking, up to 4 % of their total global turnover of the preceding fiscal year, whichever is higher. But even the catalogue of less severe violations in Art. 83(4) GDPR sets forth fines of up to 10 million euros, or, in the case of an undertaking, up to 2% of its entire global turnover of the preceding fiscal year, whichever is higher.
Seriously though, the hole I am trying to plug here is remedy.
Where is the individual remedy against someone who takes my data for
their own profit? Sure I can get an official body to act for me,
impose fines etc. But if, at near zero cost, I can force them to
delete it - no exceptions - that's a powerful response. I'm probably
mistaken but right now I don't see that to hand despite the fancy
wording about data "belonging to the subject".
The issue is that every business operator relying on this overnight would have to shift massive amounts of capital into their legal departments to create a remediation pipeline, fundamentally shift their architecture to include permission first targeting of training information, and most importantly, the Courts would be instantly swamped with just cases dealing with this kind of case for years while lobbying would go on for years to cripple any such remedy establishment.
In short, what you propose is the abolishment of the click through EULA, and a return to exercise of contract law as originally intended. This kills the tech enabled business sector, because they'd essentially have to de-scale or die under an avalanche of judicial remediation.
It would definitely be a very good thing to restore real contract law.
The "click through EULA" was a fork in the road where our legal system
went wrong and slid into a ditch and things have been spiralling
downwards ever since. Tech companies relying on unjust methods
absolutely should go bust, and good riddance,
> instantly swamped with just cases dealing
That's interesting. It's almost tempting to see that as an argument
against justice :) Right now in the UK we've potentially 4k - 5k
pending cases for about £600,000 compensation each against the Post
Office. It is already technically bankrupt on paper and will have to
be dissolved and re-nationalised under the next government.
So in many places in the world, the backlog of cases that courts
cannot deal with is the only thing keeping some large criminal
enterprises alive. In other words the sheer scale of their
injustice is what protects them. What did Stalin say, "A single death
is a tragedy, a million deaths is a statistic"?
>That's interesting. It's almost tempting to see that as an argument against justice :)
The almighty pocket veto. If I never get the chance to get around to it, dif it ever really happen?
Make no mistake; this is one of these sneaky things you can do to ensure that even if something you don't pike is the case, it never really materializes substantively, and fast enough to matter.
It's one of my major issues with legislation that creates new remedies, but without taking into account the means through which that process will be executed. While an independent judiciary is crucial; there does need to be affordance for info propagation.
Why do you believe training an AI model shouldn’t be considered fair use, especially when they are predominantly trained on publicly available information? It’s a completely transformative work where the model contains no part of the original data.
If you shop in a chain store, they generally ask for your email and phone number usually in exchange for discounts. Giving them this information opts you into their email and sometimes text spam lists meaning several messages a day, some of which might include relevant information about a sale or discount you're interested in.
The end result is that people stop looking at their email because it's mostly marketing spam and they miss genuinely important emails.
I think the tech industry understands this perfectly well. Large parts also understand that actually asking for and respecting consent will reduce their income stream some. The primary directive of the tech industry is to profit regardless of the cost or harm to others.
Interesting choice of words. As I understand it, and I do not have the
judgement transcript, in the British Post Office case those were the
words on which Judge Fraser acquitted the postmasters and found
against the Post Office.
I think the key words may have been "above all else" with respect to
the "interests of the Post Office".
You don't get to say that. Ever.
In law, at least in Britain as I can see it, you don't get to put
reputation or profit or anything else "above all else" or "regardless
of the cost or harm to others". YMMV elsewhere.
Case in point would be when Apple implemented a change to iOS requiring apps to ask if you want to be tracked. I believe almost everybody says no which cost the industry a great deal of ad money.
In essence, this is equal to slavery - "someone pays me to flood you with SPAM - if you refuse that SPAM, I will starve to death; so please don't refuse the SPAM"
We know that bees fly to the flowers by their own wish.
Imagine how bees are sitting in their hive and out of nowhere thousands and thousands of flowers are coming to the hive and advertising "Here, be-be-be-beees, the best floral pollen for you! Take 3 for the price of 2!"
If you were a bee - would you consider this normal?
It is, in that it takes advantage of peoples' speculative nature.
"What if we had all this extra data on our customers?! Then we could do a lot more to make more money! So, it makes sense we ought to pay more for that data too!"
The data provider gets to profit based of this speculative "of course more information/more stuff is better!", while the data purchaser doesn't really have a chance to test whether their speculation is well-founded. (How precisely do you test what the causative effect of changes to your ad campaign are? If you make a change, and numbers go up, how do you separate numbers going up caused by any of the other parallel factors? It's hard to be scientific with advertising.)
The person selling the tool always wins, even if the tool is nothing more than a placebo effect. The person buying the tool on the other hand...well, good thing they have deep pockets, and don't really know how to spend their money well, and good thing too that there's always more more suckers ready to take their place (as long as we keep the birth rate up)!
It's a good point that data providers can make a profit opening this as a new revenue stream. We'll see how they end up working out for the end users of that data I guess.
As a consumer, I agree with this take. Looking at it from the other side though, many businesses simply wouldn’t exist online with opt-in. To some extent, you need to understand that companies need to make a profit and we’ve developed a market where that’s not going to be direct payments by users.
Imagining an opt-in policy highlights how unethical these AI data-harvesting schemes really are. It's blindingly obvious that almost nobody would actively choose to donate their work to enrich AI companies without getting anything in return.
Guess I'm 'almost nobody'. I see zero issue with AI farming Stack Overflow, Twitter, Reddit, and the like - publicly accessible forums. The value to me was in the discussion. It's happened; I've extracted my value from that engagement. If an AI company can also extract value from that collection of discussions, it costs me nothing and I expect no compensation.
Thought AI companies farming copyrighted work on the other hand, that's a different story.
The article isn't talking about public discussion forums at all, it's talking about WordPress and its owner, Automattic. The article is a blog post on a WordPress-hosted blog. It then goes on to talk about consent in software in general.
Personally, I'm not really okay with what you're calling public forum harvesting either. I've put a lot of work into Stack Exchange answers and I am not okay with a for-profit company recycling and possibly outright regurgitating that work without attribution. (The latter would be a flagrant violation of CC BY-SA, of course.)
I understand that the author is mad and wants things to be opt-in. But I also think the author is smart enough to know that the tech industry understands consent just fine. It just doesn't care.
twitter and reddit will soon be bots talking to bots (if they aren't already) so the AI can train on that.
> Thought AI companies farming copyrighted work on the other hand, that's a different story.
Copyright also happens to be opt-out. You have to explicitly say “this is not copyrighted” for copyright to stop applying.
See your comment and my reply? Both copyrighted. Right now. As soon as we hit publish we started to own copyright. There is an EULA somewhere on HN that probably says we give HN implicit permission to host this content in perpetuity and can even make it available in APIs, show it to bots, etc. But that’s not the same as no copyright. If somebody who is not HN wants to screenshot this comment and publish it in a book, they in theory have to find us and ask for permission.
> Copyright also happens to be opt-out. You have to explicitly say “this is not copyrighted” for copyright to stop applying.
This isn't possible under US copyright law. You can say "this is not copyrighted" all you want, but it's still copyrighted. The closest you can get to voluntarily putting something in the public domain is to refuse to take enforcement actions against violations of your copyright.
Search engines link back to the original sources, making them discoverable which is the “payment” for allowing it. That’s a very different use than an AI that doesn’t even know where the original content came from and provides nothing back to the original creator.
Inevitably there will be copyrighted images, audio, and text mixed in with random social updates and discussions. It should be on the LLM builder to seek active consent, rather than everyone else to be vigilant and/or sue to get their work out of the model's data.
> I see zero issue with AI farming Stack Overflow,
> Thought AI companies farming copyrighted work on the other hand, that's a different story.
All post on Stack Overflow are still copyright of respective posters. They are offered publicly under Creative Commons license that require attribution.
In the US, everything you write anywhere online is copyrighted by you, unless you sign a copyright assignment agreement. It's automatic any time you put an expression into a fixed form, and there is no way to revoke that copyright.
As I understand it copyright has failed. Or rather we are into an age
of naked double standards where courts will enforce the copyright of
big-tech against you for "stealing" a movie, but will not enforce your
copyright against big-tech for "stealing" your data for its AI.
Copyright is still deeply important to prevent behemoths from just straight-up taking stuff individuals wrote and profiting from it with no consequences.
For instance, without copyright, traditional publishers could just take everything the authors they currently contract with have written, and every other current author, and publish it without paying the authors a cent.
ML training is a legal gray area right now, because it's a new thing, and we haven't had time to properly sit down and both understand what its effects are, and how it should be treated legally. It is possible that this process, when it ends up happening, will be captured by corruption; it is possible that it won't. But using the current frustrations and anger about ML training as evidence that copyright has "failed" is a vast oversimplification that ignores the very real good that copyright does in our society.
It's failing right now to protect millions/billions of people,
because we've decided that it's "legal gray area right now".
Maybe it should be, I don't know. I mean maybe it's time we said bye
bye to copyright?
There could be flip sides. If the world decides that ML sidesteps
copyright then I look forward to the entire corpus of LibGen, SciHub
etc being legally released as open models and the overnight demise of
Elsevier et al. (I once wrote a fiction about that [0])
My objection here is to seeing the clear wishes of the majority being
trodden over roughshod.
> I mean maybe it's time we said bye bye to copyright?
This is exactly the kind of oversimplified, baby-with-the-bathwater proposal I was talking about.
No, we should not "say bye bye to copyright". We should actually take the harder, more complex steps, requiring actual critical thinking and analysis, to fix the problem, rather than just pretending that a one-step grand gesture will be a magic bullet.
> We should actually take the harder, more complex steps, requiring
actual critical thinking and analysis,
Those are fine words. We're all about critical thinking and analysis
round here. But way I see it, folks already did some real hard critical
thinking and their analysis was "bollocks to that!"
And the judges said, "sorry the law that applies to you doesn't apply
when big money is involved". One rule for you, another rule for them.
So I'm kinda thinking we'll maybe have to get a little more critical
than you might be comfortable with.
> to fix the problem
It's always a good idea to pause right there. What is the problem? I
mean seriously... what exactly is the problem going on here? Because
from where I see it, the problem is a massive power imbalance
And it's a structural one. Because AI training compute and global
crawling/scraping is expensive and in the hands of the few.
I don't think this problem would look the same if every kid was
running AI training on a Raspberry Pi, and hoovering down JSTOR like
Aaron Swartz. People would be getting arrested, no?
Well, yes. The problem you are identifying is primarily a structural power imbalance.
It is not a structural power imbalance that can be fixed by abolishing copyright. Indeed, abolishing copyright is vastly more likely to hugely increase the power imbalance.
You are looking at the problem too narrowly (identifying it as "a problem with copyright", rather than "a problem with the power structures in our society"; "AI training and compute...in the hands of the few" rather than "most of the money and resources in the hands of the few"), and thus coming to counterproductive conclusions about how we might solve it.
It's very satisfying to imagine taking a big hammer to a system we know to be corrupt and serving those without our interests at heart. But just smashing the system does not build a new one in its place. And until you address the power imbalances, any system built to replace one you smash—assuming you can manage to do the smashing, which is highly suspect—is nearly guaranteed to simply be designed to serve the desires of the powerful even more than the one we have now.
Some good thoughts, though maybe you underestimate my bead on the
world, and perhaps overestimate my desire for "smashing". A more
peaceful, and just, time when we simply take their toys away will
come. That is certain. A question of "intellectual property"
remains. In a post-exploitation world, would we still want or need it?
Let's hope we keep living to see how it pans out. Respects.
Think of it this way. If Google makes 300 billions per year scrapping 1 trillion webpages, how much money should it pay for each webpage it scraped?
There's a point in bulk scraping that the logistics of giving people real money for data makes no sense. The payout out be to small to waste time thinking about. And the fees of costs of paying anyone in the world would be higher than the actual amount being paid!
I'm not saying it's morally right, I'm saying the only way to be commercially successful is to try to get away with it.
We're properly into the age of all-out "lawfare" now. I wonder what
happened to the likes of Lawrence Lessig and Pamela Jones, and all
those legal minds who used to weigh in on the side of ordinary decent
people. We could use some easily deployed retaliatory weapons and
countermeasures about now.
You can really only laugh when companies like OpenAI say they're working on this problem, and them working on this problem is a tedious opt-out form that you need to fill out for each and every piece of work they may or may not have ingested, and no they won't be retraining the model. It's obvious they are acting in bad faith to anyone with a brain.
Personally, I feel it would be a much smaller problem if Tumblr had an internal AI thing going on. What users REALLY don't like is that they have confided a post with one website, and that website just shared that post with a third-party because if opens up infinite possibilities.
If Tumblr can take your post and give it to OpenAI, they can take your post and give it to anyone, and that's the problem. Because for users, what they post is "between them and Tumblr" and not anybody else.
I'll even say more. Artists don't care if you scrape their art to make AI generated art with it. Because when this happens, it's "between artists and scrapers" and not anybody else, so it's fine. What they do care about is when people post that AI generated art on the internet, or publish it professionally, or do basically anything with it.
In other words, there's a sense of privacy when there is only two parties involved, no matter what is going on, but the instant a third party gets into the system, the first party will freak out, because then you lose that sense of privacy.
I think "consent" is not the right word for this, because it's never simply "consent" it's always boundaries and expectations. Consent implies there's always one yes/no answer for a process. In practice the process is always so complicated that it would require countless yes/no questions, to the point nobody sane would want to deal with it.
Just look at cookie banners and imagine if we had a consent popup for every single thing that needs to be download to show a page: do you consent to download the HTML? Do you consent to download the CSS? Do you consent to download the Javascript? Do you consent to download the images? The user just wants the see the page. You have to make assumptions about how far that consent goes, so it's absolutely transitive, the problem is at what point it crosses the line.
Let's think about dating and imagine if we both had to consent to
every single thing that needs to happen to get it on. Do you consent
to dinner with me? Do you consent to coffee at yours? Do you consent
to us getting our clothes off? Do you consent to ... you get the
picture.
Thing is, we actually do this, but silently.
The "assumptions" you rightly speak of are the mutual trust mechanisms
that allow two adults to make clear shortcuts without explicit words.
We move at a certain speed. We check for feedback. We use clever
signals. We base our trust model on the NSA definition that "Trust is
the ability to do harm" in context where there is some symmetry. Sex
is actually risky for both parties.
And no means no.
But in the digital world we're talking about a massive power
asymmetry. When you're with a big corporation it moves from dinner to
data-rape in a few milliseconds.
No means "yeah but harder".
Consent doesn't disappear because the process got complex. Human
relations are more complex than any web page.
It's about whether we obey the rules and show mutual respect. Big-tech
corporations absolutely do neither. and that's the problem here, not
complexity.
What might level the game, is if when I visit your site with my web
browser, you risk that I could do you some serious harm.
Unfortunately, it's not a human on the other side, it's a system, whether it's a system built based on business rules or a system built based on computer algorithms, everyone gets the same inflexible system. It's not logistically feasible to create a custom experience for each user based on how far they consent to every little thing, as the use cases are infinite: you may consent to download CSS on this site, but not on that site, or on that site, but not on pages that match this regex, etc. If you tried to do that, your product would immediately flop because most users would find it too annoying and complicated to use compared to any system that makes reasonable assumptions about consent.
I agree that many modern software have no respect to humans, but that's not because the developers didn't want to implement a yes/no consent dialog or several of them, it's because they think the only boundary they can't cross is what the law says they can't, and everything else is fair game.
Sorry late to reply after work today, but you really got me thinking
AR
> everyone gets the same inflexible system. It's not logistically
feasible to create a custom experience for each user based on how
far they consent to every little thing,
Is it though? Maybe this is where AI is going to be a win - in dynmaic
protocols and bespoke interactions. What if we leave my "AI
web-browsing agent" to talk to your "AI web server" to quite literally
negotiate a whole bunch of preferences. Usual game theory applies.
If your bot tries to defect or fuck mine over, or my bot is too greedy
or defective we both get a suboptimal outcome? Maybe I am happy to
trade certain bits of data like location or budget but will withold or
lie about others, say age and gender. In response I might not get all
the info I seek. By making the value of our transactional relations
explicit some interesting things might happen. Maybe if your Wbserver
agent is is a bit too greedy mine will shop around to other sites etc.
respects.
Then the AI would be the one making assumptions about your consent. When it gets it wrong and your boundaries are violated, everyone will just shrug and say "well but the AI said it was okay."
In my opinion, algorithms can not fix this. The developer needs to know clearly what boundaries they can't cross, and the only way for that to happen is if there are obvious penalties and deterrents. If Microsoft commits egregious privacy violations but it makes them more money than it costs them, why would they stop?
A lesser example of this is when you sign up for a website, and are immediately opted in to their newsletter and various other spam email lists. The opt-in happened in ~1ms, but even if you opt out immediately after your first login, you'll still get added to their list by default.