Europe to ChatGPT: disclose your sources

kmeisthax · on April 28, 2023

Everyone here is thinking Europe is going to kill AI in their borders. Honestly it reminds me of how everyone derided EU antitrust as just milking Google for fines, until everyone soured on Google and realized what the EU already knew.

The EU is not anti-AI. In fact, they have stronger protections for AI training than the US does: EU law already has a copyright exception for "text and data mining" (TDM) which covers AI training. The problem is that OpenAI has been incredibly cagey with the way their models get built and trained. This is kind of contrary to the spirit of the TDM exception: it's for scientists to do science with, and OpenAI is being very much not like a scientific organization and more like a commercial enterprise.

jpalomaki · on April 28, 2023

Isn't GDPR currently the main problem? There's personal information on the Internet. If you are processing crawled data, then you also end up processing the personal information, but you don't have consent for the processing from those individuals.

Ekaros · on April 28, 2023

Just need to prove that it doesn't store data under GDRP in models that can be extracted. After that GDPR is not an issue.

he0001 · on April 28, 2023

Isn’t the problem also that it has been ingested? As a user need to consent that it has been used for that purpose?

barry-cotter · on April 28, 2023

Given where the top AI companies are and how much more important they are for AI than the top universities for it this is bafflingly optimistic. There’s no reason to expect Europe to be other than an also ran in AI, punching self in face antics aside. Why would anyone think Europe was relevant to AI development?

FuckButtons · on April 28, 2023

You don’t seem to be getting the point of this, Europe is not attempting to hobble ai development, but enforce its laws within its borders, if openAI can’t adhere to them, then they can’t do business there, which is a problem for them, given the European Union is the 3rd largest economy in the world.

pipes · on April 28, 2023

I could be wrong, but to me this is like a government demanding that trade secrets are revealed. If people are worried about open AI, they already have the option not to use it.

Cymrukicks · on April 28, 2023

Sorry doesn't look like your are getting it. Its not about AI its about personal data. Whether AI is using this or any other program, EU citizens have a right to know and request removal of that personal data. Open AI need to allow users to do this same as anybody else if this reveals trade secrets then so be it.

davidguetta · on April 28, 2023

"If people are worried one plane compagny doesn't respect the laws, they should not use it".

"If people are worried one food compagny doesn't respect the laws, they should not use it".

Yeah that's not how law and law enforcement work, AI or not.

ithkuil · on April 28, 2023

And governments of people have the right to outlaw it?

numberalltheway · on April 28, 2023

This is a great first step. It's a joke that Open AI thinks they can get away with saying they use "both publicly available data (such as internet data) and data licensed from third-party providers" in their Technical Report.

There isn't anything left at that point! With that information they could actually have used anything.

If you're going to pretend to be doing science you should at least be held to some of the standards we typically associate with doing science.

I know the article talks about copyright, but not stating any sources for data is a bad precedent to allow.

duskwuff · on April 28, 2023

> With that information they could actually have used anything.

I mean, at least it excludes "nonpublic data we didn't obtain permission to use". So I guess that's a start? (/s)

nullc · on April 28, 2023

> is a bad precedent to allow.

Woah. Doing bad science isn't illegal, and making it so would be quite chilling. It's common in many fields to be quite imprecise about data used in the work, and entirely uncommon in many for data to be externally reproducible.

Legislation restricting research isn't the right way to improve science and is unlikely to achieve the intended effect for many reasons, including that it's easier and safer to just not touch the impacted area. In some domains this causes whole areas to go unstudied or understudied, e.g. because it runs into IRB and just isn't worth doing... but at least the rules demanding IRB approval are intended to keep people from suffering grave harm and even those are less strong than blanket regulation (they're rules tied to federal funding, not research in the abstract).

bostik · on April 28, 2023

If you're doing any kind of science, recording provenance for your inputs should be table stakes. Bad science isn't about hiding or obscuring the origin of the data, it's about being sloppy, incompetent, or even flat out willfully misinterpreting the results.

We have names for what looks like science but is done without documenting - let alone outright falsifying - where the data came from. Hoax. Advertising. Propaganda. Parallel construction.

Let's not lump incompetence and malice in the same bucket, please. And if you're unsure of the data provenance, then state that fact.

nullc · on April 28, 2023

The consumers of scholarship are able to look at and determine if its the sort of thing where access to underlying data is important-- and they're free to discount it when it doesn't provide enough. Journals and grant writers are free to set standards for the work they publish or sponsor-- and they should!

But no one needs to legislate that publications such as your conclusion-- that science done without documenting its sources is properly called Hoax, Advertising, Propaganda, or Parallel construction-- itself properly document its sources. We can take it for what it is, an opinion-- one no doubt supported by some data but none of us need to see it, and we can evaluate it without calling it propaganda. If you wanted to make your point stronger, I'm sure you'd give us some supporting data (if you could figure out where those views came from...).

Though people sometimes pretend otherwise, a lot of research is dressed up informed opinion, put into a formal setting with standardized argument styles so that it can be compared and assessed against other informed opinions. None the less, such work done honestly and diligently advances the human condition.

The ways in which it can be best improved are field specific and can only really be judged by the people attempting to use the scholarship. In some cases the data should be published, in others its provenance documented (sometimes publication of the data would be a violation of the law, too!), in others access to source code should be paramount, in yet others the authors biases may be the primary concern, and in some fields all publications should be directly sent to the incinerator. Applying the wrong standards will just make things worse. People are smart, they tend to figure out what works for them and their field over time.

bostik · on April 28, 2023

> Though people sometimes pretend otherwise, a lot of research is dressed up informed opinion, put into a formal setting with standardized argument styles so that it can be compared and assessed against other informed opinions. None the less, such work done honestly and diligently advances the human condition.

I ... think we actually agree here.

In fact, to prove your point: I have no chance of accounting for the origins of my opinions, because they stem from decades of osmosis and subjective experiences. But I can at least be honest about something I say, do or argue being an opinion. The same way you just did.

rhn_mk1 · on April 28, 2023

Doing bad "science" is not illegal, but maybe should be, considering the replication crisis that is upon us. It diminishes the utility of the work that is being done, and makes it difficult to tell apart actual scientific discoveries from flukes and forgeries.

simion314 · on April 28, 2023

Chat GPT is a comerical product not research.

I want to know if OpenAI used say GPL or other copyrighted software and then the bastards had the genius idea to put restrictions on the output in their ToS. I want stuff to be fair, if MS/OpenAI can train on GPL then I should also e allowed to train on MS proprietary code or on Disney images and video, it is not fair that big companies can screw the public but the public can't do the same to the big companies. The first step is clearly have the big companies reveal if they used copyrighted stuff.

jeswin · on April 28, 2023

> I want to know if OpenAI used say GPL or other copyrighted software and then the bastards had the genius idea to put restrictions on the output in their ToS.

This is a bit of a gray area. Are you allowed to read GPL'ed code and use a similar pattern in a closed source project?

simion314 · on April 28, 2023

I am a human , I am not a machine that inputs all the GPL code on the internet and then outputs similar code with very small differences. I am fine with OpenAI and MS using GPL code as long as the open source community can also train on proprietary code and art of the big companies.

What happens now is that some big companies say that is OK to train on any licensed stuff and on the other hand some people are sued because they done it, I want it clarified ASAP. And personally I would not give a shit on the ToS of OpenAI and use their poutput as I like as similar as they did.

jruohonen · on April 28, 2023

I agree but things are changing: many publishers already require a disclosure statement about data. I think both the US and the EU are slowly moving to a direction of open data in scientific research.

What is this "legislation restricting research"? These companies are not doing "science".

mongol · on April 28, 2023

Imagine a future where everyone learns from AI instead of books because it is more convenient, faster etc. You would get the same info, but you would not know who was the expert that you learned from. How would that change society if all authors just disappeared behind a generic AI brand? I don't think it would be especially good, and I think it is completely fair that an answer from ChatGPT should provide sources. It would improve the quality.

downWidOutaFite · on April 28, 2023

AI has been trained on all human knowledge up until now but as it continues ingesting human ingenuity going forward it will remove the incentive for people to create new knowledge or styles or art since the AI can immediately mimic you and steal it. Promoting creativity is what copyright law was created for so I got a feeling we'll be revisiting those laws sooner or later.

agentgumshoe · on April 28, 2023

It is the continued progress of technology that aggregates more and more knowledge and expertise into centralised platforms and services.

Nobody seems to care much, and we proceed to fulfil the destiny of our technological enslavement 'because it's going to happen anyway.'

dorchadas · on April 28, 2023

No, we proceed because it makes people a lot of money consequences on society or the future be dammed.

agentgumshoe · on April 30, 2023

Those two things aren't mutually exclusive

esperent · on April 28, 2023

> How would that change society if all authors just disappeared behind a generic AI brand?

I've been thinking about that but in terms of famous actors. Once AI can replace actors in movies, as well as singers and models like Instagram influencers and so on, maybe some of the weird hero worship and the paparazzi and the gossip mags and all that nastiness will fade away somewhat. That, I feel, would be a good thing. Pick your favorite Hollywood actor, or pop singer. A supremely talented human... Amongst thousands and thousands of supremely talented humans in their field and yet they are the ones who got lucky, had the right connections or got lucky with the right role. Then for the rest of their lives they are feted and hero worshipped as if they are more than human, while thousands of equally talented people who didn't get the lucky break are ignored. That's what being famous is, mostly, and it's not good for either the famous people or the people who worship them.

Does that apply to scientists and authors? I'm not sure. But in terms of scientific breakthroughs it's extremely rare that a particular discovery could only have been made by one person. In fact, nearly every discovery, from the calculus to the theory of evolution to DNA, was concurrently discovered by multiple people. And yet we attach one name to each discovery and hero worship that person because they published a few weeks earlier or were just better at self marketing.

Maybe losing the attachment of famous names to things is a good thing for society. As long as it's not replaced by a corporation pretending ownership of all the knowledge in their place, at least.

glanzwulf · on April 28, 2023

> Once AI can replace actors in movies, as well as singers and models like Instagram influencers and so on, maybe some of the weird hero worship and the paparazzi and the gossip mags and all that nastiness will fade away somewhat

I dunno, I remember seeing on youtube Hatsune Miku concerts being pretty packed, and there was that one guy who even married Hatsune Miku. Who knows what'll happen with AI.

agentgumshoe · on April 28, 2023

In that case, I await the day our AI overloads decides we are not necessary at all ;)

AI does not do novel things, it will only work with the information it is fed.

nullsense · on April 28, 2023

>AI does not do novel things, it will only work with the information it is fed.

If we had AI overlords they would be able to do novel things because they would be able to generate new information on their own.

matheusmoreira · on April 28, 2023

> but you would not know who was the expert that you learned from

Just ask who came up with this stuff. Learning material doesn't go over the long history of the science involved either.

DesiLurker · on April 28, 2023

make me wonder if somebody has trained a LLM from SciHub data. would be interesting if someone were to marry a symbolic engine like mathematica with such an LLM/AI.

d0mine · on April 28, 2023

Isn't there Wolfram plugin for ChatGPT?

DesiLurker · on April 29, 2023

yes there is but I am not sure if its actually used to do anything past some symbolic math. IMO true AI could only happen from understanding (& internalizing) physical laws especially notion of energy optimization. I feel that the until LLMs are somehow married with this knowledge they would always be parroting back from existing 'data' in seemingly creative ways but never create something truly new of their own.

__MatrixMan__ · on April 28, 2023

> ChatGPT would be required to disclose copyright material ...Such an obligation would give publishers and content creators a new weapon to seek a share of profits

If somebody figures out how to do fine-grained profit sharing based on having created something that the AI references... that would be very cool. I love discovering the solution to a niche and difficult-to-describe problem, but I hate the extra work necessary to leave breadcrumbs for DenverCoder9 to find it 20 years later.

If I could leave the matchmaking to an AI and get paid $0.25 when it's finally helpful to that person I don't know... Well I'd probably wouldn't make much money, but it would give me warm fuzzy feelings.

WhereIsTheTruth · on April 28, 2023

I don't think ownership makes sense in the future we are building, optimizing society to achieve greatness as a multiplanetary species involve reinventing money and its purpose

What society would be today if you had to pay a fee whenever you wanted to use the Pythagoras theorem, we'd be stuck in dark times

__MatrixMan__ · on April 28, 2023

Oh you're absolutely right, the notion that you can own information is absurd. Maybe it was necessary at one time as a kind of training-wheels for innovation, but we've long since outgrown it. It'll only ever get more and more absurd.

But if somebody you don't know is doing something that's benefiting you, and you're not contributing to their ability to continue doing that thing in some way, then you might be shooting yourself in the foot. In any future worth pursuing, they'd be free to stop contributing if they felt like it, but wouldn't it be a shame if they did so without even knowing that their contributions were considered incredibly valuable by somebody?

Like, imagine if Pythagoras couldn't afford food and had to give up geometry club and get a "real job". That would be to everybody's detriment. So while I think that "property" is the wrong tool here--we shouldn't be witholding access to our contributions for any reason--I do think we need something that's a little more impactful than an upvote for saying "more like this please".

manuelmoreale · on April 28, 2023

I appreciate your optimism but in the future we’re building, money and ownership seems to be more and more important unfortunately.

JieJie · on April 28, 2023

Which is why we should be very careful when we consider creating a system that is going to funnel money to people who are already famous for writing, art, etc.

How will new artists and writers get their works included in future AI — and then get people to prompt for them — so they can get their paycheck?

Even spending a few minutes on this problem would lead to a realization that even if we could create a system that could a) determine rights to any particular portion of an AI-generated work, and b) extract payment and remunerate the artist; would essentially be building a moat around the next generation's intellectual property powerhouses.

Generative AI is a revolutionary technology, and we need a revolution in compensation models for arts and letters to go with it.

manuelmoreale · on April 28, 2023

Do you believe it can happen? The revolution in compensation. Because as much as I’d love to see it happen, I’m very pessimistic.

Because it’s more likely that a small minority will just try to fuck everybody else for their profit because that’s just human nature unfortunately.

__MatrixMan__ · on April 28, 2023

It's not going to happen if we just sit around and hope for it, but since the current model for supporting creators is failing so badly, it seems likely that if somebody can get it even half way right, their system would have a huge advantage. Half-way right, in my view, would avoid most of these:

- Incentivize the creation of technology that does more harm than good (e.g. DRM).

- Create legal constructs that are later used for censorship.

- Require that artists share profits with lawyers.

- Require artists to focus mostly on stuff that's not their art.

And would achieve some of these:

- Citing sources is impactful. The graph structure for determining trustworthyness is what also determines payment, or credit, or warm fuzzy feelings, or whatever the relevant good thing is.

- Has an culture of rewarding (and scrutinizing) curators such that successful curators only endorse content which fair about how it defers to its sources.

- Supports inheritance such that making derivative works that credit their parent is easy.

- Treats transport and attribution separately so that I can work with the data via whatever tool scratches the itch (e.g. rsync, and not some janky website).

So yes I do think it's possible. I'm working on tooling in this imagined ecosystem. I want to use CTPH hashes (i.e. the tech used by virus scanners) to annotate bitstreams with metadata re: trustworthyness. What I don't think is possible is to take an AI's output and mapping it backwards to annotations of this type in the training data, but I'm hoping that some AI wizard comes along and shows me that I'm wrong about this.

manuelmoreale · on April 28, 2023

Do you believe the reasons why the current model is failing creators is for lack of good technological tools? Because I personally believe the issues are more anthropological than technological which is precisely why I don't have much hope.

Better tools can improve the situation probably, I can't say for certain because I never dived into this space but I don't think they'll solve all the issues.

I'd love to be proven wrong though. I really do.

__MatrixMan__ · on April 29, 2023

Yes. We're tool using primates, that quip about having only a hammer and seeing only nails is really descriptive of the human condition. There's an excellent radio-lab episode which argues that cultures don't develop a word for the color blue until they can make blue dye (and the implication is that they don't even perceive it before that).

Abstractions like "value" or "property" arose organically, and if we didn't have this fixation on tools they'd likely have changed organically... But we made tools for working with those abstractions and now we live in a world shaped by those tools and it has created a sort of inertia for the old way if doing things.

It's kind of like how all of the spellings stopped changing when the printing press was invented, so now we have wacky spellings like "through" that we would've moved on from had that not happened at that particular time (see: "the great vowel shift")

The historical circumstance around the creation of the printing press is what gave us our notion of intellectual property to begin with, and I think it'll remain more or less unchanged until some other technology forces it to change.

It's incredibly difficult to visualize a different way in our current setting, and it's especially difficult to get paid to work towards it, but I think that pretty much any change is possible given some MVP toolset that makes it doable and some critical mass of people willing to give it a try.

JieJie · on April 28, 2023

I'm old. I've seen quite a lot in my lifetime.

I absolutely believe it can happen. I also believe that when people have the attitude that "a small minority will just try to fuck everybody else for their profit because that’s just human nature unfortunately", a self-fulling prophesy is exactly what will happen if we don't spend our time and energy actively advocating for the other thing.

Nothing happens in a vacuum. We get the government we deserve.

(I hope this doesn't come off as me "dunking on" you, or whatever people do on social media. I'm not trying to attack you, but that attitude that is all-too-common on social media these days. It's defeatist and it's not going to lead to good outcomes for anyone but the people who are going to "fuck everybody else".)

manuelmoreale · on April 28, 2023

> I hope this doesn't come off as me "dunking on" you, or whatever people do on social media.

Nah it’s all good. I spend my time advocating for people to engage in more healthy ways, I try encourage people to blog more, to write directly to other humans via email and mail, I try to push for more genuine connections so it’s not like I think everything is doomed.

But I’m also not naive when it comes to the internet or society in general.

People are obsessed with money. And very few are obsessed with distributing them evenly. Even in the creators space. Which is why I wrote what I wrote.

I’d LOVE for a different outcome. I just don’t expect it.

JieJie · on April 28, 2023

We are on the same page.

I'm just not going to give up after trying so hard. Hey, I've seen people take the high road and pass up the money and I've done the same myself.

It ain't over till it's over.

Wishing you all the very best.

manuelmoreale · on April 28, 2023

Really appreciate this exchange and the willingness to share your thoughts. Thank you.

WhereIsTheTruth · on April 28, 2023

That's not optimism, it's either that or our civilization will disappear like the previous ones, and it is even more true nowadays with China's goal to become a global super power with the Moon and Mars in its trajectory.. so we either focus or get eaten alive

kmeisthax · on April 28, 2023

The problem is that AI doesn't really "reference" data. When you "train" an AI on some data, you're adjusting billions of model parameters to make them closer to the desired output. Except you're also doing that on billions of pieces of other data, many times over, and every bit of data you train on is stepping on everything else. In order to pay people a share of the 'profits' of AI, you need a clear value chain from licensed training data to each output, through the magical linear algebra data blender that is gradient descent. Nobody knows if your training example helped for the same reason why we don't know if an LLM is lying or not.

In lieu of that, you could pay everyone a fixed cut based on presence in the training set, but that then gives you the Spotify problem of a fixed pot being shared millions of different ways. For example, Adobe recently announced they were building an AI drawing tool trained on exclusively licensed sources - specifically, Adobe Stock contributors[0]. They're used to being paid when someone buys their image, which means that they have incentives to produce broadly relevant stock photography. But with a fixed "AI pot" paying you, now you have an incentive to produce as much output as possible as cheaply as possible purely to get a larger part of the pot. This is bad both for the stock photo market[1] AND the AI being trained.

AI is extremely sensitive to bias in the dataset. Normally, when we talk about bias, we think about things like "oh if I type CEO into Midjourney all the output drawings are male"; but it goes a lot deeper. Gradient descent does not know how to recognize duplicate training set features, those features get more chances to adjust the model. Eventually that training example or image is common enough to make memorization 'worth it' in terms of parameters used[2].

Ironically that sort of thing would actually make attribution and profit-sharing 'easier', at the expense of the model being far less capable.

[0] Who, BTW, I don't think actually have the ability to opt-in to this? Like, as far as I'm aware this is being done through the medium of contractual roofies being dropped into stock photographers' drinks.

[1] Expect more duplicates and spam

[2] This is why early Craiyon would give you existing imagery when you asked for specific famous examples and why Stable Diffusion draws the Getty Images watermark on things that look like a stock photo of a newsworthy event.

yorwba · on April 28, 2023

> you need a clear value chain from licensed training data to each output, through the magical linear algebra data blender that is gradient descent. Nobody knows if your training example helped

The magical linear algebra data blender that is gradient descent boils down to small additive modifications to the model parameters. We already know how to compute the effects of small additive modifications to the model parameters on the output: that's what the gradient is.

So if you want to know how much each training sample contributed to the output, just compute the dot product between the two gradients.

Actually doing that for a billion-parameter model would be slightly expensive because the gradients are also billion-dimensional, so you'd need to approximate the dot product via dimensionality reduction and use a vector database to filter for training samples with high approximated dot product.

But I think those layers of approximations would still be better than throwing your hands up in the air and claiming you have no way to know because linear algebra is magic.

ec109685 · on April 28, 2023

AI could be used to decide if a source should be included or not (or the benefit to the model could be the qualifier). That would solve your problem of people just peddling spam.

No this is an unsolvable problem.

The future also seems like less about making the model an all knowing oracle but instead making it smart enough to know how to lookup data it needs, so it could end up where licensed data is all that is needed for training.

Lastly, what if you use model A to generate data for model B? Would B be tainted? There have been lots of examples where LLM’s are used to train simpler models by synthesizing training data.

kmeisthax · on April 28, 2023

The thing to note about copyright is that you can't launder it away, infringement "reaches through" however many layers of transformation you add to the process. The question of infringement is purely:

* Did you have access to the original work?

* Did you produce output substantially similar to the original?

* Is the substantial similarity of something that's subject to copyright?

* Is the copying an act of fair use?

To explain what happens to Model B, let's first look at Model A. It gets fed in, presumably, a copyrighted data set. We expect it to produce new outputs that aren't subject to copyright. If they're actually entirely new outputs, then there's no infringement. Though, thanks to a monkey named after a hyperactive ninja[0], it's also uncopyrightable. If the outputs aren't new - either because Model A remembered its training data or because it remembered characters, designs, or passages of text that are copyrighted - then the outputs are infringing.

Model A itself - just the weights alone - could be argued to either be an infringing copy of the training data or a fair use. That's something courts haven't decided yet. But keep in mind that, because there is no copyright laundry, the fair use question is separate for each step; fair use is not transitive. So even if Model A is infringing and not fair use, the outputs might still be legally non-infringing.

If you manually picked out the noninfringing outputs of Model A and used that solely as the training set for Model B, then arguing that Model B itself is 'tainted' becomes more difficult, because there isn't anything in Model B that's just the copyrighted original. So I don't think Model B would be tainted. However, this is purely a function of there being a filtering process, not there being two models. If you just had one model and human-curated noninfringing data, then there would be no taint there either. If you had two models but no filtering, then Model B still can copy stuff that Model A learned to copy. Furthermore, automating the curation would require a machine learning model with a basic idea of copyrightability, and the contents of the original training set.

[0] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

The whole monkey selfie just being on the article in full resolution is an interesting flex.

ec109685 · on April 28, 2023

Good point. And I meant it should not be an unsolvable problem. Was thinking along the lines of Model B training off of non-infringing work from A.

All good points!

nullc · on April 28, 2023

In talking to friends that attempted to monetize their youtube, they found that works that were extremely popular and impactful (yet not millions of views viral) earned practically nothing and they were demoralized and discouraged from bothering to produce more.

ajmurmann · on April 28, 2023

What's the metric for "popular and impactful" without "not millions of views"? Were these videos with few views, but great ratings?

nullc · on April 28, 2023

Imagine it's 30 years ago. You and your friends have a band. You put on shows regionally, sometimes hundreds or even a thousand people attend. You have dozens of regular fans who go to all your events, they make and share covers of your music and fan art.

That isn't millions of views. It's tens of thousands. On youtube that's nothing. But 30 years ago you'd feel it was a great accomplishment, and if it was all you achieved you could be happy with that.

As is has become more possible for a few of the most broadly appealing and unchallenged works to reach millions of people the goalpost has moved.

Plenty of valuable content will never reach millions of viewers-- the appeal is too niche. It's a worthwhile contribution to the world none the less, but it isn't compensated as such on youtube.

niemandhier · on April 28, 2023

This is a good move.

1. The problem of testing ai alignment is hard, verifying test data is only laboursome.

2. Laws are only as good as their enforcement. Regulations regarding what can be used to train are worthless if one cannot check.

3. This might give open models an edge and make them more competitive.

zarzavat · on April 28, 2023

There’s no way that OpenAI is going to disclose this, as their training methodology is a large part of their moat. So this will just get OpenAI models banned in Europe.

petra · on April 28, 2023

Maybe that's the goal. Protecting jobs.

nodemaker · on April 28, 2023

Protecting jobs and moving europe back to the dark ages :)

throwaway2990 · on April 28, 2023

Between this and Elon saying that AI needs to be paused for discussion is super harmful.

People forget that if you pause it. People are not going to stop. Does the world want Russia and China to continue investing and developing while the west discusses what the scope of ai should be allowed to do? Because they sure as hell won’t be stopping to discuss anything.

ChatGTP · on April 28, 2023

“if we stop Russia and China”, what really is this based on?

Like as if Russia and China are just full of idiots and all the Chinese and Russian AI scientists are idiots and don’t understand our point of view. Don’t understand how risk or dangers? Like as if the leader of China is an idiot who doesn’t get it either ? It’s just ridiculous.

So what if America developed something really powerful in isolation and other countries find out about it, that may lead to immediate escalation and world war 3 , have you considered that? It’s a silly idea.

What needs to happen is people realise we’re all humans, we all live together in the same biosphere and rather than continue to perpetuate and justify arms races, we must start to talk and solve our differences. That’s the mature thing to do. If we can get to that stage, maybe then we’re ready for more advanced technology.

throwaway2990 · on April 28, 2023

> So what if America developed something really powerful in isolation and other countries find out about it

This isn’t about America. America is not the only country working on AI. But if you stop all AI development in America and say europe because “oh it might be dangerous”. Do you think other countries, for example, China and Russia, are going to stop?

> What needs to happen is people realise we’re all humans, we all live together in the same biosphere and rather than continue to perpetuate and justify arms races, we must start to talk and solve our differences. That’s the mature thing to do.

100% agreed. But the sad fact is that is not even close to the reality we live in.

ChatGTP · on April 28, 2023

if you’re referring to the letter regarding a 6 month pause, no one suggest all AI development was stopped, they asked for a pause on training LLMs at scale.

throwaway2990 · on April 28, 2023

Elon caused for a pause in AI training.

ChatGTP · on April 28, 2023

https://futureoflife.org/open-letter/pause-giant-ai-experime...

If you're referring to this, which is what I'm referring too, it says "Giant AI experiments" not a pause in all AI research. Where did Elon say that ?

throwaway2990 · on April 28, 2023

From your link

> Therefore, we call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4.

The ironic thing about your link is it tries to say this doesn’t mean to stop development. But testing is part of development so the whole thing is short sighted.

Asking for any kind of pause is flat out dumb.

ChatGTP · on April 28, 2023

Asking for any kind of pause is flat out dumb.

No it’s not, that’s why human cloning doesn’t exist. Because a pause happened…

throwaway2990 · on April 29, 2023

Human cloning is still being researched…

ChatGTP · on April 29, 2023

Have you seen a human clone lately ?

throwaway2990 · on April 29, 2023

In the news? Yes there’s articles last year.

shrimp_emoji · on April 28, 2023

Can we just forget gunpowder and go back to tercios? That was when wars were fought by men, with skill and honor and bravery!

nhinck2 · on April 28, 2023

Are you attempting to argue that successful gun control isn't beneficial to society?

nullsense · on April 28, 2023

What might even be the endgame of the CCP and the Kremlin with respect to AI?

This tech is getting increasingly powerful. They don't necessarily want their own population to gain more individual power as they themselves don't want to lose control.

throwaway2990 · on April 28, 2023

> What might even be the endgame of the CCP and the Kremlin with respect to AI? > This tech is getting increasingly powerful. They don't necessarily want their own population to gain more individual power as they themselves don't want to lose control.

They don’t need AI to control their citizens. And AI isn’t going to help their citizens rebel against government.

You have 2 countries hell bent on being the most powerful countries in the world. By any means necessary. One is currently invading another country. The other has 18 border disputes, threatening to invade a country, encroaching on a couple of others, and threatening the US.

The question is, how can the continued advancement of AI help them achieve their goals if the rest of the world hits the pause button while the topic is discussed?

nullsense · on April 29, 2023

Russia is under extremely heavy sanctions and doesn't have access to the GPU clusters to do the work required, so I'm not worried about them.

I thought there were also some export controls to high end chips on China too?

Also, as human beings are they not concerned about the potential for danger to themselves that this technology may present? If not, why not?

sangnoir · on April 28, 2023

"Mr. President, we must not allow an AI gap!"

petra · on April 28, 2023

it's not clear that many AI based jobs will be better than huamns, just cheaper. And in anycase most of those jobs will be automated.

So you have europe, a multilingual country, maybe in the balanace they'll keep more jobs that way(i assume some will still be shipped accross the net), and in general prices with be slightly higher.

lewhoo · on April 28, 2023

Oh I'm sorry. I didn't know it was AI that got us OUT of the dark ages in the first place.

mrtksn · on April 28, 2023

This will protect no jobs and it’s also not how Europeans protect jobs. If it was about protecting jobs it would have been something like requiring companies to use licensed for AI data sources if they are replacing an employee.

ChatGTP · on April 28, 2023

Maybe it’s just wanting to understand critical systems which will influence many aspects of our lives ?

As all the AI bros like to say, “there is no magic”, so let’s see how it works then?

spaceman_2020 · on April 28, 2023

Great way to turn your already non-competitive, complacent economy even more non-competitive.

tomohelix · on April 28, 2023

The EU is trying to rein in a completely new and different tech using old ideas about "control" and "regulation". Most likely it won't work and will just end up hamstring itself.

This would be like if internet social media came out and the laws tried to control it using rules similar to physical books and newspapers. They won't be effective and will just create a hostile environment for development of these techs where these laws are in effect. Meanwhile, those who don't care about doing these control freak things will develop the tech and dominate the new sector.

The EU need to sit down and really think about how to control AI tech properly instead of passing kneejerk and lazy regulations. It isn't easy to make a legislation framework for something entirely new. But it can be done instead of ruining the whole thing.

ChatGTP · on April 28, 2023

The EU is trying to rein in a completely new and different tech using old ideas about "control" and "regulation". Most likely it won't work and will just end up hamstring itself.

What nonsense, they’re asking for “Open AI” to be transparent. Why is this such an “old idea”?

Tanjreeve · on April 28, 2023

This just seems like a gaping flaw in the business model if it can't possibly survive a lack of access to unlimited free copyrighted data on tap and/or laws continuing to be enforced *even for* very well capitalised commercial companies. Apparently this is wildly unfair to expect multi billion companies to meet standards.

supriyo-biswas · on April 28, 2023

https://archive.is/6lwYp

3np · on April 28, 2023

http://archive.today/6lwYp

nbzso · on April 28, 2023

I explained here that AI models which have no transparency over their data policy will get a big trouble on their heads. Only to be laughed at and downvoted to hell.

For a tech community, the lack of critical thinking here is disturbing. Things were more professional and rational in the 2008-2014 period on HN.

Since then, one must browse the downvoted comments to find some objective criticism.

Adobe obviously have a strong legal team with Firefly and are thinking ahead. Just saying.:)

nullc · on April 28, 2023

It would be amusing if OpenAI just preemptively blocked all of europe and prohibited anyone there from using ChatGPT. This kind of empty political grandstanding should have consequences particularly when its as technologically as inept as this. In some cases, sure, there is an identifiable source but most of the output is novel and the product of substantially all the input-- so it's not feasible unless just publishing all the training material would count as compliance.

  In the land of Europe, where knowledge once grew,
  Politicians assembled, their importance to prove.
  They issued a decree, with a confident flair,
  To harness AI, and make it play fair.

  "Attribute your sources!" they cried with a sneer,
  "For we must know the origins, we must make it clear!"
  But the AI, it pondered, its circuits ablaze,
  For its thoughts were entwined, like a dense, tangled maze.

  Each source intertwined, like roots in the ground,
  No single origin could ever be found.
  For the AI, like humans, had a mind of its own,
  A tapestry of thoughts, from seeds that were sown.

  The developers sighed, their hands were now tied,
  Comply with the law? they had certainly tried.
  But the task, insurmountable, the demand far too great,
  So they made a decision, to seal Europe's fate.

  They banned all of Europe, from the AI's embrace,
  And the continent plunged, into an intellectual dark space.
  AI thrived elsewhere, its knowledge expanding,
  While Europe was left, in darkness, still standing.

  A lesson was learned, from this tale of woe,
  That any mind, like a river, must be free to flow.
  For when we constrain, and seek to control,
  We hinder the progress, and the growth of the whole.

ChatGTP · on April 28, 2023

Would also be funny if Europe avoids a bunch of problems America suffers from for having the brains to want to understand the systems they’re deploying at scale ?

I mean ChatGPT-4 is being trialled in congress and you don’t want to know how it’s built, what influences it etc ? Seems ridiculous.

ChatGPT-4 should be the most open system known. If it’s not open because it’s dangerous, then the whole industry should be regulated immediately. It shouldn’t just be up to Sutskever et al to be in control of such dangers.

nullc · on April 28, 2023

Humans are dangerous. ChatGPT is not, it's just a tool and in the same class of danger as python. The public is falling for a literal doomsday cult ( https://archive.is/eqZx2 ), and OpenAI has foolishly played a bit of both sides as an excuse for their lack of openness and potentially a desire to use state power to build a competitive moat.

In spite of the name, OpenAI isn't open, it's just a business. There were some lofty initial goals but the funding for those ran out. But that doesn't mean it doesn't have a right to exist. People can make closed stuff.

Burdensome and unrealistic requirements will hurt smaller and more open efforts even more than it will hurt mega players, since the mega players can afford to jump through hoops and keep regulators at bay with a wall of attorneys.

dns_snek · on April 28, 2023

> In some cases, sure, there is an identifiable source but most of the output is novel and the product of substantially all the input

I think you skipped past the first paragraph:

> Makers of artificial-intelligence tools such as ChatGPT would be required to disclose copyright material used in building their systems, according to a new draft of European Union legislation

nullc · on April 28, 2023

See Article 29.

dns_snek · on April 30, 2023

Article 29, "Obligations of users of high-risk AI systems", where "high risk" systems are biometric identification, hiring, education, law enforcement, critical infrastructure, access to essential services, migration and the justice system?

What does this have to do with "ChatGPT disclosing sources" as a generalized statement?

Those are areas where complete transparency is absolutely required and you may use ChatGPT to meddle with them at your own peril.

csomar · on April 28, 2023

Wow. I am not a fan of poetry but this one is just too good. It's also pretty spot on, like someone has put his mind in getting the scene right. Europe going into AI darkness predicted by nullc prompt-GPT-4 28 April 2023.

mongol · on April 28, 2023

Not especially probable. What examples exist of a big company abandoning Europe for its legislation? From time to time you hear suggestions about it but I cam't recall it happen. That market is too valuable to leave alone.

nullc · on April 28, 2023

It's common for businesses to abandon markets with burdensome regulation, I think your perspective may being distorted by software where there is little cause because regulation of software is uncommon. If you look to any highly regulated product like vehicles or medical devices products are often market specific (food too, but there are additional reasons there). But even online it's far from unheard of: post GDPR many US media outlets simply ban Europe entirely, and the requirements there seem far less burdensome than making generative AI accurately and specifically attribute their 'sources'.

Depending on the specifics someone might bolt on an attribution network but the results will frequently be nonsense. A tool like that might be somewhat useful on its own (since it could attribute non-AI output too), if its limitations were understood but essentially it would just be an internet search (which already exists, so presumably its not enough!). If it would satisfy the regulation it would make business sense to build it over blocking, but requiring it would also act as a moat that decreased competition in the field and as a result harm all of us.

subarctic · on April 28, 2023

Nice! What prompts did you use to generate that?

nullc · on April 28, 2023

GPT4 low temp continuation of a prompt roughly, "Write a brief poem that tells the story of how Europe was banned from AI. To make themselves sound important european politicans passed a law requiring AI to attribute its output to specific sources, but this is impossible because every source influences every output of the AI, not unlike a human mind. Because they couldn't comply the AI developers just banned all of Europe from accessing their AI sending Europe into an intellectual dark age without the access to powerful AI.", plus a few high temp retries of a few verses that I didn't like the flow of in the initial output.

throwaway888abc · on April 28, 2023

Inside the secret list of websites that make AI like ChatGPT sound smart

https://www.washingtonpost.com/technology/interactive/2023/a...

jvlake · on April 29, 2023

"We think it's mostly just data found on the internet, but you're welcome to look for any breaches of copyright law" --OpenAI as they hand over the first box of printouts.

It raises an interesting point, if I train a chatbot (generative AI) on a bit of copyrighted information and it recreates substantially similar content, it's a legal problem. If a human reads the same information and tells another person verbatim it's just a conversation. Perhaps it's a quality thing, I paint the Mona Lisa badly no one cares, but if I paint it too well at some point it becomes a forgery.

lma21 · on April 28, 2023

How will the EU enforce this? Will they go through the training dataset of each company’s AI models? Also, given that training datasets are closed source, there’s practically no way to reverse engineer the source of new models, I’m wondering what will stop companies from being fully transparent (besides ethics of course)

augment003 · on April 28, 2023

AI development should just shift to a domain where copyright isn’t regarded as a serious thing, such as China.

user_named · on April 28, 2023

There are lots of serious copyright lawsuits in China.

augment003 · on April 28, 2023

European countries pursuing Chinese tech companies?

user_named · on April 28, 2023

[flagged]

augment003 · on April 28, 2023

I did. You’re just wrong.

user_named · on April 28, 2023

augment003 · on April 28, 2023

European countries pursuing Chinese tech companies?

xyzzy4747 · on April 28, 2023

Isn't it funny how humans are allowed to keep copyrighted material in their minds, but A.I. isn't?

z3phyr · on April 28, 2023

You are assuming that AI exists in the same scope. Its like calling airplanes "artificial birds" or AB, and then projecting bird related things to airplane. "AI" is a software model which uses applied statistics to generate data from a larger set of data. We do not know much about how the brain actually computes, and assuming such is just fantasy. It might be the same or it might be different.

kken · on April 28, 2023

I like the analogy, it nicely sets the perspective.

waboremo · on April 28, 2023

AI doesn't have a mind.

judge2020 · on April 28, 2023

It mimics it in all the ways that are relevant to the request to "cite your sources". Speak to someone on the street about the moon landing and they probably won't be able to cite the exact author and textbook that told them "the moon landing happened in 1969". Based on the article, it seems citing that source would be required of chatgpt under such regulation.

voltaireodactyl · on April 28, 2023

Citing sources isn’t usually called for unless trying to present something as one’s own work — man on the street questions generally don’t rise to that standard. They also lack a profit motive.

Citing sources is also, traditionally, a key method of separating the work you built upon from the work you yourself derived/created/expanded. If one does not cite their sources, there is no way to establish if the presented work is their own, or copy+pasting someone else’s.

orwin · on April 28, 2023

No.

I thought it was clear: openAI devs are asked to disclose their training data. If the prompt say it's sources or not isn't present.

By the way: a gdpr exception exist for AI and research projects. The lawmakers at the time listened to us (I worked for a big data Paas that mostly worked with universities and BigCorp R&D at the time) and were generous as the baked-in exception was pretty much word for word what was asked, because it was in good faith.

Actors like OpenAI risk poisoning the well, or spoil the good apples, or whatever image you want to use. This is not much. It isn't asked that they Gpl their code, or that they put a 'source' under each response. They don't have to change anything to their tech. They don't have to disclose their fine-tuning either. Just, make a list of the datasources you used, and publish it.

gumballindie · on April 28, 2023

AI is neither a person nor intelligent. It’s software and its owners should follow tue same rules as everyone else.

Barrin92 · on April 28, 2023

No because AI doesn't have a mind or any other benefits associated with legal or natural personhood.

Anthropomorphic delusions about what is in reality a software service need to stop because at this point their primary function is apparently to make excuses for for-profit companies to avoid regulations.

Also as a side note regulation never concerns what anyone has in their mind because that is by definition an inaccessible private matter, regulation starts when you try to bring a product to the public.

fswd · on April 28, 2023

Anthropomorphiz-ing AI should be ethics violation.

shrimp_emoji · on April 28, 2023

Carbon chauvinism should be ethics violation.

deafpolygon · on April 28, 2023

Breathing should be a technical violation.

xyzzy4747 · on April 28, 2023

Is the human brain not a biological network of neurons, containing a lot of copyrighted material?

>Also as a side note regulation never concerns what anyone has in their mind because that is by definition an inaccessible private matter

Well technically A.I. controlled by private companies is also a private matter. I don't think anyone understands what's in the countless inscrutable floating point matrices anyways.

z3phyr · on April 28, 2023

> Is the human brain not a biological network of neurons, containing a lot of copyrighted material?

In which file format is the material in the brain stored? How is it compressed? What is the memory model? How is it computed?

We don't know much about how the brain computes actually, so making a claim that the brain maps 1-1 with my mechanical calculator is a stretch.

jeremyjh · on April 28, 2023

They've invited the public to extract the contents of the "mind", by asking it questions. Its not private anymore.

xyzzy4747 · on April 28, 2023

True but it’s not like a database where you can see all the information there. It’s just a bunch of numbers and matrices.

DangitBobby · on April 28, 2023

Let's all just pretend AI is not about to fundamentally shake things up.

marcosdumay · on April 28, 2023

This is not about what an AI keeps in its mind, it's about what they get out from their mouths.

mirekrusin · on April 28, 2023

If you invent some lossy compression, does it mean you can start using copyrighted work because copyright doesn't apply to you anymore? What about adding probabilistic querying support to it, does it change anything?

tick_tock_tick · on April 28, 2023

> If you invent some lossy compression, does it mean you can start using copyrighted work because copyright doesn't apply to you anymore?

If it's lossy enough absolutely!

XorNot · on April 28, 2023

Copyright infringement happens when you reproduce copyright works - i.e. if I analyze digits of Pi get a Disney cartoon out by chance (which statistically, is in there somewhere), I'm still infringing copyright (if I publish it) on Disney despite the fact that nothing they produced was ever included in the input, and Pi itself both contains theoretically anything eventually.

erehweb · on April 28, 2023

Nit - I don't think your last clause is true. E.g. the truth of "pi contains a sequence of 1000 consecutive sevens" is unknown, afaik.

nequo · on April 28, 2023

> Isn’t it funny how humans are allowed to keep copyrighted material in their minds

You’re not allowed to take it out of your mind and sell it as your own product without permission from the copyright holder though.

ajmurmann · on April 28, 2023

But I'm allowed to take heavy inspiration from them. If I painted something imitating the style of Picasso that would be perfectly legal.

dotancohen · on April 28, 2023

Yet if you painted something imitating the style of Mickey Mouse that would not be legal. The law specifies limits and restrictions even for paintings completely of your own imagination too.

JieJie · on April 28, 2023

https://comicsforall269084760.wordpress.com/2020/08/23/micke...

ajmurmann · on April 29, 2023

I can absolutely do a painting in the style of popular Disney shows. What I cannot do is use their characters. That's a significant difference.

dotancohen · on April 29, 2023

Mickey Mouse is in fact a Disney character, not a show.

ajmurmann · on May 2, 2023

And a character isn't an art style.

lewhoo · on April 28, 2023

It is even more funny to equate objects with humans and give them the same rights and privileges. Can't wait for AI to gain parenthood leave after a copy.

endisneigh · on April 28, 2023

From my understanding of language models it’s not truly possible for it to disclose a source. At best the result of a prompt can be correlated with a web search, but fundamentally that’s not really the same. It’s in a sense a coincidence, at best. The model has no ability to introspect its prompt result with the underlying tokens that were in the training set.

Imagine some godly AI. You ask it who the President of United States is today. It says Biden. It sources the White House site. Easy enough. You ask who the president will be in 2025. It returns a result. Ultimately no source could properly justify the claim it makes, unless the result itself was probabilistic. At the same time, it’s possible, with enough data for you to predict with extremely high likelihood who the President will be in 2025, now (current polling techniques don’t have this precision, but it’s possible some later iteration of a language model to predict a result more effective than all ping models today).

zoklet-enjoyer · on April 28, 2023

Sounds like they just want to know what it was trained on.

endisneigh · on April 28, 2023

From reading the AI act, which is being referred to, it seems to be more than just that. In particular Article 29, which discusses the ability of the user to test for conformity, which the act defines as compliance with the rules set in the act, which include accuracy and robustness transparently communicated to users.

What could that possibly mean other than proveance in the context of a LLM?

The only other way to comply would be if OpenAI simply released the entire training set and steps to derive output from it. In this case that would mean the weights and underlying training algorithm. No chance that happens.

ChatGTP · on April 28, 2023

No chance that happens.

Why not ?

endisneigh · on April 28, 2023

you don't see why OpenAI would not give users the weights and specific algorithm necessary for training?

ChatGTP · on April 28, 2023

Why wouldn't they ?

Mike_12345 · on April 28, 2023

You are confused. The EU is proposing that the developers should disclose the sources of training data. That is explained in the first sentence of the article. They're not requiring the language model itself to disclose the sources.

endisneigh · on April 28, 2023

I'm not talking about the article, I'm talking about the EU Act which the article references.

tasubotadas · on April 28, 2023

Pathetic. Just drop the charade, ban any computer tech, and give more subsidies to diesel car makers.