Everyone here is thinking Europe is going to kill AI in their borders. Honestly it reminds me of how everyone derided EU antitrust as just milking Google for fines, until everyone soured on Google and realized what the EU already knew.
The EU is not anti-AI. In fact, they have stronger protections for AI training than the US does: EU law already has a copyright exception for "text and data mining" (TDM) which covers AI training. The problem is that OpenAI has been incredibly cagey with the way their models get built and trained. This is kind of contrary to the spirit of the TDM exception: it's for scientists to do science with, and OpenAI is being very much not like a scientific organization and more like a commercial enterprise.
Isn't GDPR currently the main problem? There's personal information on the Internet. If you are processing crawled data, then you also end up processing the personal information, but you don't have consent for the processing from those individuals.
Given where the top AI companies are and how much more important they are for AI than the top universities for it this is bafflingly optimistic. There’s no reason to expect Europe to be other than an also ran in AI, punching self in face antics aside. Why would anyone think Europe was relevant to AI development?
You don’t seem to be getting the point of this, Europe is not attempting to hobble ai development, but enforce its laws within its borders, if openAI can’t adhere to them, then they can’t do business there, which is a problem for them, given the European Union is the 3rd largest economy in the world.
I could be wrong, but to me this is like a government demanding that trade secrets are revealed.
If people are worried about open AI, they already have the option not to use it.
Sorry doesn't look like your are getting it. Its not about AI its about personal data. Whether AI is using this or any other program, EU citizens have a right to know and request removal of that personal data.
Open AI need to allow users to do this same as anybody else if this reveals trade secrets then so be it.
This is a great first step. It's a joke that Open AI thinks they can get away with saying they use "both publicly available data (such as internet data) and data licensed from third-party providers" in their Technical Report.
There isn't anything left at that point! With that information they could actually have used anything.
If you're going to pretend to be doing science you should at least be held to some of the standards we typically associate with doing science.
I know the article talks about copyright, but not stating any sources for data is a bad precedent to allow.
Woah. Doing bad science isn't illegal, and making it so would be quite chilling. It's common in many fields to be quite imprecise about data used in the work, and entirely uncommon in many for data to be externally reproducible.
Legislation restricting research isn't the right way to improve science and is unlikely to achieve the intended effect for many reasons, including that it's easier and safer to just not touch the impacted area. In some domains this causes whole areas to go unstudied or understudied, e.g. because it runs into IRB and just isn't worth doing... but at least the rules demanding IRB approval are intended to keep people from suffering grave harm and even those are less strong than blanket regulation (they're rules tied to federal funding, not research in the abstract).
If you're doing any kind of science, recording provenance for your inputs should be table stakes. Bad science isn't about hiding or obscuring the origin of the data, it's about being sloppy, incompetent, or even flat out willfully misinterpreting the results.
We have names for what looks like science but is done without documenting - let alone outright falsifying - where the data came from. Hoax. Advertising. Propaganda. Parallel construction.
Let's not lump incompetence and malice in the same bucket, please. And if you're unsure of the data provenance, then state that fact.
The consumers of scholarship are able to look at and determine if its the sort of thing where access to underlying data is important-- and they're free to discount it when it doesn't provide enough. Journals and grant writers are free to set standards for the work they publish or sponsor-- and they should!
But no one needs to legislate that publications such as your conclusion-- that science done without documenting its sources is properly called Hoax, Advertising, Propaganda, or Parallel construction-- itself properly document its sources. We can take it for what it is, an opinion-- one no doubt supported by some data but none of us need to see it, and we can evaluate it without calling it propaganda. If you wanted to make your point stronger, I'm sure you'd give us some supporting data (if you could figure out where those views came from...).
Though people sometimes pretend otherwise, a lot of research is dressed up informed opinion, put into a formal setting with standardized argument styles so that it can be compared and assessed against other informed opinions. None the less, such work done honestly and diligently advances the human condition.
The ways in which it can be best improved are field specific and can only really be judged by the people attempting to use the scholarship. In some cases the data should be published, in others its provenance documented (sometimes publication of the data would be a violation of the law, too!), in others access to source code should be paramount, in yet others the authors biases may be the primary concern, and in some fields all publications should be directly sent to the incinerator. Applying the wrong standards will just make things worse. People are smart, they tend to figure out what works for them and their field over time.
> Though people sometimes pretend otherwise, a lot of research is dressed up informed opinion, put into a formal setting with standardized argument styles so that it can be compared and assessed against other informed opinions. None the less, such work done honestly and diligently advances the human condition.
I ... think we actually agree here.
In fact, to prove your point: I have no chance of accounting for the origins of my opinions, because they stem from decades of osmosis and subjective experiences. But I can at least be honest about something I say, do or argue being an opinion. The same way you just did.
Doing bad "science" is not illegal, but maybe should be, considering the replication crisis that is upon us. It diminishes the utility of the work that is being done, and makes it difficult to tell apart actual scientific discoveries from flukes and forgeries.
I want to know if OpenAI used say GPL or other copyrighted software and then the bastards had the genius idea to put restrictions on the output in their ToS. I want stuff to be fair, if MS/OpenAI can train on GPL then I should also e allowed to train on MS proprietary code or on Disney images and video, it is not fair that big companies can screw the public but the public can't do the same to the big companies. The first step is clearly have the big companies reveal if they used copyrighted stuff.
> I want to know if OpenAI used say GPL or other copyrighted software and then the bastards had the genius idea to put restrictions on the output in their ToS.
This is a bit of a gray area. Are you allowed to read GPL'ed code and use a similar pattern in a closed source project?
I am a human , I am not a machine that inputs all the GPL code on the internet and then outputs similar code with very small differences. I am fine with OpenAI and MS using GPL code as long as the open source community can also train on proprietary code and art of the big companies.
What happens now is that some big companies say that is OK to train on any licensed stuff and on the other hand some people are sued because they done it, I want it clarified ASAP. And personally I would not give a shit on the ToS of OpenAI and use their poutput as I like as similar as they did.
I agree but things are changing: many publishers already require a disclosure statement about data. I think both the US and the EU are slowly moving to a direction of open data in scientific research.
What is this "legislation restricting research"? These companies are not doing "science".
Imagine a future where everyone learns from AI instead of books because it is more convenient, faster etc. You would get the same info, but you would not know who was the expert that you learned from. How would that change society if all authors just disappeared behind a generic AI brand? I don't think it would be especially good, and I think it is completely fair that an answer from ChatGPT should provide sources. It would improve the quality.
AI has been trained on all human knowledge up until now but as it continues ingesting human ingenuity going forward it will remove the incentive for people to create new knowledge or styles or art since the AI can immediately mimic you and steal it. Promoting creativity is what copyright law was created for so I got a feeling we'll be revisiting those laws sooner or later.
> How would that change society if all authors just disappeared behind a generic AI brand?
I've been thinking about that but in terms of famous actors. Once AI can replace actors in movies, as well as singers and models like Instagram influencers and so on, maybe some of the weird hero worship and the paparazzi and the gossip mags and all that nastiness will fade away somewhat. That, I feel, would be a good thing. Pick your favorite Hollywood actor, or pop singer. A supremely talented human... Amongst thousands and thousands of supremely talented humans in their field and yet they are the ones who got lucky, had the right connections or got lucky with the right role. Then for the rest of their lives they are feted and hero worshipped as if they are more than human, while thousands of equally talented people who didn't get the lucky break are ignored. That's what being famous is, mostly, and it's not good for either the famous people or the people who worship them.
Does that apply to scientists and authors? I'm not sure. But in terms of scientific breakthroughs it's extremely rare that a particular discovery could only have been made by one person. In fact, nearly every discovery, from the calculus to the theory of evolution to DNA, was concurrently discovered by multiple people. And yet we attach one name to each discovery and hero worship that person because they published a few weeks earlier or were just better at self marketing.
Maybe losing the attachment of famous names to things is a good thing for society. As long as it's not replaced by a corporation pretending ownership of all the knowledge in their place, at least.
> Once AI can replace actors in movies, as well as singers and models like Instagram influencers and so on, maybe some of the weird hero worship and the paparazzi and the gossip mags and all that nastiness will fade away somewhat
I dunno, I remember seeing on youtube Hatsune Miku concerts being pretty packed, and there was that one guy who even married Hatsune Miku. Who knows what'll happen with AI.
make me wonder if somebody has trained a LLM from SciHub data. would be interesting if someone were to marry a symbolic engine like mathematica with such an LLM/AI.
yes there is but I am not sure if its actually used to do anything past some symbolic math. IMO true AI could only happen from understanding (& internalizing) physical laws especially notion of energy optimization. I feel that the until LLMs are somehow married with this knowledge they would always be parroting back from existing 'data' in seemingly creative ways but never create something truly new of their own.
> ChatGPT would be required to disclose copyright material ...Such an obligation would give publishers and content creators a new weapon to seek a share of profits
If somebody figures out how to do fine-grained profit sharing based on having created something that the AI references... that would be very cool. I love discovering the solution to a niche and difficult-to-describe problem, but I hate the extra work necessary to leave breadcrumbs for DenverCoder9 to find it 20 years later.
If I could leave the matchmaking to an AI and get paid $0.25 when it's finally helpful to that person I don't know... Well I'd probably wouldn't make much money, but it would give me warm fuzzy feelings.
I don't think ownership makes sense in the future we are building, optimizing society to achieve greatness as a multiplanetary species involve reinventing money and its purpose
What society would be today if you had to pay a fee whenever you wanted to use the Pythagoras theorem, we'd be stuck in dark times
Oh you're absolutely right, the notion that you can own information is absurd. Maybe it was necessary at one time as a kind of training-wheels for innovation, but we've long since outgrown it. It'll only ever get more and more absurd.
But if somebody you don't know is doing something that's benefiting you, and you're not contributing to their ability to continue doing that thing in some way, then you might be shooting yourself in the foot. In any future worth pursuing, they'd be free to stop contributing if they felt like it, but wouldn't it be a shame if they did so without even knowing that their contributions were considered incredibly valuable by somebody?
Like, imagine if Pythagoras couldn't afford food and had to give up geometry club and get a "real job". That would be to everybody's detriment. So while I think that "property" is the wrong tool here--we shouldn't be witholding access to our contributions for any reason--I do think we need something that's a little more impactful than an upvote for saying "more like this please".
Which is why we should be very careful when we consider creating a system that is going to funnel money to people who are already famous for writing, art, etc.
How will new artists and writers get their works included in future AI — and then get people to prompt for them — so they can get their paycheck?
Even spending a few minutes on this problem would lead to a realization that even if we could create a system that could a) determine rights to any particular portion of an AI-generated work, and b) extract payment and remunerate the artist; would essentially be building a moat around the next generation's intellectual property powerhouses.
Generative AI is a revolutionary technology, and we need a revolution in compensation models for arts and letters to go with it.
It's not going to happen if we just sit around and hope for it, but since the current model for supporting creators is failing so badly, it seems likely that if somebody can get it even half way right, their system would have a huge advantage. Half-way right, in my view, would avoid most of these:
- Incentivize the creation of technology that does more harm than good (e.g. DRM).
- Create legal constructs that are later used for censorship.
- Require that artists share profits with lawyers.
- Require artists to focus mostly on stuff that's not their art.
And would achieve some of these:
- Citing sources is impactful. The graph structure for determining trustworthyness is what also determines payment, or credit, or warm fuzzy feelings, or whatever the relevant good thing is.
- Has an culture of rewarding (and scrutinizing) curators such that successful curators only endorse content which fair about how it defers to its sources.
- Supports inheritance such that making derivative works that credit their parent is easy.
- Treats transport and attribution separately so that I can work with the data via whatever tool scratches the itch (e.g. rsync, and not some janky website).
So yes I do think it's possible. I'm working on tooling in this imagined ecosystem. I want to use CTPH hashes (i.e. the tech used by virus scanners) to annotate bitstreams with metadata re: trustworthyness. What I don't think is possible is to take an AI's output and mapping it backwards to annotations of this type in the training data, but I'm hoping that some AI wizard comes along and shows me that I'm wrong about this.
Do you believe the reasons why the current model is failing creators is for lack of good technological tools? Because I personally believe the issues are more anthropological than technological which is precisely why I don't have much hope.
Better tools can improve the situation probably, I can't say for certain because I never dived into this space but I don't think they'll solve all the issues.
Yes. We're tool using primates, that quip about having only a hammer and seeing only nails is really descriptive of the human condition. There's an excellent radio-lab episode which argues that cultures don't develop a word for the color blue until they can make blue dye (and the implication is that they don't even perceive it before that).
Abstractions like "value" or "property" arose organically, and if we didn't have this fixation on tools they'd likely have changed organically... But we made tools for working with those abstractions and now we live in a world shaped by those tools and it has created a sort of inertia for the old way if doing things.
It's kind of like how all of the spellings stopped changing when the printing press was invented, so now we have wacky spellings like "through" that we would've moved on from had that not happened at that particular time (see: "the great vowel shift")
The historical circumstance around the creation of the printing press is what gave us our notion of intellectual property to begin with, and I think it'll remain more or less unchanged until some other technology forces it to change.
It's incredibly difficult to visualize a different way in our current setting, and it's especially difficult to get paid to work towards it, but I think that pretty much any change is possible given some MVP toolset that makes it doable and some critical mass of people willing to give it a try.
I absolutely believe it can happen. I also believe that when people have the attitude that "a small minority will just try to fuck everybody else for their profit because that’s just human nature unfortunately", a self-fulling prophesy is exactly what will happen if we don't spend our time and energy actively advocating for the other thing.
Nothing happens in a vacuum. We get the government we deserve.
(I hope this doesn't come off as me "dunking on" you, or whatever people do on social media. I'm not trying to attack you, but that attitude that is all-too-common on social media these days. It's defeatist and it's not going to lead to good outcomes for anyone but the people who are going to "fuck everybody else".)
> I hope this doesn't come off as me "dunking on" you, or whatever people do on social media.
Nah it’s all good. I spend my time advocating for people to engage in more healthy ways, I try encourage people to blog more, to write directly to other humans via email and mail, I try to push for more genuine connections so it’s not like I think everything is doomed.
But I’m also not naive when it comes to the internet or society in general.
People are obsessed with money. And very few are obsessed with distributing them evenly. Even in the creators space. Which is why I wrote what I wrote.
I’d LOVE for a different outcome. I just don’t expect it.
That's not optimism, it's either that or our civilization will disappear like the previous ones, and it is even more true nowadays with China's goal to become a global super power with the Moon and Mars in its trajectory.. so we either focus or get eaten alive
The problem is that AI doesn't really "reference" data. When you "train" an AI on some data, you're adjusting billions of model parameters to make them closer to the desired output. Except you're also doing that on billions of pieces of other data, many times over, and every bit of data you train on is stepping on everything else. In order to pay people a share of the 'profits' of AI, you need a clear value chain from licensed training data to each output, through the magical linear algebra data blender that is gradient descent. Nobody knows if your training example helped for the same reason why we don't know if an LLM is lying or not.
In lieu of that, you could pay everyone a fixed cut based on presence in the training set, but that then gives you the Spotify problem of a fixed pot being shared millions of different ways. For example, Adobe recently announced they were building an AI drawing tool trained on exclusively licensed sources - specifically, Adobe Stock contributors[0]. They're used to being paid when someone buys their image, which means that they have incentives to produce broadly relevant stock photography. But with a fixed "AI pot" paying you, now you have an incentive to produce as much output as possible as cheaply as possible purely to get a larger part of the pot. This is bad both for the stock photo market[1] AND the AI being trained.
AI is extremely sensitive to bias in the dataset. Normally, when we talk about bias, we think about things like "oh if I type CEO into Midjourney all the output drawings are male"; but it goes a lot deeper. Gradient descent does not know how to recognize duplicate training set features, those features get more chances to adjust the model. Eventually that training example or image is common enough to make memorization 'worth it' in terms of parameters used[2].
Ironically that sort of thing would actually make attribution and profit-sharing 'easier', at the expense of the model being far less capable.
[0] Who, BTW, I don't think actually have the ability to opt-in to this? Like, as far as I'm aware this is being done through the medium of contractual roofies being dropped into stock photographers' drinks.
[1] Expect more duplicates and spam
[2] This is why early Craiyon would give you existing imagery when you asked for specific famous examples and why Stable Diffusion draws the Getty Images watermark on things that look like a stock photo of a newsworthy event.
> you need a clear value chain from licensed training data to each output, through the magical linear algebra data blender that is gradient descent. Nobody knows if your training example helped
The magical linear algebra data blender that is gradient descent boils down to small additive modifications to the model parameters. We already know how to compute the effects of small additive modifications to the model parameters on the output: that's what the gradient is.
So if you want to know how much each training sample contributed to the output, just compute the dot product between the two gradients.
Actually doing that for a billion-parameter model would be slightly expensive because the gradients are also billion-dimensional, so you'd need to approximate the dot product via dimensionality reduction and use a vector database to filter for training samples with high approximated dot product.
But I think those layers of approximations would still be better than throwing your hands up in the air and claiming you have no way to know because linear algebra is magic.
AI could be used to decide if a source should be included or not (or the benefit to the model could be the qualifier). That would solve your problem of people just peddling spam.
No this is an unsolvable problem.
The future also seems like less about making the model an all knowing oracle but instead making it smart enough to know how to lookup data it needs, so it could end up where licensed data is all that is needed for training.
Lastly, what if you use model A to generate data for model B? Would B be tainted? There have been lots of examples where LLM’s are used to train simpler models by synthesizing training data.
The thing to note about copyright is that you can't launder it away, infringement "reaches through" however many layers of transformation you add to the process. The question of infringement is purely:
* Did you have access to the original work?
* Did you produce output substantially similar to the original?
* Is the substantial similarity of something that's subject to copyright?
* Is the copying an act of fair use?
To explain what happens to Model B, let's first look at Model A. It gets fed in, presumably, a copyrighted data set. We expect it to produce new outputs that aren't subject to copyright. If they're actually entirely new outputs, then there's no infringement. Though, thanks to a monkey named after a hyperactive ninja[0], it's also uncopyrightable. If the outputs aren't new - either because Model A remembered its training data or because it remembered characters, designs, or passages of text that are copyrighted - then the outputs are infringing.
Model A itself - just the weights alone - could be argued to either be an infringing copy of the training data or a fair use. That's something courts haven't decided yet. But keep in mind that, because there is no copyright laundry, the fair use question is separate for each step; fair use is not transitive. So even if Model A is infringing and not fair use, the outputs might still be legally non-infringing.
If you manually picked out the noninfringing outputs of Model A and used that solely as the training set for Model B, then arguing that Model B itself is 'tainted' becomes more difficult, because there isn't anything in Model B that's just the copyrighted original. So I don't think Model B would be tainted. However, this is purely a function of there being a filtering process, not there being two models. If you just had one model and human-curated noninfringing data, then there would be no taint there either. If you had two models but no filtering, then Model B still can copy stuff that Model A learned to copy. Furthermore, automating the curation would require a machine learning model with a basic idea of copyrightability, and the contents of the original training set.
In talking to friends that attempted to monetize their youtube, they found that works that were extremely popular and impactful (yet not millions of views viral) earned practically nothing and they were demoralized and discouraged from bothering to produce more.
Imagine it's 30 years ago. You and your friends have a band. You put on shows regionally, sometimes hundreds or even a thousand people attend. You have dozens of regular fans who go to all your events, they make and share covers of your music and fan art.
That isn't millions of views. It's tens of thousands. On youtube that's nothing. But 30 years ago you'd feel it was a great accomplishment, and if it was all you achieved you could be happy with that.
As is has become more possible for a few of the most broadly appealing and unchallenged works to reach millions of people the goalpost has moved.
Plenty of valuable content will never reach millions of viewers-- the appeal is too niche. It's a worthwhile contribution to the world none the less, but it isn't compensated as such on youtube.
There’s no way that OpenAI is going to disclose this, as their training methodology is a large part of their moat. So this will just get OpenAI models banned in Europe.
Between this and Elon saying that AI needs to be paused for discussion is super harmful.
People forget that if you pause it. People are not going to stop. Does the world want Russia and China to continue investing and developing while the west discusses what the scope of ai should be allowed to do? Because they sure as hell won’t be stopping to discuss anything.
“if we stop Russia and China”, what really is this based on?
Like as if Russia and China are just full of idiots and all the Chinese and Russian AI scientists are idiots and don’t understand our point of view. Don’t understand how risk or dangers? Like as if the leader of China is an idiot who doesn’t get it either ? It’s just ridiculous.
So what if America developed something really powerful in isolation and other countries find out about it, that may lead to immediate escalation and world war 3 , have you considered that? It’s a silly idea.
What needs to happen is people realise we’re all humans, we all live together in the same biosphere and rather than continue to perpetuate and justify arms races, we must start to talk and solve our differences. That’s the mature thing to do. If we can get to that stage, maybe then we’re ready for more advanced technology.
> So what if America developed something really powerful in isolation and other countries find out about it
This isn’t about America. America is not the only country working on AI. But if you stop all AI development in America and say europe because “oh it might be dangerous”. Do you think other countries, for example, China and Russia, are going to stop?
> What needs to happen is people realise we’re all humans, we all live together in the same biosphere and rather than continue to perpetuate and justify arms races, we must start to talk and solve our differences. That’s the mature thing to do.
100% agreed. But the sad fact is that is not even close to the reality we live in.
if you’re referring to the letter regarding a 6 month pause, no one suggest all AI development was stopped, they asked for a pause on training LLMs at scale.
> Therefore, we call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4.
The ironic thing about your link is it tries to say this doesn’t mean to stop development. But testing is part of development so the whole thing is short sighted.
What might even be the endgame of the CCP and the Kremlin with respect to AI?
This tech is getting increasingly powerful. They don't necessarily want their own population to gain more individual power as they themselves don't want to lose control.
> What might even be the endgame of the CCP and the Kremlin with respect to AI?
> This tech is getting increasingly powerful. They don't necessarily want their own population to gain more individual power as they themselves don't want to lose control.
They don’t need AI to control their citizens. And AI isn’t going to help their citizens rebel against government.
You have 2 countries hell bent on being the most powerful countries in the world. By any means necessary. One is currently invading another country. The other has 18 border disputes, threatening to invade a country, encroaching on a couple of others, and threatening the US.
The question is, how can the continued advancement of AI help them achieve their goals if the rest of the world hits the pause button while the topic is discussed?
it's not clear that many AI based jobs will be better than huamns, just cheaper. And in anycase most of those jobs will be automated.
So you have europe, a multilingual country, maybe in the balanace they'll keep more jobs that way(i assume some will still be shipped accross the net), and in general prices with be slightly higher.
This will protect no jobs and it’s also not how Europeans protect jobs. If it was about protecting jobs it would have been something like requiring companies to use licensed for AI data sources if they are replacing an employee.
The EU is trying to rein in a completely new and different tech using old ideas about "control" and "regulation". Most likely it won't work and will just end up hamstring itself.
This would be like if internet social media came out and the laws tried to control it using rules similar to physical books and newspapers. They won't be effective and will just create a hostile environment for development of these techs where these laws are in effect. Meanwhile, those who don't care about doing these control freak things will develop the tech and dominate the new sector.
The EU need to sit down and really think about how to control AI tech properly instead of passing kneejerk and lazy regulations. It isn't easy to make a legislation framework for something entirely new. But it can be done instead of ruining the whole thing.
The EU is trying to rein in a completely new and different tech using old ideas about "control" and "regulation". Most likely it won't work and will just end up hamstring itself.
What nonsense, they’re asking for “Open AI” to be transparent. Why is this such an “old idea”?
This just seems like a gaping flaw in the business model if it can't possibly survive a lack of access to unlimited free copyrighted data on tap and/or laws continuing to be enforced *even for* very well capitalised commercial companies. Apparently this is wildly unfair to expect multi billion companies to meet standards.
I explained here that AI models which have no transparency over their data policy will get a big trouble on their heads.
Only to be laughed at and downvoted to hell.
For a tech community, the lack of critical thinking here is disturbing.
Things were more professional and rational in the 2008-2014 period on HN.
Since then, one must browse the downvoted comments to find some objective criticism.
Adobe obviously have a strong legal team with Firefly and are thinking ahead.
Just saying.:)
It would be amusing if OpenAI just preemptively blocked all of europe and prohibited anyone there from using ChatGPT.
This kind of empty political grandstanding should have consequences particularly when its as technologically as inept as this. In some cases, sure, there is an identifiable source but most of the output is novel and the product of substantially all the input-- so it's not feasible unless just publishing all the training material would count as compliance.
In the land of Europe, where knowledge once grew,
Politicians assembled, their importance to prove.
They issued a decree, with a confident flair,
To harness AI, and make it play fair.
"Attribute your sources!" they cried with a sneer,
"For we must know the origins, we must make it clear!"
But the AI, it pondered, its circuits ablaze,
For its thoughts were entwined, like a dense, tangled maze.
Each source intertwined, like roots in the ground,
No single origin could ever be found.
For the AI, like humans, had a mind of its own,
A tapestry of thoughts, from seeds that were sown.
The developers sighed, their hands were now tied,
Comply with the law? they had certainly tried.
But the task, insurmountable, the demand far too great,
So they made a decision, to seal Europe's fate.
They banned all of Europe, from the AI's embrace,
And the continent plunged, into an intellectual dark space.
AI thrived elsewhere, its knowledge expanding,
While Europe was left, in darkness, still standing.
A lesson was learned, from this tale of woe,
That any mind, like a river, must be free to flow.
For when we constrain, and seek to control,
We hinder the progress, and the growth of the whole.
Would also be funny if Europe avoids a bunch of problems America suffers from for having the brains to want to understand the systems they’re deploying at scale ?
I mean ChatGPT-4 is being trialled in congress and you don’t want to know how it’s built, what influences it etc ? Seems ridiculous.
ChatGPT-4 should be the most open system known. If it’s not open because it’s dangerous, then the whole industry should be regulated immediately. It shouldn’t just be up to Sutskever et al to be in control of such dangers.
Humans are dangerous. ChatGPT is not, it's just a tool and in the same class of danger as python. The public is falling for a literal doomsday cult ( https://archive.is/eqZx2 ), and OpenAI has foolishly played a bit of both sides as an excuse for their lack of openness and potentially a desire to use state power to build a competitive moat.
In spite of the name, OpenAI isn't open, it's just a business. There were some lofty initial goals but the funding for those ran out. But that doesn't mean it doesn't have a right to exist. People can make closed stuff.
Burdensome and unrealistic requirements will hurt smaller and more open efforts even more than it will hurt mega players, since the mega players can afford to jump through hoops and keep regulators at bay with a wall of attorneys.
> In some cases, sure, there is an identifiable source but most of the output is novel and the product of substantially all the input
I think you skipped past the first paragraph:
> Makers of artificial-intelligence tools such as ChatGPT would be required to disclose copyright material used in building their systems, according to a new draft of European Union legislation
Article 29, "Obligations of users of high-risk AI systems", where "high risk" systems are biometric identification, hiring, education, law enforcement, critical infrastructure, access to essential services, migration and the justice system?
What does this have to do with "ChatGPT disclosing sources" as a generalized statement?
Those are areas where complete transparency is absolutely required and you may use ChatGPT to meddle with them at your own peril.
Wow. I am not a fan of poetry but this one is just too good. It's also pretty spot on, like someone has put his mind in getting the scene right. Europe going into AI darkness predicted by nullc prompt-GPT-4 28 April 2023.
Not especially probable. What examples exist of a big company abandoning Europe for its legislation? From time to time you hear suggestions about it but I cam't recall it happen. That market is too valuable to leave alone.
It's common for businesses to abandon markets with burdensome regulation, I think your perspective may being distorted by software where there is little cause because regulation of software is uncommon. If you look to any highly regulated product like vehicles or medical devices products are often market specific (food too, but there are additional reasons there). But even online it's far from unheard of: post GDPR many US media outlets simply ban Europe entirely, and the requirements there seem far less burdensome than making generative AI accurately and specifically attribute their 'sources'.
Depending on the specifics someone might bolt on an attribution network but the results will frequently be nonsense. A tool like that might be somewhat useful on its own (since it could attribute non-AI output too), if its limitations were understood but essentially it would just be an internet search (which already exists, so presumably its not enough!). If it would satisfy the regulation it would make business sense to build it over blocking, but requiring it would also act as a moat that decreased competition in the field and as a result harm all of us.
GPT4 low temp continuation of a prompt roughly, "Write a brief poem that tells the story of how Europe was banned from AI. To make themselves sound important european politicans passed a law requiring AI to attribute its output to specific sources, but this is impossible because every source influences every output of the AI, not unlike a human mind. Because they couldn't comply the AI developers just banned all of Europe from accessing their AI sending Europe into an intellectual dark age without the access to powerful AI.", plus a few high temp retries of a few verses that I didn't like the flow of in the initial output.
"We think it's mostly just data found on the internet, but you're welcome to look for any breaches of copyright law" --OpenAI as they hand over the first box of printouts.
It raises an interesting point, if I train a chatbot (generative AI) on a bit of copyrighted information and it recreates substantially similar content, it's a legal problem. If a human reads the same information and tells another person verbatim it's just a conversation. Perhaps it's a quality thing, I paint the Mona Lisa badly no one cares, but if I paint it too well at some point it becomes a forgery.
How will the EU enforce this? Will they go through the training dataset of each company’s AI models? Also, given that training datasets are closed source, there’s practically no way to reverse engineer the source of new models, I’m wondering what will stop companies from being fully transparent (besides ethics of course)
You are assuming that AI exists in the same scope. Its like calling airplanes "artificial birds" or AB, and then projecting bird related things to airplane. "AI" is a software model which uses applied statistics to generate data from a larger set of data. We do not know much about how the brain actually computes, and assuming such is just fantasy. It might be the same or it might be different.
It mimics it in all the ways that are relevant to the request to "cite your sources". Speak to someone on the street about the moon landing and they probably won't be able to cite the exact author and textbook that told them "the moon landing happened in 1969". Based on the article, it seems citing that source would be required of chatgpt under such regulation.
Citing sources isn’t usually called for unless trying to present something as one’s own work — man on the street questions generally don’t rise to that standard. They also lack a profit motive.
Citing sources is also, traditionally, a key method of separating the work you built upon from the work you yourself derived/created/expanded. If one does not cite their sources, there is no way to establish if the presented work is their own, or copy+pasting someone else’s.
I thought it was clear: openAI devs are asked to disclose their training data. If the prompt say it's sources or not isn't present.
By the way: a gdpr exception exist for AI and research projects. The lawmakers at the time listened to us (I worked for a big data Paas that mostly worked with universities and BigCorp R&D at the time) and were generous as the baked-in exception was pretty much word for word what was asked, because it was in good faith.
Actors like OpenAI risk poisoning the well, or spoil the good apples, or whatever image you want to use. This is not much. It isn't asked that they Gpl their code, or that they put a 'source' under each response. They don't have to change anything to their tech. They don't have to disclose their fine-tuning either. Just, make a list of the datasources you used, and publish it.
No because AI doesn't have a mind or any other benefits associated with legal or natural personhood.
Anthropomorphic delusions about what is in reality a software service need to stop because at this point their primary function is apparently to make excuses for for-profit companies to avoid regulations.
Also as a side note regulation never concerns what anyone has in their mind because that is by definition an inaccessible private matter, regulation starts when you try to bring a product to the public.
Is the human brain not a biological network of neurons, containing a lot of copyrighted material?
>Also as a side note regulation never concerns what anyone has in their mind because that is by definition an inaccessible private matter
Well technically A.I. controlled by private companies is also a private matter. I don't think anyone understands what's in the countless inscrutable floating point matrices anyways.
If you invent some lossy compression, does it mean you can start using copyrighted work because copyright doesn't apply to you anymore? What about adding probabilistic querying support to it, does it change anything?
Copyright infringement happens when you reproduce copyright works - i.e. if I analyze digits of Pi get a Disney cartoon out by chance (which statistically, is in there somewhere), I'm still infringing copyright (if I publish it) on Disney despite the fact that nothing they produced was ever included in the input, and Pi itself both contains theoretically anything eventually.
Yet if you painted something imitating the style of Mickey Mouse that would not be legal. The law specifies limits and restrictions even for paintings completely of your own imagination too.
It is even more funny to equate objects with humans and give them the same rights and privileges. Can't wait for AI to gain parenthood leave after a copy.
From my understanding of language models it’s not truly possible for it to disclose a source. At best the result of a prompt can be correlated with a web search, but fundamentally that’s not really the same. It’s in a sense a coincidence, at best. The model has no ability to introspect its prompt result with the underlying tokens that were in the training set.
Imagine some godly AI. You ask it who the President of United States is today. It says Biden. It sources the White House site. Easy enough. You ask who the president will be in 2025. It returns a result. Ultimately no source could properly justify the claim it makes, unless the result itself was probabilistic. At the same time, it’s possible, with enough data for you to predict with extremely high likelihood who the President will be in 2025, now (current polling techniques don’t have this precision, but it’s possible some later iteration of a language model to predict a result more effective than all ping models today).
From reading the AI act, which is being referred to, it seems to be more than just that. In particular Article 29, which discusses the ability of the user to test for conformity, which the act defines as compliance with the rules set in the act, which include accuracy and robustness transparently communicated to users.
What could that possibly mean other than proveance in the context of a LLM?
The only other way to comply would be if OpenAI simply released the entire training set and steps to derive output from it. In this case that would mean the weights and underlying training algorithm. No chance that happens.
You are confused. The EU is proposing that the developers should disclose the sources of training data. That is explained in the first sentence of the article. They're not requiring the language model itself to disclose the sources.
The EU is not anti-AI. In fact, they have stronger protections for AI training than the US does: EU law already has a copyright exception for "text and data mining" (TDM) which covers AI training. The problem is that OpenAI has been incredibly cagey with the way their models get built and trained. This is kind of contrary to the spirit of the TDM exception: it's for scientists to do science with, and OpenAI is being very much not like a scientific organization and more like a commercial enterprise.