The headline is either sloppy or intentionally misleading: the Copyright Office is saying that the law surrounding whether AI generated works can be copyrighted was settled in 1965 (the answer being "yes if AI assisted a human creative process, no if not, and we have to decide on a case by case basis if there was enough human input to qualify"). This has been their stance all along, but now they've provided a bit more guidance on what counts as human input, which is helpful.
What this article doesn't talk about at all is the far more controversial AI copyright debate, the one most people will think of given the headline: whether training a model is fair use. That's the one everyone is actually concerned about, and they're definitely not claiming it was settled already.
The human input makes sense, otherwise, couldn't you bruteforce generate billions of low resolution images that cover a vast range of situations and then use that to attack anything similar enough to fit the substantial similarity condition? You could even plug a news feed into the generator.
Things like this often makes me wish we had more 'common sense' laws and left the discretion of interpreting that notion to judges, juries, and the various systems of appeals and other courts we have, entirely with the expectation that laws would 'evolve' over time. This might sound radical, but instead it's actually just going back to how things used to be. Here is the First Amendment in its entirety:
---
"Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances."
---
The rest of the bill of rights looks similarly. Now a days it'd be thousands of pages long trying to elaborate endlessly upon every single scenario. But the more important thing is that this idea of trying to codify every scenario still doesn't work because you end up with a zillion loopholes in just about every single law with some clever clown going 'ah hah! you didn't cover this!' So all you really do is end up with laws that are not only excessively fragile and subject to exploitation, but also completely indecipherable by just about anybody, certainly including the people voting on their passage.
See, while I agree that loopholes will abound — basically your whole last paragraph I completely acknowledge — I also want to point out the deeply insincere and frequently malicious people who appoint and confirm judges now. From both of our cravenly corrupt parties. If anything, since common sense is not even uncommon, it’s nearly extinct, we need even more fine-grained explicit laws, if of course we had a useful Congress to pass laws with any degree of seriousness, which we also don’t.
Basically (speaking from a US perspective) we’re doomed either way.
Many states assign judges through nonpartisan elections. I mean I do, more or less, agree with you but at the same time there's quite a lot of checks and balances - judges, appelates, constitutional review, and the final judge - juries. The entire government was intentionally designed to be mostly nonfunctional to prevent a tyranny of the majority (or minority). It's hard to really do things, and have them stick, unless there's pretty wide consensus.
I'd also add that the more simple laws add transparency. If a e.g. judge does something not in the apparent spirit of a older law, it's apparent to anybody with a mind. But when instead an offender (or judge) relies on a nuanced, if esoteric, interpretation to a loophole in section 274, subsection 74 paragraph 13a, nobody has any clue what's happening.
And if we're screwed either way, it'd be nice if people could at least visibly and explicitly see that they're being screwed.
I want to highlight that training the model is only one part of the copyright questions going on, the other is how they are making and keeping direct long-term copies in giant training datasets.
Imagine what would happen if a regular person bought and then immediately resold thousands of books, CDs, and movies, taking just enough time to make a copy of each one and building out their own library/movie-theater for friends and coworkers. You think the powers-that-be would let you or I get away with that?
There is no (non-evil) reason to hold multijillion dollar corporations with professional legal advice on-tap to a lower standard than regular people.
You’re making the exact same kind of maximalist argument as people who argued that ripping a CD and letting your brother listen to the MP3s is exactly morally and criminally like shoplifting the disc. Or that recording a movie off TV was equal to stealing it. (That particular one was of course famously judged by the courts as “tough shit” to the copyright owners who sued over it.)
Yes, training does impart some fraction of an article into the weights. NYT famously “demonstrated” this by like typing whole paragraphs of their articles into GPT and having the model produce some of the following sentences. However, this substitutes for the article in zero ways, since if you need the article to summon the article…who cares?
We should admit that nothing about our copyright law intentionally weighed in about LLMs. It’s simply nuts to apply a law to a situation when its drafters had no idea of the positive or negative implications of such an application. It would be like applying a law that calculated prison time based on number of horses stolen to someone who stole a Honda Civic with 100hp and saying that clearly they should get 100 years because it is equal to 100 horses.
Now, I get that we have a useless legislative branch which even if they actually passed intentionally applicable laws, they’d be stupid ones, but I think making simplistic analogies like that do not help anyone, other than the Luddites. Look, the cat is out of the bag and even if say, the US government effectively killed all Gen AI by forcing any training to be done with material you own the copyright of, countries like China (and criminals, who can simply use the tech in secret) will happily just pull ahead of us and economically demolish us - like any country would have done if a competing country had banned electricity 100 years ago.
We need something better than the ridiculously unfit 200-year old paradigm for this.
> yes if AI assisted a human creative process, no if not
Fair enough but does that help settle the other question, which is whether weights are considered derivative works of any material used in the training?
There's not really much of a debate, just a bunch of clamoring and wishful thinking by rightsholders who don't understand copyright law insisting that precedent should be subordinate to mimetic outrage over LLMs.
In what way are ‘rightsholders’ expressing wishful thinking? I assume you are saying that there is no violation of those rights controlling various properties that have been used to train ‘AI’. You then mention precedent in a way that implies there are legal decisions that make it clear ‘AI’ training using copyrighted material does not violate the rights of those who own that material. Could you list or link to such a precedent?
To the best of my knowledge, there is no direct precedent from any federal circuit addressing this issue and certainly no USSC opinions dealing with the issue. Additionally, any analogies drawn from precedent focused on other areas of intellectual property law is easily distinguishable. This is truly fresh legal ground and the next 10 years of jurisprudence will go a long way towards building the precedent that your comment would imply to already exist.
Just to be explicit, the above, while a legal opinion, IS NOT legal advice.
No amount of solidarity from support groups comprising clusters of likeminded folks on internet message boards who're opposed to settled law is a substitute for an act of Congress, which is what it will take to give the position of folks opposed to contemporary GenAI any legs.
Neither your comments to HN nor anyone else's strenuous assertions that there's anything to debate are going to change anything.
If you want to treat LLMs as a special case—which is what you want, since there is an entire history of jurisprudence that you have to contend with here—then you need to get Congress to write legislation that says so.
So if you collect things in an undisclosed database without archival rights you probably are violating bajillion copyright claims, right?
The AI itself can be construed as a special kind of a database, given that it can be queried to reproduce at least part of its training dataset with precision...
Doesn’t it usually have to be queried with a bunch of the training material anyway? It’s a pretty foolish sounding argument that one is being harmed if it boils down to “when I type in the first half of my article into GPT it can sometimes complete a few more paragraphs of it”
(That’s what the NYT tried to argue/show. I’m not sure if that case is ongoing or settled.)
Copyright law stipulates the conditions in which content can be reproduced, not conditions in which it can be consumed.
Arguably the material has been learned and not copied. Maybe in some cases learned with an uncanny ability or photographic memory, but learned. (People with photographic memories also cannot reproduce content in an unlimited fashion).
While learned is probably not the best word to use as far as describing the legality goes, I also don’t think “copied” is the right word.
Let’s say that the model “is influenced by” the copyrighted material. That seems hard to argue against.
So, now that we aren’t using the word “learned”, why would we say that the way the models are influenced by the copyrighted works that appear in the “training set” (not to imply that “training” in the usual sense is happening) counts as a copyright violation?
Or, perhaps the claim is that the outputs of the model are violating copyright?
If the output is substantially similar to some particular copyrighted work that is in the “training set”, and could work as a substitute for that work, and if the output resembling the work is in part due to the influence that the work had on the model, I think in this case it would be a clear case of violation of copyright.
However, if it doesn’t have substantial similarity to any particular copyrighted work that influenced it, only similarity to the style common to many of the works that influenced it (even if all by the same author), my impression is that this would not constitute copyright infringement because styles are not protected by copyright, only individual works.
(Now, is this unjust, in the case of it copying the style of some particular author/artist? Idk, maybe? But my impression is that copyright doesn’t protect styles, and that it probably shouldn’t protect styles in general… so I guess maybe if we had a law making a special case forbidding the (deliberate?) copying of a person’s particular style via some kind of machine learning model? Idk.)
The argument is that the LLM itself is essentially like a complex lossy archive of its entire training set. It's like an mp3 of all of the songs on Spotify, in some sense (of course, using all text on the internet instead of all songs on Spotify). This is the sense in which it is considered to be a copy of all of this.
Very insightful. Consider there are many "recreational imitators" who mimic how specific (famous) people speak. They are not violating copyright, they are just imitating a way of speaking.
I don’t think this is a good argument because the way of speaking isn’t a copyright issue. I don’t think you have copyright on your specific way of speaking, only on specific recordings of you yourself speaking.
I think I'm saying the same thing. Style is not something that can be copyrighted but I hadn't much thought about it before. I'm not a lawyer. I guess trademarks are something, and design-patents perhaps that let you have IP over style. :-)
> "Where a human inputs their own copyrightable work and that work is perceptible in the output, they will be the author of at least that portion of the output," the guidelines said.
This policy is sensible. Most AI generated works should be uncopyrightable, except where a substantial human contribution is in the output.
Simply describing a picture and letting AI generate it shouldn't be enough for the same reason that dictating what you want to a painter isn't enough to earn you copyright over the resulting painting.
I would be wary about integrating too much AI output into works one wants to enforce copyright over without some level of documentation. The nightmare scenario is having your copyright stripped away because of evidence one used AI extensively.
> Simply describing a picture and letting AI generate it shouldn't be enough
Interesting take, and I've heard this many times. I'm curious to explore this further and see why you think that is, and where do you draw the line?
Is it the "low effort"? Is it the "automated" stuff? Is the process of setting it up, prompting it and choosing a result not enough "creative input"? If so, why?
Let's take a "real world" example as analogue. Say I setup a camera on a tripod. I set it to take pictures every 1 second, and leave it there. Come back 1hr later, and go through the pictures. I select one of the sunset I like, and post it. Would I not have copyright on that picture? I wasn't there when it was taken. But I did setup the camera and selected the end result. How is that different?
Taking it back to genAI, say I build/train/finetune my own model. Would it now have enough "effort" from me that I can use those generations? Is this an effort thing or is it more? Or is it just that someone else did the work?
What about random "art"? As in art based on random numbers. Say I write a script in python to use random math formulas to "draw" on a canvas. I let it run for a couple of hours, come back, look at the results and select one. Do I not get copyright on the resulting "art" because it was randomly generated by a script? Does it matter if the script was written by me? Would it be different if you download my script and generate the art yourself? Would you not have copyright?
I guess what I'm trying to say is "where do we draw the line?". It's not clear to me why people say "simply prompting and selecting isn't creative enough". This distinction wasn't there before. Plenty of "art" out there based on random processes + curation. Why the sudden change?
Courts have ruled that compilation does not remove originality—the binary is still a transformation of an original, copyrighted work (the source code).
US copyright applications are not examined, in the sense that patents are.
Issued patents are presumptively valid. Registered copyrights are not.
Whether a copyright application is valid has to be determined by a court.
I'm pretty confident the copyright office was massively overthinking it in 1965 and knocked it out of the park far beyond the watered down and ignorant arguments we hear today. It's sad really.
When uploading books to kindle direct publishing you have to state that you own the copyright and publishing right.
So any book or story on Amazon that was generated substantially via prompting should now have to be removed based on this guidance from the copyright office.
Purely factual books are copyrightable. It is the collection and curation of those facts that is protected. You cannot just copy someone else's 100 Amazing Facts about The Rainforest verbatim; if you publish 100 Cool Truths about The Jungle and it has those same 100 facts, you'll get sued and they'll easily win.
The EU and the UK generally has something akin to "sweat of the brow", where collections of facts that took time to collate are copyrightable.
But in the US, Feist v Rural explicitly disavowed the sweat of the brow doctrine, and said that facts have no copyright value--a work requires a quantum of original creative spark to be copyrightable (it was discussed in the context of phone books--a phone book does still have some residual "thin copyright", but the listing of phone numbers is not copyrightable, and it is actually difficult to infringe on the thin copyright of a phone book). In the US, your example would easily be found to be not infringing, if the only similarity were reproducing the same 100 facts.
However, if it states the facts in exactly the same way, that could be considered infringing because of the creativity presumably involved in deciding how to state each fact.
"Elephants are enormous mammals, usually grey in color, with significant intelligence and social habits. They are native to portions of Africa and Asia, with different elephant species found in each region. They are famous for their strong and dexterous trunks, which can also be used to communicate something like a trumpet. Humans, especially in South Asia, have long admired elephants and used them for transportation and various kinds of work; African elephants are famous for having been ridden to war by the Carthaginians, to the dismay of their Roman opponents. Today elephants are significantly threatened by various human activities, both those intentionally directed at the elephants (like killing them for meat or for the ivory derived from their tusks) and those not deliberately meant to affect them (like deforestation). While our word for elephants comes from the Greek, it was probably borrowed by the ancient Greeks from another language family."
I just wrote this paragraph about elephants based on my own knowledge. There is no copyrightability of the substantive information here (e.g. if you learned something new about elephants, you can tell other people) but there is probably some copyrightability of the paragraph based on things like the creativity of my word choices. That distinction can sometimes create confusion in discussions about "facts", and I'm not positive that legal standards that are meant to clarify it have always given a clear and workable rule.
What this article doesn't talk about at all is the far more controversial AI copyright debate, the one most people will think of given the headline: whether training a model is fair use. That's the one everyone is actually concerned about, and they're definitely not claiming it was settled already.
reply