There also exists a working direct download link which I can’t post because the Danish Rights Alliance is following me around DMCA’ing anything to that effect. They already struck my Twitter once.
But, BitTorrent is probably the only way a dataset like this can survive now. It’s a damn shame since OpenAI are the ones making money, and researchers are simply trying to replicate their scientific efforts.
By the way, you can use aria2c to download just the books3 part of that torrent. Use aria2c —-select-file=44 and pass the torrent url. Takes about 30 minutes. Plus most people probably don’t have 800GB free.
I'm surprised the issue seems to be training on copyrighted material, that seems perfectly legal. I'm more interested in the ability of these models to violate copyright by reproducing it. How much of a book is Llama2 allowed to generate before it's an issue?
Bingo. This seems to be the central issue. In fact, it’s not even copyright violation to distribute a model, since models aren’t copyrightable. They’re a collection of facts, like a phone book, produced by a purely automated process. It’s the outout that counts: https://news.ycombinator.com/item?id=36691050
And even the output has been ruled not copyrightable in recent court proceedings.
Google didn't buy the books they incorporated into books.google.com either.
You can use books without buying them. e.g. libraries, or borrowing books from anyone even if they're not a "library".
Copyright is not about internal use, it's about copying and distribution.
This is not about what anyone thinks should be legal, but rather what is legal under the current law. The law was not designed for the digital era where "use" could be something other than a person consuming the content, and this has not been meaningfully addressed, therefore other uses are legal by default because that's how the law works.
I'm outside of my expertise here, but I think if the model is able to reproduce copyrighted text then distributing the model should be a copyright violation.
If you wanted to extract a specific text from pi, you'd have to find the location of the text within pi. Expressed as an integer, this location would amount to an encoding of the entire text and would probably be longer than the text itself. You could only find the location by explicitly searching for the exact text. The location address would effectively be a copy of the text.
On the other hand, the "address" of a text within a memorizing large language model would just be the prompt "give me the text of X".
I have a feeling this will lead to the creation of a system that compares works and the law will simply define a given threshold to conclude whether it's copyright violation or not.
Imagine how outraged authors would have been if people read their books, read other author's books, and then combine what they had learned to create new combination of thoughts. Or if someone were to create a new painting based on things they had learned from studying a previous artist. Truly awful that technology is doing this thing that has never happened before in the history of humanity.
I think there is a lot more nuance between these two polarities. People are more important than machines, even when machines reproduce the exact steps of how people learn.
If the machine can reproduce the process of learning, there should not be a law to prevent such learning from happening. Just like there wasn't a law to prevent the looms from operating while there are weavers out there.
> For example, when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the “traditional elements of authorship” are determined and executed by the technology—not the human user. Based on the Office's understanding of the generative AI technologies currently available, users do not exercise ultimate creative control over how such systems interpret prompts and generate material. Instead, these prompts function more like instructions to a commissioned artist—they identify what the prompter wishes to have depicted, but the machine determines how those instructions are implemented in its output.
When I read comments like yours I wonder what’s more arrogant, shortsighted and asinine—- thinking so high of stupid stochastic machines or diminishing the complexity of human brain interactions as if the brain was just a machine as well.
In both cases I’m just sorry for you.
Perhaps I'm arrogant, shortsighted and asinine. On the other hand, maybe it isn't so crazy to think that patterns of language do in some way represent thought and feeding lots of written language into a language model that is designed to mimic the process of expressing ideas in the way that the brain synthesizes new ways of expressing things based on it's experience, isn't a process completely different than what humans are doing when they study and then write.
That isn't to say that the computer knows what it is doing and it isn't to say that the computer isn't going to produce a lot of nonsense. But even random markov chain babbling will sometimes produce interesting sentences that make you stop and think. The meaning is in the mind of the reader.
If we think of LLMs as a large mixing pot that combines ideas from vast numbers of written sources in what is often mundane, but may occasionally spark a thought or an imaginative creative leap that humans haven't yet made, they become just another tool that humans can use to think and sometimes see things differently.
I've seen several varieties of this idea several times, and it always seems pretty straw-mannish to me.
No one is asserting that computers themselves should have rights. But their users certainly do. If it is legal for me, a human person, to do something inside my own brain, why would it not also be legal for me to do a rough approximation of the same thing inside my computer?
The luddite argument is understandable for sure, but it is not in the advantage of progress to stall the development of ai with pseudo copyright concerns that are really just disguising the idea that such progress will displace existing people who currently derive benefit from the status quo.
How about the idea that these artists were given no choice in how their personal IP was used? Can I just steal your valuable work in the name of my own technical ‘progress?’
Boo hoo, you aren’t a victim when a computer scans your drawings. You don’t have the right to an image, and you don’t get to support intellectual “property” laws only when they benefit you.
and there you've just put your own words into my post. Nowhere did the idea of selling comes into this (nor distribution of any kind).
If you publish, you implicitly allow people to purchase and read your book. The information and ideas learnt off that book is not subject to copyright.
In that case, people would probably have bought the book before reading.
I don't think authors should get royalties on GPT output, and I also don't think OpenAI needs special licensing deals in order to train, but I get that they should at least buy the books.
[1] https://torrentfreak.com/anti-piracy-group-takes-prominent-a...
[2]: https://academictorrents.com/details/0d366035664fdf51cfbe9f7...