Hacker News new | past | comments | ask | show | jobs | submit login
The Authors Whose Pirated Books Are Powering Generative AI (theatlantic.com)
51 points by LispSporks22 on Aug 19, 2023 | hide | past | favorite | 52 comments



This is an analysis of the "Books3" dataset. The original download link was taken down [1] but it's still available from Academic Torrents [2].

[1] https://torrentfreak.com/anti-piracy-group-takes-prominent-a...

[2]: https://academictorrents.com/details/0d366035664fdf51cfbe9f7...


There also exists a working direct download link which I can’t post because the Danish Rights Alliance is following me around DMCA’ing anything to that effect. They already struck my Twitter once.

But, BitTorrent is probably the only way a dataset like this can survive now. It’s a damn shame since OpenAI are the ones making money, and researchers are simply trying to replicate their scientific efforts.

Original books3 announcement thread: https://x.com/theshawwn/status/1320282149329784833?s=61&t=jQ...

An article on books3 from Gizmodo, which I quite like: https://gizmodo.com/anti-piracy-group-takes-ai-training-data...

By the way, you can use aria2c to download just the books3 part of that torrent. Use aria2c —-select-file=44 and pass the torrent url. Takes about 30 minutes. Plus most people probably don’t have 800GB free.


> Plus most people probably don’t have 800GB free.

They should make space. Data sets in general are hugely slept on among hobbyists.

Even just what's publicly available, like Wikipedia's dumps; you can do so much fun stuff with them and a few days of compute time!


I'm surprised the issue seems to be training on copyrighted material, that seems perfectly legal. I'm more interested in the ability of these models to violate copyright by reproducing it. How much of a book is Llama2 allowed to generate before it's an issue?


Bingo. This seems to be the central issue. In fact, it’s not even copyright violation to distribute a model, since models aren’t copyrightable. They’re a collection of facts, like a phone book, produced by a purely automated process. It’s the outout that counts: https://news.ycombinator.com/item?id=36691050

And even the output has been ruled not copyrightable in recent court proceedings.


So you think training should be legal, but actually using the trained model to generate copyrighted text should be illegal?

Edit: As a secondary point, it's not like Meta bought all those books. They downloaded unlicensed copies.


Google didn't buy the books they incorporated into books.google.com either.

You can use books without buying them. e.g. libraries, or borrowing books from anyone even if they're not a "library".

Copyright is not about internal use, it's about copying and distribution.

This is not about what anyone thinks should be legal, but rather what is legal under the current law. The law was not designed for the digital era where "use" could be something other than a person consuming the content, and this has not been meaningfully addressed, therefore other uses are legal by default because that's how the law works.


Google didn't buy the books they incorporated into books.google.com either.

How did they scan them in then?


They went to brick and mortar book piracy organizations, also known as libraries.


And if you had 'read' them from a library?


Getting a book from a library does not involve making a copy. So copyright is irrelevant.


...Uh, sure. Yeah. Totes.


I'm outside of my expertise here, but I think if the model is able to reproduce copyrighted text then distributing the model should be a copyright violation.


A random text generator can generate copyrighted material.


the model can produce much more than just the original text.

By your logic, distributing the digits of pi could be construed as copyright infringement otherwise.


Seems different if you can retrieve the text of a book by prompting the model with “Give me the text of X”. You can’t do that with the digits of pi.


> You can’t do that with the digits of pi

of course you can - pi contains all known combinations of digits.


If you wanted to extract a specific text from pi, you'd have to find the location of the text within pi. Expressed as an integer, this location would amount to an encoding of the entire text and would probably be longer than the text itself. You could only find the location by explicitly searching for the exact text. The location address would effectively be a copy of the text.

On the other hand, the "address" of a text within a memorizing large language model would just be the prompt "give me the text of X".


The argument is largely a philosophical (and therefore, legal) one.

The ease with which these "retrieval" operations can be done is irrelevant.


This is suspected, but has not actually been proven.


I have a feeling this will lead to the creation of a system that compares works and the law will simply define a given threshold to conclude whether it's copyright violation or not.


I believe this dataset came from https://news.ycombinator.com/user?id=shawn



Close! From me. :) That’s a former account.


Imagine how outraged authors would have been if people read their books, read other author's books, and then combine what they had learned to create new combination of thoughts. Or if someone were to create a new painting based on things they had learned from studying a previous artist. Truly awful that technology is doing this thing that has never happened before in the history of humanity.


I think there is a lot more nuance between these two polarities. People are more important than machines, even when machines reproduce the exact steps of how people learn.


> People are more important than machines

what does 'important' mean here?

If the machine can reproduce the process of learning, there should not be a law to prevent such learning from happening. Just like there wasn't a law to prevent the looms from operating while there are weavers out there.


https://www.federalregister.gov/documents/2023/03/16/2023-05...

> For example, when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the “traditional elements of authorship” are determined and executed by the technology—not the human user. Based on the Office's understanding of the generative AI technologies currently available, users do not exercise ultimate creative control over how such systems interpret prompts and generate material. Instead, these prompts function more like instructions to a commissioned artist—they identify what the prompter wishes to have depicted, but the machine determines how those instructions are implemented in its output.


When I read comments like yours I wonder what’s more arrogant, shortsighted and asinine—- thinking so high of stupid stochastic machines or diminishing the complexity of human brain interactions as if the brain was just a machine as well. In both cases I’m just sorry for you.


Perhaps I'm arrogant, shortsighted and asinine. On the other hand, maybe it isn't so crazy to think that patterns of language do in some way represent thought and feeding lots of written language into a language model that is designed to mimic the process of expressing ideas in the way that the brain synthesizes new ways of expressing things based on it's experience, isn't a process completely different than what humans are doing when they study and then write.

That isn't to say that the computer knows what it is doing and it isn't to say that the computer isn't going to produce a lot of nonsense. But even random markov chain babbling will sometimes produce interesting sentences that make you stop and think. The meaning is in the mind of the reader.

If we think of LLMs as a large mixing pot that combines ideas from vast numbers of written sources in what is often mundane, but may occasionally spark a thought or an imaginative creative leap that humans haven't yet made, they become just another tool that humans can use to think and sometimes see things differently.


Came here expecting to see the usual nonsense "b-but computers are exactly like people and deserve the same rights!" comment, was not disappointed.


I've seen several varieties of this idea several times, and it always seems pretty straw-mannish to me.

No one is asserting that computers themselves should have rights. But their users certainly do. If it is legal for me, a human person, to do something inside my own brain, why would it not also be legal for me to do a rough approximation of the same thing inside my computer?


The luddite argument is understandable for sure, but it is not in the advantage of progress to stall the development of ai with pseudo copyright concerns that are really just disguising the idea that such progress will displace existing people who currently derive benefit from the status quo.


Why do AI grifters love the word "luddite" so much?


How about the idea that these artists were given no choice in how their personal IP was used? Can I just steal your valuable work in the name of my own technical ‘progress?’


Boo hoo, you aren’t a victim when a computer scans your drawings. You don’t have the right to an image, and you don’t get to support intellectual “property” laws only when they benefit you.


Where else would you source large amounts of text but from human authors?


Asking those human authors for their consent. "It doesn't scale!" or similar reasons don't override someone not consenting.


> Asking those human authors for their consent.

they consented by virtue of publishing the work. If a human eye can read it, then consent has been given.


No, I'm pretty sure that publishing a work does not imply that you also consent to other people taking it and selling it as their own.


> selling it as their own.

and there you've just put your own words into my post. Nowhere did the idea of selling comes into this (nor distribution of any kind).

If you publish, you implicitly allow people to purchase and read your book. The information and ideas learnt off that book is not subject to copyright.


If you think open AI purchased all of this content I’ve got a bridge to sell you in Brooklyn


They are consenting to letting people understand the ideas in their work and using those ideas combined with other ideas to produce different ideas.


That’s not what I was asking.


Maybe you don't.


[flagged]


New version of Godwin's law: any use of the word Luddite to "win" an argument about AI means the argument is over and you automatically lost.


In that case, people would probably have bought the book before reading.

I don't think authors should get royalties on GPT output, and I also don't think OpenAI needs special licensing deals in order to train, but I get that they should at least buy the books.


Libraries bring author $$$/reader down considerably.


It is debatable that LLMs are really "creating" anything, but we can agree that they aren't advanced enough to "think".



I'm especially interested in how much damage ChatGPT has done these authors.


'Powering'

It's like pissing in the ocean and thinking you caused a tsunami in Southeast Asia.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: