The Authors Whose Pirated Books Are Powering Generative AI

mutant_glofish · on Aug 19, 2023

This is an analysis of the "Books3" dataset. The original download link was taken down [1] but it's still available from Academic Torrents [2].

[1] https://torrentfreak.com/anti-piracy-group-takes-prominent-a...

[2]: https://academictorrents.com/details/0d366035664fdf51cfbe9f7...

sillysaurusx · on Aug 20, 2023

There also exists a working direct download link which I can’t post because the Danish Rights Alliance is following me around DMCA’ing anything to that effect. They already struck my Twitter once.

But, BitTorrent is probably the only way a dataset like this can survive now. It’s a damn shame since OpenAI are the ones making money, and researchers are simply trying to replicate their scientific efforts.

Original books3 announcement thread: https://x.com/theshawwn/status/1320282149329784833?s=61&t=jQ...

An article on books3 from Gizmodo, which I quite like: https://gizmodo.com/anti-piracy-group-takes-ai-training-data...

By the way, you can use aria2c to download just the books3 part of that torrent. Use aria2c —-select-file=44 and pass the torrent url. Takes about 30 minutes. Plus most people probably don’t have 800GB free.

marginalia_nu · on Aug 20, 2023

> Plus most people probably don’t have 800GB free.

They should make space. Data sets in general are hugely slept on among hobbyists.

Even just what's publicly available, like Wikipedia's dumps; you can do so much fun stuff with them and a few days of compute time!

aimor · on Aug 20, 2023

I'm surprised the issue seems to be training on copyrighted material, that seems perfectly legal. I'm more interested in the ability of these models to violate copyright by reproducing it. How much of a book is Llama2 allowed to generate before it's an issue?

sillysaurusx · on Aug 20, 2023

Bingo. This seems to be the central issue. In fact, it’s not even copyright violation to distribute a model, since models aren’t copyrightable. They’re a collection of facts, like a phone book, produced by a purely automated process. It’s the outout that counts: https://news.ycombinator.com/item?id=36691050

And even the output has been ruled not copyrightable in recent court proceedings.

sp332 · on Aug 20, 2023

So you think training should be legal, but actually using the trained model to generate copyrighted text should be illegal?

Edit: As a secondary point, it's not like Meta bought all those books. They downloaded unlicensed copies.

harshreality · on Aug 20, 2023

Google didn't buy the books they incorporated into books.google.com either.

You can use books without buying them. e.g. libraries, or borrowing books from anyone even if they're not a "library".

Copyright is not about internal use, it's about copying and distribution.

This is not about what anyone thinks should be legal, but rather what is legal under the current law. The law was not designed for the digital era where "use" could be something other than a person consuming the content, and this has not been meaningfully addressed, therefore other uses are legal by default because that's how the law works.

sp332 · on Aug 20, 2023

Google didn't buy the books they incorporated into books.google.com either.

How did they scan them in then?

zarzavat · on Aug 20, 2023

They went to brick and mortar book piracy organizations, also known as libraries.

pixl97 · on Aug 20, 2023

And if you had 'read' them from a library?

sp332 · on Aug 20, 2023

Getting a book from a library does not involve making a copy. So copyright is irrelevant.

salawat · on Aug 20, 2023

...Uh, sure. Yeah. Totes.

aimor · on Aug 20, 2023

I'm outside of my expertise here, but I think if the model is able to reproduce copyrighted text then distributing the model should be a copyright violation.

cdot2 · on Aug 20, 2023

A random text generator can generate copyrighted material.

chii · on Aug 20, 2023

the model can produce much more than just the original text.

By your logic, distributing the digits of pi could be construed as copyright infringement otherwise.

canjobear · on Aug 20, 2023

Seems different if you can retrieve the text of a book by prompting the model with “Give me the text of X”. You can’t do that with the digits of pi.

chii · on Aug 20, 2023

> You can’t do that with the digits of pi

of course you can - pi contains all known combinations of digits.

canjobear · on Aug 21, 2023

If you wanted to extract a specific text from pi, you'd have to find the location of the text within pi. Expressed as an integer, this location would amount to an encoding of the entire text and would probably be longer than the text itself. You could only find the location by explicitly searching for the exact text. The location address would effectively be a copy of the text.

On the other hand, the "address" of a text within a memorizing large language model would just be the prompt "give me the text of X".

chii · on Aug 21, 2023

The argument is largely a philosophical (and therefore, legal) one.

The ease with which these "retrieval" operations can be done is irrelevant.

sp332 · on Aug 20, 2023

This is suspected, but has not actually been proven.

readyplayernull · on Aug 20, 2023

I have a feeling this will lead to the creation of a system that compares works and the law will simply define a given threshold to conclude whether it's copyright violation or not.

mmastrac · on Aug 19, 2023

I believe this dataset came from https://news.ycombinator.com/user?id=shawn

upwardbound · on Aug 20, 2023

I think you're correct: https://twitter.com/theshawwn/status/1320282149329784833?lan...

sillysaurusx · on Aug 20, 2023

Close! From me. :) That’s a former account.

troupe · on Aug 19, 2023

Imagine how outraged authors would have been if people read their books, read other author's books, and then combine what they had learned to create new combination of thoughts. Or if someone were to create a new painting based on things they had learned from studying a previous artist. Truly awful that technology is doing this thing that has never happened before in the history of humanity.

127 · on Aug 20, 2023

I think there is a lot more nuance between these two polarities. People are more important than machines, even when machines reproduce the exact steps of how people learn.

chii · on Aug 20, 2023

> People are more important than machines

what does 'important' mean here?

If the machine can reproduce the process of learning, there should not be a law to prevent such learning from happening. Just like there wasn't a law to prevent the looms from operating while there are weavers out there.

cegras · on Aug 20, 2023

https://www.federalregister.gov/documents/2023/03/16/2023-05...

> For example, when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the “traditional elements of authorship” are determined and executed by the technology—not the human user. Based on the Office's understanding of the generative AI technologies currently available, users do not exercise ultimate creative control over how such systems interpret prompts and generate material. Instead, these prompts function more like instructions to a commissioned artist—they identify what the prompter wishes to have depicted, but the machine determines how those instructions are implemented in its output.

camillomiller · on Aug 20, 2023

When I read comments like yours I wonder what’s more arrogant, shortsighted and asinine—- thinking so high of stupid stochastic machines or diminishing the complexity of human brain interactions as if the brain was just a machine as well. In both cases I’m just sorry for you.

troupe · on Aug 20, 2023

Perhaps I'm arrogant, shortsighted and asinine. On the other hand, maybe it isn't so crazy to think that patterns of language do in some way represent thought and feeding lots of written language into a language model that is designed to mimic the process of expressing ideas in the way that the brain synthesizes new ways of expressing things based on it's experience, isn't a process completely different than what humans are doing when they study and then write.

That isn't to say that the computer knows what it is doing and it isn't to say that the computer isn't going to produce a lot of nonsense. But even random markov chain babbling will sometimes produce interesting sentences that make you stop and think. The meaning is in the mind of the reader.

If we think of LLMs as a large mixing pot that combines ideas from vast numbers of written sources in what is often mundane, but may occasionally spark a thought or an imaginative creative leap that humans haven't yet made, they become just another tool that humans can use to think and sometimes see things differently.

bakugo · on Aug 19, 2023

Came here expecting to see the usual nonsense "b-but computers are exactly like people and deserve the same rights!" comment, was not disappointed.

drabiega · on Aug 20, 2023

I've seen several varieties of this idea several times, and it always seems pretty straw-mannish to me.

No one is asserting that computers themselves should have rights. But their users certainly do. If it is legal for me, a human person, to do something inside my own brain, why would it not also be legal for me to do a rough approximation of the same thing inside my computer?

chii · on Aug 20, 2023

The luddite argument is understandable for sure, but it is not in the advantage of progress to stall the development of ai with pseudo copyright concerns that are really just disguising the idea that such progress will displace existing people who currently derive benefit from the status quo.

bakugo · on Aug 20, 2023

Why do AI grifters love the word "luddite" so much?

mslt · on Aug 20, 2023

How about the idea that these artists were given no choice in how their personal IP was used? Can I just steal your valuable work in the name of my own technical ‘progress?’

alphanullmeric · on Aug 20, 2023

Boo hoo, you aren’t a victim when a computer scans your drawings. You don’t have the right to an image, and you don’t get to support intellectual “property” laws only when they benefit you.

throwaway5959 · on Aug 20, 2023

Where else would you source large amounts of text but from human authors?

notamy · on Aug 20, 2023

Asking those human authors for their consent. "It doesn't scale!" or similar reasons don't override someone not consenting.

chii · on Aug 20, 2023

> Asking those human authors for their consent.

they consented by virtue of publishing the work. If a human eye can read it, then consent has been given.

bakugo · on Aug 20, 2023

No, I'm pretty sure that publishing a work does not imply that you also consent to other people taking it and selling it as their own.

chii · on Aug 20, 2023

> selling it as their own.

and there you've just put your own words into my post. Nowhere did the idea of selling comes into this (nor distribution of any kind).

If you publish, you implicitly allow people to purchase and read your book. The information and ideas learnt off that book is not subject to copyright.

mslt · on Aug 20, 2023

If you think open AI purchased all of this content I’ve got a bridge to sell you in Brooklyn

troupe · on Aug 20, 2023

They are consenting to letting people understand the ideas in their work and using those ideas combined with other ideas to produce different ideas.

throwaway5959 · on Aug 20, 2023

That’s not what I was asking.

rideontime · on Aug 20, 2023

Maybe you don't.

gaganyaan · on Aug 20, 2023

[flagged]

DangitBobby · on Aug 20, 2023

New version of Godwin's law: any use of the word Luddite to "win" an argument about AI means the argument is over and you automatically lost.

wodenokoto · on Aug 20, 2023

In that case, people would probably have bought the book before reading.

I don't think authors should get royalties on GPT output, and I also don't think OpenAI needs special licensing deals in order to train, but I get that they should at least buy the books.

jejones3141 · on Aug 20, 2023

Libraries bring author $$$/reader down considerably.

manuelabeledo · on Aug 20, 2023

It is debatable that LLMs are really "creating" anything, but we can agree that they aren't advanced enough to "think".

sillysaurusx · on Aug 20, 2023

https://archive.is/OgTZF

beej71 · on Aug 20, 2023

I'm especially interested in how much damage ChatGPT has done these authors.

kromem · on Aug 19, 2023

'Powering'

It's like pissing in the ocean and thinking you caused a tsunami in Southeast Asia.