Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft, OpenAI sued for ChatGPT 'privacy violations' (theregister.com)
194 points by Nadeus on June 29, 2023 | hide | past | favorite | 225 comments



>For the 16 plaintiffs, the complaint indicates that they used ChatGPT, as well as other internet services like Reddit, and expected that their digital interactions would not be incorporated into an AI model.

I don't expect this lawsuit to lead anywhere. But if it does, I hope it leads to some clear laws regarding data privacy and how TOS is binding. The recent ruling regarding web scraping makes the case against OpenAI a lot weaker. [1] Data scraping publicly available data is legal. People didn't need consent to having their data be used, there was an implicit assumption the moment the data was published to the public, like on reddit or youtube.

I keep seeing this idea reoccur in the suit:

>Plaintiff ... is concerned that Defendants have taken her skills and expertise, as reflected in [their] online contributions, and incorporated it into Products that could someday result in [their] professional obsolescence ...

Anyone is able to file a suit, I wish people stopped assuming that a news report automatically means it has merit.

1. https://www.natlawreview.com/article/hiq-and-linkedin-reach-...


> But if it does, I hope it leads to some clear laws regarding data privacy and how TOS is binding.

One of the "I wonder where this will go" things with the reddit and twitter exoduses to activity pub based systems is that it is trivial for something to federate with it and slurp data without any TOS interposed.

The TOSes for these systems are typically based based on what can be pushed to them - not what can be read (possibly multiple federations downstream).


Article titles which specify the plaintiff classroom amount are a good indicator of poor journalism.

You can usually disregard such articles as you can expect biased/incomplete reporting.

Lawsuit claim amounts have zero bearing on reality. They must be specified in any classroom, but lawyers just always specify massive amounts without justification.

Any reporting on this amount indicates ignorance in the system or intentional dishonesty.


Also note that the damages typically can't be adjusted up, only down.


Regardless of access rights to the data, I've yet to read a compelling argument why LLMs are even derivative works. You can't identify your Reddit comment in a ChatGPT conversation. How is it any different than a human learning English by reading Reddit? That human wouldn't be violating copyright every time they said a phrase that was repeated by hundreds of Redditors.

My favorite LLM analogy so far is the "lossy jpeg of the web." Within that metaphor, I don't see how anyone can claim copyright on the basis of a pixel they contributed that doesn't even show up in the lossy jpeg. They can't point to it.


I've been thinking of the output as fanfiction/fan art. It shares many of the same complications regarding the ownership of ideas, commerical intent of writing, competition, and copyright. Fanfiction is generally a protected form of expression, but requires the work to be "transformative". Unlike with parodies and critisisms, fanfiction can be much harder to distinguish from original work. From that perspective, a large amount of the output of LLMs is so generic, that it's not possible to attribute it to one person. It's like trying to find the original author of "Once upon a time".

https://theinnisherald.com/the-other-once-upon-a-times-a-his...


Fanfiction isn't as protected as many people think it is.

https://en.wikipedia.org/wiki/Legal_issues_with_fan_fiction

Fanfiction and fan art also tend to run afoul of the infrequently (but occasionally) litigated part of copyright - copyright of fictional characters.

https://en.wikipedia.org/wiki/Copyright_protection_for_ficti...

I came across this with the Eleanor lawsuits - https://www.caranddriver.com/news/a42233053/shelby-estate-wi... - and while I believe that that instance Eleanor falls on the "this shouldn't have been copyrightable" (took a bit to get there), the question is "what protects the representation of Darth Vader?"

In general it tends to be ignored and tacitly encouraged... but it isn't protected.


It's more like a mirror-house of human thought. It can create countless arrangements and even execute tasks.


> Plaintiff ... is concerned that Defendants have taken her skills and expertise, as reflected in [their] online contributions, and incorporated it into Products that could someday result in [their] professional obsolescence ...

It's been a bit surreal seeing modern day Luddites come out of the wood works basically coming up with any ethical/legal argument they can that is a thinly veiled way of saying "I don't want to be automated!"

Not commenting on whether or not they are right per se, but it's weird seeing history repeat itself.


I don't think it's a matter of right or wrong - these are people who are behaving completely rationally given their context.

(I should caveat that I think if they get what they want, we all lose in a big way. Not that I think this is going anywhere)

We're coming up on the outer bounds of our systems of incentives. Captialism, as a system, is designed to solve for scarcity, both in terms of resources and in terms of skill and effort. Unfortunately, one of the core mechanisms it operates on is that it's all-or-nothing. You MUST find a scarcity to solve or you divorce yourself from the flow of capital (and starve / become homeless as a result).

Thus, artificial scarcity. It's easy to spot in places like manufacturing (planned obsolescence) IP (drug / software / etc patents) and so forth. I think this is just the rest of humanity both catching on and being caught up with. Two years ago, everyone thought they had a moat by virtue of being human. That's no longer a given.

One hopes that we'll collectively notice the rot in the foundation before the house falls over (and, critically, figure out how to act on it. We have a real problem with collective action these days that may well put us all in the ground).


As far as I remember Luddites were smart and not against all technology, they were just protecting their jobs. And they were ultimately right.

Why? Except for the longshoremen in the US getting compensation and an early retirement due to the introduction of containers, I know of exactly 0 (ZERO!) mass professional reconversions after a technological revolution.

Look at deindustrialization in the US, UK, Western Europe.

When this happens, the affected people are basically thrown in the trash heap for the rest of their lives.

Frequently their kids and grandkids, too.


Stables became gas stations. Nintendo used to be a toymaker.

Businesses change and adapt. Workers too — but people often don’t like change, so many choose to stay behind. Should we cater to them?

I used to do a lot of work which is now mostly automated. Things like sysadmin work, spinning up instances and configuring them manually, maintaining them. I reconverted and learned terraform, aws etc when it became popular.

Should I have gotten help from the government to instead stick to old style sysadmin work?


> Should I have gotten help from the government to instead stick to old style sysadmin work?

I don't think anyone beyond a few marginal voices are calling for a ban on job automation. What they seem to prefer is that, if they are to be automated out of a job, they should be compensated for their copyrighted works having been used in the process of doing so.

Regardless, at the very least people who are being automated should get some government support. Not everyone can easily retrain.


Suppose you're a weaver. It's hard, fiddly work, and you have to get your timing and your tension just right to make quality material. Now, there are mechanised looms that can do the job faster (though the quality's not great: they could still do with some improvement, in your opinion). From this efficiency gain, who should reap the profits?

Suppose you're a farmer. You've been working on your tractors for decades, and have even showed the nice folk at John Deere how you do it. Now they've built your improvements into the mass-produced models, and they say you can't work on your tractors any more. Who should reap the profits?

Suppose you're a writer. You've spent a long time reading and writing, producing essays and articles and books and poems and plays, honing your craft. You've got quite a few choice phrases and figures of speech in your back pocket, for when you want to give a particular impression. Now, there is a great big statistical model that can vomit your coinages (mixed in with others') all over the page, about any topic, in mere minutes. Who should reap the profits?

Suppose you're a visual artist. You enjoy spending your time making depictions of fantasy scenes: you have a vivid imagination, and, so you can make a living illustrating book covers and the like. You put your portfolio online, because why not? It doesn't hurt you, it makes others happy, and maybe it gets you an extra gig or two, now and then. Except now, there's a great big latent diffusion model. Plug in “Trending on Artstation by Greg Rutkowski”, and it will spit out detailed fantasy scenes, photorealistic people, the works. Nothing particularly novel, but there was so much creativity and diversity in your artwork, that few have the eye to notice the machine's subtle unoriginality. Who should reap the profits?


I've answered this before. The container revolution split some of the resulting profits with those whose livelihoods were destroyed, the longshoremen.

"You build a dam that destroys 10000 homes, who should reap the profits?"


It's a good answer, but it raises further questions:

• Should we be destroying people's homes to build dams without their consent?

• In general, are people being compensated when these things happen to them? i.e., while it might be nice, does this actually happen?

The Luddites (the real ones, not the mythological bastardisation of them) continue to be sympathetic characters.


> • In general, are people being compensated when these things happen to them? i.e., while it might be nice, does this actually happen?

The famous: "it depends" :-)

AI most likely falls under: "they should be", IMHO.


I don't think we should cater to Luddites, but (and it's a big but) if we automate enough jobs out of existence it's essentially undeniable that we will need systemic changes to avoid becoming a completely dystopian society.


But as the corollary to that, I know of zero successfully stopped technological revolutions. You can't put the genie back in the bottle, and there is no way to stop progress, aside from a one-world authoritarian government that forcibly stops as much of it as they can. But even that would only be marginally effective. Progress would eventually resume.


Yes, you do know of revolutions stopped and it worked for centuries.

Tokugawa Japan, Qing China, many other places including in Europe for centuries.

That's too extreme.

My point is that we're reaching a point where people need to be compensated. We can't just destroy their lives, collect all the money in 2 bank accounts and call it a day.


Bingo.

That's the real flaw in Luddite thinking -- you can destroy the machines.


In this case I think it's a little different. People are saying that they don't want to have their own productive or creative output used to undermine their own standard of living. That's not the same as simply not wanting to have your job automated away by someone else's business innovation.


To make chatGPT analogous to coal mining automation it would have to be able to automate the thing it is doing without learning from sources online.

To make coal mining automation analogous to chatGPT the machinery would have had to use something the coal miner did to learn how to automate their work? I'm imagining a camera looking at all the coal miner's work and then the machine can immediately do it, but better.

I agree it is a tad different, but like with someone's coal mining which is in the public domain for anyone in the tunnel to see, likewise anything you write unprotected online is in the public domain and fair game I think?


The lawsuit is far more nuanced than you're letting on. There are several aspects that come into play-

* Was it published publicly? This is basically defined in the courts as "if you make an unauthenticated web request does the data return?". This is where scraping comes in- if you make the data available without authentication you can't enforce your TOS, because you can't validate that people actually even accepted the TOS to begin with.

* Is the data able to be copyrighted? This is where things are interesting- facts can not be copyrighted, which is why a lot of scrapers are able to reuse data (things like weather, sports scores, even "for hire" notices can be considered factual).

* If it would typically be considered covered by copyright, does fair use come into play?

* Are there any other laws that come into play? For example, GDPR, CCPA, or other privacy laws can still add restrictions to how data is collected and used (this is complicated by the various jurisdictions as well)

* Was the work done with the data transformative enough to allow it to bypass copyright protections? This goes back to when Google was scanning books. Because they were making a search engine, not a library, their search tool was considered transformative enough to allow them to continue.

It's not enough to say "because it's on the internet, it's fair game for everyone to use". This is a really complicated area where things are evolving rapidly, and there's a lot of intersecting law (and case law) that comes into play.


I agree that there is additional nuance, but so far public data scraping has very clearly been ruled as legal. It's possible that at the time of scraping, copyrighted data was incorporated into the training data because it hadn't been taken down by the host platform yet. But in my opinion, the core idea proposed by the suit that private data was used intentionally, is not true. The GPT4 browsing plugin is equivalent to web scraping.

And another complication is that OpenAI is not exposing any static data. A response is generated only after prompting. I'd argue that LLMs are closer to calculators than databses in function. The amount of new information that can be added is also limited, it's is not a continuous learning/training architecture.

I do hope this leads to more clear laws regarding data privacy, but I can't imagine the allegations of "intercepting communications", violating CFAA, or violating unfair competition law will hold.


My point is that you have to separate the method for collecting the data versus the usage of the data as separate legal questions. Scraping is legal. What you do with the data that you scrap though is a whole other question.

To put it another way, it's legal for me to go to the library and borrow a DVD or a book or poems. That doesn't give me the right to publish the poems again under my own name. Whether I find the poems from scraping, borrowing the book from a library, or even just reading it off of a wall I don't get ownership rights to that data.

The same logic applies to a lot of other laws around data. If you collect data on individuals there are a bunch of laws that come up around it, and many of them don't really concern themselves with how you got the data so much as how you use it. The fact that it was scraped doesn't grant any special legal rights.


What you describe misrepresents how LLMs/neural networks and the math works, your analogy does not apply. There's no static data in the networks. The output of LLMs are much closer to parodies and fanfiction. In that case, you very clearly own the copyright to the new work you make.


That's weird, since my comment literally said nothing about LLMs. I was simply pointing out that making scraping legal doesn't invalidate any of the other data laws that were out there, and gave one example.

You keep making the claim that because it was scraped people can do whatever they want, as scraping is legal. That is the only thing I'm arguing against, because that is a gross misinterpretation of how the case that made scraping legal was decided. LLMs aren't relevant to that point (which is exactly what I keep saying- the method of collection doesn't magically change the legality of it).

That being said, you're still wrong. The USPO has said that the output of LLMs are the outputs of algorithms and are not creative works. Therefore you can't "own the copyright to the new work you make" because the work itself can't be copyrighted at all. No one can own the output of an LLM.

Also, just because it seems you want to be wrong on every level, it is absolutely possible that a neural network would be able to repeat data from its training set. This is an incredibly known problem in the field.


I see your perspective better now. The Linkedin case was specifically regarding CFAA and is relevant to the original suit against OpenAI and web scraping, but I now see you weren't discussing that. The copyright limit you mention is related to completely automated generations, it's not as clear when a human uses it. The UK assigns the copyright to the user/custodian of the AI. The neural network models can repeat data, but it requires a certain frequency, and still relies on a probabilistic output. The complication comes from the fact that there is no "copying" when training a model. Fundamentally, I think we disagree on how data use laws apply in this situation. I appreciate you discussing this with me, it did helped clear some misunderstandings I had.

https://www.bloomberglaw.com/external/document/XDDQ1PNK00000...


Even if they were exposing static data, how would that be different than a search engine? Google has been scraping the web for two decades, indexing even explicitly copyrighted content, and then making money by selling ads next to snippets from that content. If you're going to make the case that an LLM is violating copyright, then surely you must also assert that Google is too, because it's the same concept, but Google is actually surfacing exact text from the copyrighted material.


By putting something on a public-facing website, it's generally agreed that (absent a robots.txt to the contrary), you intend it to appear in web search results, and you're granting a public limited semi-transferable revocable license to request, download and view your site to your visitors.

That doesn't mean you grant a license to produce derivative works other than search indexes. Legally, it's different. (Germany codifies these as separate "moral rights": Urheberpersönlichkeitsrecht.)


These things are just not going to go anywhere, big reason being AI is part of the technological race. If AI research gets constrained in the US, progress will happen in China. Since that can't happen, this won't go anywhere.


I tend to agree with you, but I also recognize I could be unrealistically optimistic. This is the legal system we're talking about. I wouldn't expect every court case to be decided fairly, nor would I expect any new laws and regulations to necessarily be sensible. Frankly my biggest worry at this point is that regulatory capture from the first mover AI companies will stop me from purchasing more than one GPU.

I'm not too worried about copyright issues because regardless of whatever happens with upcoming case law and legislation, any regulation against the input data will be totally unenforceable. It's nearly impossible to detect whether or not an LLM was trained on some corpus of data (although maybe there is some "trap street" equivalent that could retroactively catch an LLM trained on data it wasn't allowed to read). And even if the weights of a model are found to be in violation of some copyright, it's still not enforceable to forbid them, because they're just a bag of numbers that can be torrented and used to surreptitiously power all sorts of black boxes. That's why I'm much more worried about legislative restrictions on hardware purchases.


> I hope it leads to some clear laws regarding data privacy and how TOS is binding

I hope it leads to more people realizing that a TOS doesnt override their individual rights and that the legal system works to support them.


One individual right is the right to sign away other rights in exchange for products and services.


There are limits to that -- to signing away rights. In the US You can't sign yourself into slavery. You can't sell the right to have someone kill you.

There's sort of an exception for military service, but even soldiers have acess to military courts.


Can you point to where that "right" is codified in law?


Common law of contracts dictates that you can commit to performing certain services in exchange for the counter-party performing certain services. For example, you provide both money, viewing data, and permission to run DRM and proprietary code on your property (e.g. set-top boxes or smart TVs) to Netflix in exchange for obtaining access to their library of TV shows and movies.

It's codified in the fact that saying you'll do something means you're socially obligated to do it, and legally obligated if you receive something in return.


You still haven't said where it's legal that all rights can be signed away. I know for a fact that you can't waive tenant rights when signing a lease, for example. We also don't allow people to sign over so many rights that they're considered slaves, as slavery has been made illegal. I also can't sign away my right to not be sexually harassed- if a company makes me sign something saying that they can sexually harass me they will still end up losing in court. The US has also limited the ability for NDAs to cover discussions about labor practices, so there's another right we can't sign away.

It seems to me there are a to of counter examples to this "right" you speak of. So many that it doesn't seem like it really exists.


It is open knowledge that ~0% of people read any TOS. While ignorance is no defense for breaking laws or rules, ~0% is compelling in and of itself that the process is completely broken.


> People didn't need consent to having their data be used, there was an implicit assumption the moment the data was published to the public, like on reddit or youtube.

The same argument could be used to defend ubiquitous face recognition in the street though (“when going to the street, there's an implicit assumption that your presence in this place was public”) but I'd really like if we could not have that…

There's a case to be made that corporation gathering data and training artificial intelligence don't need to have the same right as people: when I go to the street or publish something on Reddit, I'm implicitly allowing other people to read my comments, but not corporations to monetize it. (GDPR and the likes already makes this kind of distinctions for personal information by the way, so we can totally extend it to any kind of online activity).


It becomes harder and harder to pretend that this level of data scraping and disregard for consumer privacy is acceptable when things like GDPR exist.

Just because I posted something on reddit because I thought it was funny, doesn't implicitly give permission to anybody to take that post and profit from it. You're doing a disservice to consumers by acting like it's their fault for being exploited.


The fundamental issue in that situation isn't about profit, it's about the definition of what is considered publicly accessible and what consent that implies.

I disagree with you on whether it should count as being exploited. I don't see fanfiction writers professional impersonators or as inherently exploitative. I understand that some people would disagree because there is a difference in scale. But using technology to mimic and, in some sense, replace human effort is the reason it is useful.

I believe this will shift how and why people value organic media. The standard of what makes content "good" will rise in the long term. When stable diffusion first came out, I compared the generated art to the elevator music. I feel the same way about the output of LLMs. I might feel differently in a few years if models get better at the rate they currently have been, but that's not likely.

I agree that people should have more control over how their data is used, and I'd love to see this suit lead to stricter laws.


I mean, it ingested all of the content from my blog. Without my permission. It's not a major part of their corpus of data, but still -- I wasn't asked and I don't really care to donate work to large corporations like that.

So the technology is cool, but I'm firmly of the stance that they cut corners and trampled peoples' rights to get a product out the door. I wouldn't be entirely unhappy if this iteration of these products were sued into the ground and were forced to start over on this stuff The Right Way.


One thing I've been thinking about: it's only a matter of time before your friends load an AI assistant on their phone, and it devours every text message you have ever sent to that person, every photo you've shared together, every record of an in-person meeting. This makes me really uncomfortable.


That's what bothers me for years now in the context of contacts on smartphones. Maybe I'm making a mistake when thinking about it, but - if I refuse to share my contacts with let's say Instagram, but all of my friends share their contacts list which includes me, does it really matter if I decline to share or not?

Another part which bothers me is that I have lots of different personalities online. On most sites I use different usernames, and I wonder if there will someday be an AI which can match all the different online profile to a single person, even if different username are being used etc.


We're already on the way: https://www.rewind.ai/

Not on the phone yet, but on a Mac which could include iMessages.


I wanna do that locally with an LLM, fine tune it on my entire sent email history and have it generate auto-responses to most of my emails. :D


* cuts to AT&T in the background hastily dumping texts into ChatGPT *


Every email you send to a gmail backed account is this.


Anyone who reads your blog is "ingesting content" from it. That is presumably the purpose of your blog in the first place. Whether that content is used to train a human mind or an artificial one is probably not up to you as the author.


This type of comments can be seen every single time a thread about LLM, or OpenAI or some such comes up.

And it adds nothing. I'm sorry but saying "Whether that content is used to train a human mind or an artificial one is probably not up to you" may be worse than saying nothing at all.

First because it shows enough doubt on whether it's up to the authors of content (IP laws, fair use, intent of the use, and many things I ignore), while giving no laws as an example or frame of reference.

And second because it's comparing a human mind that we know exist, to an artificial one, which implies:

1. An LLM is an artificial mind, or close to one, whatever that is (again, not defined).

2. If they were to exist, they would be both equivalent and treated the same as a human one.

The amount of jumps in a couple sentences, added to the uncertainty of how copyright would/will work, multiplied by the numer of times I/we read that type of comment every single time, it's getting tiresome. And it's adding noise to the noise-signal ratio.


I think you’ve missed the point. Copyright laws prevent others from copying your work without permission. (Hence the name.) Copyright laws say nothing about who can read your work.

If you want to prevent a web spider from scraping your blog, use a captcha or robots.txt. Copyright law doesn’t apply to this scenario.


I disagree, and though the GP maybe didn't have this sentiment, my personal view is that intellectual property is a bunch of crap and just because there are laws around it in our capitalist society doesn't mean that the laws are moral/just/ethical/good. IP is constantly ingested and transformed which is exactly what LLMs are doing. The fact that ChatGPT can't even accurately reproduce data from its training (it gets basic facts/dates/quotes wrong all the time) really reinforces that it's not infringing on anyone's IP.

If you're tired of responding to these comments then stop. It's the internet, everyone is at different places in exploring topics and having discussions. Don't poo-poo on someone else's journey and instead move on with your day. There is no required reading (other than TFA) on hacker news.


You don’t get to make information publicly available. But not publicly available. If you want your blog to be restricted, put it behind a login


Yes I do. I own the work I create, even if it's publicly available. I do get to decide what happens with it.


> I do get to decide what happens with it.

No. Both legally and practically, you absolutely do not.

The only thing copyright law gives you is an exclusive right to sell it for a limited period of time, as a whole in its original form or similar -- and to transfer that right.

Regardless of your desires, anyone can reuse it under the conditions of fair use. They can copy parts of it for parody purposes. If they're not selling anything or taking away from your sales*, they can reproduce it verbatim for private purposes. And even if they are selling something, they can summarize it, quote from it, rephrase it, and so forth.

And you don't actually get to decide any of that.

* Edit: added "or..."


So you’re saying I’m right except in some narrowly carved-out situations. And I agree with you.


Nope. You said:

> I wasn't asked and I don't really care to donate work to large corporations like that... I do get to decide what happens with it.

And I said:

> No. Both legally and practically, you absolutely do not.

You think you get to decide whether large corporations can train on your work. I'm saying the the law suggests you very much don't get to decide that.


Read the comments you're replying to. I didn't comment on the legality of ChatGPT training on my content, I said I didn't like it. Regardless, the act of posting content publicly does not mean I give up my copyright claim. Yes, there are fair use situations. Training ChatGPT might be one of them, but I'm not seeing lot of concrete information one way or the other and I am seeing arguments that ChatGPT could be considered a derivative work, which would place OpenAI in violation of my copyright.

Send some links if you see some definitive case law sorting this stuff out.


You are claiming that piracy is legal.


Anyone can read your blog and then post their own blog post using knowledge they learned while reading yours. ChatGPT "learned" from your blog that same way


Since the way GPT "learns" is not materially similar to how a human learns, I don't see why this talking point is particularly relevant. Nothing stops the courts from distinguishing between an AI and a human with regard to what may be permissible.


I agree, it seems like all the arguments that the use of data by AI should have no more restrictions than the use of data by humans hinge on the implicit (or sometimes explicit) assumption that human learning and machine learning are identical. While there are parallels, there also seem to be significant differences not only in how the learning is done, but also in outcomes for the person whose data is being used. And since a major purpose of IP, copyright, etc. is at least ostensibly to protect the creators of information from negative outcomes, I don't think the outcomes can be ignored when comparing human learning to ML.


Anthropomorphizing that it "learned" is disingenuous and I expect better from the HN crowd.

If ChatGPT regurgitates verbatim or nearly verbatim, something it slurped up from OP's blog, is that not plagiarism? Where do you draw the line? Where would a reasonable person draw the line?


A human is both capable of reciting things from memory in an infringing manner, and learning from experiences to create something new. Maybe we should tape people's mouth shut if they dare to violate copyright by reciting a copyrighted book word for word or put them in a straight jacket if they recreate a copyrighted painting from memory.


Actually I fear that people that say this are doing worse than anthropomorphizing.

Often rather than claiming human aspects to the machine, they are going further, and claiming machine aspects to the human.

Using mechanistic analogies for explaining the human body or mind isn't new, but as machines become better and better at imitating humans, those analogies become more seductive.

That's my rant; the danger with 'AI' isn't so much that humans are enslaved by machines, but that we enslave each other -- or dehumanize each other -- with machines.


Like with everything in law, "intent" is paramount. Obviously it's not the trainer's, nor the end-user's goal to reproduce training set data verbatim; quite contrary, overfitting as such is undesirable.


Intent only goes so far. If I continually but unintentionally reproduce copyrighted works verbatim, I could still face consequences, particularly if I did not show due diligence in preventing it from happening in the first place.


But ChatGPT doesn’t spit out verbatim from the blog.


Computers aren't people. Software isn't humans.


There is a difference between learning from your work and copying your work.

You are entitled to control it's distribution and use. You are not entitled to control it's influence and effects.


I think you've made up an irrelevant argument. The work has been incorporated into a commercial product, intentionally, under the control of someone else. Software isn't humans that pay taxes, appear in court, have rights, etc.


No, the work has not been. The impression that the work leaves on a neural network has been though.

AIs are not massive repositories of harvested data. The models are relatively small (<20GB).


A resized, smaller, or encoded version of an image is still subject to copyright. Calling an encoding an 'impression' is deceitful.


Not always.

https://www.pinsentmasons.com/out-law/news/google-thumbnails...

> A US court ruled this week that Google's creation and display of thumbnail images does not infringe copyright. It also said that Google was not responsible for the copyright violations of other sites which it frames and links to.


Part of this ruling is about how the images are used -- Fair use -- not just that they were stored in a particular way. If Google was using the smaller versions of the images (thumbnails) in other ways, it could have been infringing.

> The Court said that Google did claim fair use, and that whether or not use was fair depended on four factors: the purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; the nature of the copyrighted work; the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and the effect of the use upon the potential market for or value of the copyrighted work.


My take on copyrights and AI models...

Taking copyrighted material and using it to train a model is not a copyright infringement - it is sufficiently transformative and has a different use than the original images.

Note that AI models can be used for different things. A model trained to identify objects in an image has never had uproar about the output of "squirrel" showing up in the output text.

The model also, as a purely mathematical transformation on the original source material does not get a copyright. If it needs to be protected, trade secrets are the tools to use to protect it. A model is no more copyright worthy than tanking an image and applying `gray = .299 red + .587 green + .114 blue` to it.

The output of a model is ineligible for copyright protection (in the US - and most other places).

The output of a model may fall into being a derivative work of the original content used to train the model.

It is up to the human, with agency in asking the model to generate certain output to be responsible for verifying that it does not infringe upon other works if it is published.

Note that the responsibility of the human publishing the work is not anything new with an AI model. It is the same responsibility if they were to copy something from Stack Overflow or commission a random person on Fiverr... its just that those we've overlooked for a long time - but it is similarly quite possible for the material on those sources to be copyrighted by and/or licensed to some other entity and the human doing the copying into the final product is responsible for any copyright infringements.

Saying "I copied this from Stack Overflow" or "I found this on the web" as a defense is just as good as "Copilot generated this for me" or "Stable diffusion generated this when I asked for a mouse wearing red pants" and represents a similar dereliction on part of the person publishing this content.


It's none of the those things, these models train on petabytes of data. They store relationships of objects to each other, not objects themselves.


Actually, people have been successfully sued for plagiarizing other works because they had internalized it and accidentally regurgitated it. So. The fact that content runs through a human brain doesn’t necessarily cleanse it from copyright concerns.


There is no "actually" because you are still addressing distribution. It wouldn't be hard to have another AI that analyzes outputs for copywriter infringement and culls them as necessary.

Would that satisfy you?


To some extent. Others can ingest your work, quote it, talk about it, criticize it, summarize, etc.


If I read your blog and used its data along with my own knowledge to create a course, would that be plagiarism or copyright violation?


>You don’t get to make information publicly available. But not publicly available.

But we do? Open sourcing something with caveats is common. This code is public BUT not for commercial use. This code is public BUT you must display attribution etc.

Sure, blogposts are unlicensed (that I know) but the idea of something publicly available being held to restrictions is nothing new.


Do you allow commercial employees to read the code and incorporate knowledge obtained from the code into their brains?


This is a fantastic point. I can legally go pick up any strictly copyrighted book at a store and read parts of it for free which I will then have learnt and have in my brain to share with to anyone else. If I happen to have a superintelligent brain I can potentially gain a lot more and make a lot more inferences from this one outing and consequently add a lot of value to others I share my info to.

But telling me it is illegal to share what I learnt because the original source is copyrighted... doesn't sit right with me.


Copyright just doesn't protect such cases. There's a funny exaggeration that is very illustrative: copyright protects the bugs in the code. I.e. the specific way in which code was written. Reading it and getting inspired was never meant to break copyright.

What protects particular solutions is patents. For example if someone were to obtain a patent for computing GCD of large integers the usual fast way, well then everyone else would have to use a different solution.

This analogy to someone reading a book, perhaps peppered with lots of legalese to the point of being hardly recognizable, will definitely be used in courts at some point. And I can't see how it wouldn't stand as a valid argument.


If you go read a book, memorize it, write it down later in a substantively similar form, and share it freely or sell it — yes, you might get into copyright trouble. It has happened before and it is at best a tricky gray area.

If you pick up a book and learn a fact, then yeah, you’re allowed to share that fact.

It’s weird that this topic keeps devolving into a form of “so what, it’s illegal for me to learn things?” Because: no, it’s not. And: You and a piece of software are treated differently under the law. You have a different set of rights than ChatGPT.


Everything ChatGPT seems gray area and might which is probably why we are where we are.


> You have a different set of rights than ChatGPT.

Gods, no. Where did you get that from?


Are you a human being? A citizen of some country? If so you definitely have a different set of rights than ChatGPT.

Those might not be a problem regarding this specific case, but the case can easily be made that it ought to be.


I don't think ChatGPT has any rights yet... And a person using it has the exact same rights as someone not using it.


?

I don't understand your point. Do you think it makes any difference whether I use my laptop, or a pen, or ChatGPT to violate copyright?


Show me where ChatGPT's brain is and your comparison will become relevant.


I mean in the floating point / quantized numbers and the connections that make the model? I'm not sure I follow, the analogy to the human brain has always been obvious, it's even in the name (artificial neural network) ...


The analogy is just that: an analogy, and a very imperfect, misleading one. The working of the brain may have motivated early research, but GPT (as instantiated in hardware) does not operate or learn in a way similar to a human brain.


Yes, it's completely unfeasible to make a license to control that.

On the other hand, it's completely feasible to make a license that stops someone from training their model with some piece of info, is it not?


Why is it that people keep on flogging dead horses?


That's not how copyright works.


Another day, another person on HN showing us how they don't understand the difference between Public Domain and Open Source or Copyleft etc.

And regardless -- the problem now is that expectations of how content can be consumed are now fundamentally violated by automation of content ingestion. People put stuff up on the Internet with the expectation of its consumption by human minds, which have inherent limitations on the speed and scale on which they can learn from and reproduce things, and those humans are also legally liable, socially/ethically obligated, etc.

Now we have machines which skirt the limits of legality, and are able to do so on massive scale and without responsibility to society as a whole.

Different game now.


> People put stuff up on the Internet with the expectation of its consumption by human minds

Then people obviously aren’t aware that bots have been indexing web pages and showing summarized information without going to the web page for three decades.


I think it's a bit intellectually dishonest to claim an equivalence between content indexing for search engines and machine learning for LLMs. They might share an underlying harvesting technique, but their uses -- indexing for information accessibility vs automatic content production are qualitatively different.

Further, almost every site has had an e.g. robots.txt which has permitted content harvesting only for certain accepted purposes for a couple decades now. So clearly people already had a sense of how they wanted their content harvested and for what purposes.


How is it not content production when I search for something on Google and get a box with similar questions and summarizes the answer.

So you’re okay with Google making money off of your content. But not OpenAI?


Your blog which you posted online for anyone to download and read?

Don't get me wrong, this is a grey area where copyright laws and general consensus haven't caught up with new techonology. But if you voluntarily stick something up online with the intent that anyone can read it, it seems a bit mean to then say "wait no you can't do that" if someone finds a way to materially profit off it.


You sent your content to them in response to their HTTP requests. That sure looks like affirmative consent to me.


You’re right! Just like Disney+ did when I watched Star Wars the other day. I’m excited to know Disney has consented to me posting Star Wars in its entirety free online.


Can you make ChatGPT produce the content of your blog post "in its entirety?" You can share the URL to a ChatGPT conversation, so it should be easy to prove the copyright violation by replying to me with two links: one to your blog post, and one to the ChatGPT conversation containing an unauthorized copy of it.


If you put your content on a billboard, what expectation should you have that you can control who reads it?


That’s the economics of (non-symbolic) AI. To work, it needs humans to create stuff for free.

Putting it more bluntly, it is somewhere between a parasite and a slave driver.


It doesn't require humans to work for free — while that's been a common default MO since everyone looked at Google making a search index and thinking to themselves "if they're doing it surely do can we", there are data sets made by paying people.


There are such datasets, and AI companies absolutely pay to have data curated. But I suspect it would be just unimaginably expensive to create a dataset from scratch with enough tokens to feed a model with hundreds of billions of parameters, all the while paying every participant fairly.


"fair" is somewhat undefined, as the fair-looking number for being paid for effort can be very different to the fair-looking number for being paid for the resale value of the end product on an open market.

I wonder what would an LLM trained on Google code and internal documents look like?


Hard to understand how this is a crime, or how they came up with 3 billion dollars of damage.

Seems like if it's legal for a person to do it should be legal for software to do for the most part.


I can personally memorize and recite copyrighted works all I want, but when ChatGPT does it then it’s in a commercial context and they’re liable to be sued for infringement.

If you ask ChatGPT the rules for D&D, the private sourcebooks are all in there.


> I can personally memorize and recite copyrighted works all I want,

Whoever told you that is lying to you. You are not legally allowed to personally memorize and recite copyrighted works all you want, any more than you're allowed to personally memorize, write down copyrighted works, and distribute them as much as you want.

All piracy is a process of computer-assisted remembering and reciting.


Last I checked I can legally enter any bookstore with copyrighted books, pick up a book, and read it. And then tell anyone what I read.

I can't go write and commercialize what I learnt directly, but I'm not breaking the law by quickly seeing how some book I didn't buy ends so I can talk about it at a party - and then everyone knows how it ends which might affect whether they want to buy said book and upset the author. But, tough shit, what I did was legal. I can even use the ending as one set of input from dozens of inspirations for my own book where the end result is transformative enough where the sources are unrecognizable. And if I had learnt about the endings from a dozen books without buying those books I didn't break any laws even though I am now commercializing something in being inspired by them all to make something new.


Maybe it would be useful what "tell anyone what I read" means. Because if you mean 1 to some in a room, then most likely. If you use any type of broadcasting then most definitely no. Try reading outloud a script from a recent movie on twitch/youtube/radio/tv and whether it gets DMCA'd or not. Same for books, songs I guess... not? But not sure.


I just mean I can socialize and talk about it without the police telling me that's illegal because I didn't pay to be allowed to talk about said plot points


Well, you could memorize and recite copyrighted works all you want, as long as you're doing it in an empty room without anyone listening.


Would you say reading a book to my kids before bed is illegal?


Sorry, I was being a little flip. There's more to it than that, of course. Is the performance sufficiently transformative, is it educational, is it non-profit, etc.


Not how copyright works.

Being non-commercial is not an automatic fair use exception. Being commercial does not preclude fair use. And rule concepts are not copyrightable, only the specific expression. Rules may have other IP protection, including patents.


> Rules may have other IP protection, including patents.

That's not even true in the US anymore. You'd have to convert those rules into some sort of device, or argue that the game is a business method.


Isn't that because a performance is different from the creation of a permanent copy? If you published an article and included a significant chunk of the copyrighted work, you'd be liable too unless it fell under fair use. Doesn't matter if you did it or ChatGpt. Commercial use would be one consideration, but not the only one, for both of you.

The rules of games cannot be copyrighted either. The artistic elements can be trademarked, but if ChatGpt merely explains the rules to you in different ways, that isn't infringement either.


> and recite copyrighted works all I want

...wait, isn't that false? legitimately asking.

or is it because it was done by a corporation that makes it illegal?

im thinking of how restaurants dont sing happy birthday and fair use restrictions etc


Like most things, it depends.

If I recite them to myself, in my home, it's fine. If I do it at a gathering at my house where we're playing D&D, fine. If I do it as a performance, in front of a crowd, or as a recording, now I'm no longer fine. Context matters in a copyright cases. Not to mention, to claim fair use, you do have to claim you violated copyright. Fair use is just an allowed violation.

As to Happy Birthday, that's actually ok for them to do now. The person/group that held the copyright to Happy Birthday was found to have not actually have held them in the first place. Happy Birthday is actually an older song called "Good Morning to All". Swap "Good Morning" with "Happy Birthday" and "children" with "dear [PERSON]" and you have the lyrics. This was not deemed a substantive change. And since the copyright on "Good Morning to All" has lapsed, Happy Birthday is in the public domain.


Yes, I was overly broad and there are restrictions on saying/copying memorized material.


I don’t get your point. Whether you use copyrighted material in commercial context or not always matters. That’s one of the most important aspects of different open source licenses.


This is not true for copyright law (the 4-factor test[0]) or for OSI licenses (they almost universally place no restrictions on commercial use). The only exception that comes to mind right now is the Creative Commons NC, which is generally recognized as being unsuitable for software[1].

[0]: https://fairuse.stanford.edu/overview/fair-use/four-factors/ [1]: https://creativecommons.org/faq/#can-i-apply-a-creative-comm...


And CC-NC isn't considered an open source license by the FSF or OSI anyway. And IMO the NC clause is pretty much impossible to define for non-trivial use and Creative Commons basically came up. Not sure non-derivatives is a lot better especially given remixing was one of the original drivers behind CC but it's at least less controversial.


Thanks you’re right. I was thinking about the license changes Elastic made to stop cloud providers from redistributing their products as a managed service.


No OSI-approved open source license prohibits the commercial use of software. In fact, the Open Source Definition expressly forbids discriminating on the basis of how the software will be used.


A license does not redefine copyright law.

I can give you a rock that I own, which I hope we all agree is not copyrightable, and ask you to sign a license that you will keep it indoors. If you put it in your yard, you are breaking the license and potentially liable. This has nothing to do with copyright.


Has this been decided by the courts?


> including personal information obtained without consent

Obtained from (check notes) public internet forums

> For the 16 plaintiffs, the complaint indicates that they used ChatGPT, as well as other internet services like Reddit, and expected that their digital interactions would not be incorporated into an AI model.

You've got to be incredibly naive if you think public Reddit data isn't used to train ML models, not least by Reddit themselves


Or maybe when you started posting on reddit, LLMs hadn't been invented yet. This is true for 99.9% of the people who post on Reddit.


People have been training ML models on data scraped from Reddit since at least 2015 [1], back when there were less than a million users

[1] https://www.kaggle.com/datasets/ehallmar/reddit-comment-scor...


LLMs were invented at least five years ago (BERT) though you could make the case for a few years earlier. My guess is the majority of Reddit users are new since then, not 0.1%?


Your guess is that the majority of Reddit users have joined since 2018? 1) I do not think that is correct, 2) the mere existence of LLMs isn't public awareness about how LLMs are trained, and 3) you know exactly what I'm saying and that 99.9% might be slight hyperbole.


1: Reddit has ~1.6B monthly active users, compared to 0.3B in 2018. [1] So 2x user growth seems more likely to me than not.

2: You're the one who went with "invented" ;)

3: I know you're exaggerating, but I think you think you're exaggerating much less than you actually are.

[1] https://www.bankmycell.com/blog/number-of-reddit-users/


> Your guess is that the majority of Reddit users have joined since 2018?

It's not really important to the debate around unlicensed use of copyrighted works to train AI models, but it wouldn't surprise me at all if the majority of Reddit users have joined since 2018. It's tough to get reliable active user counts, but they seem to have risen substantially over the past five years.

It also wouldn't surprise me if the majority of Reddit users were indeed from prior to 2018, but at the very least > 2018 would be a very substantial minority.


My account(s) are 17 years old on reddit.


Yes? Mine is nearly that old. But we are very clearly the minority!


Like operating motor vehicles, carrying guns in some US states, sueing people and companies, submitting content to wikipedia, writing children's books, and writing and voting on laws?

Surely, there is some pretty large subset of things where "if it's legal for a person to do it should be legal for software" does not hold up?

So how about the default is "not allowed"


Hard to understand how someone can read the word 'sued' and think it has anything to do with criminal law.


Scraping is a bit of a legal gray area though. If you were to go scrape 300 billion words from the Internet, you probably would be committing a crime somewhere. Especially if you then reproduced some of those words verbatim for paying customers as ChatGPT does...

I am sure OpenAI thought all this through, so I can only assume they said "fuck it let's pull an Uber and do this anyway." We are in for lots of interesting legal headlines


> Seems like if it's legal for a person to do it should be legal for software to do for the most part.

If you're going to make a claim this strong, you should expand on it. Should software be able to have custody of children? Should it be able to kill in self-defense? Should it be able to make 14th amendment claims? Exactly what part of the case (other than the damage claim) is hard to understand?


> Seems like if it's legal for a person to do it should be legal for software to do for the most part.

It's legal for me to look out of the window and watch my neighbor go to the supermarket.

It's _not_ legal for me to build an automated surveillance system that tracks everybody on the street 24×7 and stores everything into a large database.


I'd say this is more like if someone automated taking pictures of every flyer and missing pet poster people put up on a lightpole and saved it to a database.

There's more deliberate action when you post something on a public online form than just existing in a place outside of your house. Especially considering you've always had the option to use reddit anonymously anyway.


>use reddit anon....

Read, yes - post no.

And - you can no longer create an account that is not tied to an email...


OpenAI didn't have access to every poster's email when they crawled reddit. If you're making posts or have an account name that are easily tied back to your personal identity, that's on you. But you could make an account with any random username you wanted, that keeps you anonymous as far as OpenAI is concerned.


My point was only that 17 years ago - and for more than a decade, reddit required no email address as a requisite to create an account... so it was truly anon... then they tie all (new) accounts to emails now - which makes it a trivial click for survelleince to ID your reddit account...


FTA:

> The lawsuit is seeking class-action certification and damages of $3 billion – though that figure is presumably a placeholder. Any actual damages would be determined if the plaintiffs prevail, based on the findings of the court.


They're fishing.


Likely hoping for whatever settlement they can squeeze out of OpenAI as the first such suit against them...

They picked 3B hoping to get several million...


If it genuinely makes them redundant and unemployable, a few million each seems "fair" in certain ways.

But that is a moral point, not a legal one; IANAL and can't say anything valuable about the legal merits.

Ideally AI makes us all redundant and the money stops mattering anything like as much, similar to how owning land stopped mattering anything like as much when the industrial revolution happened.

Regardless, I think this is a policy question rather than a legal question, even if this fight happens to be in a court.


Fishing expedition. Will probably get thrown out because no particular injury can be enunciated. OpenAI scraped HN as well, and I don't consider my HN posts private because anyone can come here and read them, including artificial intelligences.


If we dissect this case, it seems to revolve around two central questions: what constitutes 'public' data and to what extent can AI models leverage such data without infringing upon individual privacy. This lawsuit may well set a significant precedent in defining the boundaries of AI ethics and data privacy.


When this happened to Stable Diffusion, it was easy for me to consider it a necessary evil to progress humanity.

When this happens to closedAI, it just seems like a profit grab.

Not that it changes the legality of it. Just optics.

Wonder if that matters in court.


First they came for the graphics artists, but I did not speak out because I was not a graphics artist.

Then they came for the writers, but I did not speak out because I was not a writer.

Then they came for me, and there was no one left to speak for me... well, except ChatGPT.


As a language model I cannot speak for you. But I can help you express your thoughts and views. I can generate words and sentences in many ways.


I did not speak out because copyright is farcical nonsense that fetishizes the profit motive at the expense of humanity.


That's why copyright violation should be brutally cracked down on when the copyrights of Microsoft are violated, and lawsuits against Microsoft for intentional and widespread copyright violation should be laughed off. Because capitalism is bad.

edit: corporate LLMs have pulled the "one death is a tragedy, ten thousand deaths are a statistic" ploy off fully. If you want people to question whether you're even violating copyright, make sure you violate all of them at the same time. They'll just decide that you're an act of god and not covered under earthly laws.


>edit: corporate LLMs have pulled the "one death is a tragedy, ten thousand deaths are a statistic" ploy off fully. If you want people to question whether you're even violating copyright, make sure you violate all of them at the same time. They'll just decide that you're an act of god and not covered under earthly laws.

I don't think this is relevant. If OpenAI had trained a model on just one copyrighter holder's content it would likely not be different legally, even if the model would perform much worse.


Primacy of capital is literally the culprit of the inequality you're complaining about, and the reason you cannot win short of reorganizing society.


It seems people prefer power distributed by capital, rather than military might or factionalism/leaders/politics.

Not that all capital is distributed by merit, plenty of people used military might or factionalism/leaders/politics to obtain disproportionate amount of capital.

But if you are against the last 2 happening, I don't see what you expect a reorganization of society to accomplish since you are going to get a power structure of factionalism/leaders/politics taking priority. (Sorry bud, no an-com utopia ever existed, they all had factionalism/leaders/politics, thus defeating the entire purpose of removing class.)

I think most of us think we can capture/retain power easier with money, than having to climb up inter-party politics.


Capital primacy is maintained by the capitalist state, i.e. the monopoly on violence. This is literally military might.

I don't necessarily disagree with your later points. I do, however, disagree with giving up.


>Capital primacy is maintained by the capitalist state, i.e. the monopoly on violence. This is literally military might.

At least its equitable (based on value of output), ofc there are legacy issues as well.

Some demagogue can swoon the masses and take it all if not for capital. That demagogue could be Trump or Stalin.

Know the consequences of what you are advocating for.


> At least its equitable (based on value of output), ofc there are legacy issues as well.

It's not. By definition, it's based on control of capital. That's why it's called capitalism. In other words, those aren't "legacy" issues; they are literally the system as designed.


That is too simple, people can earn their own capital as well.

Since utopia is impossible, its a choice between:

>Capitalism, where people can typically pull off the american dream in their lifetime.

or

>Let politics determine how much material things you get

The latter seems especially scary if you are familiar with history


Your logic is just profoundly short sighted.

Capitalism follows a very simple algorithm. In a capitalist economy, capital always accumulates, with all exceptions being precisely that: exceptions. Are you defending the exceptions or the rules?

Realize there was a very long and quite recent time when capitalism was impossible. By your logic, we should reinstate the divine right of kings.


It’s okay, I’m sure everything is going to be fine when Microsoft and ChatGPT hot mic your next doctor appointment.

https://news.ycombinator.com/item?id=36498294


Assuming Google, Amazon, etc haven't already been doing exactly that.


That says it's using GPT4 but it's not clear if it has anything to do with feeding back into ChatGPT.

> Nuance has strict data agreements with its customers, so patient data is fully encrypted and runs in HIPAA-compliant environments

Additionally Epic seems to already be storing these clinical notes in databases and Nuance which Microsoft owns has already technically been a 'hot mic' in these same doctors office for some time. The new offering is an AI-draft note generator.

I'm personally skeptical that model output would suddenly be under different rules than the other voice-to-text AI model output?


Discord rolled out a ChatGPT based bot that can be used in (and thus can read) all private conversations. Not surprised there are issues with it.


A tangential question...but does anyone know what software is used to generate legal documents that look like the PDF linked in the article? I’ve played with LaTeX templates a bit, but I seriously doubt law firms are futzing around with LaTeX for documents as complex as this. They must have some software that produces this formatting.


`pdfinfo` on the file says:

    Creator:         Acrobat PDFMaker 23 for Word
    Producer:        Adobe PDF Library 23.3.247; modified using iText® 7.1.6 ©2000-2019 iText Group NV (Administrative Office of the United States Courts; licensed version)
So it was likely made in Word and exported to PDF. (One can anyway guess from the "look" of the paragraphs that they're not using anything like Knuth–Plass line-breaking, which rules out things like *TeX and InDesign.)


Yep it's Word exported to pdf. Source: Am attorney, do this all the time. You write it up in Word, save as pdf. Then upload it to the court website, which (in federal court, at least) puts the case number in blue text at the top for the officially-filed version.

The 1-28 pleading numbers on the side are annoying. They're specific to courts in California and a few other jurisdictions, and the rules of court require them. But many other courts don't have them, and they only help to cite specific lines within pages; eg "Complaint 5:4-9" means "Complaint at page 5, at lines 4 to 9". It's occasionally useful for court filings like this, but more useful for court/deposition transcripts of testimony to show precisely where a witness said something.

Related: I tried building an RNN to generate legal pleadings back around 2018/19 and gathered a bunch of docs like this from courts across the country as training data. Processing text with those pleading numbers was a pain, so I built a CNN to classify whether a document had pleading numbers or not, which affected downstream processing. Probably the wrong approach in a bunch of ways, but I was just learning.


Oh cool, then that settles it. Thank you!


In my sample size of one, an attorney I talked to said that Microsoft Word was the most important software he and his colleagues used. So my guess is they're just really good with Word.


Thanks! That surprises me but maybe it shouldn’t. I figured it was some purpose-built software for attorneys.


Word has pretty good revision tracking and support for footnotes which are probably the main things lawyers use more than most average people do. And remember that lawyers communicate a lot with clients, etc. too so there would be a lot of friction associated with a non-standard tool.

When I worked on an expert witness report for a big law firm we just used Word.


Yep that pdf can be made using ms word


Lots of lawyers use WordPerfect. Version 5.1. There's plenty of evidence online if you don't believe me, I wouldn't believe me if this was the first time I heard it.


Interesting, I’ll do some searching on this. Thanks!


Also tangential, but would it be ironic if portions of the legal documents were written by ChatGPT?


If they do, hopefully they didn't allow it to cite non-existent cases https://www.reuters.com/legal/new-york-lawyers-sanctioned-us...


What I noticed is that the privacy setting which should prevent OpenAI to use my data for training purposes, was already deleted twice and I had to set it again. No idea what that means and if the data that I entered before I noticed that setting was gone is now being owned by OpenAI. Anyway, it is obvious that privacy is no priority to them. Also, it's known that YC companies are informally being told they should not worry about privacy while scaling up. Open AI is not a YC company, but its culture is definitely derived from it.


As I understood it that setting is an opt-out cookie. So must be set on all new browser sessions.

Seems to be a blatant violation of GDPR. So I assume they’ll be fined for it sooner or later and forced to cleanup the training data anyway.


How is that a GDPR violation?

GDPR doesn’t prevent opt outs of this kind of thing.


In the sense that consent requires active opt-in. The passive “opt-in” by failing to set the cookie doesn’t count as consent.

So if they’re claiming they have the right to process data on the legal basis of consent, and they claim the absence of that cookie constitutes that consent, then they have no legal basis, and are thus in violation of the law.


I am not a lawyer, just a Sysadmin; but with that said, the linked pdf of the complaint is absolutely fascinating to me. It's worth it (to me) for the list of resources it cites.


Do we think this is related to media platforms seemingly walling themselves off? Requiring accounts to view content, removing API access. It seems if they can silo data off and make it difficult to access at a large scale, then they are the gatekeeper of the data and can control usage and pricing.


No, this always happens with platforms once they feel they have attracted enough users : for instance it happened with Twitter in 2013, or see also what happened to XMPP after Google and Facebook have adopted it, or Reddit going closed source in 2017...


We talk to a lot of companies and many want to start using generative AI but are afraid of litigation. As long as it is not clear on which data a given model has been trained and that it is explicitly licensed permissively by the owner you are not sure what can happen.

We are actually working on a tool to create billion-size free-to-use Creative Commons image datasets and prepare them for training models like Stable Diffusion. There is a blogpost about it here: https://blog.ml6.eu/ai-image-generation-without-copyright-in...


Rather than there being lawsuit after lawsuit of this sort, we wrote an op-ed this morning that says there should be a simple, compulsory licensing fee that AI companies pay to the public -- something we called the AI Dividend: https://www.politico.com/news/magazine/2023/06/29/ai-pay-ame...


The order of magnitude of suggested pricing is really interesting: $0.001/word is significantly more expensive than, say, OpenAI's pricing of GPT-3.5-turbo ($0.002/1k tokens, ~750 words, so ~$0.000003/word, assuming I got my zeros correct). So this would increase the cost of running GPT-3 by about 300x.

In terms of implementation, I wonder about a few things:

Do models trained on more data have to pay more? LLaMA was trained on 1.5T tokens, the original GPT-3 was trained on ~300B tokens. And this is only partially related to model quality, LLaMA 13B and LLaMA 65B were trained on the same data, but the 65B model is better. What's the incentive to ever use the 13B model, if the licensing cost is 100x-1000x the model inference cost?

Who defines a word? Each model uses a different tokenizer. I'm personally amused by the idea of a government-mandated tokenizer.

What about generations that never see human eyes? As an NLP researcher, I've generated millions of tokens for training and automatic evaluation purposes -- are those subject to licensing as well?


Yeah, the idea is that it's much more expensive than current OpenAI pricing but much less expensive than what even a low-end marketing copy writer would charge per word. Its side effect would be to push such tools towards more valuable uses.

The idea is to keep it simple, so it wouldn't be based upon the specifics of training, just whether or not it used public data. Anything else would require companies to divulge trade secrets and that won't fly. And words are defined here as, well, words -- English words. There'd be a separate fee per pixel/voxel, and then a catchall for non-language/non-image models.


1. How would this not make tools like Github Copilot exorbitantly expensive? Why should I have to pay a tax to everyone else in the United States to use something that was disproportionately trained on my own data?

2. Given that the internet is global, is every country supposed to make their own versions of this? Will I have to pay the EU tax to use models that might have been trained on data that Europeans posted online?


To your first question, it would incentivize training of models on one's own data exclusively -- companies could train something like Copilot on their own code, for instance. To your second question, there's no way to have an international policy like this so yes each jurisdiction would do it independently -- just as they do with thousands of other similar things.


I don't think a model trained on a single company's data would be nearly as helpful as a model trained on all publicly licensed code on the internet. But suppose it were...

What if I'm not a massive corporation with millions of lines of code to train on and I want to pay for an AI coding assistant? Doesn't this make it effectively illegal for me to purchase such a product for a reasonable price when big companies will presumably be able to use it without paying the tax?

Another situation - let's say you're a company that contributes heavily to open source, but also accepts external contributions. Could Facebook train a model on the React codebase, for example, without having to pay the AI tax?

Another situation - suppose I start an LLM coding assistant and sell it to my friend. Presumably I don't have to pay the tax as a "low revenue" company. Then I get acquired or get some huge seed round and suddenly my customers have to pay the AI tax. Doesn't this just nuke all my customers?

Anyway, as a software engineer, I personally want people to use my code for whatever they want to use it for, without having to pay me for it. I indicate that by using an MIT license. Why throw that precedent out the window?


The policy would exempt all except big companies from the fees. So if you set up your own, you don't pay. And the effect of the revenue threshold creating an advantage for small businesses is commonplace in policy across the board -- SMBs don't have many of the same costs and obligations as larger companies.

And this would not prevent you from explicitly licensing your code or writing to let people to train on it. But what it would do is say that if someone didn't explicitly license it then it is covered under the policy.


Also regarding international policy - good luck getting Chinese citizens to pay the US AI tax. Effectively you'd be nerfing anyone under US jurisdiction


Not really -- it's the same as selling any service into the US. Yes people cheat on, say, sales tax, just like Amazon did in the early years, but eventually once big enough companies end up having to adhere to the policy.


Can't wait for the deluge of AI generated content dumped en masse on the internet purely to harvest "AI Dividends".


The dividend isn't paid to generated content but for generated content -- so generating content (using say ChatGPT) means you're paying into the AI Dividend fund not receiving money from it.


Does every business in the 21st century need to be some form of low-level scam in order to make headway and grow enough to satisfy VCs or investors?


Yes, that’s where the disruption comes from. In all Seriousness that’s the advantage left in an efficient market.

Look at all the share economy players it boils down to offload the risk, labor and debt but keep the margin.


It does seem like. Playing by the rules limits growth. Stealing, cheating, lying, manipulating, are the endless money cheat, particularly in societies where most people abide by the rules. Once they hit big they find willing politicians to adjust laws to their favor. Rinse and repeat.


Only 3? They should go for the whole 10, and settle for 1.

Now that the gates are open, we'll probably be entering the "free money" cycle soon.


I wonder if we'll see a license for content that forbids its use for training of language models.


Maybe as a result OpenAI will have to publish how they trained and what data was exactly used.


Why not sue for 30 billion instead if you are to go full stupid on the price


Interesting, for once this doesn't have anything to do with the GDPR. It's by 16 (US) individuals, filing the complaint in SF.


California is the only state that has active data privacy laws. Although, I don't think there's any financial transactions, it's just public data scraping. I wonder if the company can even be held liable for the output of these LLMs. There's no direct hosting of any static data.

https://iapp.org/resources/article/us-state-privacy-legislat...

https://leginfo.legislature.ca.gov/faces/codes_displayText.x...


It had to happen eventually. There is so much money to go after. This is a case of lawyers creating their own income stream.


If anything major comes out of this, is probably EVEN MORE prompts and popups asking for permission to use your data. even with GDPR, data collection and sales never stopped, it just made things more annoying by transforming every webpage into a granular term of service to continue doing the same.

It isn't even turned off by default. Many sites just give you an "i accept" button or even if you want to manage the preferences, the "accept all choices" button is where the "confirm my choices" should be.

Bigger companies will just append this to their TOS and push it down the customer's throat. That if MS doesn't settle out of court and the case gets thrown together with any major oppositon to the data mining


Well, it was too good to be true. Reminds me of the early days of music sharing and Napster.


Which was never legal in the first place, but it was great because it liberated music and content to the masses. It was the necessary precursor to what is now Spotify and the like, instant access to billions of songs. The music industry didn't like that (Napster & co) because they wanted purchases and to get paid every time the music they owned was played.


Napster was people sharing files. That's a crime! But now it's corporations so it should be legal.


Well, I think the main catalyst here is that corporations need to pay creators, etc... Napster cut the middleman, hehe.


Wow, I really don’t get it, if I were to memorize billions of pages worth of people’s private messages and medical records, then recited them live in the Internet, would that be a crime??


What exactly is the difference between that and downloading billions of pages worth of people's private messages and medical records, and putting them in a torrent? If there is a difference, I should be able to make a disability discrimination case under the ADA and erase that difference, because I don't have the memory to do that without the aid of a prosthetic (i.e. my laptop.)


Yeah, unless you had permission from the authors to do so.


Yes, it would be.

Thankfully AI doesn't work by memorization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: