Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Apple, Nvidia, Anthropic Used Swiped YouTube Videos to Train AI (proofnews.org)
41 points by gwintrob 3 months ago | hide | past | favorite | 66 comments



> “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors.”

so basically "we stole from a thief therefore we didn't steal" excuse?


"A thief gave it to us", you mean.


For anyone familiar with the legal landscape, except for scenarios where AI products make reproductions of their training material, why isn't this covered under fair use?

Don't humans basically do the same thing when attempting to create new music — they derive a lifetime of inspiration from the works of others?


One of the pillars of fair use is whether it disrupts the market for the original work. The explicit goal of gen ai is to replace artists and their original work.


I thought the explicit goal of AI was to create systems that can do tasks that typically require human intelligence. That includes beneficial things like finding cures for diseases, technology innovation, etc … Wouldn’t it be a shame to limit this growth potential to protect friggin’ YouTubers?

Maybe go after the application, not the technology? Someone uses AI to explicitly plagiarize an artist’s content? Sure, go ahead & sue! But limiting the growth potential of a whole class of technology seems like a bad idea, a really bad idea actually if your military enemy had made that same technology a top priority for the next years …


If I train a gen AI on the full works of Pablo Picasso, and ask it to create new works, have I disrupted the market for the original works of Pablo Picasso?

If I train people to draw anime from a book on how to draw anime, and ask them to start drawing work related to Bleach (e.g.), have I disrupted the market for the original works of Bleach?


No, humans are not Python programs running linear algebra libraries, sucking in mountains of copyrighted data to use at massive scale in corporate products. The fact that this question comes up in EVERY thread about this is honestly sad.


It’s like fishing. We have laws for that not because one person with a pole and a line can catch a handful of fish. It’s because that eventually evolved into a machine with many lines and nets harvesting too many fish. They’re both “fishing” but once you reach a scale beyond what one person can reasonably do the intent and effect becomes completely different.


I'm asking about the law. There's a continuously mounting discussion about how existing case law will apply to ML, and to what degree there is liability. I make it very clear that I am interested in hearing from people who are intimate with the legal landscape.

Is it that disgusting to you to discuss the law that you want to derail it by talking about how sad it is to ask?


And there’s no legislation or settled case law yet that says if you build a sufficiently complex computer program to rearrange copyrighted works that the output can be treated like an original work from a human author.


Is case law sufficient to decide on the other side, then? That Apple, Nvidia, etc. are about to be whipped by a massive class action lawsuit?


I don't think so.


Sad? Or maybe an indication that your opinion isn’t the universal truth…


Which of the factual statements do you count as an opinion?


There is a continuum between a human who has heard a lifetime of music and later wrote similar but different music inspired by what they have heard, and a copy machine that directly reproduces (with minor degradation) a copywritten work.

The question is where on that spectrum does current AI training lie, and where is the cutoff between fair-use and unauthorized commercial use of copywritten works.

Today's AIs are not the same as a human creating original work. Even humans have to be careful to not reproduce existing works too closely or they also get blamed for plagiarism.


Alternate perspective: Everything is a remix and "intellectual property" guardians are the middle men skimming profit from this continuum.


Sure, but that’s not a perspective shared by most western copyright law.


Because AI often replicates things much more closely than what fair use would constitute, and doesn't label sources like when you are quoting. And it's generally harmful for humanity, too.


I'm specifically interested in situations where ML products do not do simple reproductions of copyrighted material. I'm aware that it's difficult to even know the space of output and to "align" the model correctly.

Are we normally required to label sources when referencing other copyrighted materials, whether in songs or movies or otherwise?


Depends on how much of the source you used (which is in line with how fair use works, unless you're developing parody). Given that AI is using the entire source: yes.

As we know with scraping cases, the amount data and time also may play a role in determining fair use (think in terms of buffet ettiquite. "all you can eat" does not in fact mean "you can eat it all by yourself"). Funnily enough LinkedIn (owned by Microsoft) did argue successfully in court against scraping a website.


There are lots of things that are not simple reproductions that are not fair use.

If I take ten of your copywritten photographs and stack them on top of each other in Photoshop with transparency, the output is not a simple reproduction of your work. If I sold that for commercial purposes you would be upset with me and likely have a copyright case.

That's an obvious example, but my point is there aren't super clean-cut definitions for these things, and it's not settled case law yet which side current AI training and content generation falls under.


There is tons of human made art in contemporary art museums that is copywritten stuff re-arranged.

Most famously might be Andy Warhol's Campbell's Soup cans. You can find plenty more though. Product labels, magazine covers, pasted together as "art" even thought each part is copywritten.


Sure. My point is that it's not a super clear-cut line or definition.


People do make the claim that this ought to be considered "fair use" under the law. There are an numbers of prominent cases were AI companies are getting sued, and we'll see if this defense actually works.

If the question is about what is "fair" I don't see how you could be surprised that artists, journalists, musicians, Youtube creators, would object to huge tech companies using their stuff without permission to replace them. It is entirely to be expected that many people find this unfair.

If enough people file this to unfair and outragous, even if the courts found the "fair use" arguement cogent, the laws could be changed.


The answer is, we don't know yet, LLMs are too new. There are multiple lawsuits creaking to life on this topic and the defendants will no doubt claim fair use, but legal experts seem to think it could go either way. Training language models on publicly-available data for academic research has been successfully defended as fair use in court, but there is plenty of precedent in the law where something can be fair use in a research context but infringement if done for profit, which could very well be the case again here.


It seems like every autocompleter or recommendation feature was trained on data obtained this way. The form of the output is very different and perceived very differently. I imagine Pandora had to train their recommendation on recorded music and then using that entire body of knowledge chooses a song to stream and pay royalties.

Every music service has something like this, are they delivering just the value of the streamed music? Great then they only owe those royalties. Are they delivering the value of EVERY song they trained on every time a new song is chosen? I sure wasn’t asking that question until generative results became the product.


Well, an “AI” has accessed those items because it was forced (strictly speaking) by a human to do so. Thus, the human doing that is not making “fair” use of those sources.


Napster was considered fair use at one point as well.


It must have been a very brief point because I do not remember it, despite being very familiar with Napster.


There is nothing analagous between LLM "learning" and human learning. I think that's a false syllogism.


> from the works of others?

... which they presumably listened with permission


Legal concerns aside, aren't Youtube captions primarily AI-generated in the first place? I know some authors meticulously hand-craft their captions but that can't be the case for the vast majority of videos.

Therefore isn't training AI on this basically poisoning your own model? The caption quality is good but there are mistakes in pretty much every video I watch with captions.


They are almost certainly extracting the audio and then using Whisper or other superior speech recognition models. I made a free tool which can do this very efficiently for whole playlists of YouTube videos, so I'm sure they can do the same:

https://github.com/Dicklesworthstone/bulk_transcribe_youtube...


I imagine data laundering (?) is common.

E.g.: Nike needs to produce a large amount of clothes. They hire an oversea company who commits to the order. They set strict rules -- no child labor, certain quality controls, etc... This company then subcontracts anyway possible and delivers the order, gets paid, and dissolves. Messy but Nike's hands are clean.

With AI, same thing but with videos and other forms of data.

Hence why a question "did you train with Youtube?" to a certain CTO is so difficult to answer.


> Hence why a question "did you train with Youtube?" to a certain CTO is so difficult to answer.

Make it so they're automatically guilty if they can't provide a definitive negative answer.


Whether covered under fair use or not, the laws around copyright today did not anticipate this use case. Congress should pass laws that clarify how data is and isn’t allowed to be used in training AI models, how creators should or shouldn’t be compensated, etc - rather than speculating whether this usage technically does or doesn’t comply with the law as-is.


I think what really sizzles me is that some of these same companies helped develop such strict enhancements to copyright to begin with in the realm of software. So I'm not falling for the crocodile tears when they get caught in the very snare they used to litigate thousands of other companies for and bully potentially millions more with just because now it's more profitable to tear it down. Made your bed...

And yes, regardless of results I agree there should be new laws made. But we know Congress in the US this year has been a roller coaster, to put it lightly. And I don't even think this is top 5 of what congress needed to codify into law properly. So all the short term work will be the judicial branch interpreting what few laws we do have.


> I think what really sizzles me is that some of these same companies helped develop such strict enhancements to copyright to begin with in the realm of software.

What enhancements are you thinking of?


How would you ensure compliance?


Create a private right of action. If creator A can show that AI trainer B used their works (e.g. like how we've seen Getty watermarks show up in AI generated pics), then they can sue for $X dollars.


I wonder how much of video generative AI depended on the open source project youtube-dl/yt-dlp.


Sadly stuff like this makes me think Google is going to work even hard at preventing tools like YT-DL from working.

IMO data is Google's biggest moat in the AI race - and I suspect they'll do whatever they can to keep it.


Google's moat is deep ranks of machine learning engineers and top flight SWE talent, coupled with probably the best infrastructure in the business. Their data moat is weak, the data they can actually train on is mostly public, and people trying to prevent scraping are always on the losing end.

Meta is the one with the huge data moat.


Pretty sure, web scraping has been upheld as legal when microsoft lost its case with companies scraping linkedin. And generative content is also legal, thats even includes reposting a copyrighted video even if there is discussion video over the video. Thats an extreme case of fair use, but it shows a wide use case over a copyright video.

Personally been using fabric ai tool, since it can summarize youtube videos, so I dont have to watch an hour+ video or read a very long article/journal, just gives me a summary, top talking points or even break it down for tech points.

https://github.com/danielmiessler/fabric


> And generative content is also legal, thats even includes reposting a copyrighted video even if there is discussion video over the video.

Citation please.


I don’t see anything wrong with these companies using YouTube content to train AI in a sense. I think the creators of the videos should be fairly compensated and their permission should be sought, but I don’t think of Google/Alphabet in that way. Sorry but even if Google runs the YouTube platform, I just don’t think they ethically or morally have an exclusive right to the content the world creates, just because they have various monopolies that are immune to competition due to anti-competitive moves and the power of network effects. As far as I am concerned they are a utility service that needs to be heavily regulated.


So there is a clear effort to build enclosures around various of corpuses of material that could or would be useful to train AI. Thing is, people read books, they watch videos, they listen to music, they see and produce art and so on. How is training data distinct from "human training data"?

One could say the quantity. We're currently dealing with statistical learning models that require a huge quantity of training data. This is temporary. At some point you will be able to train an ML system with less because humans can be trained with less. What then?


This is about the Pile dataset, of course we don't if it has been used to train the commercial models we use or just for the research papers mentioned in the article.


How many of the YouTube channels depend on fair use themselves?

For example, Jacksepticeye is listed as having their videos used. Looking at the channel, it seems like a lot of it is is recordings of them playing video games.

Is the company that produced these games being compensated?


You know what game is it though. You know where they are getting their sources from.

As much as I hate react videos, at least the situation is the same. You know where the source is from and can go to the original if you wish.

Show me the attribution on your generated content for any of these creators.


> How many of the YouTube channels depend on fair use themselves?

IANAL, but recording portions of copyrighted content or using excerpts thereof is covered under fair use.

It is not yet known whether reproducing copyrighted content in substance or style using generative AI is covered under fair use.


In this specific video game context, fair use is almost totally up in the air. It's kind of crazy given how much of a market there is on videos and streaming of games, but it's how it is. Both "sides" of that argument prefer the status quo to the risk that a legal decision would go against them.

There was a period of time several years ago when some publishers were not allowing some kinds of streaming, and there was a need for them to post public statements allowing things like Let's Play videos. Even now you'll have the occasional game where the publisher just says streaming isn't allowed, or is only allowed under restrictive terms.


Now imagine combining both video game streaming and generative AI to create generated video game streams.

Given the relative simplicity of video game worlds, it should be far easier to generate those than than photorealistic video (i.e. Deep mind Veo, OpenAI Sora).

Yes, it might just saturate the world with low quality content, leaving the good stuff still distinguishable, but many content business models are built on low quality content.


It's the chicken egg problem (?). ChatGPT couldn't make LLM so valuable without stealing. It's just forcing the freemium model, if google accepts settlement/later payments without any crime charges


It's an open secret. The only thing missing is the clear evidence.


They're not the only ones. I've seen other AI firms like Adept doing the same thing. I smell a class-action suit from all of the video makers who weren't protected by YouTube.


And furthermore, Google claims that their own training of models on YouTube data is permitted.


The training of models seems less of an issue than what you do with them.

IMO, the most problematic part is generating content in the style and voice of the original human creator of the copyrighted content.

Unfortunately, this is also currently among the most user-attractive for generative AI trained on copyrighted content.


It's awfully hard to imagine any specific kind of harm that video creators will suffer by having AI trained on their subtitles...


OP used swiped content to deliver this info.


That is not the same and I really feel like the distinction should be obvious.

Most articles will link to sources from others and then build on top of them for their own article. Actually giving sources for your work. That isn't swiping the content to write an article. Yes there are bad actors in that regard, but most play by the accepted rules here.

With the AI work, attribution of source is gone. Who did the work that you are benefiting from is gone. The people that benefit are those that made the AI and the people using it, skipping the source for the training data.


I believe that the law allows us to link to others’ online content. The reason might be that the link takes you to their site where they control the presentation and can benefit from your viewing it.


[flagged]


Shocking or not, I think "this company is potentially breaking the law" is always worth calling out.


I wonder how many of the creators complaining make react videos.


> Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken). Some of the material used to train AI also promoted conspiracies such as the “flat-Earth theory.”

I'm not going to grok through all their thousands of videos, but these specific creators sure didn't get this big on reactions only.


with click-bait titles like: Apple hacked ME to put MY videos on their AI!

& a thumbnail with the logo of Apple at the top-right and their face doing a weird expression, occupying 75% of the .png with a single bright color background




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: