Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is there any license that is designed to exclude LLMs?
35 points by urlwolf 6 days ago | hide | past | favorite | 18 comments
I don't want my content to be harvested by LLMs; They are removing attribution, among other things. Otherwise, I'd like to stick as close as possible to the open source licenses (say MIT). Is there such a license out there? If not, anyone working on such a thing?

So far what we have learned is that robots.txt doesn't work; major sites are using login-only access with 2FA to have any hope to keep their content away from LLMs. I imagine the licenses would be one thing, but actually implementing/enforcing them might be a whole other can of worms!






The LLMs' training data is already mostly All Rights Reserved content which is more restrictive than whatever license you could come up with, and if that doesn't stop anyone then sure as hell you won't stand a chance either.

You best bet to fight back is to either try to poison your data, or to train your own models on their data.


If machine learning is found to be fair use, the license you choose does not matter - in the same way Google Books can scan books and make them searchable without a specific license to do so.

If machine learning is not found to be fair use, and your concern is the removal of attribution, then MIT license should be fine.

> So far what we have learned is that robots.txt doesn't work;

The companies training models I'm aware of[0][1][2] all respect robots.txt for their crawling. Can't necessarily guarantee that all of them do - but the fact that smaller players are likely to use CommonCrawl (which also follows robots.txt[3]) means it should catch the vast majority of cases and I'd recommend it if you don't want your work trained on.

> major sites are using login-only access with 2FA to have any hope to keep their content away from LLMs

I suspect it's more that users with accounts are more valuable than lurkers, and framing forced sign-up as protecting user data from LLMs is a convenient excuse.

[0]: https://platform.openai.com/docs/bots

[1]: https://support.anthropic.com/en/articles/8896518-does-anthr...

[2]: https://blog.google/technology/ai/an-update-on-web-publisher...

[3]: https://commoncrawl.org/faq


You don't have a choice. Any content you put online will be harvested by LLMs regardless of your intent, or any license you post to the contrary. That's already the norm and it isn't going to change any time soon.

hehehheh's comment is your best option - poison your content when possible. It's still going to be consumed but at least you can make the LLMs choke on it. Second best option is to never post content to the free internet, but even that's just a temporary measure - all accessible data (including private data) will be assimilated eventually.. But expecting a license to work in a post LLM world is just naive.


Yes, for the most part. While it's academically possible to attempt to control this through legal means, it is, in practice, unlikely to have much impact because LLM creators are effectively similar in operation to web crawlers for search engines. It's probably ineffective and wasteful use of webops/webadmin time and energy to obsess over attempting to control access or bikeshed about it because deploying well-intentioned "defenses" will likely end up creating false positives blocking ordinary users and costing time and effort to support these headaches that don't contribute any value. Perhaps it might be possible to notice the more honest LLM creators with user agent headers, but it's also entirely possible a nontrivial fraction of them spoof headers, run as batch jobs from AWS, and cache and store content for offline so they don't/wouldn't necessarily check for updates as often as search engines would to create a training corpus.

As bad as it sounds, this is definitely the best advice unless you actually have the funds and determination to bring legal action against the licensing breach I assume ?

I doubt a private citizen would have the resources to stand against these companies at the moment. The situation could get better in the future, in case some big company puts the resources to fight in court and wins. The precedent could be of great help in presenting similar cases.

A class action may be one way.

Yes, but whether that's an option depends on which country you're residing in.

I mean, if you have those kinds of funds, you probably also already have lawyers on retainer and lawsuits are already SOP. I don't know how effective that would be under the current legal climate.

Best license then would be an LLM poisoning attack.

I'm aware of glazing and nightshade to poison image sets for LLMs but is there anything for a code repository or blog post?

Any licence that requires attribution should be enough in principle, eg CC BY 4.0, Apache 2.0.

Thanks, I suspect the LLM companies would ignore this, and of course getting into a legal battle is beyond the means of most content producers. so perhaps the license is not the solution, and we need to create a complete world outside http...

By their current interpretation of copyright law (which hasn't yet been successfully challenged), the license is almost completely irrelevant, because they believe it is fair use. You need to first establish that what they are doing is copyright infringement before you can apply any license terms.

I suspect that the BadGuys(TM) will indeed ignore any sort of licence, as they have done with content that I created long ago.

However, for a laugh, I just made all the textual content on my key site explicitly CC BY 4.0. Most of my code is already Apache 4.0 and data CC0.


In other words: Licenses are only as useful as your ability to enforce them in court.

And yes, the companies are fully aware of this and that's why they do it, they know you won't dare sue them.


If you care about it being an OSI-approved license (or purists arguing that it's not really "open source"), then any restrictions on who/what can use the software violates the FSF's "freedom zero": https://www.gnu.org/philosophy/free-sw.en.html#four-freedoms

but actually implementing/enforcing them might be a whole other can of worms!

Are you assuming out lawyering Google, OpenAI, etc. is only a can of worms?

A license is only as good as your legal wherewithal to enforce it. Good luck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: