I don't want my content to be harvested by LLMs; They are removing attribution, among other things. Otherwise, I'd like to stick as close as possible to the open source licenses (say MIT). Is there such a license out there? If not, anyone working on such a thing?
So far what we have learned is that robots.txt doesn't work; major sites are using login-only access with 2FA to have any hope to keep their content away from LLMs. I imagine the licenses would be one thing, but actually implementing/enforcing them might be a whole other can of worms!
You best bet to fight back is to either try to poison your data, or to train your own models on their data.
reply