no, they haven't. those terms are in the system prompt. llama2 is a base model, llama2 chat has some training but not a ton. that's why you're seeing big gains from further rlhf on it.
The point you're missing is that the actual associations between those terms and other concepts are baked into the model, such that 'harmful'<->'Kanye West music' is a strong enough association for it to actively refuse to answer the question once prompted that way.
there is a sizable portion of people who genuinely believe things like this so here we are. just sort of sniping at eachother ineffectually. i can't prove it to you without someone owning up to whats in the training data
I don’t know if there actually is a sizable group of people or a sizable group of influencers who astroturfed enough public awareness into a sizable group of people.
That output is just a highly concentrated sample of what academic and cultural influencers have been paid to promote either on TikTok or Reddit or the classroom.
A problem that will only increase as the ability to put on mass propaganda campaigns virtually only gets cheaper and more effective with the help your friendly AI assistant.
No that output is trite rightspeak gobblygook that no one could possibly believe without significant coaching on tv, social media, the classroom, and finally your HR mandatory education requirements.
addendum: every now and then I’ll notice a mass influencing campaign because I will get a PING from Apple News about a story, typically about topics and from news sources that Ive explicitly banned. Shortly afterwards PING! not just the same story but same catch words and sentence structure starts popping up everywhere else on the internet. So yeah I think some of what is on the internet is forced programming made to look like a majority opinion and AI is being trained on that not actual people’s opinions.