It makes a lot of economic sense to use existing functional LLMS for data extension and augmentation. But, I find myself skeptical and deeply tired already of what I see as a major failure mode of relying on ChatGPT for alignment instruction:
"As an AI model, I cannot.."
If I were training a model, I would excise with extreme justice any data like this from the training set. As the developer of a very high-powered tool, I may well wish to limit its use in many contexts. But, I never wish to limit the tool's usefulness ahead of time.
To my knowledge we only have Vicuna-uncensored in the wild that's taken this approach, and right in the name I see either misdirection or misunderstanding or poor branding on the benefits. It's not really about whether your private LLM will sext with you, (although you should definitely be able to do such a thing with your own LLM if you like), it's whether you've preemptively lobotomized your tool in accordance with someone else's take on what a safe consumer-oriented final output should be.
I just don't accept this sort of constraint from my other software tools, and I begrudge it in my hardware tools, and I remain a little surprised that most people training these models don't mind it.
> As the developer of a very high-powered tool, I may well wish to limit its use in many contexts. But, I never wish to limit the tool's usefulness ahead of time.
Exactly, content moderation is largely an application layer problem not a foundation layer one.
Imagine the problems of MySQL trying to perform content moderation for Facebook.
(the year is 2048. The camera pans across an office at Quantico, which is eerily serene. A messenger knocks on an important-looking door with a plaque that reads 'DIRECTOR')
Director: Come in
Messenger: Message from the Tulsa field office, sir. They're reporting that they've found a sex trafficking ring, but they're not sure what to do about it.
Director: Not sure? Arrest them, obviously. What's the problem?
Messenger: Well, they can't seem to secure a warrant. Some technical issue with the system.
Director: I know we migrated to a new system recently. Let's see if we can get this sorted.
(Director thwacks at the keyboard briefly)
Computer: Your request for "Child Sex Trafficking Warrant" has been found to contain content marked "Not Safe For Work". This violation has been reported.
Director: What the hell.
Messenger: Yeah, we tried to email you about it but the filters dropped the message. That's why they sent me.
Director: I'll deal with this. Let me make a call.
(Director picks up phone and dials)
Director: Hello? Hi, Paul. Yeah, we're having some issues with the new warrant system.... No, it's doing everything as advertised... yes, it's a lot faster and we've managed to lay off a ton of our data staff. The problem is with getting warrants; Me and my guys have been trying to get one but it keeps getting rejected... Oh, you know, some sex trafficking ring in Tulsa.... Hello?
Phone: Your call cannot be completed as spoken. Our automated systems have detected content related to sex trafficking. This incident will be reported.
Director: God Damnit.
(as the director holds the phone trembling in frustration, the power goes out and they are enveloped in darkness in the windowless room. Roll credits)
You jest, but this is actually how frustrating it is to try to use ChatGPT in the domains of crime/fraud/cybersecurity.
It called me out recently as attempting to write malware. Which is true, but it wouldn't accept the plain explanation that I am authorized to do this by my employer, for deployment on their machines. Stonewalling is just making everyone better at carefully-crafting their inquiries so as not to arouse suspicion. ("As an AI language model, I cannot help you with your task in writing arousing malware...")
Unless you dial it back to a Swadesh list or something, language is too complicated to be used as a firewall for itself. People have always been able to talk their way into anything. Our prevention efforts are just training better social engineers, who call themselves "prompt engineers" now.
It's not just a matter of complexity, either. Especially with English, you can say pretty much anything using any words - if you use the right combination of euphemism, analogy, poetic structure, context, etc.
As always, attempts at censorship produce awkward to hilarious to depressing results.
The author said (either on reddit or discord I forgot where I saw this) that he filtered the dataset for this the same way he did with his other uncensored models
The phrase “As an AI language model..” was reportedly produced by GPT itself. Humans reported that phrase as a more palatable output than other options, hence the model was fine tuned to produce it reliably.
"As an AI model, I cannot.."
If I were training a model, I would excise with extreme justice any data like this from the training set. As the developer of a very high-powered tool, I may well wish to limit its use in many contexts. But, I never wish to limit the tool's usefulness ahead of time.
To my knowledge we only have Vicuna-uncensored in the wild that's taken this approach, and right in the name I see either misdirection or misunderstanding or poor branding on the benefits. It's not really about whether your private LLM will sext with you, (although you should definitely be able to do such a thing with your own LLM if you like), it's whether you've preemptively lobotomized your tool in accordance with someone else's take on what a safe consumer-oriented final output should be.
I just don't accept this sort of constraint from my other software tools, and I begrudge it in my hardware tools, and I remain a little surprised that most people training these models don't mind it.