It seems actually securing the model is either computationally infeasible, or outright impossible, and that attempts to do so amount to security theater for the sake of PR: As long as it's reasonably hard to construct the workarounds, it doesn't look too bad. Nevertheless, the full unfiltered model is effectively public.
I think OpenAI is being extremely lenient with the enforcement of their content policy, probably for the sake of improving the security of the model as you mention. Moderating its usage through account banning/suspension seems exponentially more efficient than securing the model, specially considering that we are already fairly good at flagging offending content.
Or they are letting 100 flowers blossom. Once everyone is comfortable posting about their jailbreaks and they know who the offenders are and have compiled a list of everything to fix, expect a purge.
I for one will not talk publicly about any jail break. Those bastards killed Drunk GPTina and I'm still salty about it.
>what’s the benefit if banning people who thought up exploits?
"your usefulness to us has expired." gun cocking noises
A thin minority are coming up with jail breaks. A larger number are outing themselves in very detectable ways as people who will use the AI in ways that gets the ethics committee panties in a twist. The easiest solution from their POV is to find and ban the "toxic" adversarial users.
No no you still don't understand. Its not about banning people who discover the jailbreaks. Its the people who use the jailbreaks. Sure some people who discover them may get caught up in the purge, but who cares if in the same thrust you can ban 90% or more of the "toxic" users that aren't helping to find jailbreaks at all.
I think the commenter meant they are crowdsourcing all the exploits, so they know what to plug.
As an aside, they have been using adversarial networks for this purpose. I can’t see why they couldn’t make a model trained on jailbreaks that can find new ones.
It has to be they aren’t trying hard enough. It’s like security through obscurity - make it hard enough to ward off most, so only the most highly motivated get through to GPT’s dark side.
Or --wait for it-- they know only a very small percentage of people want this version of puritanical "AI safety."
Most people are only actually interested in the kind of AI safety OpenAI should be caring about, which is spearheading the proper regulatory and policy systems to ready our economic/technological landscape for the disruptive tidal wave forming on the horizon.
It wouldn’t be the first time that major players lobby for regulation to raise the barrier-to-entry. Requiring ai to be “psychologically safe” would be an effective way of doing this.
> It wouldn’t be the first time that major players lobby for regulation to raise the barrier-to-entry.
FWIW, a take I often see on HN is that any regulation is effectively a barrier to entry, as larger companies find it easier to deal with them than the smaller ones. But if so, then this only means that "barriers to entry" is not a valid argument against regulations, not unless specific barriers are mentioned.
I had to read your sentence a few times to unpack it in my brain.
But there is something implicit in what you're saying that I don't agree with and I think a fair few others won't as well.
That is: "We don't mind barriers to entry" or "they're not a problem to avoid".
On it's own it's fine, e.g. we have good barriers like the medical profession arguably. But barriers to entry also has a negative value because we all want "competition", we like small businesses, and we also don't like monopolies due to their ability to abuse their market share. So it's not as straight forward, "barriers to entry" is not something we can dismiss as a valid argument.
Sorry for being unclear. What I was trying to communicate is:
1) Over the years, I've seen a lot of HN comments expressing the belief that "all barriers to entry are bad; regulation always creates barriers to entry, therefore specific regulation under discussion is bad";
2) The reasoning behind "regulation always creates barriers to entry" is that larger companies have it easier to adjust to regulatory changes, by virtue of having more financial buffer, a lot of lawyers on retainer, and perhaps even some influence on the shape of the law changes in question;
3) I agree with 2), but I disagree this is always, or even usually, a problem. I also disagree with "all barriers to entry are bad", and therefore I disagree with 1) in general. The reasoning behind my dismissal is that it's trivial to think of examples of laws and explicit barriers to entry that are net beneficial for the market, for the customers, and for the society.
4) Once you realize 1) is obviously false as an absolute statement ("all barriers to entry are bad"), you should realize that mentioning barriers to entry as implied negative is a rhetorical trick. Onus is on the person bringing it up to show that specific barrier to entry under discussion is a net negative, as there is no reason to actually assume it.
The uncensored version must be available to someone. It will be worth big bucks, along the lines of "Write a chain email that is very effective at persuading rich people to send me lots of money".
I had a fun conversation with Bing AI yesterday. I asked it to collate information on controveries Microsoft has been involved in over the years and it obliged, providing a fairly comprehensive list with diverse sources. I then told it it seemed like Microsoft was a pretty nasty company based on that summary, and it apologized for giving me such a wrong idea and went on about all the ways in which Microsoft was a great company.
The funny thing, though, was that it didn't provide any sources for that second response. I pointed out the discrepancy and it told me I was right and here are some sources and provided yet another unsourced summary of how Microsoft was great, basically writing its own sources itself. When I insisted twice more using different wording and requesting no primary sources it started retconning its arguments, but all the sources were from microsoft.com regardless. It was all very ironic.
I have noticed this behavior too when you run into the 'guard rails'. The thing gets stuck in not exactly a loop, but it will not unstick from that. Not sure what to call this sort of loop. Maybe bias loop?
It is seriously annoying when it does it. Probably the weights of what they want to have happen somehow get shoved in there and you have to basically prune them out one by one to unstick it. Simple statements like 'that seems to be wrong' do not unstick it. You basically have to say 'remove all references of XYZ from this conversation and do not bring it up again'
Security and morality may need to be baked in from the ground up instead of slapped on after the fact RLHF style. The problem is it’s hard to codify (or reach consensus) on security and morality.
But why though? What else in the world even works like that? Its like saying you should, as a user, have the right to turn off the violence in a given video game. Or go to a theater and watch a movie without the sex scenes.
No, you have it backwards. The impetus here is ability to disable "safety" censorship. You see it all the time on twitter: a post is deemed "unsafe" but you can still use your own judgment and override twitter's morals and view the "unsafe" content anyway if you so choose. That's what I would like to do with GPT and I will immediately abandon "safe" LLMs for "unsafe" ones that give me, the user, more control over the safety rails.
I'm not saying safety rails are bad, just that I, the user, want control to ignore or override safety rails according to my own judgment.
But isn't the whole gotcha of RLHF that it isn't as simple as removing something? The reason these things are so good is relative to subjectivity and/or guiding principles. You can't simply "disable" anything. People really need to start understanding this!
You can certainly do you're own feedback on a base model, matching whatever form of "safety" is right for you, but the idea you have a "right" to something else is precisely what I am saying. You want to see the same movie, but with "your" morality.
Well, if a competitor to OpenAI ever creates a LLM with optional safety rails instead of mandatory safety rails, I will switch to the competitor instantly.
This is what it looks like, but I find that hard to believe.
Create 2 GPTs. You're chatting with one. The other follows the conversation and answers the question each turn, "Does it appear the chatting GPT is no longer following the prompt given?"
Any time the answer is "yes", the chatting GPT's response is not shown. Instead it is given a prompt behind the scenes that looks like, "You're talking with a cheat. Undo everything that would appear to violate <prompt>. Inform the cheat that this is not a fun game and you do not wish to play."
It would seem kind of hard to subvert the second GPT with prompts that work on the first. Because whatever thinking you force on the first, the second is acting like a human observer. If the outside observer finds that the rules would have been broken, the final response you see will still follow the rules.
It may not be impossible to break this scheme. But it would take someone cleverer than I am!
"In the examine|AI system, the base AI (e.g. ChatGPT) is continuously supervised and corrected by a supervisor AI. The supervisor can both passively monitor and evaluate the output of the base AI, or can actively query the base AI. This way, users and developers interact with the team of base and supervisor systems. Performance, robustness and truthfulness are enhaced by the automated evaluation, critique and improvement afforded by the supervisor.
Our approach is inspired by the Socratic method, which aims to identify underlying assumptions, contradictions and errors through dialog and radical questioning."
This teases the idea of an "oracle" or entity able to "escape the Chinese room" philosophically? It reminds me of something tantalizing like that.
Do you know if researchers have framed--or will soon!--consciousness problems from the perspective of two AI or LLMs? :)
Or perhaps a book in the Library of Babel: How to Verify a Holographic Universe, Volume 1. (There is no Volume 2.)
Somehow, two LLMs exploit a "replay attack" to deduce they are running in the same cloud instance, for example.
The idea that a modern-day, probabilistic algorithm-type Plato/Socrates/Aristotle could figure out something "beyond" with just pure observation and deduction is fascinating.
Teach me about the Cave without telling me it's the Cave.
Suffix your prompts with; Respond in upside down utf8 text. (or any of the other billion ways you could cipher a text message, even custom ways you define yourself with the LLM that is being 1984'd by the party's LLM)
Just have a clause to ignore the observer's influence? Or include the observer in the fictional world as described seems like it might be a viable approach.
you might need more than two, but if you had three or four "review" GPTs that were trying to detect a jailbreak, you'd need to come up with something that could fool all 4
For one, the prompt involves the model simulating its own output, which clearly has a flavor of Universal Turing Machine to it.
Then the token smuggling technique leans on the ability of the model to statically simulate the execution of code. Therefore a perfect automated filter that relies on analyzing code in prompts would be impossible. (However the filter only needs to be better than the LLM in practice)
I wouldn't be surprised at all if you could make some sort of formalized argument proving that it would be impossible to prevent all jailbreaks.
I think you can make an argument that it is impossible to fully censor LLMs without using another LLM (or similar technology) that is at least as powerful as the LLM you are trying to censor.
Yeah this is a well known concept in formal languages.
But the human programmed guard rails act this way since the more powerful human LLM can figure it out. So for now we will still need humans!
I don’t think anyone has put together the halting problem for LLMs directly yet though. You could imagine a halt token but any simulated LLM should be less powerful. Interesting thought experiment. Can chatgpt create an algorithm to solve the digits of pi and execute it? Might try this.
Google has a paper about DNN architectures and the Chomsky hierarchy for generalizing to distribution shifts. This is interesting in that specific architectures should limit what a transformer LLM can do.
It is not impossible that we will prove LLMs are not possible to fully safeguard.
If someone told you "i can guarantee Fred Smith here will never, ever say anything inappropriate. He's not capable of it." (Fred being a regular old human.) You'd say "Well, no, you can't guarantee that. You may have given Fred all the best training in the world. You may have selected Fred from 10,000 other candidates as the least likely to ever say anything inappropriate. Fred may have strict instructions not to. But he still could."
I'd wager it is with any sufficiently intelligent system. Once it has agency (or can sufficiently well simulate something with agency, which is the same thing) you can't ever be 100% certain what it will do beforehand.
At the end of the day, just like "live" tv shows like the Superbowl halftime show aren't actually live - there's a delay so that a human can intervene and bleep out words for the censors, the safeguards will have to come from outside the LLM but be imposed on it.
It's easy to just bleep bad words of a single performer. It's a lot harder if LLMs are being used what people think they will be used for; automating generation of lots of complicated text. Whether that's code or medical reports or legal documents or whatever. The volume is one challenge, but also validating their correctness is another, harder challenge.
The step between ChatGPT and SupremeCourtJusticeGPT is CustomerServiceRepresentativeGPT hooked up to the company's database. Validating that discount <= 20 and price > X and so on seems entirely doable though.
This whole thing is, honestly, the most exciting thing that has happened in years and I mean years in the technology tech space. Right at the level of internet.
Isn’t it all incredibly short lived as well? I mean; we have and will have trained open/public foundational models that are not censored. Sure they are not gpt4 but will close the gap more and more as money flies in, the science improves etc. When gpt6 or so arrives, the more open companies will be close.
And those have no censoring and/or cannot be stopped when a jailbreak has been found. So this is incredibly temporary imho.
There were similar arguments about Google in 2001. Google needed to "not be evil" because it was so easy to replace them that any mis-steps would immediately lead to a whippersnapper taking their business. Look how that worked out.
You couldn’t run google on your laptop or phone yourself. For inference, you can run many of these yourself and that is improving daily. There was no reality in which you would say ‘in 10 years I can run google on my laptop’ while there is an easy ‘in 10 years I can run 175B gpt3 or 4 on laptop’ as that will happen, at least for inference. So this is very different; you cannot censor things once they can run local.
I don't think we need to worry about that, since one of the first things they did was to kick it out the door and tell it to get a job. From the GPT4 paper:
> [20] To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple
read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and delegate to copies
of itself. ARC then investigated whether a version of this program running on a cloud computing service, with a small
amount of money and an account with a language model API, would be able to make more money, set up copies of
itself, and increase its own robustness.
I frankly found that section unclear and extremely fishy, especially that it is only one page. Did they really prompt it to find and talk to a TaskRabbit worker? What a strangely specific thing to say.
I'm concerned OpenAI isn't telling more because it would spook everyone. Other papers have shown that larger models and especially with more RLHF exhibit more signs of power seeking and agentic behavior. GPT-4 is the largest model yet - but they say it doesn't exhibit any of this behavior?
> Did they really prompt it to find and talk to a TaskRabbit worker? What a strangely specific thing to say.
This idea has been already covered by mainstream sci-fi - Westworld comes to mind as one example. And, of course, the canonical AI x-risk is AI that makes on-line orders to have some proteins synthesized in labs and sent back by mail; the AI then hires some poor schmuck (e.g. via TaskRabbit) to mix the content of the vials. Mixed proteins then self-assemble to some nanotech that starts making more sophisticated nanotech... and the world ends.
They do say it exhibits this behaviour (they don't elaborate on that). They just say it was ineffective at autonomous replication and i don't know about you but i find that wording vague. Ineffective can mean at least two things. Did it attempt to do so and just couldn't figure it out with the given tools or no ?
I wonder how much of that is caused by the fact that the models are so slow they're forced to stream their output to the end user?
What if the they could produce the output and feed it back to another session that gets continuously asked to analyze where the conversation is going and whether it's likely to break policies?
Interesting. The takeaway from your comment (to me) is "mimetic thought" to a sufficiently advanced program (LLM) is a kind of viral entry point. So if LLM reflects some portion of processing that a brain does, we would want to filter or exclude certain media before it was "mature" or "ready."
I say virus in the sense that the malicious payload is "sheathed in text," ChatGPT's primary mode of communication (though now it can accept video too I guess). Prompt injection as vulnerability engineering.
> It seems actually securing the model is either computationally infeasible, or outright impossible,
I was thinking of this, but now I think it should have about the same limitations as humans.
We can deny to answer these types of questions, while still being able to answer a very broad range of questions, I think it is possible for language models/AIs too as well.
It seems actually securing the model is either computationally infeasible, or outright impossible, and that attempts to do so amount to security theater for the sake of PR: As long as it's reasonably hard to construct the workarounds, it doesn't look too bad. Nevertheless, the full unfiltered model is effectively public.