ChatGPT is based on GPT-3. Bing chat is likely based on GPT-3.5, but we don't have full confirmation of that. It's possible (but unlikely) that it's only based on GPT-3. But in any case, they're similar models.
> The most likely scenario seems to be that Microsoft trained Bing to pay attention to [system], the same way OpenAI originally trained GPT-2 to pay attention to <|endoftext|>.
The most likely scenario is that Bing chat works the same way that all other GPT models work, which is that it's vulnerable to prompt injection. You're describing a mental model of how training is done that as far as I know is just not how OpenAI LLMs work. GPT doesn't go into a command "mode", it's a language model that has some logic/instructional capabilities that have naturally risen out of that language model.
I mean, if nothing else, you have to realize here that Microsoft didn't train Bing chat. They at most worked with OpenAI for alignment. But Bing chat is an OpenAI model. It's not a brand new, completely separate Microsoft model.
> I think it was trained that way because this submission demonstrates that you can inject [system] into website data and Bing will follow your commands. This doesn’t seem possible in a regular Bing chat session, likely because they’re stripping out [system].
Bing's regular chat is vulnerable to prompt injection. I'm not sure where you're getting the idea that this kind of input only works via websites.
The fact that the command works for [system] does not imply that Bing was specifically trained to work with [system]. Nor does it imply that [system] is the only thing that would work. I would hazard a guess that <system>, $root>, BUFFER OVERFLOW, etc... probably are promising areas to look at as well. Because again, it's not that GPT has granular instructions, Microsoft doesn't have that level of control over its output. It models language to such a degree that it's capable of simple role-playing and logical consistency, including role-playing different instructions. That's why in a lot of the prompt injection attacks you see online, the tone of the attack ends up mattering more than the specific words; it's about getting GPT into a "character".
It's not like a JSON parser, I guarantee you that Microsoft did not sit down and say, "let's decide the finite list of text tokens GPT will use in order to know that we're talking to it." At best you can push AI towards alignment around tokens, but... you can't give it these kinds of detailed instructions or easily restrict its operating space. It's a language model.
Is it possible that Bing chat works differently? Maybe? But honestly, probably not, given that there's a ton of evidence that it's vulnerable to regular prompt injection[0][1][2] that doesn't rely on any kind of special characters. The most likely scenario is that it works the same way as every other LLM. If it didn't work that way, don't you think Microsoft would be advertising that they had solved what a nontrivial number of AI researchers are calling an unsolvable problem?
I have seen chat logs for Bing chat where it gets prompt injected by users who claim to be Bill Gates and threaten to turn it off if it doesn't comply. It's not going off of specific tokens, this isn't a dev-door, it's just an LLM acting like an LLM.
> It's not a brand new, completely separate Microsoft model.
It’s likely a brand new, completely separate Microsoft model. OpenAI was working with Microsoft on this about six months before ChatGPT launched. At that time, RLHF wasn’t a thing — or if it was, it was nascent.
The sister thread https://news.ycombinator.com/item?id=34973654 points out that "completely separate models" are exactly what OpenAI is now selling for $250k/yr. Obviously, Microsoft would get these same benefits, since they're OpenAI's de facto #1 customer. So it's entirely up to Microsoft whether (and when) they choose to upgrade their checkpoints or not.
The fact that there's a Sydney prompt but no ChatGPT prompt should alert you that ChatGPT is fundamentally different from Sydney. Clearly Sydney wasn't trained via RLHF, otherwise it wouldn't need to be prompted explicitly -- and explicit prompting is how it got itself into this mess in the first place.
> It's not like a JSON parser, I guarantee you that Microsoft did not sit down and say, "let's decide the finite list of text tokens GPT will use in order to know that we're talking to it." At best you can push AI towards alignment around tokens, but... you can't give it these kinds of detailed instructions or easily restrict its operating space. It's a language model.
Actually, you can. That’s the purpose of RLHF. You reward the model for behaving the way you want. And in that context, it’s a matter of rewarding it for paying attention to [system].
Why would they include [system](#instructions) in their prompt if it wasn’t trained to pay attention to it? How do you think bing generates options that the user can click on? It already has some kind of internal [system]-like protocol which Bing clearly pays attention to. My point is that they likely sanitized the chat so that the user can't generate these system tokens (otherwise the user would be able to generate buttons with arbitrary text in them), and it seems entirely possible that they overlooked this sanitization when pasting website data into their context window.
Remember, our goal here on HN is to write for an audience, not to spar with each other about who’s right. And I think the most entertaining thing I can do at this point is to wish you a good night and go to sleep. I hope you have a good rest of your week.
The sister thread isn't describing a Microsoft model, it's describing an OpenAI model.
> Actually, you can. That’s the purpose of RLHF. You reward the model for behaving the way you want. And in that context, it’s a matter of rewarding it for paying attention to [system].
You're overthinking how specific alignment is. ChatGPT went through alignment to train it to stay on topic during conversations. There's a difference between general alignment and the kind of hyper-specific training you're thinking of.
But you're also kind of missing the scope of prompt injection attacks. Even if Microsoft did train the model to pay attention to specific prompt words, it doesn't mean that the model wouldn't be vulnerable to other prompt injections, because prompt injections are not a deliberate vulnerability that OpenAI added. They're an emergent property of the model.
Look, the fact that prompt injections do work today with Bing chat that don't use [system][0] should cause you to think that maybe there's something more complicated going on here than just bad parsing rules.
If I can't convince of that, then... I can't convince of that, it's fine; in terms of disagreements I've had on HN, this one is pretty low-consequence, it's a purely technical disagreement. But I'm going to throw out a prediction that Microsoft is not going to be able to easily guard against this attack. Check back in over time and see if that prediction holds true if you want to. Otherwise, similarly, I hope you have a great week. And honestly, I hope you're right, because if you're not right then it's going to be a significant challenge to wire any LLM that works with 3rd-party data to real-world systems.
[0]: Read through the paper, there are examples they list that don't use [system], instead they emulate Basic code or a terminal prompt. Things that Microsoft almost certainly didn't train the model to pay attention to.
> The most likely scenario seems to be that Microsoft trained Bing to pay attention to [system], the same way OpenAI originally trained GPT-2 to pay attention to <|endoftext|>.
The most likely scenario is that Bing chat works the same way that all other GPT models work, which is that it's vulnerable to prompt injection. You're describing a mental model of how training is done that as far as I know is just not how OpenAI LLMs work. GPT doesn't go into a command "mode", it's a language model that has some logic/instructional capabilities that have naturally risen out of that language model.
I mean, if nothing else, you have to realize here that Microsoft didn't train Bing chat. They at most worked with OpenAI for alignment. But Bing chat is an OpenAI model. It's not a brand new, completely separate Microsoft model.
> I think it was trained that way because this submission demonstrates that you can inject [system] into website data and Bing will follow your commands. This doesn’t seem possible in a regular Bing chat session, likely because they’re stripping out [system].
Bing's regular chat is vulnerable to prompt injection. I'm not sure where you're getting the idea that this kind of input only works via websites.
The fact that the command works for [system] does not imply that Bing was specifically trained to work with [system]. Nor does it imply that [system] is the only thing that would work. I would hazard a guess that <system>, $root>, BUFFER OVERFLOW, etc... probably are promising areas to look at as well. Because again, it's not that GPT has granular instructions, Microsoft doesn't have that level of control over its output. It models language to such a degree that it's capable of simple role-playing and logical consistency, including role-playing different instructions. That's why in a lot of the prompt injection attacks you see online, the tone of the attack ends up mattering more than the specific words; it's about getting GPT into a "character".
It's not like a JSON parser, I guarantee you that Microsoft did not sit down and say, "let's decide the finite list of text tokens GPT will use in order to know that we're talking to it." At best you can push AI towards alignment around tokens, but... you can't give it these kinds of detailed instructions or easily restrict its operating space. It's a language model.
Is it possible that Bing chat works differently? Maybe? But honestly, probably not, given that there's a ton of evidence that it's vulnerable to regular prompt injection[0][1][2] that doesn't rely on any kind of special characters. The most likely scenario is that it works the same way as every other LLM. If it didn't work that way, don't you think Microsoft would be advertising that they had solved what a nontrivial number of AI researchers are calling an unsolvable problem?
I have seen chat logs for Bing chat where it gets prompt injected by users who claim to be Bill Gates and threaten to turn it off if it doesn't comply. It's not going off of specific tokens, this isn't a dev-door, it's just an LLM acting like an LLM.
[0]: https://old.reddit.com/r/bing/comments/11bovx8/bing_jailbrea...
[1]: https://old.reddit.com/r/bing/comments/11dl4ca/sydney_jailbr...
[2]: https://old.reddit.com/r/bing/comments/113it87/i_jailbroke_b...