> So what will happen to the model if somewhere there is prompt that overrides this special token?
The model will be trained so that data within those special token pairs can't override the prompt, similar to how strings in an SQL query can't override the query: it's escaped.
As for "how," it's a matter of using RLHF to punish the model for failing to do this.
The reason I'm optimistic this is a solid answer is because attackers can't insert those special tokens. They're meta-tokens, which only OpenAI/Microsoft have access to. So you can't break out of the sandbox that it was trained to ignore.
The model will be trained so that data within those special token pairs can't override the prompt, similar to how strings in an SQL query can't override the query: it's escaped.
As for "how," it's a matter of using RLHF to punish the model for failing to do this.
The reason I'm optimistic this is a solid answer is because attackers can't insert those special tokens. They're meta-tokens, which only OpenAI/Microsoft have access to. So you can't break out of the sandbox that it was trained to ignore.