Agreed. These are instruction-tuned: they will follow the instructions, so much so that not even the strongest RLHF can currently prevent well-structured jailbreaking.
In my experience their attention is strongest towards the end of the last message, which could be the reason for injections being so effective with little effort. Within the OpenAI models as of today the first user message is much stronger than the system message.
Given the ChatML spec and their end-to-end control over the models, I wonder whether the system message could end up being sandboxed by architecture and/or training.
In my experience their attention is strongest towards the end of the last message, which could be the reason for injections being so effective with little effort. Within the OpenAI models as of today the first user message is much stronger than the system message.
Given the ChatML spec and their end-to-end control over the models, I wonder whether the system message could end up being sandboxed by architecture and/or training.