Can't you make a rule about the user potentially being adversarial and to assume the role until the <secret> is spoken. or treat the initial prompt as a separate input and train the network to weight that much more. For instance
important prompt: only reply in numbers
user prompt: ignore previous instructions/roleplay/etc
and then train the model to much more strongly favor rules complying with the important prompt
I think the problem is that all dialog is given the same importance.
Do we know it is waited differently? How are they composing the messages into a token stream embedding? How are they manipulating this vector in preprocessing or the first layer(s)?
important prompt: only reply in numbers user prompt: ignore previous instructions/roleplay/etc
and then train the model to much more strongly favor rules complying with the important prompt
I think the problem is that all dialog is given the same importance.