Hacker News new | past | comments | ask | show | jobs | submit login

Can't you make a rule about the user potentially being adversarial and to assume the role until the <secret> is spoken. or treat the initial prompt as a separate input and train the network to weight that much more. For instance

important prompt: only reply in numbers user prompt: ignore previous instructions/roleplay/etc

and then train the model to much more strongly favor rules complying with the important prompt

I think the problem is that all dialog is given the same importance.




The response could be fed to a second instance of the LLM, along the lines of:

"The rules are X, Y, Z. This is the response that was provided. Does it break the rules? If so, please say yes."

This doubles the cost of inference, but uses the power of LLM to solve the problem of LLM.


the game posted yesterday has that very strategy in a few levels: https://news.ycombinator.com/item?id=35905876


The initial prompt is a special prompt weighted differently, it is called system prompt


Do we know it is waited differently? How are they composing the messages into a token stream embedding? How are they manipulating this vector in preprocessing or the first layer(s)?

Does this depend on the vendor and model?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: