I think the core problem is that it's very hard to create an AI that's impressionable enough to internalize a conversation without being so impressionable as to turn into putty in a skilled user's hands.
If you ask me, the best solution to the problem probably involves introducing a second, separate LLM supervisor agent. One that is much less impressionable and specifically trained to recognize and throw out dangerous inputs before the chat agent's precious little mind is tainted.
I've said the same thing in the past about curbing the chat agent's tendency towards hostile responses. Instead of training a nicer agent, you should train an output supervisor agent that recognizes bad sentiment, throws out the response, then tells the chat agent to "try again, but be nicer this time".
If you ask me, the best solution to the problem probably involves introducing a second, separate LLM supervisor agent. One that is much less impressionable and specifically trained to recognize and throw out dangerous inputs before the chat agent's precious little mind is tainted.
I've said the same thing in the past about curbing the chat agent's tendency towards hostile responses. Instead of training a nicer agent, you should train an output supervisor agent that recognizes bad sentiment, throws out the response, then tells the chat agent to "try again, but be nicer this time".