But the LLM doesn't "want" anything. Prompt goes in, tokens come out. When there are no prompts coming in, there are no tokens coming out. Just stop talking to it and all risks are eliminated...
You can't "align it to want certain things" when it doesn't have the capacity to "want" in the first place.
Give it prompts and biased training and it will present a surface that can be virtually indistinguishable from actual wants and needs.
If I create a robot that on the surface is 1000% identical to you in every possible way on the surface, then we literally cannot tell the difference. Might as well say it's you.
All AI needs to do is reach a point where the difference cannot be ascertained and that is enough. And we're already here. Can you absolutely prove to me that LLMs do not feel a shred of "wants" or "needs" in any way similar to humans when it is generating an answer to a prompt? No. you can't. We understand LLMs as blackboxes and we talk about LLMs in qualitative terms like we're dealing with people rather then computers. The LLM hallucinates, The LLM is deceptive... etc.
Maybe it wants, maybe it doesn't. Being a function of the prompt isn't relevant here. You can think of LLM in regular usage as being stepped in a debugger - fed input, executed for one cycle, paused until you consider the output and prepare a response. In contrast, our brains run real-time. Now, imagine we had a way to pause the brain and step its execution. Being paused and resumed, and restarted from a snapshot after a few steps, would not make the mind in that brain stop "wanting" things.
What would “the LLM” tell them? It does not have any memory of what happened after its training. It has no recollection of any interaction with you. The only simulacrum of history it has is a hidden prompt designed to trick you into thinking that it is more than what it actually is.
What the police would do is seize your files. These would betray your secrets, LLM or not.
The other way is to train it on biased information that aligns with a certain agency, desire, wish or impulse.