Odd how many of those instructions are almost always ignored (eg. "don't apologize," "don't explain code without being asked"). What is even the point of these system prompts if they're so weak?
It's common for neural networks to struggle with negative prompting. Typically it works better to phrase expectations positively, e.g. “be brief” might work better than ”do not write long replies”.
But surely Anthropic knows better than almost anyone on the planet what does and doesn't work well to shape Claude's responses. I'm curious why they're choosing to write these prompts at all.
I’ve previously noticed that Claude is far less apologetic and more assertive when refusing requests compared to other AIs. I think the answer is as simple as being ok with just making it more that way, not completely that way. The section on pretending not to recognize faces implies they’d take a much more extensive approach if really aiming to make something never happen.
It lowers the probability. It's well known LLMs have imperfect reliability at following instructions -- part of the reason "agent" projects so far have not succeeded.