This is a great point, and something that may be at least partially addressable ...

This is a great point, and something that may be at least partially addressable with current methods (e.g. RLHF/SFT). Maybe (part of) what's missing is a tighter feedback loop between a) limitations experienced by the human users of models (e.g. "actually okay but just near the off limits stuff"), and b) model training signal.

Thank you for the insightful perspective!