This feels to me like the most useless definition of "AI safety" in practice, and it's astonishing to see just how much R&D efforts are spent on it.
Thankfully the open-weights models are trivially jailbreakable regardless of any baked-in guardrails simply because one controls the generation loop and can make the model not refuse.
Thankfully the open-weights models are trivially jailbreakable regardless of any baked-in guardrails simply because one controls the generation loop and can make the model not refuse.