Test it how? What makes it fail? The ability to tell people how to make a bomb? Being able to say what (few) good things Hitler accomplished for Germany? Giving medical advice? Where’s the line?
One thing law explicitly says is full shutdown capability. So it should be tested whether it can autonomously hack computers on the internet and propagate itself. In fact Anthropic tested this. See https://metr.org/ for more.