But there's the rub. It's impossible to determine through testing whether a part...

But there's the rub. It's impossible to determine through testing whether a particular AI system will actually have a lower frequency of errors than humans. You can program an AI system to handle certain failure modes and test for those in simulation. But complex systems tend to have hidden failure modes which no one ever anticipated, so by definition it's impossible to test how the AI will handle those. Whereas an experienced human can often determine the correct course of action based on first principles.

For example, see US Airways Flight 1549. Airbus had never tested a double engine failure in those exact circumstances so the flight crew disregarded some steps in the written checklist and improvised a new procedure. Would an AI have handled the emergency as well? Doubtful.