I think that the hope/dream here is to make end-to-end tests less flakey. It would be great to have navigation and assertions commands that are robust against simple changes in the app that aren't relevant to the test case.
No. Both of the requirements "to interact" and "based on what it looks like" require unshakable foundations in reality - which current models clearly do not have.
They will inevitably hallucinate interactions and observations and therefore decrease reliability. Worse, they will inject a pervasive sense of doubt into the reliability of any tests they interact with.
Yes, you are correct that it entirely lays in the reputation of the AI.
This discussion leads to interesting question, which is "what is quality?"
Quality is determined by perception. If we can agree that an AI is acting like a user and it can use your website, we can assume that a user can use your website and therefor it is "quality".
For more, read "Zen and the Art of Motorcycle Maintenance"