I think this is longer off than you might expect. LLMs work because the “answer” (and the prompt) is fuzzy and inexact. Proving an exact answer is a whole different and significantly more difficult problem, and it’s not clear the LLM approach will scale up to that problem.