To me, that's actually one of the more interesting questions. It's possible to grade the output of the AI against objective criteria, like does it run, and resources consumed (RAM, CPU time, and, particularly of interest to me, parallel scaling, as GPU algorithms are too hard for most programmers). To what extent can you keep training by having the AI generate better and better solutions to a relatively smaller input pool of problems? I skimmed the paper to see how much they relied on this but didn't get a clear read.