Why not just generate complete random stuff and ask it to find stuff in that?

We have run that test.- generate random string(not by llm) names of values- ask the llm to do math (algebra) using those strings. Tests logic, 100% not in the data set GPT2 was like 50% accurate, now we up around the 90%.

