It's trivial to test, just use ChatGPT yourself and ask it to solve the same pro...

It's trivial to test, just use ChatGPT yourself and ask it to solve the same problem several times in new sessions. Then paste in all attempts and ask for a combined result.

The main issue is context length: if you use 4 attempts you have to have to fit in the original question, four temporary answers, and the final answer. So that's 6 roughly equal sized chunks of text. With GPT4's 8K limit that's just 1300 tokens per chunk, or about 900 words. That's not a lot!

The LLMs with longer context windows are not as intelligent, and tend to miss details or they don't follow instructions as accurately.

Right now this is just a gimmick that demonstrates that more intelligence can be squeezed out of even existing LLMs...