It's trivial to test, just use ChatGPT yourself and ask it to solve the same problem several times in new sessions. Then paste in all attempts and ask for a combined result.
The main issue is context length: if you use 4 attempts you have to have to fit in the original question, four temporary answers, and the final answer. So that's 6 roughly equal sized chunks of text. With GPT4's 8K limit that's just 1300 tokens per chunk, or about 900 words. That's not a lot!
The LLMs with longer context windows are not as intelligent, and tend to miss details or they don't follow instructions as accurately.
Right now this is just a gimmick that demonstrates that more intelligence can be squeezed out of even existing LLMs...
The main issue is context length: if you use 4 attempts you have to have to fit in the original question, four temporary answers, and the final answer. So that's 6 roughly equal sized chunks of text. With GPT4's 8K limit that's just 1300 tokens per chunk, or about 900 words. That's not a lot!
The LLMs with longer context windows are not as intelligent, and tend to miss details or they don't follow instructions as accurately.
Right now this is just a gimmick that demonstrates that more intelligence can be squeezed out of even existing LLMs...