Did you make a framework for the agents so they could delegate problems to an appropriate model, query a dataset, etc, or was it just turtles all the way down on GPT4?
My hunch is that one big LLM isn't the answer, and we need specialization much like the brain has specialized regions for vision, language, spatial awareness, and so on.
To take the analogy of a company, the problem here is that management is really bad.
What you described is rather akin to hiring better workers, but we need better managers.
Whether it’s a single or multiple models is more of an implementation detail, as long as there’s at least one model capable of satisfactory goal planning _and_ following.
My hunch is that one big LLM isn't the answer, and we need specialization much like the brain has specialized regions for vision, language, spatial awareness, and so on.