Maybe the key to a good universal LLM is having multiple fine tuned models for various domains. The user thinks he's querying a single model but really there's some mechanism that selecting the best model for his query out of say like 300 different possibilities.
This also helps distributes traffic as a side effect.
I guess the problem is how the conversation would flow. If the user changes topics from say art to quantum physics then asks a question about quantum physics and art then I'm not sure what the algorithm should do.
Two users. One user is talking about physics, the other about art. Two different models are utilized.
Load is divided across 2 models. Load balancing is a feature for free and division is across subjects. Of course this is assuming each model owns it's own set of gpus.
This also helps distributes traffic as a side effect.
I guess the problem is how the conversation would flow. If the user changes topics from say art to quantum physics then asks a question about quantum physics and art then I'm not sure what the algorithm should do.