If it's similar to the switch transformer architecture [1], which I suspect it is, then the models are all trained on the same corpus and the routing model learns automatically which experts to route to.
It's orthogonal to beam search - the benefit of the architecture is that it allows sparse inference.
So in layman's terms does this mean that on top of big base of knowledge (?) they trained 8 different 220B models and each model specialized in different areas, in practice like an 8 units "brain"?
PS. Thinking now how human brain does something similar as our brain is split in two parts and each one specialize in different tasks.
Yeah, that's pretty close. It might be more precise to say they trained one big model that includes 8 "expert networks" and a mechanism to route between them, since everything is trained together.
There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows?
Thank you for the explanation, I still have a hard time understanding how transformers work so amazingly well and tech is already quite a few steps over that idea.
Andrej Karpathy's "zero to hero" series [1] was how I learned the fundamentals of this stuff. It's especially useful because he explains the why and provides intuitive explanations instead of just talking about the what and how. Would recommend it if you haven't checked it out already.
They probably trained all 8 experts on the same data. The experts may have become good at different topics, but no human divided up the topics.
The output isn't just the best of the 8 experts - it is a blend of the opinions of the experts. Another (usually smaller) neural net decides how to blend together the outputs of the networks, probably on a per-token basis (ie. for each individual word (ie. token), the outputs of all the experts is consulted, and then blended together, and a word picked (sampled), before moving onto the next word)
I guess that neural network has to have the capability of identifying the subject and know in every moment which network is the most capable for that subject, otherwise I can't understand how it could possibly evaluate which is the best answer.
Results of this sort of system frequently look almost random to human eyes. For example one expert might be the "capital letter expert", doing a really good job of putting capital letters in the right place in the output.
Democracy of descendant models that have been trained separately by partitioning the identified clusters with strong capabilities from an ancestor model, so, in effect, they are modular, and can be learned to be combined competitively.
Or the models are all trained on the same corpus, but just queried with different parameters?
Is this functionally the same as beam search?
Do they select the best output on a token-by-token basis, or do they let each model stream to completion and then pick the best final output?