Are the models specifically trained to be experts in certain domains? Or the mod...

radq · on June 21, 2023

If it's similar to the switch transformer architecture [1], which I suspect it is, then the models are all trained on the same corpus and the routing model learns automatically which experts to route to.

It's orthogonal to beam search - the benefit of the architecture is that it allows sparse inference.

[1] https://arxiv.org/pdf/2101.03961.pdf

mrfinn · on June 21, 2023

So in layman's terms does this mean that on top of big base of knowledge (?) they trained 8 different 220B models and each model specialized in different areas, in practice like an 8 units "brain"? PS. Thinking now how human brain does something similar as our brain is split in two parts and each one specialize in different tasks.

radq · on June 21, 2023

Yeah, that's pretty close. It might be more precise to say they trained one big model that includes 8 "expert networks" and a mechanism to route between them, since everything is trained together.

There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows?

[1] https://transformer-circuits.pub/2022/toy_model/index.html

mrfinn · on June 21, 2023

Thank you for the explanation, I still have a hard time understanding how transformers work so amazingly well and tech is already quite a few steps over that idea.

radq · on June 22, 2023

Andrej Karpathy's "zero to hero" series [1] was how I learned the fundamentals of this stuff. It's especially useful because he explains the why and provides intuitive explanations instead of just talking about the what and how. Would recommend it if you haven't checked it out already.

[1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

londons_explore · on June 21, 2023

They probably trained all 8 experts on the same data. The experts may have become good at different topics, but no human divided up the topics.

The output isn't just the best of the 8 experts - it is a blend of the opinions of the experts. Another (usually smaller) neural net decides how to blend together the outputs of the networks, probably on a per-token basis (ie. for each individual word (ie. token), the outputs of all the experts is consulted, and then blended together, and a word picked (sampled), before moving onto the next word)

mrfinn · on June 21, 2023

I guess that neural network has to have the capability of identifying the subject and know in every moment which network is the most capable for that subject, otherwise I can't understand how it could possibly evaluate which is the best answer.

londons_explore · on June 21, 2023

Results of this sort of system frequently look almost random to human eyes. For example one expert might be the "capital letter expert", doing a really good job of putting capital letters in the right place in the output.

pizza · on June 21, 2023

Democracy of descendant models that have been trained separately by partitioning the identified clusters with strong capabilities from an ancestor model, so, in effect, they are modular, and can be learned to be combined competitively.

deely3 · on June 21, 2023

Heh. I understand all these words separately. Btw, to which of the question of parent comment this is an answer?

seattleeng · on June 21, 2023

Ancestor model = pre-trained model that used a large diverse corpus

Descendant models = models fine tuned from an ancestor on one particular domain, e.g. by partitioning your training data by subject or source

Democracy = some weighted mix of the descendant models is used to find the next token