Hacker News new | past | comments | ask | show | jobs | submit login

That doesn't clarify it for me. The same parameters are being used for every layer for every token. Yes, there is this differentiable lookup in attention like in MoE - but routing is about more than just differentiable lookup, it is about selecting on parameters not state.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: