Hacker News new | past | comments | ask | show | jobs | submit login

I wonder how much you can push the "deeper and thinner" part. At some point your entire FFN fits into your L2 cache, you're bound to get some performance jumps.



Other research from Meta FAIR actually suggests that you should prune deeper layers if you want to improve performance while maintaining accuracy [1]. So there must be a cutoff point for smaller networks where this approach still works, otherwise the results are contradictory. Or we could drastically improve these new models even further.

[1] https://arxiv.org/html/2403.17887v1


That reminds me of the findings of Google’s paper on EfficientT5 (https://arxiv.org/abs/2109.10686). They refer to it as “DeepNarrow”.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: