Hacker News new | past | comments | ask | show | jobs | submit login

Wouldn't attention networks use less computation for the same number of weights? Feed forward networks have higher connectivity no?

If a network is the same size, less computationally demanding, and gives you 3% improvement, it seems extremely worthwhile, especially given the (mostly) diminishing returns of just adding more weights/layers.




It is less general though, so this is interesting because it shows you can learn a lot without adding very much structure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: