Hacker News new | past | comments | ask | show | jobs | submit login

There are rarely any design "decisions". Typically, you throw many things at wall and something sticks which becomes paper. Transformer paper has approximately zero "design decisions" apart from attention block. I can imagine they just tried out various combinations, kept adding projections and went with what worked the best.



Attention itself was the key idea of that paper and, as you sort of acknowledge, was definitely not just throwing things at the wall. It was the culmination of a long line of work gradually progressing toward fully dynamic routing via attention, and it was motivated, if not by deep theory, at least deep intuition from linguistics. The other details of transformers are perhaps sort of arbitrary, but made sense to everyone at the time. There was no claim that those other details were optimal - just that they were one way of surrounding the attention mechanism with computing machinery that worked.


I hear people say that a lot, but is that really how people at the leading edge of research do this? Those I know who are coming up with new stuff and not just new applications for old architectures, are either building loosely on animal models, or designing based off a traditional algorithm with some room for the training to take advantage of complex interactions the traditional algorithms don't.


Most advances in NLP with transformers over the last 2 years has been random trial and error and just throwing more compute at transformers.

Some models like RWKV (which isn’t a transformer model) explain the design decisions in their paper, but generally that’s not the case.

Nobody knows why, nobody seems to be looking into it.

We’re all just trying to figure out how to improve the outcome atm.


There have been huge advances in the mathematics of neural networks from Greg Yang (formerly of Microsoft). This allowed predictable training-hyperparameter transfer from smaller versions of GPT-4 where they could be tuned, to the final large model.

https://www.microsoft.com/en-us/research/uploads/prod/2021/1...

He has proofs and theorems about frontiers of maximal feature learning before things devolve into equivalent to kernel methods, and more: a whole bunch of breakthrough math making deep links with random matrix theory.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: