Not to shade on anyone, but even Jeff Dean isn't good at predicting the future. ...

knowriju · on Feb 20, 2024

Care to share some topics where the things are moving towards? I understand diffusion, GaNs and Mamba is in vogue these days, but those are different logical architecture. I am unsure where the next level ML physical architecture research is moving towards.

karmasimida · on Feb 20, 2024

I think at this rate, everything is moving towards Transformer based models(text/audio/image/video), as Sora has shown, there isn't really anything Transformer can't do, it can generate both real life quality photo and video. Its ability to fit ANY given distribution is beyond compare, the most powerful neural network we have ever designed, nothing else is even close.

GANs are on the contrary, not hot any more in industry, diffusion models have achieved high fidelity in image generation, hard to see how GAN can make a comeback. It is faster, but it image generation in terms of quality is done, the wow factor is no more.

This might be a hot take, but I think architectural changes is going to die down in industry, Transformer is the new MOS transistor. As billions of dollars pumping into making it runs faster AND cheaper, other alternative architecture is going to have a hard time compete.

espadrine · on Feb 20, 2024

There is no question in my mind that the transformer architecture will not stop to evolve. Already now, we are stretching the definition by calling current transformers that way; the 2017 transformer had an encoder block which is nearly always absent nowadays, and the positional encoder and multi-head attention have been substantially modified.

VRAM costs and latency constraints will drive architectural changes, which Mamba hints at: we will have less quadratic scaling in the architectures that transformers evolve into, and likely the attention mechanism will look more and more akin to database retrieval (for way more evolved database querying mechanisms than is often seen in relational databases). One day, the notion of a maximum context size will be archaeological. Breaking the sequentiality of predicting only the next token would also improve throughput which could require changes. I expect experts to also evolve into new forms of sparsity. More easily quantizable architectures may also emerge.

karmasimida · on Feb 20, 2024

The original transformer is an encoder decoder model, where the decoder model is what leads to first GPT model. Except you need to feed the encoder states to the decoder attention module in the original proposal, it is basically the same decoder only model. I would argue the decoder only model is even simpler in that regards.

When it comes to the core attention mechanism it is surprisingly stable comparing to other techniques in neural networks. There is the qkv projection, then dot product attention, then two layer of ffn. Arguably the most influential changes regarding attention itself is the multi query/grouped attention, but that is still imo, a reasonably small change.

If you look back into the convolutional NNs, their shapes and operators just changes every six months back in the day.

At the same time, the original transformer today is still a useful architecture, even in production, some bert models must be hanging around still.

Not that I am saying it didn’t change at all, but the core stays very much stable across countless revisions. If you read the original transformer paper, you already understood 80% of what LLama model does, the same thing can’t be said for other models is what I meant