1. Attention is quadratic with context length; RNN with gating (LSTM, GRU, etc) ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

pama on Dec 21, 2023 | parent | context | favorite | on: Implementation of Mamba in one file of PyTorch

1. Attention is quadratic with context length; RNN with gating (LSTM, GRU, etc) are linear, as are all these new architectures. Early RNN used gating to avoid exploding gradients, these new ideas use theory from dynamical systems that guarantees stability so the gating can focus on memory, rather than solving two problems at once.

2. The models released in the last couple of weeks running up to neurIPS23 (Mamba and Based) included a multi-query associative recall (MQAR) and data-dependence in the gating/selection inspired by multi-headed attention. It turned out these were the main missing ingredients compared to earlier state-space (Hyena and earlier) architectures and made these new models as good as attention in associative recall tasks, and potentially even slightly better than attention in other non-lookup tasks. Of course the huge detail in mamba is the efficient implementation on CUDA; without it the architecture may not make much sense for tasks where transformers are already appropriate.

3. If one does not have to worry too much about context length, a lot of new domains open up: DNA-sequence analysis is a linear task with long dependence; think of analyzing images, videos, or higher dimensional info in terms of streams of tokens (scan the pixels in the way of an old CRT monitor). The early dreams of AI included a continuously evolving single learning trajectory of an agent interacting with an environment continuously, so maybe such dreams will be easier to realize with these infinite-context-length models.

bonus: you didn't ask for it, but as of today the downstream applications of these models for important/practical tasks are largely untested/untuned compared to the rather mature applications of attention, so there may be a little delay before people figure out all the tricks for how to use large pre-trained models of these types. The analogy to the old RNN helps to a degree, but people had super specialized to attention and transformers the last 5 years, so there is a lot of momentum in favor of transformers.

adhi01 on Dec 21, 2023 [–]

Can you cite what's the "Based" paper in here.

pama on Dec 21, 2023 | [–]

This blog about “Based” came out just before neurIPS: https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-b...

pama on Dec 21, 2023 | | [–]

And this is the zoology paper: https://arxiv.org/abs/2312.04927

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact