In case people are wondering why Mamba is exciting:
There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.
Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.
I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.
> Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers.
“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.
But this also means that possible information retained is constant irrespective of the prefix length. This might be a problem if the prefix is composed of essentially uncompressable data.
Indeed: https://arxiv.org/pdf/2402.01032.pdf
Perhaps future iterations of SSMs will accommodate dynamically sized (but still non-linearly-growing) hidden states / memories!
Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.
1. The Mamba co-author was also the FlashAttention lead author.
2. The secret ingredient that makes SSMs viable for deep learning is HiPPO theory. If you start with random initialization you're not going to get results. What you need is "optimal online function approximation" using Legendre polynomials, a Fourier basis, etc., in matrix form. The Mamba story starts with Legendre Memory Units.
Invariably someone comments, "How do we know that it scales?" We don't. But the lead author has backing and a new startup at cartesia.ai. Could be the next Mistral.
The architecture is completely public. I would be surprised if certain other players (including but not limited to Mistral AI) are not training models yet. We'll hear soon enough if this is viable. Maybe not for official release candidates, but at least for internal testing.
Fantastic blog post, thank you for this. I am not even familiar with transformers, yet the explanation is stellar clear to me, and the included references and context are a trasure trove. The explanation of FlashAttention is the best I have seen, and that is not even the focus of the article.
One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training.
Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:
“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”
when I read the paper I thought the idea was changing \Delta permits getting the model to learn different things over different time scales. As you quoted “the main source of improvement".
I don’t have an llm backround, just controls, so I might wrong.
The issue with Attention essentially is that it is used to relate all token of the input sequence with each other. The need to do that somehow makes sense no matter how much one understands about the internals of a transformer. The naive way to do that boils down to matrix multiplications, and a lot more people understand the performance issues implied by them.
If you can understand attention and transformers, how can you not understand that population numbers can rise, reach a peak, fall, and then level out (all w/out any genocidial actions)?
How can you claim that it is "absurdism" to imagine something that can be seen in data across the plant and animal kingdom?
If I'm not mistaken the largest mamba model right now is 2.8B and undertrained with low quality data (the Pile only). The main problem is that it's new and unproven.
Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.
There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny compared to other recent releases (ex. stablelm-3b [2] trained on 4T tokens).
> low quality data (the Pile only)
The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.
Eh quality is subjective. There are good parts, like Books3 and arxiv, but a large part of it is common crawl which has just about anything people put up on the internet, random IRC chat logs, HN and Reddit shitposts, Youtube subtitles which are in broken English half the time, and of course the Enron corporate email dump to make every model sound like an HR middle manager.
This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity?
EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.
Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752
From what I can tell all the large players in the space are continuing developing on transformers right? Is it just that Mamba is too new, or is the architecture fundamentally not usable for some reason?
Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers.
There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.
Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs.
It's a reasonably easy bet that Together is doing or will do a serious pretraining run with Mamba, where if that's a success other players might start considering it more.
Exactly this. Except, there is zero chance they just looked at mamba and went "meh, too new for us". People are definitely trying stuff. It takes a lot of fiddling around with a brand new model architecture to get something working well. OpenAI aren't going to give a running commentary on the state of all the things they are looking into.
Not sure how much detail you need but generally there exist implicit and explicit integrators for numerically solving (integrating) ODE. The implicit ones, like the one used here, tend to be more stable. The ideas behind SSM come from control theory ideas that used integrators with stability guarantees so that the rest of the neural network can focus on other aspects of the problem.
Very annoying namespace conflict since a package called "mamba" (faster reimplementation of the python conda package manager) already existed for awhile before this architecture was even dreamed up.
Beyond that, I'll care about an alternative to transformers when it shows superior performance with an open source 7b-34b model compared to transformer model competitors. So far this has not happened yet
> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
That's the issue - I keep hearing that it is beyond small research group's budget to meaningfully train such a large model. You don't just need GPU time, you also need data. And just using the dregs of the internet doesn't cut it.
I use the former and have been experimenting with the latter. Fortunately, the contexts are separate enough that they never come up in the same sentence.
> Importantly, these recurrent and convolutional forms, which I like to call “RNN mode” and “CNN mode,” are mathematically equivalent. This allows S4 to shape-shift depending on what you need it to do, with no difference in its outputs.
Is this really true? Because it seems to ignore hardware and data type precisions entirely. I mean computing same math thing in a different way with floating points often leads to different results.
I'm very positive I can actually understand the terminology used in discussing machine learning models if it was presented in a way that describes the first principles a little bit better, instead of diving directly into high level abstract equations and symbols.
I'd like a way to learn this stuff as a computer engineer, in the same spirit as "big scary math symbols are just for-loops"
Ironically, you can probably just ask a Transformer model to explain it to you.
I'm the same as you: I have no problem grasping complex concepts, I just always struggled with the mathematical notation. I did pass linear algebra in university, but was glad I could go back to programming after that. Even then, I mostly passed linear algebra because I wrote functions that solve linear algebra equations until I fully grasped the concept.
I've found that GPT-4 is very good at taking a math-notation-rich document and just describing it in terms a math-notation-averse engineer would understand.
I was a data engineer for about 6-7 years at various companies, always working together with data scientists who insist that `x_` or `_phi` are proper variable names. Man am I glad to be working with engineers now.
Also, just try really hard. Repeat. It's new language to explain concepts you likely already know. You don't remember spanish by looking at the translations once.
That's a heuristic that's usually true. You can definitely understand convolution or attention better with a "big scary math symbols are just for-loops" explanation, but there are also things like dopri45 or elliptic curve crypto where we just have to accept that Weird Math Shit is happening and the symbols are inevitable. It looks to me like mamba has dragged a part of llm research into the latter camp.
MoE (Mixture of Experts) is an effective way to scale transformers. Gemini 1.5 is already doing upto 1 million tokens. I have not seen any large scale mamba model, so not aware of its shortcomings, but I am sure there are tradeoffs.
It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context?
MoE let's you use scale model size up with compute. That leads to hopefully more intelligent models.
It, however, is independent with context size: the ability to process a lot of tokens / text.
There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.
Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.
For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752