Mamba: The Easy Way

jxmorris12 · 2024-02-23T18:34:50 1708713290

In case people are wondering why Mamba is exciting:

There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.

Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.

For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752

KuriousCat · 2024-02-23T21:49:45 1708724985

People have shown even CNNs can match up the peformance of the transformers.

https://openreview.net/forum?id=TKIFuQHHECj#

I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.

jdeaton · 2024-02-24T02:43:25 1708742605

you linked a paper about vision transformers...

hervature · 2024-02-24T03:40:21 1708746021

Being used as a comparison...

From the abstract:

> Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers.

hansonw · 2024-02-23T20:45:31 1708721131

“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.

shikon7 · 2024-02-24T03:47:26 1708746446

But this also means that possible information retained is constant irrespective of the prefix length. This might be a problem if the prefix is composed of essentially uncompressable data.

hansonw · 2024-02-24T08:51:21 1708764681

Indeed: https://arxiv.org/pdf/2402.01032.pdf Perhaps future iterations of SSMs will accommodate dynamically sized (but still non-linearly-growing) hidden states / memories!

5kg · 2024-02-23T20:16:31 1708719391

I'd love to see someone who has the resources train a model bigger than 2.8b and show the scaling law still holds.

nickpsecurity · 2024-02-23T20:26:12 1708719972

Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.

intalentive · 2024-02-23T22:07:52 1708726072

Nice post. A couple things to add:

1. The Mamba co-author was also the FlashAttention lead author.

2. The secret ingredient that makes SSMs viable for deep learning is HiPPO theory. If you start with random initialization you're not going to get results. What you need is "optimal online function approximation" using Legendre polynomials, a Fourier basis, etc., in matrix form. The Mamba story starts with Legendre Memory Units.

Invariably someone comments, "How do we know that it scales?" We don't. But the lead author has backing and a new startup at cartesia.ai. Could be the next Mistral.

sigmoid10 · 2024-02-23T22:34:15 1708727655

The architecture is completely public. I would be surprised if certain other players (including but not limited to Mistral AI) are not training models yet. We'll hear soon enough if this is viable. Maybe not for official release candidates, but at least for internal testing.

3abiton · 2024-02-24T06:13:34 1708755214

Nonetheless, this is extremely exciting, unlike RWKV and Retention Network

cztomsik · 2024-02-25T07:18:55 1708845535

Why? From what I read those architectures have many similarities (and same weaknesses)

magnio · 2024-02-23T17:56:11 1708710971

Fantastic blog post, thank you for this. I am not even familiar with transformers, yet the explanation is stellar clear to me, and the included references and context are a trasure trove. The explanation of FlashAttention is the best I have seen, and that is not even the focus of the article.

One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training.

jackcook · 2024-02-23T18:10:30 1708711830

Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:

“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”

nlrk · 2024-02-24T06:14:51 1708755291

when I read the paper I thought the idea was changing \Delta permits getting the model to learn different things over different time scales. As you quoted “the main source of improvement".

I don’t have an llm backround, just controls, so I might wrong.

whimsicalism · 2024-02-23T18:39:11 1708713551

How are you not familiar with transformers yet have seen multiple explanations of FlashAttention?

samus · 2024-02-23T21:27:31 1708723651

The issue with Attention essentially is that it is used to relate all token of the input sequence with each other. The need to do that somehow makes sense no matter how much one understands about the internals of a transformer. The naive way to do that boils down to matrix multiplications, and a lot more people understand the performance issues implied by them.

whimsicalism · 2024-02-23T22:10:14 1708726214

your comment makes no sense to me, sorry. if you understand attention you understand transformers, period.

samus · 2024-02-23T22:39:32 1708727972

That's good to know :)

defrost · 2024-02-24T00:56:11 1708736171

Likewise your comment(s) makes no sense to me.

If you can understand attention and transformers, how can you not understand that population numbers can rise, reach a peak, fall, and then level out (all w/out any genocidial actions)?

How can you claim that it is "absurdism" to imagine something that can be seen in data across the plant and animal kingdom?

avarun · 2024-02-23T19:55:47 1708718147

Literally the exact question I had reading that comment haha

moffkalast · 2024-02-23T18:39:21 1708713561

If I'm not mistaken the largest mamba model right now is 2.8B and undertrained with low quality data (the Pile only). The main problem is that it's new and unproven.

Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.

SekstiNi · 2024-02-23T20:21:03 1708719663

There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny compared to other recent releases (ex. stablelm-3b [2] trained on 4T tokens).

> low quality data (the Pile only)

The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.

[1]: https://huggingface.co/state-spaces/mamba-2.8b-slimpj [2]: https://huggingface.co/stabilityai/stablelm-3b-4e1t

moffkalast · 2024-02-23T21:21:06 1708723266

Eh quality is subjective. There are good parts, like Books3 and arxiv, but a large part of it is common crawl which has just about anything people put up on the internet, random IRC chat logs, HN and Reddit shitposts, Youtube subtitles which are in broken English half the time, and of course the Enron corporate email dump to make every model sound like an HR middle manager.

jsenn · 2024-02-23T19:24:08 1708716248

This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity?

EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.

jackcook · 2024-02-23T21:04:05 1708722245

Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752

paxys · 2024-02-23T17:30:44 1708709444

From what I can tell all the large players in the space are continuing developing on transformers right? Is it just that Mamba is too new, or is the architecture fundamentally not usable for some reason?

thatguysaguy · 2024-02-23T17:47:24 1708710444

Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers.

There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.

espadrine · 2024-02-23T18:49:22 1708714162

Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs.

cs702 · 2024-02-23T20:18:36 1708719516

Not necessarily:

https://www.reddit.com/r/MachineLearning/comments/1amb3xu/d_...

gaogao · 2024-02-23T18:05:20 1708711520

It's a reasonably easy bet that Together is doing or will do a serious pretraining run with Mamba, where if that's a success other players might start considering it more.

whimsicalism · 2024-02-23T18:36:29 1708713389

> There are some papers suggesting that transformers are better than SSMs in fundamental ways

I mean the vanilla transformers are also shown failing at the tasks they present.

whimsicalism · 2024-02-23T18:31:29 1708713089

we have no idea what the large players in the space are doing

danielmarkbruce · 2024-02-23T20:55:44 1708721744

Exactly this. Except, there is zero chance they just looked at mamba and went "meh, too new for us". People are definitely trying stuff. It takes a lot of fiddling around with a brand new model architecture to get something working well. OpenAI aren't going to give a running commentary on the state of all the things they are looking into.

denial · 2024-02-23T20:42:10 1708720930

Something minor I always wonder about when I read Mamba is the discretization.

All of the sources I see referred to as derivations of it have a discretization of the form

h_t =Ah_{t-1} + Bx_{t-1} for the first line instead of the given one of the form h_t =Ah_{t-1} + Bx_t.

Does anyone know why this is?

pama · 2024-02-23T21:28:52 1708723732

Not sure how much detail you need but generally there exist implicit and explicit integrators for numerically solving (integrating) ODE. The implicit ones, like the one used here, tend to be more stable. The ideas behind SSM come from control theory ideas that used integrators with stability guarantees so that the rest of the neural network can focus on other aspects of the problem.

denial · 2024-02-23T21:43:21 1708724601

That's a helpful pointer. Thank you.

Der_Einzige · 2024-02-23T17:14:16 1708708456

Very annoying namespace conflict since a package called "mamba" (faster reimplementation of the python conda package manager) already existed for awhile before this architecture was even dreamed up.

https://github.com/mamba-org/mamba

Beyond that, I'll care about an alternative to transformers when it shows superior performance with an open source 7b-34b model compared to transformer model competitors. So far this has not happened yet

jasonjmcghee · 2024-02-23T17:17:59 1708708679

> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

lpasselin · 2024-02-23T17:53:53 1708710833

The mamba paper shows significant improvements in all model sizes, up to 1b, the largest one tested.

Are there any reason why it wouldn't scale to 7b or more? Have they tried it?

samus · 2024-02-23T21:34:05 1708724045

That's the issue - I keep hearing that it is beyond small research group's budget to meaningfully train such a large model. You don't just need GPU time, you also need data. And just using the dregs of the internet doesn't cut it.

woadwarrior01 · 2024-02-23T17:19:54 1708708794

I use the former and have been experimenting with the latter. Fortunately, the contexts are separate enough that they never come up in the same sentence.

amelius · 2024-02-23T17:37:21 1708709841

I was using mamba to install mamba the other day, when suddenly I had to run for a live mamba.

croes · 2024-02-23T17:56:22 1708710982

While chewing a Mamba?

https://www.mamba.us/

scarmig · 2024-02-23T18:49:31 1708714171

I had the exact same experience, and I was also using it for a web application powered by the Mamba web framework.

cztomsik · 2024-02-25T07:13:21 1708845201

> Importantly, these recurrent and convolutional forms, which I like to call “RNN mode” and “CNN mode,” are mathematically equivalent. This allows S4 to shape-shift depending on what you need it to do, with no difference in its outputs.

Is this really true? Because it seems to ignore hardware and data type precisions entirely. I mean computing same math thing in a different way with floating points often leads to different results.

lxe · 2024-02-23T18:17:13 1708712233

I'm very positive I can actually understand the terminology used in discussing machine learning models if it was presented in a way that describes the first principles a little bit better, instead of diving directly into high level abstract equations and symbols.

I'd like a way to learn this stuff as a computer engineer, in the same spirit as "big scary math symbols are just for-loops"

paulluuk · 2024-02-23T18:28:56 1708712936

Ironically, you can probably just ask a Transformer model to explain it to you.

I'm the same as you: I have no problem grasping complex concepts, I just always struggled with the mathematical notation. I did pass linear algebra in university, but was glad I could go back to programming after that. Even then, I mostly passed linear algebra because I wrote functions that solve linear algebra equations until I fully grasped the concept.

I've found that GPT-4 is very good at taking a math-notation-rich document and just describing it in terms a math-notation-averse engineer would understand.

I was a data engineer for about 6-7 years at various companies, always working together with data scientists who insist that `x_` or `_phi` are proper variable names. Man am I glad to be working with engineers now.

danielmarkbruce · 2024-02-23T20:58:39 1708721919

This is very effective.

Also, just try really hard. Repeat. It's new language to explain concepts you likely already know. You don't remember spanish by looking at the translations once.

QuadmasterXLII · 2024-02-23T23:14:30 1708730070

That's a heuristic that's usually true. You can definitely understand convolution or attention better with a "big scary math symbols are just for-loops" explanation, but there are also things like dopri45 or elliptic curve crypto where we just have to accept that Weird Math Shit is happening and the symbols are inevitable. It looks to me like mamba has dragged a part of llm research into the latter camp.

yorwba · 2024-02-23T18:30:28 1708713028

It is unclear to me whether you're praising the article as particularly easy to understand or complaining that it contains equations like

  h_t = A h_{t-1} + B x_t
  y_t = C h_t

(which the author attempts to illustrate in the "My name is Jack" figure below)

whimsicalism · 2024-02-23T18:38:05 1708713485

If you want to learn this stuff as a computer engineer, you can read the code here [0]. I find the math quite helpful.

[0]: https://github.com/state-spaces/mamba

esafak · 2024-02-23T18:24:46 1708712686

Ask an LLM to translate it into terms you understand. This is something they excel at.

israrkhan · 2024-02-23T22:27:31 1708727251

MoE (Mixture of Experts) is an effective way to scale transformers. Gemini 1.5 is already doing upto 1 million tokens. I have not seen any large scale mamba model, so not aware of its shortcomings, but I am sure there are tradeoffs.

It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context?

intalentive · 2024-02-23T22:41:23 1708728083

https://arxiv.org/abs/2401.04081

https://github.com/jzhang38/LongMamba

israrkhan · 2024-02-23T22:57:49 1708729069

interesting. This is exactly what I was thinking about. Thanks for sharing

nestorD · 2024-02-24T00:15:44 1708733744

MoE let's you use scale model size up with compute. That leads to hopefully more intelligent models. It, however, is independent with context size: the ability to process a lot of tokens / text.

whimsicalism · 2024-02-23T22:43:49 1708728229

nope :) MoE does not scale transformers along sequence length

mistrial9 · 2024-02-23T19:01:30 1708714890

namespace collision detected

https://anaconda.org/conda-forge/mamba