Hacker News new | past | comments | ask | show | jobs | submit login
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts (arxiv.org)
129 points by jonbaer 8 months ago | hide | past | favorite | 39 comments



I very much appreciate that the authors not only published their code (https://github.com/llm-random/llm-random) but included the dataset they used (available on Huggingface - https://huggingface.co/datasets/c4) as well as the training process and hyperparameters they used so others can replicate and build on their work. The only thing really missing is the weights which would be nice to have on huggingface as well.


It's very confusing to me that you are praising the authors of a published scientific paper for almost making their work reproduceable.


If we had a proper data version control, wherein the git commit hash was tied directly to the output data hash and hosted on IPFS (and the make system checked ipfs like it does local files for the cache) then it would be absolutely reproducible.

And the wonderful thing is, every person that used git clone on this repo and ran it would be serving the NN weights.

But alas, this unfortunately hasn't been done yet.


That's not what confusing means.


Feigned confusion


The weights aren't needed to make it reproducable. The code and training data are needed. Hopefully if you used those, you'd ultimately reach the same result.


Even in the days where this was standard, that is not the case entirely.

There is a whole other world between "released code" and "getting the results as seen in the paper".

Unfortunately. The reproducibility crisis is very much well and alive! :'( Much more to go into but it is a deep rabbit hole, indeedy. :'((((


I guess I'm saying that if there are reproducibility problems without the weights, then there's still a reproducibility problem with them. A paper with weights that magically work, when training on the same data and algorithm doesn't work is a paper that isn't reproducible.

IMO, having the weights available sometimes just papers over a deeper issue.


Training, especially on large GPU clusters, is inherently non-deterministic. Even, if all seeds are fixed.

This boils down to framework implementations, timing issues and extra cost of trying to ensure determinism (without guarantees).


Random initialization would keep you from producing the exact same results.


Yes, but there's a difference between exact results and reproducible results. I should get similar performance, otherwise there is an issue.


It's a sad world where our standards are that low. But they are that low for good reasons.


If anything CS papers are far more reproducible than most papers. Maybe that is sad, but I think most scientists and researchers are trying their best.


I understand where you're coming from but what they provided DOES make their work reproducible. You can use the data, source code, and recipe to train the model and get the weights.

It would be nice if they provided the weights so it could be USABLE without the effort or knowledge required.

We (I think) would all like to see more _truly_ open models (not just the source code) that enable collaboration in the community.


Only if they also include the random seed they used for the initial weights, otherwise you may be able to reproduce similar performance but will not likely obtain their same weights.


But that's a lot like saying that my recipe for muffins isn't reproducible because it doesn't say exactly which batch of which field my flour comes from. I mean, of course you won't get the same muffins, but if your muffins taste just as good it's still a win.


If this work is valuable, the random seed shouldn't affect the outcome thaaat much.


Interesting and expected core idea. The MoE models in table 1 have drastically larger total parameter count than the baseline transformer, which in turn is super tiny for the purpose of NLP (25M parameters). I suppose inference is of similar speed for the large MoE and the simple mamba model, so for some applications that extra parameter count is OK, but I don’t know how these performance benefits would scale when the model reaches practical sizes at which point the 32x larger MoE may be unrealistic for training purposes. I’d be very interested to see a real life application of traditional mamba scaling to the 70B or more parameters, or any attempts to get meaningful parallelizing of the original (or the MoE) model.


I'd put my money OpenAI has tried internally


It is a reasonable bet to assume that openAI had some powerful models along the lines of Chris Re’s group well before mamba came out. They hired people with the right background and certainly the cuda optimization would not be a problem for OpenAI. My main question is if it makes sense to scale up an already huge model 32x during training compared to other ideas for increasing capacity at scale.


This is cool. And interesting, in that MoE+Transformers gets you a little bit of something for free on the training side, and a medium amount of something for free on the inference side -- seeing that MoE still gives those benefits with State space architectures is useful information; it looks like it's about twice as efficient in training time over non-MoE Mamba, according to the paper.

I don't understand Mamba's architecture well enough to predict much about it, but I do think we'll see significant exploration of MoE architectural ideas this year, and that will be cool!


I thought training time/compute scales linearly with number of experts?


I'll struggle through the mamba papers, but does anyone have a good post or article that gives an intuition about what is fundamentally different about Mamba compared to transformers?


I struggled learning about Mamba's architecture but realized it's because I had some gaps in knowledge. In no particular order, they were:

- a refresher on differential equations

- legendre polynomials

- state spaced models; you need to grok the essence of

x' = Ax + Bu

y = Cx

- discretization of S4

- HiPPO matrix

- GPU architecture (SRAM, HBM)

Basically, transformers is an architecture that uses attention. Mamba is the same architecture that replaces attention with S4 - but this S4 is modified to overcome its shortcomings, allowing it to act like a CNN during training and an RNN during inference.

I found this video very helpful: https://www.youtube.com/watch?v=8Q_tqwpTpVU

His other videos are really good too.


I was about to post that video too. Highly recommended.


This is an article about "S4" that uses "Structured State Spaces".

https://srush.github.io/annotated-s4/

The matrices that make up the state space (A, B and C) are constant in S4. This allowed them to represent some of the math operations as a convolution (which can be parallelized).

The difference between S4 and Mamba is that these matrices are input-dependent in Mamba. Plus they add in some CUDA stuff ("parallel scan") to make it faster to compute on a GPU even if these matrices are not constant.

Yannic Kilcher's video on Mamba might also be a good resource: https://youtu.be/9dSkvxS2EB0


Would recommend Sasha Rush's lecture https://www.youtube.com/watch?v=dKJEpOtVgXc, it provides the intuition for state space models via linear RNNs


We went over it in our Friday paper club before the holidays which helped me gain an intuition.

https://blog.oxen.ai/mamba-linear-time-sequence-modeling-wit...

I'm still not convinced on Mamba's performance on Natural Language tasks, but maybe it's just because they haven't trained a large enough model on enough data yet.


Is this a group I can join? Is it like a book club, but for reading ML papers?


Yes it is! We meet every Friday at 10am PST and pick an Arxiv Paper to go over as a group.

Feel free to join here: https://lu.ma/oxenbookclub




I’d recommend starting with the appendix of the original hippo paper.


I’ve spent a few hours reading the Mamba paper, examining the code, and watching Yannick’s YouTube dive into the paper. If you took discrete time signal processing in school, you will have what you need to work through the math.

Here is my intuition about Mamba. It’s a linear discrete time signal filter where the filter blocks are conditioned on the input at each time step as well as receiving the input as a normal filter would. Think of the predecessor to Mamba as a discrete time filter without this conditioning. That’s the first innovation.

The other innovation is some lovely algebra allowing them to reorganize things so they can be computed much more efficiently on current GPU hardware, with respect to slow and fast memory.

State space models are very simple conceptually. It’s kind of amazing that Mamba does so well on long sequence modeling because the architecture is ridiculously simple. But when you consider that the state matrices are conditioned on the input, it all makes sense. Somehow, they learn how to adapt the “filter” based on the input. That’s where the intelligence lies.


I wonder whether there is a way to use existing model weights (e.g. from open llms) to populate other types of models with a set of weights closer to final state than where'd they would have started from scratch. I am very much speaking from a position of ignorance, but being mechanistically derived.....


Isn't this transfer learning?


"The MOE architecture uses 20 times the parameters, is this comparison fair? Can it be compared with a single model that also uses 20 times the parameters?"


It has more parameters, but not all of them are used during inference. They compared models that use equal numbers of parameters.


MoE-Mamba is great, but I prefer Sicko-MoEd




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: