I very much appreciate that the authors not only published their code (https://github.com/llm-random/llm-random) but included the dataset they used (available on Huggingface - https://huggingface.co/datasets/c4) as well as the training process and hyperparameters they used so others can replicate and build on their work. The only thing really missing is the weights which would be nice to have on huggingface as well.
If we had a proper data version control, wherein the git commit hash was tied directly to the output data hash and hosted on IPFS (and the make system checked ipfs like it does local files for the cache) then it would be absolutely reproducible.
And the wonderful thing is, every person that used git clone on this repo and ran it would be serving the NN weights.
But alas, this unfortunately hasn't been done yet.
The weights aren't needed to make it reproducable. The code and training data are needed. Hopefully if you used those, you'd ultimately reach the same result.
I guess I'm saying that if there are reproducibility problems without the weights, then there's still a reproducibility problem with them. A paper with weights that magically work, when training on the same data and algorithm doesn't work is a paper that isn't reproducible.
IMO, having the weights available sometimes just papers over a deeper issue.
If anything CS papers are far more reproducible than most papers. Maybe that is sad, but I think most scientists and researchers are trying their best.
I understand where you're coming from but what they provided DOES make their work reproducible. You can use the data, source code, and recipe to train the model and get the weights.
It would be nice if they provided the weights so it could be USABLE without the effort or knowledge required.
We (I think) would all like to see more _truly_ open models (not just the source code) that enable collaboration in the community.
Only if they also include the random seed they used for the initial weights, otherwise you may be able to reproduce similar performance but will not likely obtain their same weights.
But that's a lot like saying that my recipe for muffins isn't reproducible because it doesn't say exactly which batch of which field my flour comes from. I mean, of course you won't get the same muffins, but if your muffins taste just as good it's still a win.
Interesting and expected core idea. The MoE models in table 1 have drastically larger total parameter count than the baseline transformer, which in turn is super tiny for the purpose of NLP (25M parameters). I suppose inference is of similar speed for the large MoE and the simple mamba model, so for some applications that extra parameter count is OK, but I don’t know how these performance benefits would scale when the model reaches practical sizes at which point the 32x larger MoE may be unrealistic for training purposes. I’d be very interested to see a real life application of traditional mamba scaling to the 70B or more parameters, or any attempts to get meaningful parallelizing of the original (or the MoE) model.
It is a reasonable bet to assume that openAI had some powerful models along the lines of Chris Re’s group well before mamba came out. They hired people with the right background and certainly the cuda optimization would not be a problem for OpenAI. My main question is if it makes sense to scale up an already huge model 32x during training compared to other ideas for increasing capacity at scale.
This is cool. And interesting, in that MoE+Transformers gets you a little bit of something for free on the training side, and a medium amount of something for free on the inference side -- seeing that MoE still gives those benefits with State space architectures is useful information; it looks like it's about twice as efficient in training time over non-MoE Mamba, according to the paper.
I don't understand Mamba's architecture well enough to predict much about it, but I do think we'll see significant exploration of MoE architectural ideas this year, and that will be cool!
I'll struggle through the mamba papers, but does anyone have a good post or article that gives an intuition about what is fundamentally different about Mamba compared to transformers?
I struggled learning about Mamba's architecture but realized it's because I had some gaps in knowledge. In no particular order, they were:
- a refresher on differential equations
- legendre polynomials
- state spaced models; you need to grok the essence of
x' = Ax + Bu
y = Cx
- discretization of S4
- HiPPO matrix
- GPU architecture (SRAM, HBM)
Basically, transformers is an architecture that uses attention. Mamba is the same architecture that replaces attention with S4 - but this S4 is modified to overcome its shortcomings, allowing it to act like a CNN during training and an RNN during inference.
The matrices that make up the state space (A, B and C) are constant in S4. This allowed them to represent some of the math operations as a convolution (which can be parallelized).
The difference between S4 and Mamba is that these matrices are input-dependent in Mamba. Plus they add in some CUDA stuff ("parallel scan") to make it faster to compute on a GPU even if these matrices are not constant.
I'm still not convinced on Mamba's performance on Natural Language tasks, but maybe it's just because they haven't trained a large enough model on enough data yet.
I’ve spent a few hours reading the Mamba paper, examining the code, and watching Yannick’s YouTube dive into the paper. If you took discrete time signal processing in school, you will have what you need to work through the math.
Here is my intuition about Mamba. It’s a linear discrete time signal filter where the filter blocks are conditioned on the input at each time step as well as receiving the input as a normal filter would. Think of the predecessor to Mamba as a discrete time filter without this conditioning. That’s the first innovation.
The other innovation is some lovely algebra allowing them to reorganize things so they can be computed much more efficiently on current GPU hardware, with respect to slow and fast memory.
State space models are very simple conceptually. It’s kind of amazing that Mamba does so well on long sequence modeling because the architecture is ridiculously simple. But when you consider that the state matrices are conditioned on the input, it all makes sense. Somehow, they learn how to adapt the “filter” based on the input. That’s where the intelligence lies.
I wonder whether there is a way to use existing model weights (e.g. from open llms) to populate other types of models with a set of weights closer to final state than where'd they would have started from scratch. I am very much speaking from a position of ignorance, but being mechanistically derived.....
"The MOE architecture uses 20 times the parameters, is this comparison fair? Can it be compared with a single model that also uses 20 times the parameters?"