Are we going to hit bullseye?

ein0p · 2024-04-07T18:52:45.000000Z

This only cuts compute by “up to” 50% and only during inference. Quadratic dependence on context size remains, as do the enormous memory requirements. For something to be considered a bulls eye in this space it has to offer nonlinear improvements on both of these axes, and/or be much faster to train. Until that happens, people, including Google will continue to train bog standard MoE and dense transformers. Radical experimentation at scale is too expensive even for megacorps at this point.

mdale · 2024-04-07T21:07:01.000000Z

Makes opportunities for smaller companies to innovative/experiment to offer solutions / acquisition targets where tighter inference compute requirements makes or breaks the experience but larger training cost is less of a concern (such as embedded or local runtime use cases)

ein0p · 2024-04-07T21:15:43.000000Z

Before those opportunities are available to you, someone would need to spend a few million dollars and train a competitive model with this, and then release it under a license that allows commercial use. This is out of reach for the vast majority of smaller companies. These models only excel at large parameter counts, even for narrow problems. This is especially true in the case of MoE, which is a way to push the overall parameter count even larger without lighting up the whole thing for every token.

visarga · 2024-04-07T20:50:16.000000Z

Yeah all attempts at reducing complexity from quadratic to linear failed, only Mamba still has a chance, but it's not tested on large models and only provides a speedup at for 2000+ tokens. That was to be expected as small sequences have very small memory requirements for transformers, but recursive architectures use the same hidden size. So when recurrent hidden size > sequence length, the old transformer is faster.

ein0p · 2024-04-07T21:10:24.000000Z

It's more subtle than that IMO. They haven't necessarily "failed" - they just don't have the "superpowers" that the metrics used to evaluate such systems are aimed at. E.g. no such linear method devised so far (that I know of, at least) is able to do very high recall point retrieval in long context _and_ effective in-context learning simultaneously. You get one or the other, but not both. But as far as the metrics go, high recall retrieval in long context is easier to for the researcher to demonstrate and for the observer to comprehend - a typical needle/haystack setting is trivial to put together. It is also something that (unlike in-context learning) humans are usually very bad at, so it's perceived as a "superpower" or "magic". In this case e.g. Mamba being more human like due to its selective forgetfulness is currently playing against it. But whether it's "better" per se will depend on the task. It's just that we do not know how to evaluate most of the tasks yet, so people keep trying to find the proverbial keys under the lamp post, and measure what they can to make progress, and thereby keep their efforts lavishly funded.