Which means Codestral Mamba and DeepSeek both lead four benchmarks. Kinda takes ...

causal · 2024-07-16T17:04:01.000000Z

It should be corrected but the interesting aspect of this release is the architecture. To stay competitive while only needing linear inference time and supporting 256k context is pretty neat.

mbowcut2 · 2024-07-16T17:43:24.000000Z

THIS. People don't realize the importance of Mamba competing on par with transformers.

imtringued · 2024-07-17T11:07:11.000000Z

Linear attention is terrible for chatbot style request-response applications, but if you're giving the model the prompt and then let it scan the codebase and fill in the middle, then linear attention should work pretty decently. The performance benefit should also have a much bigger impact, since you're reprocessing the same code over and over again.

ed · 2024-07-16T17:05:24.000000Z

They're in roughly the same class but totally different architectures

Deepseek uses a 4k sliding window compared to Codestral Mamba's 256k+ tokens