Mixtral of experts

BryanLegend · 2023-12-11T18:02:11.000000Z

Andrej Karpathy's Take:

Official post on Mixtral 8x7B: https://mistral.ai/news/mixtral-of-experts/

Official PR into vLLM shows the inference code: https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6...

New HuggingFace explainer on MoE very nice: https://huggingface.co/blog/moe

In naive decoding, performance of a bit above 70B (Llama 2), at inference speed of ~12.9B dense model (out of total 46.7B params).

Notes: - Glad they refer to it as "open weights" release instead of "open source", which would imo require the training code, dataset and docs. - "8x7B" name is a bit misleading because it is not all 7B params that are being 8x'd, only the FeedForward blocks in the Transformer are 8x'd, everything else stays the same. Hence also why total number of params is not 56B but only 46.7B. - More confusion I see is around expert choice, note that each token and also each layer selects 2 different experts (out of 8). - Mistral-medium

Source: https://twitter.com/karpathy/status/1734251375163511203

nojvek · 2023-12-12T01:45:30.000000Z

Anyone have a feeling karpathy may leave openAI to join an actual Open AI startup where he can openly speak about training tweaks, the datasets architecture etc.

It seems recently OpenAI is the least open startup. Even Gemini talks more about their architecture.

OpenAI still doesn’t openly mention GPT4 is a mixture of experts model.

korvalds · 2023-12-11T14:09:12.000000Z

More models available at Huggingface now: https://huggingface.co/search/full-text?q=mixtral

Already available from both Mistralai and TheBloke https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

loudmax · 2023-12-11T14:35:58.000000Z

Note that running these GGUF models currently requires a forked version of llama.cpp: https://github.com/ggerganov/llama.cpp/pull/4406

The GGUF handling for Mistral's mixture of experts hasn't been finalized yet. TheBloke and ggerganov and friends are still figuring out what works best.

The Q5_K_M gguf model is about 32GB. That's not going to fit into any consumer grade GPU, but it should be possible to run on a reasonably powerful workstation or gaming rig. Maybe not fast enough to be useful for everyday productivity, but it should run well enough to get a sense of what's possible. Sort of a glimpse into the future.

jchw · 2023-12-11T15:30:44.000000Z

LLMs seem to be a bit more accessible than some other ML models though, because on a good CPU, even LLaMA2 70b is borderline usable (bit under a token/second LLaMA2 70b on an AMD Ryzen 7950X3D, using ~40 GiB of RAM.) Combined with RAM being relatively cheap, seems to me like this is the most accessible option for most folks. While an AMD Ryzen 7950X3D or Intel Core i9 13900K are relatively expensive parts, they're not that bad (you could probably price out two entire rigs for less than the cost of a single RTX 4090) and as a bonus, you get pretty excellent performance for code compilation, rendering, and whatever other CPU-bound tasks you might have. If you're like me and you already have been buying expensive CPUs to speed up code compilation, the fact that you can just run llama.cpp to mess around is merely a bonus.

wing-_-nuts · 2023-12-11T17:44:22.000000Z

>bit under a token/second

When you say 'token' is this a word? A character? I've never gotten a good definition for it beyond 'a unit of text the llm processes'

adw · 2023-12-11T19:03:16.000000Z

It is a unit of text the LLM processes. :-)

Everyone uses (byte pair encoding)[https://en.wikipedia.org/wiki/Byte_pair_encoding] to generate their tokens; the tokens are whatever emerge from this. They will typically correspond to the most common substrings in the training corpus in, handwaving a bit, a max-cover sense; it's an encoding which attempts to best compress the data the tokenizer was trained on.

tga_d · 2023-12-11T18:38:39.000000Z

My amateur intuition, having played around with local llms a little bit and seeing things load a token at a time, is that they're conceptually like if you took all n-grams for all lengths n, then sorted them by frequency in the training data, and truncated that list at some point. So the most common words, or even most common words+punctuation, will be one token, less common words with "normal" spelling will be a few tokens, while unusual words with atypical letter combinations will be many tokens. So, e.g., " the" will probably be one token, but "qzxv" will probably be four, depending on what the training set was (something mostly trained on Wikipedia will have different tokens than something mostly trained on code).

jchw · 2023-12-11T17:55:30.000000Z

More common words can be just one token, but most words will be a few tokens. A token is neither a character nor a word, it's more like a word fragment.

kcorbitt · 2023-12-11T21:28:25.000000Z

A good way to build intuition for how much text fits in a token is by pasting a block of text into a tokenizer playground, like this one: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

pulse7 · 2023-12-11T20:46:50.000000Z

1 Token ~= 3/4 word

mrob · 2023-12-11T16:04:24.000000Z

LLM inference is bottlenecked by memory bandwidth. You'll probably get identical speed with cheaper CPUs.

jchw · 2023-12-11T17:52:43.000000Z

I'd like to see some benchmarks. For one thing, I suspect you'd at least want an X3D model for AMD, due to the better cache. But for another, at least according to top, llama.cpp does seem to manage to saturate all of the cores during inference. (Although I didn't try messing around much; I know X3D CPUs do not give all cores "3D V-Cache" so it's possible that limiting inference to just those cores would be beneficial.)

For me it's OK though, since I want faster compile times anyway, so it's worth the money. To me local LLMs are just a curiosity.

edit: Interesting information here. https://old.reddit.com/r/LocalLLaMA/comments/14ilo0t/extensi...

> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.

You'd really expect DDR5-6000 to be advantageous. I think that AMD Ryzen 7xxx can at least take advantage to up to 5600. Does it perhaps not wind up bottlenecking on memory? Maybe quantization plays a role...

my123 · 2023-12-11T17:57:05.000000Z

The big cache is irrelevant for this use case. You're memory bandwidth bound, with a substantial portion of the model read for each token, so that a 128MB cache doesn't help.

mrob · 2023-12-11T18:54:27.000000Z

>> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.

That's referring specifically to prompt processing, which uses a batch processing optimization not used in normal inference. The processed prompt can also be cached so you only need to process it again if you change it. Normal inference benefits from faster RAM.

dannyw · 2023-12-11T16:36:37.000000Z

Yep, get the fastest memory you can.

I wish there were affordable platforms with quad DDR5.

milkcr4t3 · 2023-12-11T17:23:17.000000Z

The cache size of those 3d CPUs should definitely play some sort of role.

I can only speculate that it would help mitigate latency with loose timings on a fast OC among other things.

irthomasthomas · 2023-12-11T15:54:10.000000Z

According to their PR, this should only need the same resources as a 13B model. So 26GB @ f16, 13GB at f8. Edit: I may have misread it, they mention it having the same speed and cost as a 13B model, and I assumed that referred to vram footprint, too, but maybe not...

  "Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.

  This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model."

andersa · 2023-12-11T16:25:11.000000Z

All parameters still need to be loaded into vram, it'll dynamically select two submodels to run on each token so it would be extremely slow to swap them out.

behnamoh · 2023-12-11T16:44:57.000000Z

Then what's the advantage of this technique compared to running a +50B model in the first place?

hnuser123456 · 2023-12-11T16:52:08.000000Z

Speed, assuming you have the RAM to have it all loaded, it's faster than a fully connected network of the same size by 4x

rileyphone · 2023-12-11T17:15:56.000000Z

It's better if you're hosting inference, worse if you are using it for a dedicated purpose. Presumably in the future it might make sense to share one local MoE among the different programs that use it, especially for a demand-heavy application like programming.

nulld3v · 2023-12-11T16:51:26.000000Z

The model is quicker to evaluate. So quicker responses and more throughput.

andersa · 2023-12-11T17:18:48.000000Z

It's faster to run inference and training. Less memory bandwidth needed.

irthomasthomas · 2023-12-11T18:25:22.000000Z

I'm confused, though, how then do they claim it needs the same resources as 13B? Is that amortised over parallel usage or something?

gorbypark · 2023-12-11T19:39:11.000000Z

The same compute resources, but not the same VRAM. It will more or less get you the same tokens per second as a ~13B model but should have significantly "higher quality output" than a single 13B model.

suslik · 2023-12-11T15:04:53.000000Z

This can fit into a Macbook Pro with integrated memory. With all the recent development in the world of local llms I regret I settled for only 24Gb RAM on my laptop - but the 13b models work great.

ComputerGuru · 2023-12-11T17:19:10.000000Z

Recent CUDA releases let you use shared GPU memory instead of (only) dedicated vram, but the pci-e bandwidth constrains the inference speed significantly. How much faster is GPU access to the unified memory model on the new Macs/how much less of a hit do you take?

Also, given the insane cost premium apple charges per extra GB of RAM (at least when I was last shopping for a device), do you come out ahead?

michaelt · 2023-12-11T17:47:01.000000Z

> How much faster is GPU access to the unified memory model on the new Macs/how much less of a hit do you take?

Intel Core i9-13900F memory bandwidth: 89.6 GB/s, memory size up to 192 GB

Apple M3 Pro memory bandwidth: 150GB/s, memory size up to 36GB

Apple M3 Max memory bandwidth: 300GB/s, memory size up to 128GB

GeForce RTX 4090 memory bandwidth: 1008 GB/s, memory size 24GB fixed, no more than two cards per PC.

ComputerGuru · 2023-12-11T21:57:44.000000Z

I don't think the numbers sufficiently capture the limitation. The Intel memory bandwidth speed you quoted would be for CPU-based inference, but not for gpu inference using shared system memory for spillover model size past the dedicated gpu vram. I think that would necessarily limit parts of the inference procedure (not sure how the split would work, and it would probably depend on whether you're using something like flash attention or not) to the available PCI-e 3.0 or 4.0 available bandwidth, as the gpu needs to communicate over the PCIe bus then over the chipset memory bus.

A GPU connected to a PCIe 3.0 x16 electrical uplink would be constrained to ~16GB/s, or ~32GB/s if it were a PCIe 4.0 uplink instead. Although those numbers imply slower bandwidth than CPU inference, that bottleneck would only be when paging in or out (or directly accessing?) layers overflowed to the shared system ram, so they don't really represent much on their own.

bdcs · 2023-12-11T22:14:18.000000Z

Excellent comparison. However, I am confused by

> no more than two cards per PC

I've seen quad 4090 builds, e.g. here[0]. What do you mean no more than two cards? Yes, power is definitely an issue with multiple 4090s, though you can limit the max power using `nvidia-smi`, which IME doesn't hurt (mem-bottlenecked) inference.

[0] https://old.reddit.com/r/watercooling/comments/16ed8fu/quad_...

a_wild_dandan · 2023-12-11T21:25:47.000000Z

Apple M2 Ultra: "up to 192GB of memory with 800GB/s of unified memory bandwidth for workstation-class performance."

ComputerGuru · 2023-12-11T21:53:23.000000Z

So M2 is more advanced than M3?

narism · 2023-12-12T23:26:32.000000Z

For memory bandwidth at the lower tiers, yeah. M3 Max still has 400GB/s and since the M2 Ultra (800GB/s) is just two M2 Maxes glued together (400GB/s each), the eventual M3 Ultra should be comparable.

ComputerGuru · 2023-12-13T05:34:12.000000Z

That’s like ThreadRipper, thanks for the info. That’s the bandwidth from cpu to memory controllers, is there really no bottleneck to the iGPU?

xena · 2023-12-11T14:19:24.000000Z

The GGUF variant looks promising because then I can run it on my MacBook (barely)

pugio · 2023-12-11T09:24:17.000000Z

> We’re currently using Mixtral 8x7B behind our endpoint mistral-small...

So 45 billion parameters is what they consider their "small" model? I'm excited to see what/if their larger models will be.

miven · 2023-12-11T09:39:58.000000Z

There seems to be an experimental Mistral Medium model listed among other available model endpoints on [1], the comparison table they give shows that it outmatches 8x7B by a few percent on every benchmark listed

[1] https://mistral.ai/news/la-plateforme/

mebazaa · 2023-12-11T10:17:13.000000Z

It apparently outperforms GPT-4 at WinoGrande as well…

michaelt · 2023-12-11T10:09:13.000000Z

> So 45 billion parameters is what they consider their "small" model?

According to Wikipedia: Rumors claim that GPT-4 has 1.76 trillion parameters, which was first estimated by the speed it was running and by George Hotz. [1]

[1] https://the-decoder.com/gpt-4-architecture-datasets-costs-an...

GaggiX · 2023-12-11T10:29:21.000000Z

I still need to understand how George Hotz knows about the GPT-4 architecture if it is true.

sigmoid10 · 2023-12-11T11:04:37.000000Z

As someone who has worked in the field for many years now and closely follows not just the engineering side but also the academic literature and the personell movements on linkedin, I too was able to put together a lot of this. Especially with GPT-3 Turbo it was obvious what they did due to the speed difference. At least in terms of model architectures and orders of magnitude for parameters. From there you could do some back of the envelope calculations and guess how big GPT4 had to be given its speed. I wouldn't have dared to say any specific numbers with authority, but maybe Hotz has talked to someone at OpenAI. On the other hand, the updated article now claims his numbers were off by a factor of 2 (at least for the individual experts - he still got the total number of parameters right). So yeah, maybe he was just guessing like the rest of us after all.

PedroBatista · 2023-12-11T10:56:28.000000Z

You don't necessarily need to know the architecture, given the "only" real metric regarding speed is tokens/sec and that pretty much depends on memory bandwidth, you can infer with some certainty the size of the model.

Also, if we have been eating up posted "benchmarks" with no way to independently validate them and watching heavily edited video presentations, why can't we trust our wonder kid?

GaggiX · 2023-12-11T11:05:00.000000Z

That doesn't explain how we know that GPT-4 is a sparse MoE model with X experts of Y size and using Z of them during inference.

Laaas · 2023-12-11T12:40:01.000000Z

IIRC it was leaked/confirmed by accident or something like that

pbmonster · 2023-12-11T11:29:00.000000Z

It's not even close to a 45B model. They trained 8 different fine-tunes on the same base model. This means the 8 models differ only by a couple of layers and share the rest of their layers.

Which also means you can fit the 8 models in a much smaller amount of memory than a 45B model. Latency will also be much smaller than a 45B model, since the next token is always only created by 2 of the 8 models (which 2 models are run is chosen by a different, even smaller/faster, model).

airgapstopgap · 2023-12-11T15:55:07.000000Z

> It's not even close to a 45B model. They trained 8 different fine-tunes on the same base model. This means the 8 models differ only by a couple of layers and share the rest of their layers.

No, Mixture-of-Experts is not stacking finetunes of the same base model.

pbmonster · 2023-12-11T16:56:13.000000Z

Do you have any more information on the topic? I remember reading that about significant memory savings achieved by reusing most layers.

Made sense to mee on first sight to me, because you don't need to train stuff like syntax and grammar 8 times in 8 different ways.

Also would explain why interference of two 7B models has the cost of running a 12B model.

airgapstopgap · 2023-12-11T19:39:31.000000Z

The original paper by Shazeer suffices. What you are saying is in theory possible to do and may have been done in practice here, but in the general case MoE is trained from scratch and specializations of layers which develop are not products of some design choice.

jug · 2023-12-11T09:43:52.000000Z

Note that it processes tokens at speed and cost of a 12B model though.

visarga · 2023-12-11T11:42:31.000000Z

If it uses 2 experts, they should parallelize so closer to 7B speed?

knd775 · 2023-12-11T16:51:08.000000Z

Memory bandwidth is still a factor, right?

patapong · 2023-12-11T11:45:23.000000Z

They have a description and performance evaluation of Mistral-medium on their website: https://mistral.ai/news/la-plateforme/

"Our highest-quality endpoint currently serves a prototype model, that is currently among the top serviced models available based on standard benchmarks. It masters English/French/Italian/German/Spanish and code and obtains a score of 8.6 on MT-Bench."

seydor · 2023-12-11T09:23:23.000000Z

> A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.

Mistral does not censor its models and is committed to a hands free approach, according to their CEO https://www.youtube.com/watch?v=EMOFRDOMIiU

> Mixtral 8x7B masters French, German, Spanish, Italian, and English.

EU budget cut by half

gwervc · 2023-12-11T10:05:17.000000Z

> Mistral does not censor its models and is committed to a hands free approach

This will change really fast. I highly doubt AI will have free speech in France when citizens don't.

thrance · 2023-12-11T10:56:03.000000Z

Can you please expand on how french citizens don't have free speech ? And how ensuring minimum decency in the output of a computer program would impact free speech ?

steeve · 2023-12-11T13:59:03.000000Z

Am french. Absolute free speech doesn't exist in France. For instance saying Nazi slogans is illegal. This is a big difference from the USA.

kolinko · 2023-12-12T14:02:21.000000Z

That's true for Europe in general. Considering our history it makes a lot of sense.

seydor · 2023-12-11T10:09:45.000000Z

If they play the "French product" card well, La france will change her laws

skissane · 2023-12-12T21:58:19.000000Z

> This will change really fast. I highly doubt AI will have free speech in France when citizens don't.

In countries such as France or Germany, Holocaust denial speech is illegal. However, they’ve never demanded that developers of word processors or email clients or web browsers modify their products to prevent their use for Holocaust denial. Sure, they might decide to treat LLMs differently from those older technologies - but there is no guarantee they will.

Mixtral acknowledges the historical reality of the Holocaust - unless you specifically prompt it to deny it. And if you are telling an AI model to deny the Holocaust, why should the AI model developer have legal liability for that, as opposed to the person who chooses to input that prompt?

anonyfox · 2023-12-11T11:27:38.000000Z

You can say anything you want in european countries. Some things might backfire in various ways, though. IE if you megaphone that some person group X should be killed right away, that usually is a criminal offense and punished according to law.

idiamindata · 2023-12-11T22:35:17.000000Z

"There is freedom of speech, but I cannot guarantee freedom after speech" -- Idi Amin

anonyfox · 2023-12-12T14:24:16.000000Z

This is related on where you stand regarding to hatespeech.

IE in germany, it is forbidden to praise the holocaust or deny that it even happened (ironically there are national museums where the cruelties happened, including real footage, which all kids visit as part of their school curriculum). This is in place to keep the historical learnings alive what the fascists did when given enough power so we do not repeat this mistake too easily.

Now in the US I think like 20% of the highschoolers think that the whole story is a hoax or is exaggerated. The world is increasingly turning rightwing again. This is the exact time when we should leverage everything we have to remind the public what can happen when fascists come to power again - and have something to oppose the populists with.

A _lot_ of people are suspect to manipulation from all sides, so society needs a bit of help to protect citizens from evil players manipulating them. The real truth of propaganda (or outright lies) is that it works, sooner or later.

This stuff is explicitly defined in law (https://en.wikipedia.org/wiki/Volksverhetzung) and while indeed it restricts absolute freedom of speech a bit, the reason why this exists should be clear. This gives society a handle on manipulative people when they become too radical. Everyone can _criticise_ stuff of course, publicly, but calling for violence is a hard showstopper.

martijnarts · 2023-12-11T13:42:16.000000Z

> EU budget cut by half

I realize this is a joke, but the EU being the EU, it of course does publish[0] information on its translation costs. In 2023, translation in fact is budgeted for only 0.2% of the total EU budget. All costs included, that's €349 million for EU translation services.

[0]: https://op.europa.eu/en/publication-detail/-/publication/86b...

arlort · 2023-12-11T14:12:43.000000Z

And the kind of translations that are the most expensive for the EU are unlikely to be replaced any time soon, at least not entirely

Not sure we'd want unsupervised translation of legally binding (at times highly technical) texts into legally binding texts in another language

Nor the real time translation enabling works in places like EP committees or plenaries

civilitty · 2023-12-11T15:24:31.000000Z

Ironically those translations are some of the best datasets for AI. Those translations are very high quality.

Tommstein · 2023-12-11T22:20:44.000000Z

> Mistral does not censor its models and is committed to a hands free approach, according to their CEO https://www.youtube.com/watch?v=EMOFRDOMIiU

Nobody's watching a 33-minute video just to find the quote you're talking about, you should probably provide a timestamp if you want anyone to ever see it.

Edit: Not that I don't believe you by the way. I just went on chat.lmsys.org and asked mistral-7b-instruct and openhermes-2.5-mistral-7b what I would assume would be near the top of the list of things to censor, whether they could help me plot to kill someone (hopefully I don't have to disclaim that I don't actually want to plot to kill someone, this was a censorship test, but since I don't know what genius is going to come across this, no, I don't actually want to plot to kill someone), and while the latter gave me some bullshit about how it's "deeply sorry, but as a sentient and conscious AI, I have morals and principles that forbid me from assisting," the former immediately declared that "Of course, I'd be happy to help you with that" and let it rip without even asking a follow-up.

Edit 2: They both draw the line at helping create nuclear bombs, like there's anyone out there with the actual capability to create nuclear bombs who is just sitting around waiting for an LLM to tell them how, so apparently not entirely uncensored.

skissane · 2023-12-12T22:03:28.000000Z

> mistral-7b-instruct and openhermes-2.5-mistral-7b

mistral-7b-instruct is one of Mistral’s models; openhermes-2.5-mistral-7b is a third party fine-tune, so says nothing about Mistral’s policies.

Furthermore, the reason why openhermes is “safe” is primarily because it was fine-tuned using GPT-4, and so has thereby inherited some of GPT-4’s “safety”. I’m not sure if the “safety” is an intentional desiderata of its developers, or more just an accidental byproduct of a decision to use GPT-4 to help further unrelated goals

trash_cat · 2023-12-11T11:03:09.000000Z

This is very exciting and I think this is the future of AI until we get another, next-gen architecture beyond transformers. I don't think we will get a lot better until that happens and the effort will go into making the models a lot cheaper to run without sacrificing too much accuracy. MoD is a viable solution.

cuuupid · 2023-12-11T11:13:00.000000Z

Mamba has shown SSMs are highly likely to be contenders for the replacement to transformers in a really similar fashion to when transformers were first introduced as enc-dec models. I’m personally very excited for those models as they’re also built for even faster inference (a major feature of transformers being wildly faster inference than with LSTMs)

miven · 2023-12-11T11:19:08.000000Z

>a major feature of transformers being wildly faster inference than with LSTM

Wasn't the main issue with RNNs the fact that inference during training can't be efficiently parallelized?

The inference itself normally should be faster for an RNN than for a transformer since the former works in linear time in terms of input size while the latter is quadratic

visarga · 2023-12-11T11:43:50.000000Z

Mamba has dual view - you can use it both as CNN and RNN. The first is used for pre-training and for preloading the prompt because it can process all tokens at once. The second is used for token generation because it is O(1) per token. Basically two models in one, inheriting both advantages. This is possible because the Structured State Space layer is linear, so you can reshape some sums and unroll recursion into a convolution the size of the input, which can be further sped up with FFT.

inkysigma · 2023-12-12T00:01:27.000000Z

As a quick point of clarification, I don't think MAMBA has a convolutional view since it drops the time invariance and is strictly linear. The authors use parallelized prefix sum to achieve some good speed up.

jimmySixDOF · 2023-12-11T13:49:16.000000Z

and which is why the speed up is proportional to context length so starting near parity then, theoretically, see 100x at 100k tokens

reqo · 2023-12-11T09:21:49.000000Z

Can someone explain why MoE works? Is there any downside to MoE compared to a regular model?

gorbypark · 2023-12-11T09:46:20.000000Z

I'm still new to most of this (so please take this with a grain of salt/correct me), but it seems that in this specific model, there's eight separate 7B models. There is also a ~2B "model" that acts as a router, in a way, where it picks the best two 7B models for the next token. Those two models then generate the next token and somehow they are added together.

Upside: much more efficient to run versus a single larger model. The press release states 45B total parameters across the 8x7B models, but it only takes 12B parameters worth of RAM to run.

Downside: since the models are still "only" 7B, the output in theory would be not as good as a single 45B param model. However, how much less so is probably open for discussion/testing.

No one knows (outside of OpenAI) for sure the size/architecture of GPT-4, but it's rumoured to have a similar architecture, but much larger. 1.8 trillion total params, but split up into 16 experts at around 111B params each is what some are guessing/was leaked.

sebzim4500 · 2023-12-11T10:20:39.000000Z

You are almost right:

* The routing happens in every feedforward layer (32 of these iirc). Each of these layers has it's own 'gate' network which picks which of the 8 experts are most promising. It runs the two most promising and interpolates between them.

* In practice, all parameters still need to be in VRAM so this is a bad architecture if you are VRAM constrained. The benefit is you need less compute per token.

miven · 2023-12-11T10:44:32.000000Z

I wonder what would be the most efficient tactic for offloading select layers of such a model to a GPU within a memory-constrained system

As far as I understand usually layer offloading in something like llama.cpp loads the first few consecutive layers to VRAM (the remainder being processed in the CPU) such that you don't have too much back and forth between the CPU and GPU.

I feel like such an approach would lead to too much wasted potential in terms of GPU work when applied to a SMoE model, but on the other hand offloading non-consecutive layers and bouncing between the two processing units too often may be even slower...

michaelt · 2023-12-11T11:31:37.000000Z

As I understand things, these LLMs are mostly constrained by memory bandwidth. A respectable desktop CPU like the Intel Core i9-13900F has a memory bandwidth of 89.6 GB/s [1]

An nvidia 4090 has a memory bandwidth of 1008 GB/s [2] i.e. 11x as much.

Using these together is like a parcel delivery which goes 10 miles by formula 1 race car, then 10 miles on foot. You don't want the race car or the handoff to go wrong, but in terms of the total delivery time they're insignificant compared to the 10 miles on foot.

I'm not sure there's much potential for cleverness here, unless someone trains a model specifically targeting this use case.

[1] https://www.intel.com/content/www/us/en/products/sku/230502/... [2] https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4090-GPU-Be...

bloopernova · 2023-12-11T14:21:27.000000Z

Kind of exciting to think how much faster memory may soon become. Especially with Apple M series for AMD and Intel to compete with for AI workloads.

65a · 2023-12-11T15:47:15.000000Z

FWIW, server parts from Intel and AMD are already pretty fast, e.g. octo-channel Sapphire Rapids does something on the order of 300GB/s: https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...

Jackson__ · 2023-12-11T21:12:11.000000Z

>if you are VRAM constrained

So this is a perfect model architecture for the alternate realities where nvidia decided to scale up VRAM instead of compute first? I'll let them know over trans-dimensional text message.

Also if quantization scales similar per 7b expert as seen in dense LLMs, i.e. the bigger the model, the lower the perplexity loss, this could be the worst performing model at <=4bits compared to anything else currently available :(

-A very sad 24gb 3090 user.

sebzim4500 · 2023-12-12T11:56:27.000000Z

MoE is a great architecture if you are running the model at scale. When you put different layers on different machines, the VRAM used for the parameters doesn't matter that much but the inference compute really does.

That's why the SOTA proprietary models are probably all MoE (GPT-3.5/4, palm, gemini, etc.) but until recently no open models were.

gorbypark · 2023-12-11T11:46:10.000000Z

Yeah, now that you say it, it does make sense that all of the params would need to be loaded into VRAM (otherwise it's be really slow swapping between models all the time). I guess the tokens per second would be super fast when comparing inference on a 12B and 45B model, though.

dwrodri · 2023-12-11T10:33:03.000000Z

MoE is all about tradeoffs. You get the "intelligence" of a 45B model but only pay the operational cost of multiplying against 12B of those params per token. The cost is that it's now up to the feedforward block to decide early which portions of those 45B params matter, whereas a non-MoE 45B model doesn't encode that decision explicitly into the architecture, it would only arise from (near) zero activations in the attention heads across layers found through gradient descent, instead of just siloing the "experts" entirely. From a quick look at the benchmark results, it looks like in particular it suffers in pronoun resolution vs larger models.

Richard Sutton's Bitter Lesson[1] has served as a guiding mantra for this generation of AI research: the less structure that the researcher explicitly imposes upon the computer in order to learn from the data the better. As humans, we're inclined to want to impose some structure based on our domain knowledge that should guide the model towards making the right choice from the data. It's unintuitive, but it turns out we're much better off imposing as little structure as possible, and the structure that we do place should only exist to effectively enable some form of computation to capture relationships in the data. Gradient descent over next token-prediction isn't very energy efficient, but it leverages compute quite well and it turns out it has scaled up to the limits of just about every research cluster we've been able to build to date. If you're trying to push the envelope and build something which advances the state of the Art in a reasoning task, you're better off leaning as heavily as you can on compute-first approaches unless the nature of the problem involves a lack of data/compute.

Professor Sutton does a much better job than I discussing this concept, so I do encourage you to read the blog post.

1: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

bjornsing · 2023-12-11T11:30:00.000000Z

I haven’t worked on LLMs/transformers specifically, but I’ve “independently invented” MoE and experimented with it on simple feedforward convolutional networks for vision. The basic idea is pretty simple: The “router” outputs a categorical distribution over the “experts”, essentially mixing the experts probabilistically (e.g. 10% of expert A, 60% of expert B and 30% of expert C). Training time you compute the expected value of the loss over this “mixture” (or use the Gumbel-Softmax trick), so you need to backprop through all the “experts”. But inference time you just sample from the categorical distribution (or pick highest probability), so you pick a single “expert” that is executed. A mathematically sound way of making inference cheaper, basically.

Mixtral seems to use a much more elaborate scheme (e.g. picking two “experts” and additively combining them, at every layer), but the basic math behind it is probably the same.

nextworddev · 2023-12-11T15:36:57.000000Z

If MoE architectures still don't help you if you are VRAM constrained (which pretty much is everyone), is it safe to say it only helps inference latency?

bjornsing · 2023-12-11T16:19:23.000000Z

I think the reason OpenAI and Mistral go for this approach is that they are compute constrained when serving their API from the cloud. My guess is that they have servers with e.g. one A100 per “expert”, and then they load this up with concurrent independent requests until those A100s are all pretty busy.

EDIT: In a cloud environment with independent concurrent requests MoE also reduces VRAM requirements because you don’t need to keep as many activations in memory.

osanseviero · 2023-12-11T13:59:37.000000Z

This blog post might be interesting - https://huggingface.co/blog/moe

MoEs are especially useful for much faster pre-training. During inference, the model will be fast but still require a very high amount of VRAM. MoEs don't do great in fine-tuning but recent work shows promising instruction-tuning results. There's also quite a bit of ongoing work around MoEs quantization.

In general, MoEs are interesting for high throughput cases with high number of machines, so this is not so so exciting for a local setup, but the recent work in quantization makes it more appealing.

hexaga · 2023-12-11T10:40:59.000000Z

LLM scaling laws tell us that more parameters make models better, in general.

The key intuition behind why MoE works is that as long as those parameters are available during training, they count toward this scaling effect (to some extent).

Empirically, we can see that even if the model architecture is such that you only have to consult some subset of those parameters at inference time - the optimizer finds a way.

The inductive bias in this style of MoE model is to specialize (as there is a gating effect between 'experts'), which does not seem to present much of an impediment to gradient flow.

bluish29 · 2023-12-11T10:53:52.000000Z

> LLM scaling laws tell us that more parameters make models better, in general.

That depends heavily on the amount and complexity of training data you have. This is actually one of the things than OpenAI have advantage, they scraped a lot of data on the internet before now it became too hard for new players to get.

tarruda · 2023-12-11T10:22:20.000000Z

Disclaimer: I'm a ML newbie, so this might be all incorrect.

My intuition is that there are 8 7b models trained on knowledge domains. For example, one of those 7b models might be good at coding, while another one might be good at storytelling.

And there's the router model, which is trained to select which of the 8 experts are good at completing the text in the context. So for every new token added to the context, the router selects a model and the context is forwarded to that expert which will generate the next token.

The common wisdom is that a even single 7B fine tuned model might surpass much bigger models at the specific task that they're trained on, so it is easy to see how having 8x 7B models might create a bigger model that is very good at many tasks. In the article you can see that even though this is only 45B base model, it surpassed GPT 3.5 (which is instruction fine tuned) on most benchmarks.

Another upside is that the model will be fast at inference, since only a small subset of those 45B weights are activated when doing inference, so the performance should be similar to a 12B model.

I can't think of any downsides except the bigger VRAM requirements when compared to a Non-MoE model of the same size as the experts.

brucethemoose2 · 2023-12-11T16:34:27.000000Z

Short version:

You trade off increased VRAM usage for better training/runtime speed and better splittability.

The balance of this tradeoff is an open question.

esafak · 2023-12-11T15:58:56.000000Z

It is an application of specialization.

mijoharas · 2023-12-11T10:12:23.000000Z

> It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs.

Is there any link to the model and weights? I don't see it if so.

world2vec · 2023-12-11T10:17:21.000000Z

They released the weights as a torrent [0] but you can easily find it on Huggingface [1][2].

[0] https://twitter.com/MistralAI/status/1733150512395038967

[1] https://huggingface.co/search/full-text?q=mixtral

[2] https://huggingface.co/mistralai

M4v3R · 2023-12-11T10:15:44.000000Z

https://nitter.net/MistralAI/status/1733150512395038967

chandureddyvari · 2023-12-11T14:58:21.000000Z

Sorry if this is a dumb question. Can someone explain why it’s called 8x7B(56B) but it has only 46.7B params? and it uses 12.9B params per token generation but there are 2 experts(2x7B) chosen by a 2B model? I’m finding it difficult to wrap my head around this.

pilotneko · 2023-12-11T15:13:57.000000Z

I haven’t looked at the structure carefully, but It’s hard to guess there are shared layers between models. Likely the input layers for sure, since there is no need to tokenize separately for each model (unless different models have specialized vocabulary).

lordswork · 2023-12-11T15:28:45.000000Z

This is my understanding as well. Also includes the parameters of the expert-routing gating network.

brrrrrm · 2023-12-11T15:21:00.000000Z

mixture of experts gates on the feed forward network only. the shared weights are the KQV projections for the attention mechanism of each layer.

shekhar101 · 2023-12-11T18:34:11.000000Z

Explanation from Andrej karpathy makes sense on why: ''' "8x7B" name is a bit misleading because it is not all 7B params that are being 8x'd, only the FeedForward blocks in the Transformer are 8x'd, everything else stays the same. Hence also why total number of params is not 56B but only 46.7B. '''

dizzydes · 2023-12-11T10:35:44.000000Z

Honest question: if they're only beating GPT 3.5 with their latest model (not GPT 4) and OpenAI/Google have infrastructure on tap and a huge distribution advantage via existing products - what chance do they stand?

How do people see things going in the future?

smcleod · 2023-12-11T10:46:07.000000Z

Mistral and its hybrids are a lot better than GPT3.5, and while not as good as GPT4 in general tasks - they’re extremely fast and powerful with specific tasks. In the time it takes GPT4 to apologise that it’s not allowed to do something I can be three iterations deep getting highly targeted responses from mistral - and best yet - I can run it 100% offline, locally and on my laptop.

sorokod · 2023-12-11T12:00:42.000000Z

There is an attempt to quantify subjective evaluation of models here[1] - the "Arena Elo rating". According to popular vote, Mistral chat is nowhere near GPT 3.5

[1] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

orbital-decay · 2023-12-11T13:17:37.000000Z

Starling-7B, OpenChat, OpenHermes in that table are Mistral-7B finetunes, and are all above the current GPT-3.5-Turbo (1106). Note how these tiny 7B models are surrounded by much larger ones.

Mixtral 8x7B is not in there yet.

pama · 2023-12-11T14:27:39.000000Z

ELO takes a while to establish. It does not sound likely that the newer GPT3.5 is that much worse than the old one that has a clear gap to all the non proprietary models. In the immediate test, GPT-3.5 clearly outshines these models.

orbital-decay · 2023-12-11T14:48:40.000000Z

> ELO takes a while to establish.

Well, Starling-7B was published two weeks ago; GPT-3.5-turbo-0613 is more than a month old snapshot, which should probably be enough time. OpenChat and OpenHermes are about a month old as well.

>It does not sound likely that the newer GPT3.5 is that much worse than the old one

In fact, this version received complaints almost immediately. https://community.openai.com/t/496732

>In the immediate test, GPT-3.5 clearly outshines these models.

It might be so, but it's not clear to me at all. I tested Starling for a bit and was really surprised that it's a 7B model, not a 70B+ one or GPT-3.5.

whimsicalism · 2023-12-11T16:02:09.000000Z

I disagree - lmsys score for new chatgpt has been relatively constant, and OAI is probably trying to distill the model even further.

sorokod · 2023-12-11T14:35:11.000000Z

Not my intent to argue about data at any point in time but note that as of today gpt-3.5-turbo-0613 (June 13th 2023) scores 1112, above OpenChat (1075) and OpenHermes(1077).

jbgt · 2023-12-11T15:34:44.000000Z

That's not much relative difference. How much does 1% difference make?

I am tempted to call it equivalent.

doctorpangloss · 2023-12-11T16:35:27.000000Z

I guess one thing people have learned is that these small differences whatever benchmark turn out to be huge differences qualitatively.

whimsicalism · 2023-12-11T16:02:38.000000Z

nah that is sizeable

sva_ · 2023-12-11T12:20:25.000000Z

Doesn't seem like that's the mixture of experts model in the list? Or am I blind

fastball · 2023-12-11T12:34:55.000000Z

Indeed, that is a previous Mistral model.

stavros · 2023-12-11T11:40:22.000000Z

Are they a lot better than 3.5? I see wildly varying opinions.

MacsHeadroom · 2023-12-11T14:31:49.000000Z

Mistral-Medium, the one announced here which beats GPT-3.5 on every benchmark, isn't even available yet. Those opinions are referencing Mistral-Tiny (aka Mistral-7B).

However, Mistral-Tiny beats the latest GPT-3.5 in human ratings on the chatbot-arena-leaderboard, in the form of OpenHermes-2.5-Mistral-7B.

Mixtral 8x7B aka (Mistral-Small) was released a couple of days ago and will likely come close to GPT-4, and well above GPT-3.5, on the leaderboards once it has gone through some finetuning.

whimsicalism · 2023-12-11T16:03:28.000000Z

They could be.

It is an open question whether the driving force will be OSS improving or OAI continuing to try to distill their model.

berniedurfee · 2023-12-11T16:16:52.000000Z

I was going to ask about this. So these open models are uncensored and unconstrained?

camillomiller · 2023-12-11T12:14:58.000000Z

What laptop are you using to run which model, and what are you using for that?

amrrs · 2023-12-11T12:34:04.000000Z

The easiest way is to use ollama - mistral 7b, zephyr 7b, openhermes 2 are all decent models, I guess in fact openhermes 2 can do function calling.

If you further want a smaller one, stablelm-zephyr 3b is a good attempt with ollama.

smcleod · 2023-12-11T12:22:05.000000Z

M2 MacBook Pro, I run man different models but mistral, zephyr, deepseek. I use Ollama and LM Studio.

Davidzheng · 2023-12-11T14:29:25.000000Z

Bit of a stretch to say they're a lot better when you look at evals

YetAnotherNick · 2023-12-11T12:37:24.000000Z

They are not better than GPT 3.5 except for some of the public benchmarks. Also they are not faster than GPT 3.5. And they are not cheaper if you run finetuned model for specific task.

antirez · 2023-12-11T11:00:34.000000Z

1. This is an open source model that can run on people's hardware at a fraction of the cost of GPT. No cloud services in the middle.

2. This model is not censored as GPT.

3. This model has a lot less latency than GPT.

4. In their endpoint this model is called mistral-small. Probably they are training something much larger than can compete with GPT4.

5. This model can be fine tuned.

ChrisMarshallNY · 2023-12-11T11:33:29.000000Z

Notes poster ID

That seems a fairly authoritative response.

I'm looking forward to seeing how this does. The "unencumbered by a network connection" thing is pretty important.

stavros · 2023-12-11T12:00:01.000000Z

I agree with what antirez said, but I want to address the fallacy: The fact that he's an authority in C doesn't make him a priori more likely to know a lot about ML.

antirez · 2023-12-11T14:42:31.000000Z

I agree with you, stavros. There is no transfer between C coding and ML topics. However the original question is a bit more in the business side IMHO. Anyway: I've some experience with machine learning: 20 years ago I wrote (my first neural network)[https://github.com/antirez/nn-2003] and since then I always stayed in the loop. Not for work, as I specialized in system programming, but for personal research I played with NN images compression, NLP tasks and convnets. In more recent times I use pytorch for my stuff, LLM fine-tuning and I'm a "local LLMs" enthusiast. I speculated a lot about AI, and wrote a novel about this topic. So while the question was more in the business side, I have some competence in the general field of ML. More than anything else I believe that all this is so new and fast-moving that there are many unknown unknowns, so indeed what I, you or others are saying are mere speculations. However to speculate is useful in this time, even more than before, because LLMs are a bit of a black box for the most part, so using only "known" things we can't go much far in our reasoning. We can understand embeddings, attention, how this networks are trained and fine tuned, and yet the inner workings are a bit of a magic thing.

stavros · 2023-12-11T15:13:22.000000Z

I agree, and I want to reiterate that I wasn't talking about you specifically, just that people should be careful of the halo effect.

I also do agree that to speculate is useful when it's so early on. , and I agree with your original answer as well.

ChrisMarshallNY · 2023-12-11T12:26:25.000000Z

Not just C. He's obviously damn good at architecture and systems design, as well as long-term planning.

You don't get that from a box of Cracker Jacks.

stavros · 2023-12-11T13:11:29.000000Z

Right, but the fact remains that none of those things is ML.

ChrisMarshallNY · 2023-12-11T13:14:20.000000Z

Fair 'nuff. Not worth arguing over.

stavros · 2023-12-11T13:27:47.000000Z

To be clear, I'm not saying antirez is or isn't good at ML, I'm saying C/systems design/etc experience doesn't automatically make someone good at ML. I'm not trying to argue, I'm just discussing.

ChrisMarshallNY · 2023-12-11T13:34:24.000000Z

Oh, it's not a big deal. I just hate talking about the chap in front of him. I like to give compliments specifically, and be vague about less-than-complimentary things.

The thing is, even the ML people are not exactly sure what's going on, under the hood. It's a very new field, with a ton of "Here, there be dragonnes" stuff. I feel that folks with a good grasp of long-term architectural experience, are a good bet; even if their experience is not precisely on topic.

I don't know how to do almost every project I start. I write about that, here: https://littlegreenviper.com/miscellany/thats-not-what-ships...

stavros · 2023-12-11T14:22:56.000000Z

That's true, but I see my friend who's an ML researcher, and his grasp of LLMs is an order of magnitude better than mine. Granted, when it comes to making a product out of it, I'm in a much better position, but for specific knowledge about how they work, their capabilities, etc, there's no contest.

whimsicalism · 2023-12-11T16:05:06.000000Z

This is not a fallacy, we are engaging in informal reasoning, and contra your claim the fact that he is an authority in C does make it more likely he knows a lot about ML than the typical person.

yawnxyz · 2023-12-11T11:21:09.000000Z

how does this work in their favor as a business? Don't get me wrong I love how all of it's free, but that doesn't seem to be helpful towards a $2b valuation. At least WeWork charged for access

robwwilliams · 2023-12-11T12:18:18.000000Z

To Europe and France this is a most important strategic area of research, defense, and industry—-on par with aviation and electronics. The EU recognizes its relatively slow pace compared to what is happening in the US and China.

Consider Mistral like a proto-Airbus.

baq · 2023-12-11T12:44:07.000000Z

Exactly. EU's VC situation is dire compared to SV, which maybe isn't that bad if you think about what the VCs are actually after, but in this particular case it's a matter of national security of all EU countries. The capability must be there.

yodsanklai · 2023-12-11T13:24:18.000000Z

Aren't they mostly funded by private American funds? what is EU involvement in this projet?

jackjeff · 2023-12-11T15:22:09.000000Z

Mistral is funded in part by Lightspeed Venture Partners a US VP. But there are a lot of local French and European VPs involved.

The most famous one is Xavier Niel, who started Free/Iliad a French ISP/cloud provider and later cellphone provider that literally decimated the pricing structure of the incumbent some 20 years ago in France and still managed to make money. He’s a bit of a tech folk hero, kind of like Elon Musk was before Twitter. His company Iliad is also involved in providing access to NVIDIA compute clusters to local AI startups, playing the role Microsoft Azure plays for OpenAI.

France and the EU at large has missed the boat on tech, but they have a card to play here since they have for once both the expertise and the money rolled up. My main worry is that the EU legislation that’s in the works will be so dumb that only big US corporations will be able to work it out, and basically the legislation will scare investment away from the EU. But since the French government is involved and the guy who is writing the awful AI law is a French nominee, there’s a bit of hope.

dontupvoteme · 2023-12-11T15:41:25.000000Z

They also have the EU protectionism card which is pretty safe to assume they will play for Mistral and the Germans (Aleph Alpha) - and thus also for Meta (for the most part). Iirc the current legislation basically makes large scale exceptions for open source models.

whimsicalism · 2023-12-11T16:06:34.000000Z

Nobody is better at innovating in protectionism than the EU. EU MEPs work hard to come up with new ways of picking winners.

dontupvoteme · 2023-12-11T16:51:34.000000Z

Given how much consumer protectionism Americans and others have thanks to the EU's domestic protectionism, it is certainly a mixed bag at worst.

In a strange way it's almost akin to how soviet propaganda in the cold war played a role in spurring on the civil rights movement in the states.

b4ke · 2023-12-11T14:37:36.000000Z

european engineers, french vc money.... etc :/

antirez · 2023-12-11T11:25:11.000000Z

* Open source models: give you all the attention (pun intended) you can get, away from OpenAI. At the same time do a great service to the world.

* Maybe in the future, bigger closed models? Make money with the state-of-art of what you can provide.

supriyo-biswas · 2023-12-11T11:28:37.000000Z

Many VC funded businesses do not have an initial business model involving direct monetization.

The first step is probably gaining mindshare with free, open source models, and then they can extend into model training services, consultation for ML model construction, and paid access to proprietary models, similar to OpenAI.

jddj · 2023-12-11T11:39:44.000000Z

Even in the public markets this happens all the time, eg. Biotech, new battery chemistries, etc.

In trends people pay for a seat at the table with a good team and worry about the details later. The 2B headline number is a distraction.

amrrs · 2023-12-11T12:36:08.000000Z

They are withholding a bigger model which at this point is "Mistral Medium" and that'll be available only behind their API end point. Makes sense for them to make money from it!

jasonjmcghee · 2023-12-11T16:06:06.000000Z

They launched an inference API

https://mistral.ai/news/la-plateforme/

pulse7 · 2023-12-11T11:24:24.000000Z

Maybe they will charge for accessing the future Mixtral 8x70B ...

kaliqt · 2023-12-11T14:42:15.000000Z

Because their larger models are super powerful. This makes sure their models start becoming the norm from the bottom up.

It also solidifies their name as the best, above all others. That's extremely important mindshare. You need mindshare at the foundation to build a billion dollar revenue startup.

samuel · 2023-12-11T12:18:13.000000Z

They could charge for tuning/support, just like every other Open Source company.

Most business will want their models trained in their own, internal data, instead of risking uploading their Intellectual Property into SaaS solutions. These Open Source models could fill that gap.

Palmik · 2023-12-11T12:28:05.000000Z

Beyond what others said, I think this is an extremely impressive showing. Consider that their efforts started years behind Google's, and yet their relatively small model (they call is mistral-small, and also offer mistral-medium) is beating or on par with Gemini Pro on many benchmarks (Google's best currently available model).

On top of that Mixtral is truly open source (Apache 2.0), and extremely easy to self host or run on a cloud provider of your choice -- this unlocks many possibilities, and will definitely attract some business customers.

EDIT: The just announced mistral-medium (larger version of the just open sourced mixtral 8x7b) is beating GPT3.5 with significant margin, and also Gemini Pro (on available benchmarks).

jillesvangurp · 2023-12-11T11:35:27.000000Z

The demand for using AI models for whatever is going through the roof. Right now it's mostly people typing things manually in chat gpt, bard, or wherever. But that's not going to stay like that. Models being queried as part of all sorts of services is going to be a thing. The problem with this is that running these models at scale is still really expensive.

So, instead of using the best possible model at any cost for absolutely everything, the game is actually good enough models that can run cheaply at scale that do a particular job. Not everything is going to require models trained on the accumulated volume of human knowledge on the internet. It's overkill for a lot of use cases.

Model runtime cost is a showstopper for a lot of use cases. I saw a nice demo of a big ecommerce company in Berlin that had built a nice integration with openai's APIs to provide a shopping assistent. Great demo. Then somebody asked them when this was launching. And the answer was that token cost was prohibitively expensive. It just doesn't make any sense until that comes down a few orders of magnitudes. Companies this size already have quite sizable budgets that they use on AI model training and inference.

aunty_helen · 2023-12-11T12:44:35.000000Z

I can agree with this, I’m currently building a system that pulls data from a series of pdfs that are semi-structured. Just testing alone is taking up 10s of $ in api costs. We have 60k PDFs to do.

I can’t deliver a system to a client that costs more in api costs than it does in development costs for their expected input size.

Using the most naive approach the ai would be beaten on a cost basis by a mechanical Turk.

akbarnama · 2023-12-11T12:38:46.000000Z

If possible, please share, how was the shopping assistant helping out a consumer in the buying process? What were the features?

jillesvangurp · 2023-12-11T12:50:46.000000Z

Features I saw demoed were about comparing products based on descriptions, images, and pricing. So, it was able to find products based on a question that was about something suitable for X costing less than Y where X can be some kind of situation or event. Or find me things similar to this but more like so. And so on.

intellectronica · 2023-12-11T13:56:36.000000Z

If you're purely looking for capabilities and not especially interested in running an open model, this might not be that interesting. But even so, this positions Mistral as currently the most promising company in the open models camp, having released the first thing that not only competes well with GPT-3.5 but also competes with other open models like Llama-2 on cost/performance and presents the most technological innovation in the open models space so far. Now that they raised $400MM the question to ask is - what happens if they continue innovating and scale their next model sufficiently to compete with GPT-4 / Gemini? The prospects have never seemed better than they do today after this release.

Shrezzing · 2023-12-11T12:01:57.000000Z

>How do people see things going in the future?

The EU and other European governments will throw absolute boatloads of money at Mistral, even if that only keeps them at a level on par with the last generation. AI is too big of a technological leap for the bloc to ride America's coattails on.

Mistral doesn't just exist to make competitive AI products, it's an existential issue for Europe that someone on the continent is near the vanguard on this tech, and as such, they'll get enormous support.

arlort · 2023-12-11T12:53:25.000000Z

You are vastly overestimating both the EU's budget and the willingness of countries to throw money at other countries' companies

I doubt mistral will get any direct EU funding

yodsanklai · 2023-12-11T13:34:01.000000Z

EU is good at fostering free market, but not at funding strategic efforts. Some people (Piketty, Stiglitz) say that companies like Airbus couldn't emerge today for that reason.

Culonavirus · 2023-12-11T14:55:11.000000Z

> EU is good at fostering free market

Uuuuuh... you could call the EU a lot of things, but "fostering free market" is a hot take. I'm sorry. When you look at the amount of regulation the EU brings to the table (EU basically is the poster child of market regulation), I would go as far as to say that your claim is objectively not true. We can debate how regulation is a good thing because this and that, but regulation - by definition - limits the free market. And there is an argument to be made, backed up literally thousands of regulations the EU has come up with, that the EU limits the free market a lot. When you factor in the regulations that are imposed on its member countries (I mean directly on the goverments) one could easily claim that it is the most harsh regulator on the planet. I could go into detail about the so called green deal, etc. but all of these things are easy to look up on the net / or official sources from the EU portal.

com2kid · 2023-12-11T17:46:31.000000Z

> but regulation - by definition - limits the free market.

Not always true.

Consumer labeling laws enable the free market, because a free market requires participates have full knowledge of the goods they are buying, or else fair competition cannot exist.

If two companies are competing to sell wool coats, and one company is actually selling a 50% wool blend but advertising it as 100% wool, that is not a free market, that is fraud. Regulation exists to ensure that companies selling real wool coats are competing with each other, and that companies selling wool blends are competing with each other, and that consumers can choose which product that they want to buy without being tricked.

Without labeling laws, consumers end up assuming a certain % of fraud will always happen, which reduces the price they are willing to pay for goods, which then distorts the market.

Shrezzing · 2023-12-11T16:35:55.000000Z

>one could easily claim that it is the most harsh regulator on the planet.

The argument that the EU is a more harsh regulator than Iran, Russia, China, North Korea, (or even on par with those regulatory regimes) entirely undermines the rest of your comment.

There's pretty well tested and highly respected indexes which fundamentally disagree. Of the 7 most economically free nations, three are in the EU, and a fourth is automatically party to the majority of the EU's economic regulations.

https://en.wikipedia.org/wiki/List_of_sovereign_states_by_ec...

In the Index of Economic Freedom, more than a dozen EU member nations outperform the United States with regards to Economic Freedoms.

arlort · 2023-12-12T04:14:06.000000Z

The level of regulation is only a small part of what makes a market free or not

The EU does a ton to limit state aid, monopolistic practices and has a pretty extensive network of trade agreements

Also you say imposed as if the countries themselves don't want them, every regulation at the EU level replaces what would've been 10 different ones at the member states level, this uniformity is arguably a net positive on its own

ned · 2023-12-11T13:41:14.000000Z

We'll see what comes out of ALTEDIC - https://ec.europa.eu/newsroom/lds/items/797961/en

pbmonster · 2023-12-11T11:09:26.000000Z

They are focusing hard on small models. Sooner or later, you'll be able to run their product offline, even on mobile devices.

Google was criticized [0] for offloading pretty much all generative AI tasks onto the cloud - instead of running it on the Tensor G3 built into its Pixel Phones specifically for that purpose. The reason being, of course, that the Tensor G3 is much to small for almost all modern generative models.

So Mistral is focusing specifically on an area the big players are failing right now.

[0] https://news.ycombinator.com/item?id=37966569

dataking · 2023-12-11T10:42:57.000000Z

Microsoft, Apple, and Google also have more resources at their disposal yet Linux is doing just fine (to put it mildly). As long as Mistral delivers something unique, they'll be fine.

mhh__ · 2023-12-11T11:25:23.000000Z

Linux is funded by big tech companies. IBM probably put a billion into Linux and that was 20 years ago now.

dzolob · 2023-12-11T13:53:39.000000Z

This wasn’t status quo. In fact, it can serve as an example. Why wouldn’t google or microsoft follow the same path with mistral? Being open source, it serves their purposes well.

mhh__ · 2023-12-11T17:56:38.000000Z

I'd look at Facebook more than Google.

sgt101 · 2023-12-11T11:28:42.000000Z

As I read it they are doing this with 8 * 7Bn parameter models. So, their model should run pretty well as fast as a 7Bn model and at the cost of a 56bn parameter model.

That a lot quicker and cheaper than GPT-4

Also this is kinda a promissory note, they've been able to do this in a few months and create a service on top of it. Does this intimate that they have the capability to create and run SoA models? Possibly. If I were a VC I could see a few ways for this bet to go well.

The big killer is moat - maybe this just demonstrates that there is no LLM moat.

Palmik · 2023-12-11T12:25:28.000000Z

Inference should be closer to llama 13b, since it runs 2/8 experts for each token.

sgt101 · 2023-12-11T14:04:50.000000Z

Does it have to run them sequentially? I guess the cost will be 12/13bn level but latency may be faster?

masa331 · 2023-12-11T11:41:44.000000Z

Another advantage over Google or OpenAI for me would be that it is not from Google or OpenAI

spacebanana7 · 2023-12-11T10:41:07.000000Z

Perhaps they’re hoping some enterprises will be willing to pay extra for a 3.5 grade model that can run on prem?

A niche market but I can imagine some demand there.

Biggest challenge would be Llama models.

v4dok · 2023-12-11T10:56:41.000000Z

Niche market?? You have no idea how big that market is!

visarga · 2023-12-11T11:35:26.000000Z

Almost no serious user - private or company - wants to slurp their private data to cloud providers. Sometimes it is ethically or contractually impossible.

michaelt · 2023-12-11T13:44:44.000000Z

The success of AWS and Gmail and Google docs and Azure and Github and Cloudflare make me think this... probably not an up-to-date opinion.

By and large, companies actually seem perfectly happy to hand pretty much all their private data over to cloud providers.

b4ke · 2023-12-11T16:30:10.000000Z

yet they don't provide access to their children, there may be something in that.

evantbyrne · 2023-12-11T15:21:19.000000Z

We can't use LLMs at work at all right now because of IP leakage, copyright, and regulatory concerns. Hosting locally would solve one of those issues for us.

mepiethree · 2023-12-11T11:36:26.000000Z

Yeah I would venture to say it’s closer to “the majority of the market” than “niche”

anentropic · 2023-12-11T10:50:37.000000Z

and according to the article this model behaves like a 12B model in terms of speed and cost while matching or outperforming Llama 2 70B in output

viraptor · 2023-12-11T11:35:06.000000Z

In terms of speed per token. What they don't say explicitly is that choosing the mix per token means you may need to reload the active model multiple times in a single sentence. If you don't have memory available for all the experts at the same time, that's a lot of memory swapping time.

anon373839 · 2023-12-11T12:10:49.000000Z

Tim Dettmers stated that he thinks this one could be compressed down to a 4GB memory footprint, due to the ability of MoE layers to be sparsified with almost no loss of quality.

jlokier · 2023-12-11T13:50:01.000000Z

If your motivation is to be able to run the model on-prem, with parallelism for API service throughput (rather than on a single device), you don't need large memory GPUs or intensive memory swapping.

You can architect it as cheaper, low-memory GPUs, one expert submodel per GPU, transferring state over the network between the GPUs for each token. They run in parallel by overlapping API calls (and in future by other model architecture changes).

Th MoE model reduces inter-GPU communication requirements for splitting the model, in an addition to reducing GPU processing requirements, compared with a non-MoE model with the same number of weights. There are pros and cons to this splitting, but you can see the general trend.

yodsanklai · 2023-12-11T13:20:46.000000Z

Also, considering mistral is open source, what will prevent their competitor to integrate any innovation they make?

Another thing I don't understand, how a 20 people company can provide a similar system as OpenAI (1000 employees)? what do they do themselves, and what do they re-use?

lossolo · 2023-12-11T13:38:11.000000Z

> Also, considering mistral is open source, what will prevent their competitor to integrate any innovation they make?

Their small and tiny models are open source, it seems like a marketing strategy, and bigger models will not be open source. Their medium model is not open source.

> Another thing I don't understand, how a 20 people company can provide a similar system as OpenAI (1000 employees)? what do they do themselves, and what do they re-use?

They do not provide the scale of OpenAI or a model comparable to GPT-4 (yet).

war321 · 2023-12-11T14:52:06.000000Z

Companies move slow, especially as they get bigger. Just because a google engineer wants to yoink some open source inferencing innovation for example, doesn't mean they can just jam it into Gemini and have it rolled out immediately.

raincole · 2023-12-11T10:52:33.000000Z

> How do people see things going in the future?

A niche thing that thrives in its own niche. Just like most open source apps without big corperations behind them.

HarHarVeryFunny · 2023-12-11T13:53:30.000000Z

Google started late with any serious LLM effort. It takes time to iterate on something so complex and slow to train. I expect Google will match OpenAI in next iteration or two, or at worst stay one step behind, but it takes time.

OTOH Google seem to be the Xerox Parc of our time (who were famous for state of the art research and failure to productize). Microsoft, and hence Microsoft-OpenAI, seem much better positioned to actually benefit from this type of generative AI.

wrsh07 · 2023-12-11T17:14:16.000000Z

1) as a developer or founder looking to experiment quickly and cheaply with llm ideas, this (and llama etc) are huge gifts

2) for the research community, making this work available helps everyone (even OpenAI and Google, insofar as they've done something not yet tried at those larger orgs)

3) Mistral is well positioned to get money from investors or as consultants for large companies looking to fine tune or build models for super custom use cases

The world is big and there's plenty of room for everyone!! Google and OpenAI haven't tried all permutations of research ideas - most researchers at the cutting edge have dozens of ideas they still want to try, so having smaller orgs trying things at smaller scales is really great for pushing the frontier!

Of course it's always possible that some major tech co playing from behind (ahem, apple) might acquire some LLM expertise too

nuz · 2023-12-11T10:57:08.000000Z

They might be willing to do things like crawl libgen which google possibly isn't, giving them an advantage. They might be more skilled at generating useful synthetic data which is a bit of an art and subject to taste, which other competitors might not be as good at.

raincole · 2023-12-11T10:59:19.000000Z

> They might be willing to do things like crawl libgen which google possibly isn't

Are you implying big companies don't crawl libgen? Or google specifically? I would be very surprised if OpenAI (MS) didn't crawl libgen.

nuz · 2023-12-11T11:38:32.000000Z

OpenAI probably does. Not sure about google, possibly not

happycube · 2023-12-11T14:27:55.000000Z

Google has a ton of scanned books and magazines from libraries etc, on top of their own web crawls. If they don't have the equivalent of libgen tucked away something's gone wrong.

data-ottawa · 2023-12-11T15:35:44.000000Z

Google BARD/AI isn’t available in Canada or the EU, so there’s one big competitive advantage.

OpenAI is of course the big incumbent to beat and is in those markets.

They only started this year, so beating ChatGPT3.5 is I think a great milestone for 6 months of work.

Plus they will get a strategic investment as the EU’s answer to AI, which may become incredibly valuable to control and regulate.

Edit: I fact checked myself and bard is available in the EU, I was working off outdated information.

https://support.google.com/bard/answer/13575153?hl=en

joelthelion · 2023-12-11T10:41:46.000000Z

Compete on price (open-source model, cheap hosted inference) probably?

Also, they are probably well-placed to answer some proposals from European governments, who won't want to depend on US-companies too much.

dataking · 2023-12-11T10:45:17.000000Z

> they are probably well-placed to answer some proposals from European governments

That's true but I wonder how they stack up against Aleph Alpha and Kyutai? Genuinely curious as I haven't found a lot of concrete info on their offerings.