Is it really close to GPT-3.5 at 2.7B?

regularfry · 2024-01-06T10:11:02 1704535862

It's 2.7B, not 1.1. In my experience it goes off the rails and starts generating nonsense after a few paragraphs, but I haven't dug too much into tweaking the kv cache params to see if that's controllable. It also needs a fair bit of prompt massaging to get it to do what you want. So no, not GPT3.5, but it's comfortably better than anything else in its size class.

realPtolemy · 2024-01-06T11:36:56 1704541016

How is it compared to 7B LLaMA quantized to run on a raspberry pi?

regularfry · 2024-01-06T13:02:26 1704546146

Probably similar token rates out of the box, although I havent done a straight comparison. Where they'll differ is in the sorts of questions they're good at. Llama2 was trained (broadly speaking) for knowledge, Phi-2 for reasoning. And bear in mind that you can quantise phi-2 down too. The starting point is f16.

jasonjmcghee · 2024-01-07T04:19:43 1704601183

If you can run quantized 7B, nothing beats mistral and its fine tunes- like openhermes2.5

stavros · 2024-01-06T10:20:30 1704536430

This sounds much more realistic, thanks!

eurekin · 2024-01-06T19:03:11 1704567791

What are kv cache params?

regularfry · 2024-01-07T00:13:45 1704586425

Key-value cache in the attention layers. There was a paper a little while back about how maintaining the first N tokens across an extended context helped an LLM keep sane for longer, and it turns out you can replicate it with the right CLI arguments to llama.cpp.

moffkalast · 2024-01-06T10:12:14 1704535934

Close is a subjective term. Vicuna-33B from over half a year ago gets within 22 ELO of 3.5 in the arena leaderboard, but in practice the refusals are reducing 3.5 and other RLHFd models' ratings a lot and they're not even close.

You can try Phi-2 with wasm here, I mostly just get gibberish out of it though: https://huggingface.co/spaces/radames/Candle-phi1-phi2-wasm-...

Mixtral is the only model that properly matches 3.5 at the moment.

Y_Y · 2024-01-06T11:43:19 1704541399

The Elo scoring system is named after a Hungarian man called Arpad Elo (a simplification of his original more Hungarian name). My phone helpfully miscorrects it to "ELO" probably because it prefers Jeff Lynne's Electric Light Orchestra. Anyway,

Elo is a proper name, not an acronym!

moffkalast · 2024-01-06T12:10:36 1704543036

TIL, interesting. I always figured it must be some kind of abbreviation.

stavros · 2024-01-06T10:22:20 1704536540

That makes sense, thank you. There seems to be some inflation, as Mistral is supposed to be GPT-3.5-level, and Mixtral is supposed to be nearer GPT-4, but yeah, it sounds suspicious in practice, even though Mistral is very good.

moffkalast · 2024-01-06T10:38:26 1704537506

Well in some things it totally can be to some extent, yes. You can almost certainly get a Mistral 7B fine tuned for a specific thing (e.g. coding) and it will likely be about as good as 3.5 in that specific thing (not a super high bar in objective terms). For all the other areas it may suffer in performance relative to its original self, but for some applications that's fine. As for GPT-4 it's about 120 ELO points [0] above Mixtral, and that's even the distilled turbo version. Not even close imo, especially when Mixtral is far less censored.

Both 3.5 and 4 have changed drastically over the past year with continued fine tuning, quantization, etc. so what some people consider their level is not exactly a fixed point either.

[0] The actual leaderboard I'm referencing, it has its biases but it's the most generally indicative thing available right now: https://chat.lmsys.org

regularfry · 2024-01-06T11:03:03 1704538983

Mixtral is only 32 ELO ahead of the best 7B model on that leaderboard, although I suspect that might be understating the difference.

derac · 2024-01-06T10:29:32 1704536972

Though admittedly I haven't played with phi-2 much, smaller models are hurt much more by quantization. I'd try 8 bits or so, at least.

WiSaGaN · 2024-01-06T10:36:46 1704537406

In my experience, it's great at its size, but obviously worse than mistral:7b-instruct-v0.2. Currently mixtral:8x7b-instruct-v0.1 is the lowest inference cost model at similar performance level of GPT-3.5.