Hacker News new | past | comments | ask | show | jobs | submit login

Is it really close to GPT-3.5 at 2.7B?



It's 2.7B, not 1.1. In my experience it goes off the rails and starts generating nonsense after a few paragraphs, but I haven't dug too much into tweaking the kv cache params to see if that's controllable. It also needs a fair bit of prompt massaging to get it to do what you want. So no, not GPT3.5, but it's comfortably better than anything else in its size class.


How is it compared to 7B LLaMA quantized to run on a raspberry pi?


Probably similar token rates out of the box, although I havent done a straight comparison. Where they'll differ is in the sorts of questions they're good at. Llama2 was trained (broadly speaking) for knowledge, Phi-2 for reasoning. And bear in mind that you can quantise phi-2 down too. The starting point is f16.


If you can run quantized 7B, nothing beats mistral and its fine tunes- like openhermes2.5


This sounds much more realistic, thanks!


What are kv cache params?


Key-value cache in the attention layers. There was a paper a little while back about how maintaining the first N tokens across an extended context helped an LLM keep sane for longer, and it turns out you can replicate it with the right CLI arguments to llama.cpp.


Close is a subjective term. Vicuna-33B from over half a year ago gets within 22 ELO of 3.5 in the arena leaderboard, but in practice the refusals are reducing 3.5 and other RLHFd models' ratings a lot and they're not even close.

You can try Phi-2 with wasm here, I mostly just get gibberish out of it though: https://huggingface.co/spaces/radames/Candle-phi1-phi2-wasm-...

Mixtral is the only model that properly matches 3.5 at the moment.


The Elo scoring system is named after a Hungarian man called Arpad Elo (a simplification of his original more Hungarian name). My phone helpfully miscorrects it to "ELO" probably because it prefers Jeff Lynne's Electric Light Orchestra. Anyway,

Elo is a proper name, not an acronym!


TIL, interesting. I always figured it must be some kind of abbreviation.


That makes sense, thank you. There seems to be some inflation, as Mistral is supposed to be GPT-3.5-level, and Mixtral is supposed to be nearer GPT-4, but yeah, it sounds suspicious in practice, even though Mistral is very good.


Well in some things it totally can be to some extent, yes. You can almost certainly get a Mistral 7B fine tuned for a specific thing (e.g. coding) and it will likely be about as good as 3.5 in that specific thing (not a super high bar in objective terms). For all the other areas it may suffer in performance relative to its original self, but for some applications that's fine. As for GPT-4 it's about 120 ELO points [0] above Mixtral, and that's even the distilled turbo version. Not even close imo, especially when Mixtral is far less censored.

Both 3.5 and 4 have changed drastically over the past year with continued fine tuning, quantization, etc. so what some people consider their level is not exactly a fixed point either.

[0] The actual leaderboard I'm referencing, it has its biases but it's the most generally indicative thing available right now: https://chat.lmsys.org


Mixtral is only 32 ELO ahead of the best 7B model on that leaderboard, although I suspect that might be understating the difference.


Though admittedly I haven't played with phi-2 much, smaller models are hurt much more by quantization. I'd try 8 bits or so, at least.


In my experience, it's great at its size, but obviously worse than mistral:7b-instruct-v0.2. Currently mixtral:8x7b-instruct-v0.1 is the lowest inference cost model at similar performance level of GPT-3.5.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: