It's 2.7B, not 1.1. In my experience it goes off the rails and starts generating nonsense after a few paragraphs, but I haven't dug too much into tweaking the kv cache params to see if that's controllable. It also needs a fair bit of prompt massaging to get it to do what you want. So no, not GPT3.5, but it's comfortably better than anything else in its size class.
Probably similar token rates out of the box, although I havent done a straight comparison. Where they'll differ is in the sorts of questions they're good at. Llama2 was trained (broadly speaking) for knowledge, Phi-2 for reasoning. And bear in mind that you can quantise phi-2 down too. The starting point is f16.
Key-value cache in the attention layers. There was a paper a little while back about how maintaining the first N tokens across an extended context helped an LLM keep sane for longer, and it turns out you can replicate it with the right CLI arguments to llama.cpp.
Close is a subjective term. Vicuna-33B from over half a year ago gets within 22 ELO of 3.5 in the arena leaderboard, but in practice the refusals are reducing 3.5 and other RLHFd models' ratings a lot and they're not even close.
The Elo scoring system is named after a Hungarian man called Arpad Elo (a simplification of his original more Hungarian name). My phone helpfully miscorrects it to "ELO" probably because it prefers Jeff Lynne's Electric Light Orchestra. Anyway,
That makes sense, thank you. There seems to be some inflation, as Mistral is supposed to be GPT-3.5-level, and Mixtral is supposed to be nearer GPT-4, but yeah, it sounds suspicious in practice, even though Mistral is very good.
Well in some things it totally can be to some extent, yes. You can almost certainly get a Mistral 7B fine tuned for a specific thing (e.g. coding) and it will likely be about as good as 3.5 in that specific thing (not a super high bar in objective terms). For all the other areas it may suffer in performance relative to its original self, but for some applications that's fine. As for GPT-4 it's about 120 ELO points [0] above Mixtral, and that's even the distilled turbo version. Not even close imo, especially when Mixtral is far less censored.
Both 3.5 and 4 have changed drastically over the past year with continued fine tuning, quantization, etc. so what some people consider their level is not exactly a fixed point either.
[0] The actual leaderboard I'm referencing, it has its biases but it's the most generally indicative thing available right now: https://chat.lmsys.org
In my experience, it's great at its size, but obviously worse than mistral:7b-instruct-v0.2. Currently mixtral:8x7b-instruct-v0.1 is the lowest inference cost model at similar performance level of GPT-3.5.