It's 2.7B, not 1.1. In my experience it goes off the rails and starts generating nonsense after a few paragraphs, but I haven't dug too much into tweaking the kv cache params to see if that's controllable. It also needs a fair bit of prompt massaging to get it to do what you want. So no, not GPT3.5, but it's comfortably better than anything else in its size class.
Probably similar token rates out of the box, although I havent done a straight comparison. Where they'll differ is in the sorts of questions they're good at. Llama2 was trained (broadly speaking) for knowledge, Phi-2 for reasoning. And bear in mind that you can quantise phi-2 down too. The starting point is f16.
Key-value cache in the attention layers. There was a paper a little while back about how maintaining the first N tokens across an extended context helped an LLM keep sane for longer, and it turns out you can replicate it with the right CLI arguments to llama.cpp.