Hacker News new | past | comments | ask | show | jobs | submit login

It's 2.7B, not 1.1. In my experience it goes off the rails and starts generating nonsense after a few paragraphs, but I haven't dug too much into tweaking the kv cache params to see if that's controllable. It also needs a fair bit of prompt massaging to get it to do what you want. So no, not GPT3.5, but it's comfortably better than anything else in its size class.

How is it compared to 7B LLaMA quantized to run on a raspberry pi?

Probably similar token rates out of the box, although I havent done a straight comparison. Where they'll differ is in the sorts of questions they're good at. Llama2 was trained (broadly speaking) for knowledge, Phi-2 for reasoning. And bear in mind that you can quantise phi-2 down too. The starting point is f16.

If you can run quantized 7B, nothing beats mistral and its fine tunes- like openhermes2.5

This sounds much more realistic, thanks!

What are kv cache params?

Key-value cache in the attention layers. There was a paper a little while back about how maintaining the first N tokens across an extended context helped an LLM keep sane for longer, and it turns out you can replicate it with the right CLI arguments to llama.cpp.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
