It's 2.7B, not 1.1. In my experience it goes off the rails and starts generating...

realPtolemy · 2024-01-06T11:36:56.000000Z

How is it compared to 7B LLaMA quantized to run on a raspberry pi?

regularfry · 2024-01-06T13:02:26.000000Z

Probably similar token rates out of the box, although I havent done a straight comparison. Where they'll differ is in the sorts of questions they're good at. Llama2 was trained (broadly speaking) for knowledge, Phi-2 for reasoning. And bear in mind that you can quantise phi-2 down too. The starting point is f16.

jasonjmcghee · 2024-01-07T04:19:43.000000Z

If you can run quantized 7B, nothing beats mistral and its fine tunes- like openhermes2.5

stavros · 2024-01-06T10:20:30.000000Z

This sounds much more realistic, thanks!

eurekin · 2024-01-06T19:03:11.000000Z

What are kv cache params?

regularfry · 2024-01-07T00:13:45.000000Z

Key-value cache in the attention layers. There was a paper a little while back about how maintaining the first N tokens across an extended context helped an LLM keep sane for longer, and it turns out you can replicate it with the right CLI arguments to llama.cpp.