I run in Linux system it via cli using Ollama, very easy to setup https://ollama...

vunderba · 2023-12-19T07:41:43

I want to like Ollama, but I wish it didn't obfuscate the actual directives (full prompt) that it sends to the underlying model. Ollama uses a custom templatizing script in its Modelfiles to translate user input into the format that a specific model expects ([INST], etc), but it can be difficult to tell if it's working as expected because it won't show up in the logs at all.

Other than that it's a great project - very easy to get started and has a solid API implementation. I've got it running on both a Win 10 + WSL2 docker and on a Mac M1.

coffeeri · 2023-12-19T10:18:35

You can bypass the templating in raw mode, by setting the request parameter `raw` to true.

https://github.com/jmorganca/ollama/blob/main/docs%2Fapi.md

vunderba · 2023-12-19T17:58:42

yeah I guess I could compare the output at 0.0 temperature with same prompt using Modelfile and then after using raw mode with my best guess of how the modelfile is creating the raw data it's passing to Ollama.

I'd push a PR to the repo itself but I have zero experience with Go...

bugglebeetle · 2023-12-19T03:09:05

Yeah, I was surprised Ollama was not mentioned as it’s by far the easiest to get started with. If it only had real grammar support, I’d never have to use another library again (it has a JSON mode that generally works, however).

Tomte · 2023-12-19T08:05:42

What is grammar support? I've seen that mentioned several times now. Does it allow to restrict the output to a given template, or am I totally wrong there?

biwills · 2023-12-19T08:34:57

Yes. Here's a quick rundown of grammar in llama.cpp (link on the docs of faraday.dev which runs llama.cpp under the hood)

https://docs.faraday.dev/character-creation/grammars

giblaz · 2023-12-19T08:01:20

Ollama is great. I discovered it today while looking for a way to serve LLMs locally for my terminal command generator tool (cmdh: https://github.com/pgibler/cmdh) and was able to get it up and running and implement support for it very easily.

traverseda · 2023-12-19T01:58:01

Huh, I've fought with a few of these things on my laptop, no nvidia GPU, limited ram, etc.

This actually worked as advertised.

3abiton · 2023-12-19T02:42:29

What's the performance like (quality and speed wise)?

tarruda · 2023-12-19T07:12:50

Yesterday I tried mixtral 7bx8 running on the CPU. With an Intel 11th gen chip and 64gb DDR4 at 3200mhz, I got around 2-4 tokens/second in a small context, this gets progressively slower as the context grows.

You would get a much better experience with apple silicon and lots of RAM

yobanate · 2023-12-19T09:06:23

Can confirm. My M3 Max gets about 22t/s, putting the bottleneck BKAC.

3abiton · 2023-12-19T15:18:08

That's 10x speed increase. What's the secret behind apple M3? Faster clocked RAMs? Specific AI hardware?

bugglebeetle · 2023-12-19T17:44:40

Unified memory and optimizations in llama.cpp (which Ollama wraps).

ithkuil · 2023-12-19T19:53:50

Is that using the GPU?

bugglebeetle · 2023-12-19T20:35:24

It can be variably configured. There are details in the repo, but llama.cpp makes use of Metal.

all2 · 2023-12-19T04:03:26

Mistral 7b is serviceable for short contexts. If you have a longer conversation, token generation can start to lag a lot.