Hacker News new | past | comments | ask | show | jobs | submit login
Llama.cpp guide – Running LLMs locally on any hardware, from scratch (steelph0enix.github.io)
368 points by zarekr 34 days ago | hide | past | favorite | 87 comments



Neat to see more folks writing blogs on their experiences. This however does seem like it's an over-complicated method of building llama.cpp.

Assuming you want to do this iteratively (at least for the first time) should only need to run:

  ccmake .
And toggle the parameters your hardware supports or that you want (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple etc..), and press 'c' (configure) then 'g' (generate), then:

  cmake --build . -j $(expr $(nproc) / 2)

Done.

If you want to move the binaries into your PATH, you could then optionally run cmake install.


Actually I think even this makes it look scarier than it is if you're on an M-series Apple.

In that case, the steps to building llama.cpp are:

1. Clone the repo.

2. Run `make`.

To start chatting with a model all you need is to:

1. Download the model you want in gguf format that will fit into your hardware (probably the hardest step, but readily available on HuggingFace)

2. Run `./llama-server -m model.gguf`.

3. Visit localhost:8080


On a Mac, if all you want is to just use it directly, it is also readily available from Homebrew.


Wow, i did not know about ccmake. I'll check it out and edit the post if it's really that easy to use, thanks.


Yeah the mingw method on windows is a ludicrous thing to even think about, and llama.cpp still has that as the suggested option in the readme for some weird reason. Endless sourcing of paths that never works quite right. I literally couldn't get it to work when I first tried it last year.

Meanwhile Cmake is like two lines and somehow it's the backup fallback option? I don't get it. And well on linux it's literally just one line with make.


Building anything on windows without cmake is just... I don't know why anyone would use anything else. I used to spend hours wrestling with build failures, but after setting up cmake, it just works with everything.


>Yeah the mingw method on windows is a ludicrous thing to even think about

Building most stuff on Windows is ludicrous, that's not something uncommon. I've chosen MSYS there as it's the easiest and least paninful way of installing deps for Vulkan build.


You can get a release binary from https://github.com/ggerganov/llama.cpp/releases too.


Does it autoupdate? I get it from github so I just have to pull and build again every time I want to update it.


Yes, but that's not building it for your system, that's a relatively generic build.


First time I heard about Llama.cpp I got it to run on my computer. Now, my computer: a Dell laptop from 2013 with 8Gb RAM and an i5 processor, no dedicated graphic card. Since I wasn't using a MGLRU enabled kernel, It took a looong time to start but wasn't OOM-killed. Considering my amount of RAM was just the minimum required, I tried one of the smallest available models.

Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.

Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.

Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?


You might also try https://github.com/Mozilla-Ocho/llamafile , which may have better CPU-only performance than ollama. It does require you to grab .gguf files yourself (unless you use one of their prebuilts in which case it comes with the binary!), but with that done it's really easy to use and has decent performance.

For reference, this is how I run it:

  $ cat ~/.config/systemd/user/llamafile@.service
  [Unit]
  Description=llamafile with arbitrary model
  After=network.target
  
  [Service]
  Type=simple
  WorkingDirectory=%h/llms/
  ExecStart=sh -c "%h/.local/bin/llamafile -m %h/llamafile-models/%i.gguf --server --host '::' --port 8081 --nobrowser --log-disable"
  
  [Install]
  WantedBy=default.target
And then

  systemctl --user start llamafile@whatevermodel
but you can just run that ExecStart command directly and it works.


Be careful running this on work machines – it will get flagged by Crowdstrike Falcon and probably other EDR tools. In my case the first time I tried it, I just saw “Killed” and then got a DM from SecOps within two minutes.


the irony, preventing and killing something that is actually useful, while we let crowdcrap hum along consuming tons of memory and bottlenecking IO so it can do snakeoil things...


Are they specifically flagging LLMs, or do they not like Cosmopolitan Libc / APE?


Nah nothing to do with LLMs, it’s just because the method of Llamafile is very similar to malware - basically zip up an executable, concatenate it with some stuff, throw it in /tmp and execute it with a randomly generated high entropy name.

(That said, after I explained it to SecOps they did tell me I would need to “consult legal” if I wanted to use a local LLM, but I’ll give them the benefit of the doubt there…)


Is that `--host` listening on non-local addresses? Might be good to default to local-only.


Good call out; in my context yes I do want it listening for use by other machines in its subnets and deliberately set that option (including using the IPv6 form), but most people are probably better off binding to loopback. Thanks


Open Web UI [1] with Ollama and models like the smaller Llama, Qwen, or Granite series can work pretty well even with CPU or a small GPU. Don't expect them to contain facts (IMO not a good approach even for the largest models) but they can be very effective for data extraction and conversational UI.

1. http://openwebui.com


Hey, author of the blog post here. Check out avante.nvim if you're already a vim/nvim user, I'm using it as assistant plugin with llama-server and it works great.

Small models, like Llama 3.2, Qwen and SmolLM are really good right now, compared to few years ago


You can use Ollama for serving a model locally, and Continue to use it in VSCode.

https://ollama.com/blog/continue-code-assistant


Relevant telemetry information. I didn’t like how they went from opt-in to opt-out earlier this year.

https://docs.continue.dev/telemetry


Is autocomplete working well?


you can do that with llama-server too


What you describe is very similar to my own experience first running llama.cpp on my desktop computer. It was slow and inaccurate, but that's beside the point. What impressed me was that I could write a question in English, and it would understand the question, and respond in English with an internally coherent and grammatically correct answer. This is coming from a desktop, not a rack full of servers in some hyperscaler's datacenter. This was like meeting a talking dog! The fact that what it says is unreliable is completely beside the point.

I think you still need to calibrate your expectations for what you can get from consumer grade hardware without a powerful GPU. I wouldn't look to a local LLM as a useful store of factual knowledge about the world. The amount of stuff that it knows is going to be hampered by the smaller size. That doesn't mean it can't be useful, it may be very helpful for specialized domains, like coding.

I hope and expect that over the next several years, hardware that's capable of running more powerful models will become cheaper and more widely available. But for now, the practical applications of local models that don't require a powerful GPU are fairly limited. If you really want to talk to an LLM that has a sophisticated understanding of the world, you're better off using Claude or Gemeni or ChatGPT.


Not sure about copilot, but I recently became aware of llmware:

https://llmware-ai.github.io/llmware/

https://github.com/llmware-ai/llmware

There's also Simon Wilson's llm cli:

https://llm.datasette.io/en/stable/

https://github.com/simonw/llm

Both might help getting started with some LLM experiments.


Llama.cpp is one of those projects that I want to install, but I always just wind up installing kobold.cpp because it's simply miles better with UX.


Llama.cpp forms the base for both Ollama and Kobold.cpp and probably a bunch of others I'm not familiar with. It's less a question of whether you want to use llama.cpp or one of the others and more of a question of whether you benefit from using one of the wrappers.

I can imagine some use cases where you'd really want to use llama.cpp directly, and there are of course always people who will argue that all wrappers are bad wrappers, but for myself I like the combination of ease of use and flexibility offered by Ollama. I wrap it in Open WebUI for a GUI, but I also have some apps that reach out to Ollama directly.


What is the advantage of ollama/open webui let's say, vs llama-server? I have been using llama.cpp since when it came out, I am used to the syntax etc and I do not have problems building it (probably because I use macos which it supports better), so am I missing something from not using ollama?


Llama.cpp may have gotten better, but when I first started using them:

* Ollama provides a very convenient way to download and manage models, which llama.cpp didn't at the time (maybe it does now?).

* Last I checked, with llama.cpp server you pick a model on server startup. Ollama instead allows specifying the model in the API request, and it handles switching out which one is loaded into vram automatically.

* The Modelfile abstraction is a more helpful way to keep track of different settings. When I used llama.cpp directly I had to figure out a way to track a bunch of model parameters as bash flags. Modelfiles + being able to specify the model in the request is a great synergy, allowing clients to not have to think about parameters at all, just which Modelfile to use.

I'm leaving off some other reasons why I switched which I know have gotten better (like Ollama having great Docker support, which wasn't always true for llama.cpp), but some of these may also have improved with time. A glance over the docs suggests that you still can't specify a model at runtime in the request, though, which if true is probably the single biggest improvement that Ollama offers. I'm constantly switching between tiny models that fit on my GPU and large models that use RAM, and being able to download a new model with ollama pull and immediately start using it in Open WebUI is a huge plus.


I just use ollama. It works on my mac and windows machine and it's super simple to install + run most open models. And you can have another tool just shell out to it if you want more than than the CLI.


Llama cpp is more of a backend software. Most front end software like kobold/open webui uses it


I found it only took me ~20 minutes to get Open-WebUI and Ollama going on my machine locally. I don't really know what is happening under the hood but from 0 to chat interface was definitely not too hard.


If anyone on macOS wants to use llama.cpp with ease, check out https://recurse.chat/. Supports importing ChatGPT history & continue chats offline using llama.cpp. Built this so I can use local AI as a daily driver.


“koboldcpp forked from ggerganov/llama.cpp”


I'd say avoid pulling in all the python and containers required and just download the gguf from huggingface website directly in a browser rather than doing is programmatically. That sidesteps a lot of this project's complexity since nothing about llama.cpp requires those heavy deps or abstractions.


Yeah, I should have mentioned that in the post because I know that not everyone likes to tinker as much as I do. I'll add a note about this site in a revision soon.

However, I still find it useful to know how to do that manually.


I tried building and using llama.cpp multiple times, and after a while, I got so frustrated with the frequently broken build process that I switched to ollama with the following script:

  #!/bin/sh
  export OLLAMA_MODELS="/mnt/ai-models/ollama/"
  
  printf 'Starting the server now.\n'
  ollama serve >/dev/null 2>&1 &
  serverPid="$!"
  
  printf 'Starting the client (might take a moment (~3min) after a fresh boot).\n'
  ollama run llama3.2 2>/dev/null

  printf 'Stopping the server now.\n'
  kill "$serverPid"
And it just works :-)


this was pretty much spot-on to my experience and track. the ridicule of people choosing to use ollama over llamacpp is so tired.

i had already burned an evening trying to debug and fix issues getting nowhere fast, until i pulled ollama and had it working with just two commands. it was a shock. (granted, there is/was a crippling performance problem with sky/kabylake chips but mitigated if you had any kind of mid-tier GPU and tweaked a couple settings)

anyone who tries to contribute to the general knowledge base of deploying llamacpp (like TFA) is doing heaven's work.


I have spent unreasonable amounts of time building llama.cpp for my hardware setup (AMD GPU) on both Windows and Linux. That was one of the main reasons of writing that blog post for me. Lmao.


Seeing a lot of Ollama vs running llama.cpp direct talk here. I agree that setting up llama.cpp with CUDA isn't always the easiest. But there is a cost to running all inference over HTTPS. Local in-program inference will be faster. Perhaps that doesn't matter in some cases but it's worth noting.

I find that running PyTorch is easier to get up and running. For quantization, AWQ models work and it's just a "pip install" away.


FYI, if you're on Ubuntu 24.04, it's easy to build llama.cpp with AMD ROCm GPU acceleration. Debian enabled support for a wider variety of hardware than is available in the official AMD packages, so this should work for nearly all discrete AMD GPUs from Vega onward (with the exception of MI300, because Ubuntu 24.04 shipped with ROCm 5):

    sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential
    # add yourself to the video and render groups
    sudo usermod -aG video,render $USER
    # reboot to apply the group changes

    # download a model
    wget --continue -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \
        https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true

    # build llama.cpp
    git clone https://github.com/ggerganov/llama.cpp.git
    cd llama.cpp
    git checkout b3267
    HIPCXX=clang++-17 cmake -S. -Bbuild \
        -DGGML_HIPBLAS=ON \
        -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" \
        -DCMAKE_BUILD_TYPE=Release
    make -j8 -C build

    # run llama.cpp
    build/bin/llama-cli -ngl 32 --color -c 2048 \
        --temp 0.7 --repeat_penalty 1.1 -n -1 \
        -m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \
        --prompt "Once upon a time"
I think this will also work on Rembrandt, Renoir, and Cezanne integrated GPUs with Linux 6.10 or newer, so you might be able to install the HWE kernel to get it working on that hardware.

With that said, users with CDNA 2 or RDNA 3 GPUs should probably use the official AMD ROCm packages instead of the built-in Ubuntu packages, as there are performance improvements for those architectures in newer versions of rocBLAS.


What are the limitations on which LLMs (specific transformer variants etc) llama.cpp can run? Does it require the input mode/weights to be in some self-describing format like ONNX that support different model architectures as long as they are built out of specific module/layer types, or does it more narrowly only support transformer models parameterized by depth, width, etc?


This was nice. I took the road less traveled and tried building on Windows and AMD.

Spoiler: Vulkan with MSYS2 was indeed the easiest to get up and running.

I actually tried w64devkit first and it worked properly for llama-server, but there were inexplicable plug-in problems with llama-bench.

Edit: I tried w64devkit before I read this write-up and I was left wondering what to try next, so the timing was perfect.


Somewhat related - on several occasions I've come across the claim that _"Ollama is just a llama.cpp wrapper"_, which is inaccurate and completely misses the point. I am sharing my response here to avoid repeating myself repeatedly.

With llama.cpp running on a machine, how do you connect your LLM clients to it and request a model gets loaded with a given set of parameters and templates?

... you can't, because llama.cpp is the inference engine - and it's bundled llama-cpp-server binary only provides relatively basic server functionality - it's really more of demo/example or MVP.

Llama.cpp is all configured at the time you run the binary and manually provide it command line args for the one specific model and configuration you start it with.

Ollama provides a server and client for interfacing and packaging models, such as:

  - Hot loading models (e.g. when you request a model from your client Ollama will load it on demand).
  - Automatic model parallelisation.
  - Automatic model concurrency.
  - Automatic memory calculations for layer and GPU/CPU placement.
  - Layered model configuration (basically docker images for models).
  - Templating and distribution of model parameters, templates in a container image.
  - Near feature complete OpenAI compatible API as well as it's native native API that supports more advanced features such as model hot loading, context management, etc...
  - Native libraries for common languages.
  - Official container images for hosting.
  - Provides a client/server model for running remote or local inference servers with either Ollama or openai compatible clients.
  - Support for both an official and self hosted model and template repositories.
  - Support for multi-modal / Vision LLMs - something that llama.cpp is not focusing on providing currently.
  - Support for serving safetensors models, as well as running and creating models directly from their Huggingface model ID.
In addition to the llama.cpp engine, Ollama are working on adding additional model backends (e.g. things like exl2, awq, etc...).

Ollama is not "better" or "worse" than llama.cpp because it's an entirely different tool.


I think what you just said actually reinforces the point that ollama is a llama.cpp wrapper. I don't say that to disparage ollama, in fact I love ollama. It is an impressive piece of software. If x uses y under the hood, then we say "x is a y wrapper"


I mean.... is Debian just a libc6 wrapper? Is Firefox just a JavaScript wrapper?

Given Ollama currently has llama.cpp, mllama and safetensors backends, there's far more Ollama code and functionality than code that calls llama.cpp


why do you say "just" a wrapper? it is not a bad thing to be a wrapper, it is just a descriptive term

the amount of code does not dictate whether or not it is a wrapper.


The biggest frustration with Ollama is that it's very opinionated about the way it stores models for usage. If all you use is Ollama, that doesn't matter much, but it's frustrating when that GGUF needs to be shared with other things.


Ollama is so easy, what's the benefit to Llama.cpp?


If you're satisfied with ollama, I don't think there's a point to go "lower" and use llama.cpp directly.

Unless you want to tinker.


I set up llama.cop last week on my M3. Was fairly simple via homebrew. However, I get tags like <|imstart|> in the output constantly. Is there a way to filter them out with llama-server? Seems like a major usability issue if you want to use llama.cpp by itself (with the web interface).

ollama didn’t have the issue, but it’s less configurable.


I just gave this a shot on my laptop and it works reasonably well considering it has no discrete GPU.

One thing I’m unsure of is how to pick a model. I downloaded the 7B one from Huggingface, but how is anyone supposed to know what these models are for, or if they’re any good?


Read the README of the model. You'll probably find some benchmark metrics there that can tell you more-or-less how "good" the model is, but keep in mind that it's not that hard to artifically boost those scores, so don't reject every model that isn't at the top of the benchmark.

I've listed some good starting models at the end of the post. Usually, most LLM models like Qwen or Llama are general-purpose, some are fine-tuned for specific stuff, like CodeQwen for programming.


I use ChatGPT and Claude daily, but I can't see a use case for why would I use LLM outside of these services.

What do you use Llama.cpp for?

I get you can ask it a question in natural language and it will spit out sort of an answer, but what would you do with it, what do you ask it?


You can run a model with substantially similar capabilities to Claude or ChatGPT locally, with absolute data privacy guaranteed. Whereas with Claude or ChatGPT, all you can do is trust and hope they won’t use your data against you at some point in the future. If you’re more technically minded, you can hack on the model itself, the sampling method, etc., and have a level of fine-grained control over the technology that isn’t possible with a cloud model.


> You can run a model with substantially similar capabilities to Claude or ChatGPT locally

I am all for local models, but this is massively overselling what they are capable of on common consumer hardware (32GB RAM).

If you are interested in what your hardware can pull off, find the top-ranking ~30b models on lmarena.ai and initiate a direct chat with them on the same site. Pose your common questions and see if they are answered to your satisfaction.


Two points: 1) I actually think that smaller models are substantially similar to frontier models. Of course the latter are more capable, but they’re more similar than different (which I think the ELO scores on lmarena.ai suggests).

2) You can run much larger models on Apple Silicon with surprisingly decent speed.


I use llama.cpp mostly for working with code that i can't share with any online provider. Simple NDA stuff. Some refactors are easier to do via LLM than manually. It's a decent debugging duck too.


Do you know any tutorial that could help me set something like this up?


re Temperature config option: I've found it useful for trying to generate something akin to a sampling-based confidence score for chat completions (e.g., set the temperature a bit high, run the model a few times and calculate the distribution of responses). Otherwise haven't figured out a good way to get confidence scores in llama.cpp (Been tracking this git request to get log_probs https://github.com/ggerganov/llama.cpp/issues/6423)


Can someone tell me what the advantages are of doing this over using, e.g., the ChatGPT web interface? Is it just a privacy thing?


Privacy is a big one, but avoiding censorship and reducing costs are the other ones I’ve seen.

Not so sure about the reducing costs argument anymore though, you'd have to use LLMs a ton to make buying brand new GPUs worth it (models are pretty reasonably priced these days).


I never understand these guardrails. The whole point of llms (imo) is for quick access to knowledge. If I want to better understand reverse shell or kernel hooking, why not tell me? But instead, “sorry, I ain’t telling you because you will do harm” lol


Key insight: the guardrails aren't there to protect you from harmful knowledge; they're there to protect the company from all those wackos on the Internet who love to feign offense at anything that can get them a retweet, and journalists who amplify their outrage into storms big enough to depress company stock - or, in worst cases, attract attention of politicians.


There are also plausibly some guardrails resulting from oversight by three letter agencies.

I don't take everything Marc Andreessen said in his recent interview with Joe Rogan at face value, but I don't dismiss any of it either.


Privacy, available offline, software that lasts as long as the hardware can run it.


Yeah, that's me. nCapture a snapshot of it, from time to time — so if it ever goes offline (or off the rails: requires a subscription, begins to serve up ads), you have the last "good" one locally.

I have a snapshot of Wikipedia as well (well, not the whole of Wikipedia, but 90GB worth).


Which Wikipedia snapshot do you grab? I keep meaning to do this, but whenever I skim the Wikipedia downloads pages, they offer hundreds of different flavors without any immediate documentation as to what differentiates the products.


You can use Kiwix: https://kiwix.org/en/


wikipedia_en_all_maxi

I guess that means English ... and maxi? As I say, was something around 90GB or so.


Was hoping you had more insight than "maxi sounds good" which is the also the best I have.


Privacy, freedom, huge selection of models, no censorship, higher flexibility, and it's free as in beer.


Ability to have a stable model version with stable weights until the end of time


For work I routinely need to do translations of confidential documents. Sending those to some web service in a state that doesn't even have basic data protection guarantees is not an option.

Putting them into a local LLM is rather efficient, however.


This is a way to run open source models locally. You need the right hardware but it is a very efficient way to experiment with the newest models, fine tuning etc. ChatGPT uses massive model which are not practical to run on your own hardware. Privacy is also an issue for many people, particularly enterprise.


That said though, there's currently no practical way to reach ChatGPT's performance locally, but I have hope that will change eventually. Of course, even today you can reach quite good performance locally, and for a lot of people it's more than good enough for their needs.


I did a blog post about my preference for offline [1]. LLM's would fall under the same criteria for me. Maybe not so much the distraction-free aspect of being offline, but as a guard against the ephemeral aspect of online.

I'm less concerned about privacy for whatever reason.

[1] https://engineersneedart.com/blog/offlineadvocate/offlineadv...


You can chug through a big text corpus at little cost.


you get to find out all the steps!

meaning you learn more


Yeah, agreed. If you think artificial intelligence is going to be an important technology in the coming years, and you want to get a better understanding of how it works, it's useful to be able to run something that you have full control over. Especially since you become very aware what the shortcomings are, and you appreciate the efforts that go into running the big online models.


Yeah but I think of you've got a GPU you should probably think about using vllm. Last I tried using llama.cpp (which granted was several months ago) the ux was atrocious -- vllm basically gives you an openai api with no fuss. That's saying something as generally speaking I loathe Python.


You can also just download LMStudio for free, works out of the box.


There are many open source alternatives to LMstudio that work just as good.


[flagged]


If the original author (steelph0enix) is reading this, i just want to counter, the post of nisten and say: I liked your blogpost, i thought it was as concise as it needed to be - i have made notes from it and linked back to it from my notes, Thank you.


oh, hey, thanks!

i wonder what was there, i like spicy comments...


>Get off the f*king adderall,

Jikes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: