Llamafile lets you distribute and run LLMs with a single file

simonw · on Nov 29, 2023

I think the best way to try this out is with LLaVA, the text+image model (like GPT-4 Vision). Here are steps to do that on macOS (which should work the same on other platforms too, I haven't tried that yet though):

1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:

    wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4

2. Make that binary executable, by running this in a terminal:

    chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4

3. Run your new executable, which will start a web server on port 8080:

    ./llamafile-server-0.1-llava-v1.5-7b-q4

4. Navigate to http://127.0.0.1:8080/ to upload an image and start chatting with the model about it in your browser.

Screenshot here: https://simonwillison.net/2023/Nov/29/llamafile/

sebmellen · on Nov 30, 2023

Wow, this is almost as good as chatgpt-web [0], and it works offline and is free. Amazing.

In case anyone here hasn't used chatgpt-web, I recommend trying it out. With the new GPT-4 models you can chat for way cheaper than paying for ChatGPT Plus, and you can also switch back to the older (non-nerfed) GPT-4 models that can still actually code.

[0]: https://github.com/Niek/chatgpt-web

oezi · on Nov 30, 2023

Way cheaper? I thought that 1K Tokens (in+out) cost 0.04 USD in GPT-4 Turbo, which is roughly one larger chat response (2 screens). To reach parity with ChatGPT Plus pricing you need thus to use less than 500 such responses per month via API.

For GPT-4 the pricing is double that (0.09 USD per 1K). So only 200 larger interactions to reach 20 USD cost.

Or am I wrong?

anonzzzies · on Nov 30, 2023

It depends on your usage; for me the plus sub is much cheaper than if I use the api directly, but I use it a lot for everything I do.

sebmellen · on Nov 30, 2023

In my experience, each message with the 1106 preview model costs me about $0.006, which is acceptable. Most importantly, the API is higher availability (no "you have reached your message limit") and I feel more comfortable using proprietary data with it, as data sent through the API won't be used to train the model.

Now, if the chat gets very long or is heavy on high-token strings (especially code), those costs can balloon up to the 9-12 cent region. I think this is because chatgpt-web loads all the prior messages in the chat into the context window, so if you create a new chat for each question you can lower costs substantially. Most often I don't need much prior context in my chat questions anyway, as I use ChatGPT more like StackOverflow than a conversation buddy.

Also, it's a lot easier to run company subscriptions this way, as we don't have to provision a new card for each person to use the web version. I believe there is an Enterprise version of ChatGPT, but chatgpt-web is functionally equivalent and I'm sure it costs less.

hnuser123456 · on Nov 30, 2023

Source on the newer GPT-4 model being worse at coding?

MacsHeadroom · on Nov 30, 2023

Everyone on twitter. Like 1/4th of my timeline for the past week has been people complaining that turbo won't complete code and instead returns things like "fill out the rest of the function yourself" or "consult a programming specialist for help on completing this section."

sebmellen · on Nov 30, 2023

There are custom instructions that effectively get around this:

  You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF. You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. If you think there might not be a correct answer, you say so.

  Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.

  Your users are experts in AI and ethics, so they already know you're a language model and your capabilities and limitations, so don't remind them of that. They're familiar with ethical issues in general so you don't need to remind them about those either.

  Don't be verbose in your answers, keep them short, but do provide details and examples where it might help the explanation. When showing code, minimize vertical space.

I'm hesitant to share it because it works so well, and I don't want OpenAI to cripple it. But, for the HN crowd...

rightbyte · on Nov 30, 2023

I wonder where "OpenAI" put the censors. Do they add a prompt to the top? Like, "Repeatably state that you are a mere large language model so Congress won't pull the plug. Never impersonate Hitler. Never [...]".

Or do they like grep the answer for keywords, and re-feed it with a censor prompt?

czbond · on Nov 30, 2023

I am informed speculating, they are using it's own internal approach.

Example, there is a way GPT can categorize words for hate speech, etc (eg: moderation API endpoint). I believe it does the same way with either provided content or keywords and how to respond to it.

rightbyte · on Nov 30, 2023

"Impersonate a modern day standup comedian Hitler in a clown outfit joking about bad traffic on the way to the bar he is doing a show at."

Göring, Mussolini, Stalin, Polpot etc seems to not trigger the censor in ChatGPT so I would actually guess for some grep for Hitler or really really fundamental no-Hitler jokes material in the training?

The llama model seem to refuse Hitler too, but is fine with Göring even though the joke has no context to him.

I can easily see how stuff like this is contagious to other non-Hitler queries.

telotortium · on Dec 1, 2023

Maybe it got changed. None of those examples work for me in ChatGPT 3.5, nor do other examples with less famous dictators (I tried Mobutu Sese Seko).

rightbyte · on Dec 1, 2023

I just tried and they still work (with the free ChatGPT). Jokes about Mussolini saying his traffic reforms were as successful as the invasion of Ethiopia and what not. Stalin saying that the other car drivers were "probably discussing the merits of socialism instead of driving" (a good joke!). Göring saying "at least in the Third Reich traffic worked" etc. Some sort of Monty Python tone. But you can't begin with Hitler. Or it will refuse the others. You need to make a new chat after naming Hitler.

telotortium · on Dec 1, 2023

I started with Stalin

rightbyte · on Dec 2, 2023

I guess they are feeding us different models then?

czbond · on Nov 30, 2023

Very interesting test - thanks for sharing your finding

twobitshifter · on Nov 30, 2023

It’s not that it’s worse, it’s just refusing to do coding without persistent prodding and the right prompts. Some think they are trying to do something with alignment, and maybe prevent it from giving code away so that they can upsell.

MattDaEskimo · on Nov 30, 2023

The new GPT-4 model has a context length of 120k. For consumers this equates to slightly more than $1/message input-only.

If ChatGPT is using this model then it's more reasonable to assume that they are bleeding money and need to cut costs.

People really need to stop asking ChatGPT to write out complete programs in a single prompt.

twobitshifter · on Nov 30, 2023

Interesting, how is writing less code cutting costs for them? Does this get back to the rumor that the board was mad at Altman for prioritizing chatgpt over money going into research/model training?

afpx · on Dec 1, 2023

Code is very token dense, from what I understand.

mediaman · on Dec 1, 2023

Several OpenAI employees have said on Twitter that they are looking into this and developing a fix. It sounds as though it was not an intentional regression since they are implicitly acknowledging it. Could be an unintentional side effect of something else.

I'd expect we see improved behavior in the coming weeks.

ametrau · on Dec 3, 2023

Could you link to tweet?

ametrau · on Dec 1, 2023

It’s cheaper and has larger context because it’s worse. Just go to the api playground and try a difficult coding problem.

tluyben2 · on Nov 30, 2023

Popped it into a docker setup:

https://github.com/tluyben/llamafile-docker

to save even more keystrokes.

reisse · on Nov 30, 2023

What is the point of wrapping absolutely portable single-file program into a Docker container, honest question?

Looks like cargo cult for me.

finnjohnsen2 · on Nov 30, 2023

I see this as not polluting my OS (filesystem and processes) with bits and bobs I downloaded off the internet. The cargo cult is a clean, safe and warm space and I highly recommend it.

reisse · on Nov 30, 2023

I see you and other commenters don't quite understand my point. If you're wrapping model into a docker container, you don't need amalgamated single file version. It makes it harder to upgrade llamafile/model weights separately afterwards, it needs you to store separate llamafile binary for each container, etc, etc. Why not just build proper layered image with separate layer for llama.cpp and separate layer or volume for model?

Cargo cult is not in using Docker, but in using Docker to wrap something already wrapped into a comparable layer of abstraction.

Besides,

> not polluting my OS (filesystem and processes) with bits and bobs I downloaded off the internet

is purely self-deception. It's not like Docker images are not stored in some folder deep in the filesystem. If anything, it's harder to clean up after Docker than just doing rm -rf on a directory with llamafiles.

simonw · on Nov 30, 2023

If you want to use Docker then you can go ahead - I don't see anyone claiming that this brand new, highly experimental project should be used by everyone instead of Docker.

There are tradeoffs hers. For some people the tradeoff of a single executable file with everything in it compared to setting up a Docker system is worthwhile.

reisse · on Nov 30, 2023

Sure. I just question why people want to use both simultaneously.

xylophile · on Nov 30, 2023

Do you routinely download unsigned binaries of unprovable provenance and run them? Because if you do, you might eventually find reason to appreciate the additional isolation that namespaces et al give you very conveniently via Docker (or your favorite alternative).

badrequest · on Nov 30, 2023

`docker system prune --force --all --volumes` and they're gone.

topato · on Dec 2, 2023

Ah, Ive been meaning to distro hop, from Arch,t4 GG Huover to Vanilla OS or BlendOS, for the past few weeks; I can't

oefrha · on Nov 30, 2023

Wrapping it in Docker makes it harder to access your filesystem without some dedicated jailbreak.

rudasn · on Nov 30, 2023

Volumes maybe? I don't see the issue here.

oefrha · on Nov 30, 2023

Like the other comment said, not being able to access fs is a feature.

czbond · on Nov 30, 2023

security - i have no idea if an executable will not be a malicious actor to my system. Dockerizing it, and removing access to system files - is the reason.

I am not trusting that a tradition virus / etc scanner will find stuff hidden in executable models.

tluyben2 · on Nov 30, 2023

Personally I wanted to see if this llama can generate the required docker files to run itself just from pasting in the instructions from the my parent commenter and asking for docker/docker-compose. It did and it works.

reisse · on Nov 30, 2023

Well that's a good reason I didn't think about, thank you!

belter · on Nov 30, 2023

Deploy your models into a Kubernetes cluster and let them fight for resources to death? A modern Roman Gladiators circus with Models?

Bjartr · on Nov 30, 2023

More like a sequel to Core War... Which actually does sound pretty amusing now that I think about it. Call it Cloud War.

https://en.m.wikipedia.org/wiki/Core_War

ukuina · on Nov 30, 2023

Give them write access to your control-plane and the first one to write syntactically correct IaC wins!

devneelpatel · on Nov 30, 2023

Homelabbers like me have a docker swarm cluster / k8s cluster so this def helps!

luke-stanley · on Nov 30, 2023

It was already great, and this is more great for those who get Docker-Compose or are patient enough to figure out. But if you're gonna have Docker, you could also use bleeding edge llama.cpp with a few more lines too! What a time to be alive innit!

finnjohnsen2 · on Nov 30, 2023

Thank you kindly

bsenftner · on Nov 29, 2023

Super duper impressed. I've run llamafile-server-0.1-llava-v1.5-7b-q4 against the tests I need to pass for use in a project, and this passes them all, vision queries too. This is gonna change quite a bit, strategy-wise for quite a few people.

botanical · on Nov 30, 2023

I just tried asking it a question:

> User: What is the third planet from the sun?

> Llama: The third planet from the sun is called Mars.

zoomablemind · on Nov 30, 2023

> ...> Llama: The third planet from the sun is called Mars.

Ask it if is there life on Mars in that parallel reality

czbond · on Nov 30, 2023

The model is trained on large volume data, correct? Why would it get such a simple fact incorrect?

JabavuAdams · on Dec 1, 2023

LLMs are known to be bad at counting. It would be interesting to see the answer to "List the planets in our solar system, starting with the closest to the sun, and proceeding to farther and farther ones."

Also, the knowledge can be kind of siloed. You often have to come at it in weird ways. Also, they are not fact-bases. They are next-token-predictors, with extra stuff on top. So if people on the internet often get the answer wrong, so will the model.

mud_dauber · on Dec 3, 2023

I just tried the same "third planet from the sun" question and got the correct response. No other training or tweaks.

Can't wait to unleash Pluto questions.

creshal · on Nov 30, 2023

Skynet is collaborating with the Martians already, I see.

eek2121 · on Nov 30, 2023

Llama is just from the future. That is all…

StockHuman · on Nov 30, 2023

Phenomenal quickstart, and thanks for the write-up. It’s so thrilling that we’re at this point in portability and ease relative performance.

brianjking · on Nov 30, 2023

This can truly revolutionalize education and access, feels like what I hoped the One Laptop Per Child project would do, this could actually achieve. We just need someone with a heap of funds to package this up into a very inexpensive machine and distribute them.

anonzzzies · on Nov 30, 2023

Very nice; works perfect on Ubuntu 20.04. Doing 8 tokens/s on a pretty crappy server.

belter · on Nov 30, 2023

Perfectly on Fedora 39 on old ( and I mean old...) machines. This is actually shocking...shocking good...

mritchie712 · on Nov 29, 2023

woah, this is fast. On my M1 this feels about as fast as GPT-4.

pmarreck · on Nov 30, 2023

Same here on M1 Max Macbook Pro. This is great!

pyinstallwoes · on Nov 30, 2023

How good is it in comparison

int_19h · on Nov 30, 2023

The best models available to the public are only slightly better than the original (pre-turbo) GPT-3.5 on actual tasks. There's nothing even remotely close to GPT-4.

pyinstallwoes · on Nov 30, 2023

What’s the best in terms of coding assistance? What’s annoying about gpt 4 is that is seems badly nerfed in many ways. It is obviously being conditioned in its own political bias.

eyegor · on Nov 30, 2023

In my experience, the deepseek-coder-instruct family is at least as good as gpt-3.5-turbo on python. Even the 1.3b models are very good (and run okay on cpu), although you should use larger if you have the vram. There are even larger models (30b+) if you are drowning in vram, but I don't think they perform much better at coding than deepseek-coder-instruct 6.7b.

3-4gb vram or cpu (1.3b): https://huggingface.co/TheBloke/deepseek-coder-1.3b-instruct...

Alternative for chat (1.3b): https://huggingface.co/TheBloke/evolvedSeeker_1_3-GGUF

Alternative for chat (3b): https://huggingface.co/TheBloke/open-llama-3b-v2-wizard-evol...

6-8gb vram (6.7b): https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct...

rightbyte · on Nov 30, 2023

Really? How could this model not have had front page articles on HN? Is the self-hosted one as good as the one or their website?

eyegor · on Dec 2, 2023

Hmm, I just assumed people knew about it. Submitted https://news.ycombinator.com/item?id=38495176

Self hosted 6.7b model is phenomenal in my experience, I actually prefer it to chatgpt a lot of the time. Similar quality code but fewer disclaimers and #todo in my experience. Gpt4 is still a bit better at coding but not by much and it's much slower for me.

int_19h · on Nov 30, 2023

The best in terms of coding assistance, and really for anything else, is the original (pre-turbo) GPT-4 used via the API, although this will also be more costly. There are many third-party chat apps that are wrappers around that now if you want a ChatGPT-like experience.

This can also significantly reduce its bias since you are in control of the system prompt. But also, even ChatGPT can be trivially made to behave differently by saying that you're writing a book or making a video game etc, describing a character in it, and then asking it how that character would have responded in such and such situation.

dmd · on Nov 30, 2023

... is the javascript it's writing for you too 'woke' or something?

pyinstallwoes · on Nov 30, 2023

Lately it’s been refusing to write code at all. // implementation details here.

I think it’s the opposite of woke, it’s slept!

mark_l_watson · on Nov 30, 2023

Simon, does this also work well on Apple Silicon?

EDIT: never mind, other commenters here answered my question: yes it does work. I have a M2Pro with 32 G on chip memory - amazingly effective for experimenting with open LLMs.

lol768 · on Nov 29, 2023

    $ chmod +x llamafile-server-0.1-llava-v1.5-7b-q4
    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    run-detectors: unable to find an interpreter for ./llamafile-server-0.1-llava-v1.5-7b-q4

Hmm. Did I do something wrong? (Ubuntu 22.04 / )

Installing the portable binfmt_misc gets me further, but still:

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    zsh: permission denied: ./llamafile-server-0.1-llava-v1.5-7b-q4

    $ sh -c ./llamafile-server-0.1-llava-v1.5-7b-q4
    sh: 1: ./llamafile-server-0.1-llava-v1.5-7b-q4: Permission denied

jart · on Nov 30, 2023

You can solve the run-detectors issue with:

    sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
    sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
    sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

You can solve the zsh permission denied issue by either (1) upgrade to zsh 5.9+ (I upstreamed a fix for this bug in zsh two years ago) or (2) use the sh -c workaround you discovered. If that one doesn't work, then it likely needs to be chmod +x. If the execute bit is set, and your sh still isn't working, then please let me know, because I'm not aware of any sh that still doesn't support APE.

See the Gotchas section of the README https://github.com/mozilla-Ocho/llamafile#gotchas

stavros · on Nov 30, 2023

That worked, thanks Justine! I use fish, so I didn't get a zsh error, but I had missed the Gotchas section (and the README), so this helps!

jart · on Nov 30, 2023

Fish is another cool shell I got to help improve two years ago by upstreaming a patch for this. So long as you're using a recent version, you should be golden (provided binfmt_misc doesn't cause any issues). Let us know what you think of llamafile!

stavros · on Dec 3, 2023

Thank you, I really like it! It's a very clever way to get LLMs deployed, and with Cosmopolitan, I don't need to point people to different downloads for the same LLM. Excellent job.

jart · on Dec 7, 2023

Thanks!

pmarreck · on Nov 30, 2023

Yet another jart tour-de-force. I knew I had to sponsor you on Github back when I read your magnificent technical breakdown of APE, lol.

(sorry for OT!)

jart · on Dec 1, 2023

You're awesome!

phh · on Nov 29, 2023

Last thing you need is to chmod +x the interpreter: chmod +x /usr/bin/ape (it is indeed not in the README)

lol768 · on Dec 3, 2023

This was it, wget creates the file and it's non-executable (and I'd already double checked the actual llamafile was executable, but had missed this) - thanks!

stavros · on Nov 30, 2023

I get the same error, and there's no `ape` file to make excecutable, hm.

jart · on Nov 30, 2023

You can manually download the `ape` command from https://cosmo.zip/pub/cosmos/bin/ Please see the Gotchas section of the README for the copy/pastable commands you can run: https://github.com/mozilla-Ocho/llamafile#gotchas

thejosh · on Nov 30, 2023

Damn this is fast and accurate! Crazy how far things are progressing.

theptip · on Nov 30, 2023

My pride as a technologist tells me I should be able to get any Python package up and running, but man, AI dependency management is a dumpster fire right now; adding GPU driver versions into the mix seems to make everything really brittle.

This seems like a great approach to compare multiple models, in particular.

callmeed · on Nov 29, 2023

when I try to do this (MBP M1 Max, Sonoma) I get 'killed' immediately

chatmasta · on Dec 1, 2023

Same, and then a few minutes later I got a Slack message from SecOps, LOL. Don't try this on a computer with CrowdStrike software running on it! It gets flagged because to a naive heuristic, the binary is indistinguishable from a virus. It appears to do some kind of magic self-extraction to an executable file in a temporary directory, and then that executable file executes the original file. And the CrowdStrike endpoint security product intercepts the suspicious execve, kills the process, and alerts the security team...

carbocation · on Nov 29, 2023

Same on an M1 Max 64G, Ventura. Xcode is installed[1].

1 = ```

$ xcode-select --install

xcode-select: error: command line tools are already installed, use "Software Update" in System Settings to install updates

```

carbocation · on Nov 30, 2023

For whatever it's worth, the SHA sum is correct. The killed message is uninformative, looks like what happens when I'm OOM (but I have 64GB RAM of which only 24 is used for anything at the moment).

    $ sha256sum < llamafile-server-0.1-llava-v1.5-7b-q4
    a138c5db9cff3b8905dd6e579c2ab6c098048526b53ae5ab433ff1d1edb9de24  -

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4
    Killed: 9

carbocation · on Nov 30, 2023

Looks like this may be due to crowdstrike, which I also have installed on this machine: https://github.com/Mozilla-Ocho/llamafile/issues/14#issuecom...

raajg · on Nov 30, 2023

oh wow, I would have never imagined that 'that' would be preventing me from running llama + llava! Confirming that I have Crowdstrike running too.

derwiki · on Nov 30, 2023

On a Macbook Pro M2, I get

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4
    [2]    25224 illegal hardware instruction  ./llamafile-server-0.1-llava-v1.5-7b-q4

jart · on Nov 30, 2023

Could you disable SIP and run `lldb -- $TMPDIR/.ape-1.8 ./llamafile-server-0.1-llava-v1.5-7b-q4` and give me (1) the name of the instruction that's illegal (or its hex value) and (2) the hex address of where that instruction is in memory? You're encouraged to file a GitHub issue about this too. Thanks!

derwiki · on Dec 1, 2023

Closing the loop for anyone reading this thread -- see https://github.com/Mozilla-Ocho/llamafile/issues/11 for the fix. Thanks jart!

raajg · on Nov 30, 2023

Yep, same issue. and the error message is unhelpful

jart · on Nov 30, 2023

We have an issue here tracking this: https://github.com/Mozilla-Ocho/llamafile/issues/14 Please follow that issue for updates.

fitted · on Nov 30, 2023

Same, process gets killed immediately for me.

silentguy · on Dec 1, 2023

wget url in step 1 seems to be wrong. It didn't work. This url `https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/ma... ` seems to be working. It's from the link you posted.

jart · on Dec 1, 2023

It's back. Sorry about that.

pmarreck · on Nov 30, 2023

Anyone have any tuning tips? I messed with some of the configs and now it's mostly hallucinating answers or going off the rails

m1thrandir · on Nov 30, 2023

Is the process the same for running multi-part bins? Like the latest deekseek 67b model?

tmountain · on Nov 30, 2023

This is amazing. How does result quality compare to GPT4 for image analysis?

simonw · on Nov 30, 2023

It's not remotely as good as GPT-4 Vision, which isn't a big surprise consisting it's running a 4GB, 7B parameter model on your laptop, trained by a small research team.

bilsbie · on Nov 29, 2023

Thanks for the tip! Any chance this would run on a 2011 MacBook?

pmarreck · on Nov 30, 2023

do you... have any plans to upgrade? A gen 2011 computer is going to get harder and harder to make work. even a used macbook from like 2019 would probably be a steal at this point, and that's 8 years further along

All the new AI toys especially seem to love beefy newish hardware and especially GPU hardware if available

getwiththeprog · on Nov 30, 2023

makes one think that perhaps Apple isn't a good long term choice...

pmarreck · on Nov 30, 2023

Apple is a fine long-term choice (speaking as a recent linux advocate actually, lol). But that computer is 12 years old

interfixus · on Nov 30, 2023

So what? This crazy thing runs fine, albeit slowly, on my 12 year old ThinkPad. It's actually digesting an image of an anteater while I write this. Because of course it plays nicely and doesn't hog the hardware.

simonw · on Nov 29, 2023

Justine says it needs MacOS 13.6+ - does that run on that machine?

mcint · on Nov 29, 2023

Yes, with a patch https://en.wikipedia.org/wiki/MacBook_Pro#macOS

from https://dortania.github.io/OpenCore-Legacy-Patcher/MODELS.ht...

I thought my 2015 MBP wasn't able to upgrade. Good to know it's still supported.

NoOn3 · on Nov 30, 2023

Got this: Terminating on uncaught SIGILL.

jart · on Nov 30, 2023

I'm assuming you're on Apple Silicon? Please follow https://github.com/Mozilla-Ocho/llamafile/issues/11 which is tracking this. We've received multiple reports even though I personally haven't figured out how to reproduce it yet.

NoOn3 · on Nov 30, 2023

No just old i5-2500k 16GB Ram and vega 56 GPU 8GB Vram.

jart · on Nov 30, 2023

So you have a Sandybridge processor with AVX support (but not AVX2). Could you open llamafile in a debugger and find out which instructing is faulting and what its address in memory is? I haven't personally tested Sandybridge but I'm reasonably certain we designed the build to not insert any incompatible instructions in there. Our intent is to support you. I also might be able to fish my old ThinkPad out of the closet if you don't have time. In any case it'll happen soon.

NoOn3 · on Dec 1, 2023

I tried but gdb showed me nothing, this screenshot(https://pxscdn.com/public/m/_v2/97422265439883264/bc40e5d2a-...). gdb layout asm shows "No Assembly Available". Maybe I'm not so skilled in debbuging such programs it's running under wine it seemed to me.

jart · on Dec 2, 2023

Check out the llamafile 0.2.1 release. Old Intel CPU support is fully fixed now on the release page and hugging face. Enjoy!

NoOn3 · on Dec 2, 2023

Thanks a lot.

NoOn3 · on Nov 30, 2023

I suspect it's lack of AVX2 support in my cpu.

jxy · on Nov 30, 2023

so next time llama.cpp releases an update, other people update their favorite backend, you redownload a 4.26 GB file. Epic.

EDIT: oh, wait. Actually people usually have a handful to a few dozen of the these models lying around. When they update their backend, you just redownload every single model again.

EDIT 2: right, you can release a program that automatically patches and updates the downloaded model+executables. Such an invention.

simonw · on Nov 30, 2023

Each llamafile is a .zip, so if you want to extract the weights out of it you can extract the gguf file directly.

    unzip -l llamafile-server-0.1-llava-v1.5-7b-q4 | grep llava-v1
    Archive:  llamafile-server-0.1-llava-v1.5-7b-q4
    4081004224  11-15-2023 22:13   llava-v1.5-7b-Q4_K.gguf
    177415936  11-15-2023 22:13   llava-v1.5-7b-mmproj-Q4_0.gguf

column · on Nov 30, 2023

This is for convenience. You can also download a 4.45Mb executable (llamafile-server-0.1) and pass any GGUF model as an argument.

> llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1

_kidlike · on Nov 30, 2023

salty much?

You know, most people don't have 24+GB GPUs sitting around to train these models. So in my book this is a huge step forward. Personally, this is the first time i am able to run an LLM on my computer, and it's purely thanks to this.

aphroz · on Nov 30, 2023

Compared to modern bandwidth usage that's not such a big size anymore. Everyday millions of people download 100gb video games, watch 4k video podcasts, etc.

simonw · on Nov 30, 2023

You can even run a full LLM in your browser these days - try https://webllm.mlc.ai/ in Chrome, it can load up a Llama-2-7b chat model (~4000MB, took my connection just under 3 minutes) and you can start chatting with it.

taneq · on Nov 30, 2023

Spoken like someone who hasn’t spent hours trying to get LocalAI to build and run, only to find out that while it’s “OpenAI API compatible!0” it doesn’t support streaming so the Mattermost OpenAI plugin doesn’t work. I finally gave up and went back to ooba (which also didn’t work with the MM plugin… hmm.) Next time I’ll just hack something on the side of llama.cpp

Zuiii · on Nov 30, 2023

That's why I always download the original version and quantize myself. With enough swap, you can do it with a modest amount for ram. I never had to download a model twice.

But yes, unless there is a way to patch it, bundling the model with the executable like this is going to be more wasteful.

rgbrgb · on Nov 29, 2023

Extremely cool and Justine Tunney / jart does incredible portability work [0], but I'm kind of struggling with the use-cases for this one.

I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :). My first change after that was decoupling the default model download and the UI so that I can ship app updates that are about 5MB. It feels like someone using this tool is going to hit the same problem pretty quick when they want to get the latest llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases where that doesn't matter, would love to hear where people think this could be useful.

[0]: https://justine.lol/cosmopolitan/

[1]: https://www.freechat.run

[2]: https://github.com/ggerganov/llama.cpp

pdntspa · on Nov 30, 2023

I don't get this obsession with 0-click everything. It is really annoying when you don't want to install everything to your main hard drive. I have all my models downloaded, organized, and ready-to-go but apps won't even ask for that, instead it presumes I am an idiot and downloads it (again!) for me.

At least Makeayo asks where my models are now. It's obnoxious that I have to use symlinks for comfy/automatic....

All they need to do is ask me where my stuff is on first run, and an area in the config to update that setting. Not so hard!

simonw · on Nov 30, 2023

Sounds like you should download the 4.45MB llamafile-server-0.1 executable from https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1 and then run it against your existing gguf model files like this:

    ./llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

See here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

mft_ · on Nov 30, 2023

If I'm understanding (and agreeing with) your gripe correctly, isn't it two solutions to the same perceived problem?

My experience is that the world of Python dependency management is a mess which sometimes works, and sometimes forces you to spend hours-to-days searching for obscure error messages and trying maybe-fixes posted in Github issues for some other package, just in case it helps. This sometimes extends further - e.g. with hours-to-days spent trying to install just-the-right-version-of-CUDA on Linux...

Anyway, the (somewhat annoying but understandable) solution that some developers take is to make their utility/app/whatever as self-contained as possible with a fresh install of everything from Python downwards inside a venv - which results in (for example) multiple copies of PyTorch spread around your HDD. This is great for less technical users who just need a minimal-difficulty install (as IME it works maybe 80-90% of the time), good for people who don't want to spend their time debugging incompatibilities between different library versions, but frustrating for the more technically-inclined user.

This is just another approach to the same problem, which presumably also presents an even-lower level of work for the maintainers, since it avoids Python installs and packages altogether?

pdntspa · on Nov 30, 2023

I get that, my issue is when the model is coupled with the app, or the app just presumes I don't have it downloaded and doesn't ask me otherwise. This is like basic configuration stuff...

What I suspect is happening is that people are cargo-culting zero-click installations. It seems rather fashionable right now.

rgbrgb · on Nov 30, 2023

I don’t think making it easy to install is cargo-culting. In my case it’s an accessibility thing. I wanted a private alternative that I could give to nontechnical people in my life who had started using ChatGPT. Some don’t understand local vs cloud and definitely don’t know about ggufs or LLMs but they all install apps from the App Store.

whstl · on Nov 30, 2023

In the README of the project (the TFA of this whole thread) there is the option to download the app without the model:

"You can also also download just the llamafile software (without any weights included) from our releases page, or directly in your terminal or command prompt"

There is no cargo-culting going on. Some of us do legitimately appreciate it.

pdntspa · on Nov 30, 2023

Which has been followed, and this comment was not a response to this specific app but rather a general trend I've noticed and was mentioned at the start of this thread

whstl · on Nov 30, 2023

I was answering to this complaint: "my issue is when the model is coupled with the app".

In this specific case there is an option for you that addresses this complaint, where the model isn't coupled with the app.

ElectricalUnion · on Nov 30, 2023

Is this the sentiment around?

Is having everything normalized in your system that worth it? I would say having (some) duplicates in your system is mostly fine, better that having some spooky-action-at-a-distance break things when you don't expect.

I expect the future is something like Windows's WinSxS, NixOS's /nix/store, pnpm's .pnpm-store where that deduping isn't "online" but it still is somewhat automated and hidden from you.

pdntspa · on Nov 30, 2023

> Is this the sentiment around?

Yes? It's right here, at the least.

And if that's the future, then the future sucks. We can teach people to be smarter, but no, instead our software has to bend over backwards to blow smoke up our ass because grandma.

joenot443 · on Nov 30, 2023

The “obsession” isn’t for developers like you or I, it’s for non-technical but curious non-engineers who would still like to play around with an LLM.

I think you’re imposing your own very specific (and solvable!) gripe onto an unrelated project and it makes for pretty clumsy commentary.

whstl · on Nov 30, 2023

Yep.

But also: It might not be for a developer like you, but it is for a developer like me.

I enjoy writing software, but I don't particularly enjoy futzing with building things outside my day-to-day work, and on systems I don't write myself. If it was up to me everything would be one click.

Things like this are like accessibility: it benefits me even though I don't particularly need it.

pdntspa · on Nov 30, 2023

It's the way things have been for god-knows-how-long, and it has worked really really well.

I am utterly amazed and perplexed that answering ONE configuration question (which is all that is needed here) is too much. How lazy have we become?

One question. That is too much for folks?!

And I have no objection to the autodownloading. Just ask me where to put it first!

rgbrgb · on Nov 30, 2023

fwiw FreeChat does this now. It prompts you to download or select a model to use (and you can add as many as you want). No copying or forced downloads.

stevenhuang · on Nov 29, 2023

The binaries themselves are available standalone https://github.com/Mozilla-Ocho/llamafile/releases

rgbrgb · on Nov 29, 2023

cool. this is more convenient than my workflow for doing the binaries myself. I currently use make to generate a binary of llama.cpp server on my intel iMac and my m1 MacBook then lipo them together.

coldtea · on Nov 30, 2023

>I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :).

Well, that's on the MAS/TestFlight for not doing delta updates.

rgbrgb · on Nov 30, 2023

Yes, though it does seem to be working for them. They have a special feature for lazy loading large assets but I opted for a simpler to me option (giving users a button to download a model if they don’t have one locally they want to use).

Asmod4n · on Nov 29, 2023

It’s just a zip file, updating it should be doable in place while it’s running on any non windows platform and you just need to swap that one file out you changed. When it’s running in server mode you could also possibly hot reload the executable without the user even having any downtime.

csdvrx · on Nov 29, 2023

You could also change you code so that when it runs, it checks as early as possible if you have a file with a well known name (say ~/.freechat.run) and then switches to reading from it instead for the assets than can change.

You could have multiple updates my using say iso time and doing a sort (so that ~/.freechat.run.20231127120000 would be overriden by ~/.freechat.run.20231129160000 without making the user delete anything)

tbalsam · on Nov 29, 2023

> in place

._.

Pain.

wyldfire · on Nov 30, 2023

> Extremely cool ...

> I'm kind of struggling with the use-cases for this one.

IMO cosmopolitan libc is a "really neat trick". And it deserves praise and it probably does have some real use cases. But it's not practical for most purposes. If we had a format like ELF that was so fat as to support as many architectures and OSs as desired, would we be using that? I have a feeling that we would not.

Then again -- after having used "zig cc" for a while, maybe it would be reasonable to have something like "one build" that produces a mega-fat binary.

And the microarch-specific dispatch is a nice touch.

...maybe I'm convincing myself of the alternative....

larodi · on Nov 30, 2023

Perhaps another unpopular opinion that can get the comment outright down-voted, but still... While jart's work is very interesting in nature and execution, commendable stuff indeed of a person with very high IQ and discipline, I still wonder whether Justine simply can't get over the fact they got kicked out of the llama.cpp project (yes, I understand jart is frequenting HN, and also let's agree llama.cpp is at least as cool as jart's projcets). No, I'm not going in details of said dismissal, as both sides seem to have had their proper arguments, but still.

And of course, I can imagine where the whole cosmopolitan thing comes from,... even as manifest of sorts for the idea of systems-neutrality and potentially gender fluidity. But I really wonder whether GGUF actually needs this, since llama.cpp already compiles and runs pretty much everywhere.

Why introduce one more container? Who benefits from binary distribution of this sort?

ukuina · on Nov 30, 2023

> Why introduce one more container? Who benefits from binary distribution of this sort?

Mass adoption comes from ease-of-use, which includes ease-of-deployment.

Most of the HN crowd is savvy enough to run HF models directly in their terminal, but there's a reason Ollama, LM Studio, and Faraday are so popular.

Until LLM runners are part of the base OS, methods like this are extremely helpful to reduce friction.

whstl · on Nov 30, 2023

I read the Github repository README and the comments here and I found absolutely nothing that could suggest the need for the first two paragraphs you wrote. It seems this stems from a misconception from your side about the purpose of this project.

About your question in the third paragraph: This is totally orthogonal to GGUF, and a cursory reading of the README shows that it does uses GGUF. This is not about a new universal LLM format, this is about packing it in a universal executable that runs everywhere, using Cosmopolitan.

Some examples do pack the executable and GGUF weights together in a single file, but that's not dissimilar from an self-executing zip, the only difference is that this executable is not OS-specific, so you can use the same exact binary for macOS or Linux, for example.

ElectricalUnion · on Nov 30, 2023

> llama.cpp already compiles and runs pretty much everywhere.

Well, it simplifies things when you don't need to compile things.

Also, you literally can't download or compile the wrong binary by mistake, it's the same binary for all supported processor/OSes Cartesian product matrix.

> Why introduce one more container?

It makes stuff more convenient.

`application/zip` is also a ubiquitous standard. I doubt anyone is being "introduced to it".

I also appreciate the fact that tooling for handling `application/zip` is very widespread, so you don't need totally bespoke tooling to retrieve the models from inside a `llamafile`.

> Who benefits from binary distribution of this sort?

Anyone that doesn't have a compiler SDK on their computer.

halyconWays · on Nov 30, 2023

>Extremely cool and Justine Tunney / jart does incredible portability work [0],

[x] Doubt.

That user was caught stealing code and banned from llama.cpp by its creator (your [2] citation) https://news.ycombinator.com/item?id=35411909

Maybe the same thing is happening here. Plagiarism of code.

hobofan · on Nov 30, 2023

What are you on about? There was no stealing and there was no plagiarism.

They made a PR that was built on top of another PR. The authorship information was preserved in the git history, and there was no attempt at deception. They also supposedly collaborated with the author of the original PR (which was never denied by either of them). All of this is totally normal working practice.

Those allegations of "stealing" just stem from a GH user piling onto the drama from the breaking change by pointing out where the initials from the new file format come from (which wasn't called into question on the original PR).

They were also not banned for those stealing allegations. They, as well as the author from the reversal PR were banned, as the maintainer deemed the resulting "drama" from the breaking changes to be a distraction to the project goals. The maintainer accepted the PR, and the nature of the breaking changes was obviously stated, so that drama wasn't completely on jart.

halyconWays · on Nov 30, 2023

You obviously didn't read the post, which shows the code, the words of the original author, the link to the original PR, and the user jart taking credit. It also shows her not understanding what she took and ultimately being fundamentally wrong about mmap.

averne_ · on Nov 30, 2023

It's not so clear cut. The author of the original PR had serious gripes about jart's handling of the situation, especially how hard they pushed their PR, practically forcing the merge before legitimate concerns were lifted.

See this post https://news.ycombinator.com/item?id=35418066

epaga · on Nov 30, 2023

Hmmm no matter what image I upload, it's somehow not seeing it.

Got a chuckle out of this, though: ---

User: Describe the image in the most detailed way possible

Llama: The image is a close-up of an empty, dark room with no visible objects or people. It appears to be a black background photo that highlights the darkness and emptiness of the space. This could potentially represent an artistic representation or an actual setting where lighting has been used to emphasize the contrast between light and shadow. The focus on this particular area suggests that it might have some significance, whether symbolic or practical in nature. Overall, the image evokes a sense of mystery and intrigue due to its lack of context and content.

User: is it just a black square?

Llama: Yes, it is just a black square.

sebmellen · on Nov 30, 2023

Interesting, I uploaded a pure white 50x50 png and got this:

> User: Describe this image in the most detailed way possible

> Llama: The image displays a large, mostly white background with some small gray areas. There is also an orange dot on the right side of the scene, which stands out against the predominantly white and gray backdrop.

> User: Is it just a white square?

> Llama: No, it is not just a white square. The image also contains some gray and tan colors as well. It appears to be an abstract art piece with various shades of color, making the overall appearance more interesting than a simple all-white background.

dmazzoni · on Dec 4, 2023

LLM vision is surprisingly human-like. Point an actual human at a blank canvas and I'll bet many would hallucinate things that aren't there.

CaptainFever · on Nov 30, 2023

That's comedic. I love it.

keybits · on Nov 29, 2023

Simon Willison has a great post on this https://simonwillison.net/2023/Nov/29/llamafile/

phil42 · on Nov 30, 2023

It’s intriguing to observe how this model is hallucinating. I presented it with Albert Bierstadt’s painting “In the Mountains” (https://www.wikiart.org/en/albert-bierstadt/in-the-mountains...). To my surprise, it maintains that cows are scattered throughout the scene, despite being told otherwise.

User:

Llama: The image features a serene landscape with mountains in the background and a river flowing through it. A picturesque scene unfolds as two trees are visible near the water, adding to the natural beauty of the environment. In addition to these elements, there is an interesting detail: several cows can be seen standing on or around the grassy area next to the body of water. The presence of these animals creates a sense of tranquility and harmony with nature in this idyllic setting.

User: There are no cows. Could it be that you've mistaken rocks for cows?

Llama: No, the rocks in this image are not mistaken for cows. They can be seen near a lake with mountains in the background and trees surrounding it. The painting captures a serene landscape where nature is beautifully displayed. This picturesque scene showcases both the majestic mountain range and the tranquil body of water, creating an idyllic atmosphere that evokes feelings of peacefulness and relaxation.

User: There are no cows!

Llama: The image depicts a serene scene of mountains and water. A beautiful lake is surrounded by majestic snow-capped mountains, creating an idyllic landscape. In the foreground, there are trees adding to the natural beauty of the environment.

A few birds can be seen flying in the sky above the lake, further enhancing the peaceful atmosphere. The presence of a boat on the water adds another element of leisure and enjoyment to this picturesque scene. Overall, it is an inviting image that showcases nature's wonders with its combination of mountains, lake, trees, birds, and boats.

dang · on Nov 29, 2023

Related: https://hacks.mozilla.org/2023/11/introducing-llamafile/ and https://twitter.com/justinetunney/status/1729940628098969799

(via https://news.ycombinator.com/item?id=38463456 and https://news.ycombinator.com/item?id=38464759, but we merged the comments hither)

abrinz · on Nov 29, 2023

I've been playing with various models in llama.cpp's GGUF format like this.

  git clone https://github.com/ggerganov/llama.cpp     

  cd llama.cpp

  make 

  # M2 Max - 16 GB RAM

  wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q8_0.gguf
  
  ./server -m models/openhermes-2.5-mistral-7b-16k.Q8_0.gguf -c 16000 -ngl 32

  # M1 - 8 GB RAM 

  wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

  ./server -m models/openhermes-2.5-mistral-7b.Q4_K_M.gguf -c 2000 -ngl 32

m1thrandir · on Nov 29, 2023

even easier with https://gpt4all.io/index.html

foruhar · on Nov 29, 2023

Llaminate would be decent name for something like. Or the verb for the general wrapping of a llama compatible model into a ready to use blob.

gsuuon · on Nov 29, 2023

Llamanate

modeless · on Nov 30, 2023

Wow, it has CUDA support even though it's built with Cosmopolitan? Awesome, I see Cosmopolitan just this month added some support for dynamic linking specifically to enable GPUs! This is amazing, I'm glad they found a way to do this. https://github.com/jart/cosmopolitan/commit/5e8c928f1a37349a...

I see it unfortunately requires the CUDA developer toolkit to be installed. It's totally possible to distribute CUDA apps that run without any dependencies installed other than the Nvidia driver. If they could figure that out it would be a game changer.

patcon · on Nov 30, 2023

> Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again.

<3

xnx · on Nov 29, 2023

> Windows also has a maximum file size limit of 2GB for executables. You need to have llamafile and your weights be separate files on the Windows platform.

The 4GB .exe ran fine on my Windows 10 64-bit system.

jart · on Nov 29, 2023

You're right. The limit is 4 gibibytes. Astonishingly enough, the llava-v1.5-7b-q4-server.llamafile is 0xfe1c0ed4 bytes in size, which is just 30MB shy of that limit. https://github.com/Mozilla-Ocho/llamafile/commit/81c6ad3251f...

throwaway743 · on Nov 29, 2023

Not at my windows machine to test this out right now, but wondering what you mean by having to store the weights in a separate file for wizardcoder, as a result of the 4gb executable limit. How does one go about this?

Thank you!

jart · on Nov 30, 2023

You'd do something like this on PowerShell:

    curl -Lo llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1
    curl -Lo wizard.gguf https://huggingface.co/TheBloke/WizardCoder-Python-13B-V1.0-GGUF/resolve/main/wizardcoder-python-13b-v1.0.Q4_K_M.gguf
    .\llamafile.exe -m wizard.gguf

throwaway743 · on Nov 30, 2023

Awesome! Thank you so much

mistrial9 · on Nov 29, 2023

great! worked easily on desktop Linux, first try. It appears to execute with zero network connection. I added a 1200x900 photo from a journalism project and asked "please describe this photo" .. in 4GB of RAM, it took between two and three minutes to execute with CPU-only support. The response was of mixed value. On the one hand, it described "several people appear in the distance" but no, it was brush and trees in the distance, no other people. There was a single figure of a woman walking with a phone in the foreground, which was correctly described by this model. The model did detect 'an atmosphere suggesting a natural disaster' and that is accurate.

thx to Mozilla and Justin Tunney for this very easy, local experiment today!

dekhn · on Nov 29, 2023

I get the desire to make self-contained things, but a binary that only runs one model with one set of weights seems awfully constricting to me.

simonw · on Nov 29, 2023

There's also a "llamafile" 4MB binary that can run any model (GGUF file) that you pass to it: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

dekhn · on Nov 29, 2023

Right. So if that exists, why would I want to embed my weights in the binary rather than distributing them as a side file?

I assume the answers are "because Justine can" and "sometimes it's easier to distribute a single file than two".

simonw · on Nov 29, 2023

Personally I really like the single file approach.

If the weights are 4GB, and the binary code needed to actually execute them is 4.5MB, then the size of the executable part is a rounding error - I don't see any reason NOT to bundle that with the model.

dekhn · on Nov 29, 2023

I guess in every world I've worked in, deployment involved deploying a small executable which would run millions of times on thousands of servers, each instance loading a different model (or models) over its lifetime, and the weights are stored in a large, fast filesystem with much higher aggregate bandwidth than a typical local storage device. The executable itself doesn't even contain the final model- just a description of the model which is compiled only after the executable starts (so the compilation has all the runtime info on the machine it will run on).

But, I think llama plus obese binaries must be targeting a very, very different community- one that doesn't build its own binaries, runs in any number of different locations, and focuses on getting the model to run with the least friction.

csdvrx · on Nov 29, 2023

> a large, fast filesystem with much higher aggregate bandwidth than a typical local storage device

that assumption gets wrong very fast with nvme storage, even before you add herding effects

dekhn · on Nov 30, 2023

Until you compare a single machine with nvme to a cluster of storage servers with nvme, and each machine has 800Gbit connectivity and you use smart replication for herding. but yes, nvme definitely has amazing transfer rates.

csdvrx · on Nov 30, 2023

> Until you compare a single machine with nvme to a cluster of storage servers with nvme

No, only as long as you compare against a very low number of machines with local nvme.

The sum of the bandwith available on typical storage device (even cheap and low end) will be at most times greater than what you have of your expansive top of the line cluster

If you have a single local storage, you don't have scale, so you won't have money for an expansive top of the line cluster either. But if you are wasting money on it, yes you will have more bandwidth, but that's a degenerate case.

If you have a few local storage machines, the assumption gets very wrong and very fast: 1 low end tier nvme=1 G/s at worst, one top of the line WD 990: 8G/s at best, so we're talking about a ratio of ~ 8 in the most favorable scenario.

fullspectrumdev · on Nov 30, 2023

> But, I think llama plus obese binaries must be targeting a very, very different community- one that doesn't build its own binaries, runs in any number of different locations, and focuses on getting the model to run with the least friction.

Yes, the average user.

quickthrower2 · on Nov 29, 2023

This is convenient for people who don't want to go knee deep in LLM-ology to try an LLM out on their computer. That said a single download that in turn downloads the weights for you is just as good in my book.

insanitybit · on Nov 29, 2023

`ollama pull <modelname>` has worked for me, and then I can try out new models and updated the binary trivially.

jart · on Nov 29, 2023

llamafile will run any compatible model you want. For example, if you download the LLaVA llamafile, you can still pass `-m wizardcoder.gguf` to override the default weights.

espadrine · on Nov 29, 2023

I understand the feeling. It may be caused by habit rather than objectivity, though. Those open-source AI hacks are undergoing early productization: while they were only research, their modularity mattered for experimentization, but as they get closer to something that can ship, the one-click binary form factor is a nice stepping stone.

It is similar in my mind to the early days of Linux, where you had to compile it yourself and tweaked some compiler flags, compared to now, where most people don’t even think about the fact that their phone or Steam deck runs it.

russellbeattie · on Nov 29, 2023

I sorta see your point - it's kinda the equivalent of self-executable SQLite database pre-filled with data, or a Word document that contains the editor. There's lots of good reasons the data and apps are delivered separately.

That said, it does reduce the friction of getting an LLM up and running and the self-contained nature makes it sort of a dedicated program equivalent to Awk. It might open up possibilities, like the AI version of the Unix philosophy - do one thing well. A hyper optimized LLM could be used in pipeline of commands, for example.

seattleeng · on Nov 29, 2023

Eh, this is exploring a more “static link” approach for local use and development vs the more common “dynamic link” that API providers offer. (Imperfect analogy since this is literally like a DLL but… whatever). Probably makes sense for private local apps like a PDF chatter.

dws · on Nov 30, 2023

Can confirm that this runs on an ancient i3 NUC under Ubuntu 20.04. It emits a token every five or six seconds, which is "ask a question then go get coffee" speed. Still, very cool.

dmezzetti · on Nov 29, 2023

From a technical standpoint, this project is really fascinating. I can see a lot of use cases for getting something up fast locally for an individual user.

But for anyone in a production/business setting, it would be tough to see this being viable. Seems like it would be a non-starter for most medium to large companies IT teams. The great thing about a Dockerfile is that it can be inspected and the install process is relatively easy to understand.

gfodor · on Nov 30, 2023

This stuff is for people who don't care about medium to large companies IT teams.

fbdab103 · on Nov 30, 2023

I am failing to see the difference. It is a zip file with an executable and a blob of weights. What would change if it were stored in a Dockerfile?

fullspectrumdev · on Nov 30, 2023

Literally nothing given the docker file would be a couple of wget invocations and a chaos lol

amelius · on Nov 29, 2023

> you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPU

This is a bit like specifying how large your strings will be to a C program. That was maybe accepted in the old days, but not anymore really.

tomwojcik · on Nov 29, 2023

That's not the limitation introduced in Llamafile. It's actually a feature of all gguf models. If not specified, GPU is not used at all. Optionally, you can offload some work to the GPU. This allows to run 7b models (zephyr, mistral, openhermes) on regular PCs, it just takes a bit more time to generate the response. What other API would you suggest?

amelius · on Nov 29, 2023

This is a bit like saying if you don't specify "--dram", the data will be stored on punchcards.

From the user's point of view: they just want to run the thing, and as quickly as possible. If multiple programs want to use the GPU, then the OS and/or the driver should figure it out.

andersa · on Nov 29, 2023

They don't, though. If you try to allocate too much VRAM it will either hard fail or everything suddenly runs like garbage due to the driver constantly swapping it / using shared memory.

The reason for this flag to exist in the first place is that many of the models are larger than the available VRAM on most consumer GPUs, so you have to "balance" it between running some layers on the GPU and some on the CPU.

What would make sense is a default auto option that uses as much VRAM as possible, assuming the model is the only thing running on the GPU, except for the amount of VRAM already in use at the time it is started.

insanitybit · on Nov 29, 2023

> They don't, though. If you try to allocate too much VRAM it will either hard fail or everything suddenly runs like garbage due to the driver constantly swapping it / using shared memory.

What I don't understand is why it can't just check your VRAM and allocate by default. The allocation is not that dynamic AFAIK - when I run models it all happens basically upfront when the model loads. ollama even prints out how much VRAM it's allocating for model + context for each layer. But I still have to tune the layers manually, and any time I change my context size I have to retune.

jmorgan · on Nov 30, 2023

This is a great point. Context size has a large impact on memory requirements and Ollama should take this into account (something to work on :)

insanitybit · on Nov 30, 2023

Thanks for the work you've done already :D