Hacker News new | past | comments | ask | show | jobs | submit login
Accessing Llama 2 from the command-line with the LLM-replicate plugin (simonwillison.net)
200 points by simonw on July 18, 2023 | hide | past | favorite | 43 comments



I'm so confused about running these models locally only. When reading about the llm tool, I thought, ok, this helps organize all the pieces on my machine. But then it uses a replicate API key, so clearly it requires a network connection. Is this just to download the models? I feel like we need a new license or packaging model that clearly states whether the computation is happening locally or remotely. It's very important to me and often hard to know until I'm a long way into the installation process.


For those getting started, the easiest one click installer I've used is Nomic.ai's gpt4all: https://gpt4all.io/

This runs with a simple GUI on Windows/Mac/Linux, leverages a fork of llama.cpp on the backend and supports GPU acceleration, and LLaMA, Falcon, MPT, and GPT-J models. It also has API/CLI bindings.

I just saw a slick new tool https://ollama.ai/ that will let you install a llama2-7b with a single `ollama run llama2` command that has a very simple 1-click installer for Apple Silicon Mac (but need to build from source for anything else atm). It looks like it only supports llamas OOTB but it also seems to use llama.cpp (via Go adapter) on the backend - it seemed to be CPU-only on my MBA, but I didn't poke too much and it's brand new, so we'll see.

For anyone on HN, they should probably be looking at https://github.com/ggerganov/llama.cpp and https://github.com/ggerganov/ggml directly. If you have a high-end Nvidia consumer card (3090/4090) I'd highly recommend looking into https://github.com/turboderp/exllama

For those generally confused, the r/LocalLLaMA wiki is a good place to start: https://www.reddit.com/r/LocalLLaMA/wiki/guide/

I've also been porting my own notes into a single location that tracks models, evals, and has guides focused on local models: https://llm-tracker.info/


The gold standard of local-only model inference for LLaMA, alpaca, and friends is LLaMA-cpp, https://github.com/ggerganov/llama.cpp No dependencies, no GPU needed, just point it to a model snapshot that you download separately on bittorrent. Simple CLI tools that are usable (somewhat) from shell scripts.

Hoping they add support for llama 2 soon!


The real gold standard is https://github.com/oobabooga/text-generation-webui

Which includes the llama.cpp backend, and a lot more.

Unfortunately, despite claiming to be the "Automatic1111" of text generation, it doesn't support any of the prompt engineering capabilities (i.e. negative prompts, prompt weights, prompt blending, etc) available in Automatic1111, despite the fact that it's not difficult to implement - https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

Luckily for Ooga Booga, no one else supports it either. Why this is? I have no explanation except that the NLP community doesn't know jack about prompt engineering, which is Kafkaesque


> Hoping they add support for llama 2 soon!

The 7B and 13B models use the same architecture, so llama.cpp already supports it. (Source: running 13b-chat myself using llama.cpp GPU offload.) It's only 70B that has additional extensions that llama.cpp would need to implement.


Using their default CLI tools from a shell script is sadly a little bit tricky.

I opened a feature request a while back suggesting they add a --json mode to make that easier, hasn't gained much traction though: https://github.com/ggerganov/llama.cpp/issues/1739


Llama2 7B and 13B GGML are up and work with existing llama.cpp no changes needed! The 70B does require a llama.cpp change, but I'm sure it won't take long.


Are there new system requirements known for these?


My LLM tool can be used for both. That's what the plugins are for.

It can talk to OpenAI, PaLM 2 and Llama / other models on Replicate via API, using API keys - the OpenAI stuff is in the default LLM tool, PaLM 2 uses https://github.com/simonw/llm-palm and Replicate uses https://github.com/simonw/llm-replicate

It can run local models on your own machine using these two plugins: https://github.com/simonw/llm-gpt4all and https://github.com/simonw/llm-mpt30b


Do you have any thoughts on how I can make this more obvious?

It's covered by the documentation for the individual plugins, but I want to make it as easy as possible for people to understand what's going on when they first start using the tool.


It's laziness on my part. I read just your blog post. Had I clicked through to the tool, it clearly says it at the top. My apologies.

I'm very grateful for all the work and writing you are doing about LLMs.

Regarding your note about JSON mode with llama.cpp, I'm writing a wrapper for it on my katarismo project. It is basically the stdout suggestion from that comment, but it is working really well for me when I use it with pocketbase.

https://gitlab.p.katarismo.com/katarismo/backend


Perhaps something as simple as stating it was first built around OpenAI models and later expanded to local via plugins?

I've been meaning to ask you, have you seen/used MS Guidance[0] 'language' at all? I don't know if it's the right abstraction to interface as a plugin with what you've got in llm cli but there's a lot about Guidance that seems incredibly useful to local inference [token healing and acceleration especially].

[0]https://github.com/microsoft/guidance


Yeah, I looked at Guidance and I have to admit I don't fully get it - my main problem was that I can't look at one of their Handlebars templates and figure out exactly what LLM prompts it's going to fire and in what order they will be sent.

I'm much happier with a very thin wrapper where I can explicitly see exactly what prompts are processed when, and where prompts are assembled using very simple string manipulation.

I'm thinking I may pull the OpenAI stuff out of LLM core and make that a plugin as well - that way it will be VERY obvious when you install the tool that you get to pick which LLMs you're going to work with.


I’m in the same boat.

With text-to-image models I pick out software of my choice and drop a downloaded model into an input directory, then launch a GUI.

What is it about LLMs that makes actually being able to run them locally so complex in comparison?

I’m genuinely curious as a layman.


Nothing in particular. Setting up a scientific python+pytorch stack is difficult if you're unfamiliar with the python packaging ecosystem.

If you're not on a "happy path" of "ubuntu, nvidia, and anaconda," then lots of things could go wrong if you're configuring from scratch. If you want to run these models efficiently, hardware acceleration is a must, but managing the intersection between {GPU, operating system, architecture, Python version, Python virtualenv location} is tricky.

That's even before you deal with hardware-specific implementation details!

- Running NVidia? Double-check your CUDA library version, kernel module version, (optional) CuDNN, and pytorch version

- Running AMD? Double-check your ROCm library version, AMD drivers, and make sure that you use Pytorch provided by AMD with ROCm support

- On Apple machines? Double-check that your M1 hardware actually has proper hardware support, then download and install a custom Pytorch distribution linked with M1 support, and make sure that the numpy library version has been properly linked with Accelerate.framework or else your BLAS calls run on the CPU rather than the undocumented AMX coprocessor. If you want to run on the ANE, you'll additionally need a working xcode toolchain and a version of the CoreML model compiler that can read your serialized pytorch model files properly.

I think the pain of getting things working makes it easier to just throw up one's hands and pay someone else to run your model for you.


It is wild to me how hard it is to run these things GPU-accelerated on an Apple M1/M2.

The hardware support should be amazing for this, given that the CPU and GPU share the same RAM.

I mostly end up running the CPU versions and grumbling about how slow they are.


> The hardware support should be amazing for this

I mean, caveat emptor; CoreML has been barebones for years, and Apple isn't exactly notorious for huge commitment to third party APIs. The writing was on the wall with how Metal was rolled out and how fast OpenCL got dropped, it honestly doesn't surprise me at all at this point. Even the current Apple Silicon support in llama.cpp is fudged with NEON and Metal Shaders instead of Apple's "Neural Engine".


GPU acceleration is pretty easy with llama.cpp. You just run make with an extra flag and then an argument or two at runtime.


Adrien Brault on Twitter gave me this recipe, which worked perfectly: https://gist.github.com/adrienbrault/b76631c56c736def9bc1bc2...


Therein lies the problem.

I want to write instructions for people to use my software that don't expect them to know how to run make, or even to have a C compiler installed.


Then you more or less need a GUI like OpenAI built with ChatGPT so you control the whole environment. Even setting up LLM in Homebrew required me to do the whole install twice because of some arcane error.


I think I can fix that by shipping a bottle release - "brew install simonw/llm/llm" currently attempts to compile Pydantic's Rust extensions, which means it runs incredibly slowly.

I built a bottle for it which installs much faster, but you currently have to download the file from https://static.simonwillison.net/static/2023/llm--0.5.arm64_... - I've not yet figured out how to get that to install when the user runs "brew install simonw/llm/llm" - issue here: https://github.com/simonw/llm/issues/102


In general the default LLM models are too big to run on a consumer GPU without quantization.

That means additional steps or alternative downloads are needed.

Also generative graphics models are about 6 months further advanced than generative text models because Stable Diffusion came out before Llama (Llama being the first "useful" smaller-large-language-model)


We know that GPT-4 is the king at this time, so only researchers, developers, and enthusiasts setup local LLMs. They don't need easier way to setup. For image generation, Local Stable Diffusion is the king for certain big genre of image generation. Many people including PC newbie try it, so there are many guides.


The best place to start is the localllama subreddit.

The install guide describes text generation webui, a generic frontend for multiple ways of running models locally, and llama.cpp, a way of running llama-derived models locally on CPU from the command-line.

https://www.reddit.com/r/LocalLLaMA/wiki/guide/

The bundled models that the oobabooga offers during install are probably not what you want. To find good models the localllama wiki has a handy models page. GPTQ models are only running on GPU, GGML models are for running on CPU or GPU through llama.cpp (directly or via textgen webui).

https://www.reddit.com/r/LocalLLaMA/wiki/models/


Simon's article didn't show local usage.

Use one of the one-click installers linked in the README of

https://github.com/oobabooga/text-generation-webui

and you're set.

Note that just in case you have the hardware necessary to run the biggest available model llama2-70b (for example two RTX 3090 with a total of 48GB of VRAM), there is currently a small bug (with a fix) documented at https://github.com/oobabooga/text-generation-webui/issues/32...


Idk, to me this reads pretty clearly remote (for this model in particularly):

> My LLM tool provides command-line access to a wide variety of language models, both via web APIs and self-hosted on your own machine ... The brand new llm-replicate plugin provides CLI access to models hosted on Replicate, and this morning a16z-infra released a16z-infra/llama13b-v2-chat which provides Replicate API access to the new Llama 2 13B chat model.

I'm also unsure what this would have to do with the license, when you install any other software do you expect the license to tell you whether processing happens locally or in the cloud?


More about my LLM tool (and Python library) here: https://llm.datasette.io/

Here's the full implementation of that llm-replicate plugin: https://github.com/simonw/llm-replicate/blob/0.2/llm_replica...

If you want to write a plugin for some other LLM I have a detailed tutorial here: https://llm.datasette.io/en/stable/plugins/tutorial-model-pl... - plus a bunch of examples linked from here: https://github.com/simonw/llm-plugins


Can you or anyone else comment on how replicate's per-second pricing ends up comparing to OpenAI's per token pricing when using Llama2?


My hunch is that OpenAI is a lot cheaper. I've spent $0.26 on 115 seconds of compute with Llama 2 on Replicate so far, which is only a dozen test prompts.


It is insanely more expensive on replica and they don't have the 70b model yet which will make it even more prohibitive.


Looks like it's here now: https://replicate.com/replicate/llama70b-v2-chat

As for pricing, that model's pages says: "Predictions run on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 17 seconds."

And the pricing page (https://replicate.com/pricing) says Nvidia A100 (80GB) GPU hardware costs $0.0032 per second.

So Llama 2 70B would "typically" cost under 17 x 0.0032 = $0.0544 per run.


Thank you for checking that.


I am not familiar with Replicate, but based on their website, they charge per GPU type. I didn't see the GPU type set in the example. Is it baked in as part of the "a16z-infra/llama13b-v2-chat" model?


There's info about that here: https://replicate.com/a16z-infra/llama13b-v2-chat

> Run time and cost

> Predictions run on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 9 seconds.


I feel like Simon's been on a tear with these LLM postings. Simon, I really enjoying you swashbuckling through this, and then documenting your travels.


Does Simon sleep? He's unstoppable!


I kind of wonder the opposite.

> created: October 29, 2007

> karma: 40256

Is he not potentially under the category of "potentially starved for online HackerNews attention"?

Why does he feel the need to maintain a blog and libraries, etc. etc. and then submit it and let us all know about it?


Technology is fun! Sharing with others is fun! Why do any of us do anything?


Because we know we learn from others so we also offer. Constructive reciprocity.

Not dissimilar from sharing code.


This would be a solid point if he kept putting out nonsense or useless but catchy things, but he keeps putting out pretty concise useful things. Useful tech things are pretty popular here, for good reason.


This works too now for the 70b model:

    llm replicate add \
      replicate/llama70b-v2-chat \
      --chat --alias llama70b
Then:

    llm -m llama70b "Invent an absurd ice cream sundae"


I just downloaded llama2 - what more do I need to access it from the command line than the code and the weights?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: