Run ChatGPT-like LLMs on your laptop in 3 lines of code

keyle · on Sept 7, 2023

Saving you some time, if you have a Macbook pro M1/M2 with 32GB of RAM (I presume a lot of HN folks would), you can comfortably run the `34B` models on CPU or GPU.

And... If you'd like a more hands on approach, here is a manual approach to get llama running locally

    - https://github.com/ggerganov/llama.cpp 
    - follow instructions to build it (note the `METAL` flag)
    - https://huggingface.co/models?sort=trending&search=gguf
    - pick any `gguf` model that tickles your fancy, download instructions will be there

and a little script like this will get it running swimmingly

   ./main -m ./models/<file>.gguf --color --keep -1 -n -1 -ngl 32 --repeat_penalty 1.1 -i -ins

Enjoy the next hours of digging through flags and the wonderful pit of time ahead of you.

NOTE: I'm new at this stuff, feedback welcome.

zitterbewegung · on Sept 7, 2023

With your M1 or M2 Max with 64 GB of ram and up you can run using llama.cpp the original llama from Facebook the 65B model.

Here is the starting output of running Llama 65B in a gist

https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...

MuffinFlavored · on Sept 7, 2023

> GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata.

leoh · on Sept 7, 2023

You can do the first step a lot faster with nix: `nix shell github:ggerganov/llama.cpp -c llama`

seaal · on Sept 7, 2023

Note that the OP repo doesn't yet support GGUF format.

kgwgk · on Sept 7, 2023

What do you mean?

https://github.com/ggerganov/llama.cpp/releases/tag/master-6...

keyle · on Sept 7, 2023

Yes it does. Or do you mean the OP's github repo?

seaal · on Sept 7, 2023

Yeah I was referring to OP, oops

>We currently support models in GGML format. However, the GGML format has now been superseded by GGUF.

>Future versions of OnPrem.LLM will use the newer GGUF format.

jmorgan · on Sept 7, 2023

Love how simple of an interface this has. Local LLM tooling can be super daunting, but reducing it to a simple ingest() and then prompt() is really neat.

By chance, have you checked out Ollama (https://github.com/jmorganca/ollama) as a way to run the models like Llama 2 under the hood?

One of the goals of the project is to make it easy to download and run GPU-accelerated models, ideally with everything pre-compiled so it's easy to get up and running. It's API that can be used by tools like this – would love to know if it would be helpful (or not!)

There's a LangChain model integration for it and a PrivateGPT example as well that might be a good pointer on using the LangChain integration: https://github.com/jmorganca/ollama/tree/main/examples/priva.... There's also a LangChain PR open to add support for generating embeddings, although there's a bit more work to do to support the major embedding models.

Best of luck with the project!

phren0logy · on Sept 7, 2023

I learned about ollama here on HN, and have found that to be super easy. Worth a look to compare with this one if you are looking to run LLMs locally.

geepytee · on Sept 7, 2023

link?

brianjking · on Sept 7, 2023

https://ollama.ai/

There's also plenty of other local LLM type tools like GPT4All, LMStudio, Simon Wilsons LLM, privateGPT, and about a million other setups.

zyl1n · on Sept 7, 2023

Just yesterday, I found out that Simon Wilson's name is Simon Willison.

carbocation · on Sept 7, 2023

https://github.com/jmorganca/ollama/

3abiton · on Sept 7, 2023

What's the difference with OP?

rawoke083600 · on Sept 7, 2023

We have come far, not too long ago it was 'sudoku solver in 5 lines' of bash !

Lol but for real today for the first time when browsing new laptops I was looking for high vram because of llm.

d4rkp4ttern · on Sept 7, 2023

Related: say I’ve written code that uses OpenAI API, and code that handles streaming, retries, function-calls. And now I want to switch it to using a local non-API-based model such as llama2, without changing too much code.

Is there a library that offers a layer on top of local models that simulates the OpenAI API?

Daviey · on Sept 7, 2023

https://github.com/abetlen/llama-cpp-python#web-server

d4rkp4ttern · on Sept 7, 2023

Thanks, this was hard to find

satvikpendem · on Sept 7, 2023

Ollama.ai is pretty good too, any differences with this one?

Seems like all of these open source wrappers, just as the closed sourced ones, are a race to the bottom.

jmorgan · on Sept 7, 2023

I work on Ollama. It's a good question since there are quite a few tools emerging in this space.

The focus for Ollama is to make downloading and serving a model easy – there's an included `ollama` CLI but it's all powered by a REST API. Hopefully, it's a way to support really cool applications of LLMs like OP's onprem tool.

OP's tool is more focused on ingesting and analyzing data. There seems to be quite a bit of interesting opportunity as an application of LLMs – e.g. analyzing not only local docs but data in a remote data store.

kacesensitive · on Sept 7, 2023

I've built a couple projects that use Ollama. Thanks for making such a cool tool!

moneywoes · on Sept 7, 2023

What is the most effective local llm model?

tommek4077 · on Sept 7, 2023

Totally depends on your use case. There are optimized models for stories, code, knowlegde...

ralphc · on Sept 7, 2023

How big are these models that are downloaded? Is 7B 7 gigabytes of data?

notpublic · on Sept 7, 2023

If the model is stored in 32bits, it will be ~4x param size

You can see the actual size from hugging face. For example, https://huggingface.co/WizardLM/WizardLM-7B-V1.0/tree/main

Size with quantization, https://github.com/ggerganov/llama.cpp#quantization

raincole · on Sept 7, 2023

It's actually very straightforward.

7B = 7 Billions parameters.

If 1 parameter takes 1 byte then a 7B model would be about 7GB in size.

Usually 1 parameter takes 4 bytes tho (a 32-bit float), so it would be aboult 28GB.

You can use 16-bit, 8-bit or even 4-bit float.

artursapek · on Sept 7, 2023

wow, this looks approachable. will have to try it tomorrow

froggertoaster · on Sept 7, 2023

Wonderful. I love the advent of open source LLMs, and love the turnkey nature of this product.

What sold me on ChatGPT was its efficacy combined with its ease of use. As the owner of a consultancy, I find time to do technical exploration to be more and more scarce - stuff like this that makes it super easy for me to run an LLM is most welcome.

gitgud · on Sept 7, 2023

It's a little bit ironic that the package is called "onprem" but the second line imports an external model from huggingface...

    from onprem import LLM
    url = 'https://huggingface.co/TheBloke/CodeUp-Llama.....'
    llm = LLM(url, n_gpu_layers=43) # see below for GPU information

Anyway looks like a great little project, nice work!

keyle · on Sept 7, 2023

I'm not the OP but I believe it is correct. It says 'RUN' ... LLM. Meaning you have to get it from somewhere.

Training your own from nothing is a monumental task, I don't think many of us can realistically do it from scratch.

paxys · on Sept 7, 2023

Code that runs on prem doesn't magically show up on your server, it still has to be fetched from somewhere.

scubbo · on Sept 7, 2023

It's still running on-prem if the execution happens on compute hardware that you control. If importing code from somewhere else disqualified something from being on-prem, then no application would ever be on-prem - when's the last time a meaningful application had no dependencies!?