Hacker News new | past | comments | ask | show | jobs | submit login
Run ChatGPT-like LLMs on your laptop in 3 lines of code (github.com/amaiya)
150 points by amaiya on Sept 6, 2023 | hide | past | favorite | 33 comments
I've been playing around with https://github.com/imartinez/privateGPT and wanted to create a simple Python package that made it easier to run ChatGPT-like LLMs on your own machine, use them with non-public data, and integrate them into practical GPU-accelerated applications.

This resulted in Python package I call OnPrem.LLM.

In the documentation, there are examples for how to use it for information extraction, text generation, retrieval-augmented generation (i.e., chatting with documents on your computer), and text-to-code generation: https://amaiya.github.io/onprem/

Enjoy!




Saving you some time, if you have a Macbook pro M1/M2 with 32GB of RAM (I presume a lot of HN folks would), you can comfortably run the `34B` models on CPU or GPU.

And... If you'd like a more hands on approach, here is a manual approach to get llama running locally

    - https://github.com/ggerganov/llama.cpp 
    - follow instructions to build it (note the `METAL` flag)
    - https://huggingface.co/models?sort=trending&search=gguf
    - pick any `gguf` model that tickles your fancy, download instructions will be there
and a little script like this will get it running swimmingly

   ./main -m ./models/<file>.gguf --color --keep -1 -n -1 -ngl 32 --repeat_penalty 1.1 -i -ins
Enjoy the next hours of digging through flags and the wonderful pit of time ahead of you.

NOTE: I'm new at this stuff, feedback welcome.


With your M1 or M2 Max with 64 GB of ram and up you can run using llama.cpp the original llama from Facebook the 65B model.

Here is the starting output of running Llama 65B in a gist

https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...


> GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata.


You can do the first step a lot faster with nix: `nix shell github:ggerganov/llama.cpp -c llama`


Note that the OP repo doesn't yet support GGUF format.



Yes it does. Or do you mean the OP's github repo?


Yeah I was referring to OP, oops

>We currently support models in GGML format. However, the GGML format has now been superseded by GGUF.

>Future versions of OnPrem.LLM will use the newer GGUF format.


Love how simple of an interface this has. Local LLM tooling can be super daunting, but reducing it to a simple ingest() and then prompt() is really neat.

By chance, have you checked out Ollama (https://github.com/jmorganca/ollama) as a way to run the models like Llama 2 under the hood?

One of the goals of the project is to make it easy to download and run GPU-accelerated models, ideally with everything pre-compiled so it's easy to get up and running. It's API that can be used by tools like this – would love to know if it would be helpful (or not!)

There's a LangChain model integration for it and a PrivateGPT example as well that might be a good pointer on using the LangChain integration: https://github.com/jmorganca/ollama/tree/main/examples/priva.... There's also a LangChain PR open to add support for generating embeddings, although there's a bit more work to do to support the major embedding models.

Best of luck with the project!


I learned about ollama here on HN, and have found that to be super easy. Worth a look to compare with this one if you are looking to run LLMs locally.


link?


https://ollama.ai/

There's also plenty of other local LLM type tools like GPT4All, LMStudio, Simon Wilsons LLM, privateGPT, and about a million other setups.


Just yesterday, I found out that Simon Wilson's name is Simon Willison.



What's the difference with OP?


We have come far, not too long ago it was 'sudoku solver in 5 lines' of bash !

Lol but for real today for the first time when browsing new laptops I was looking for high vram because of llm.


Related: say I’ve written code that uses OpenAI API, and code that handles streaming, retries, function-calls. And now I want to switch it to using a local non-API-based model such as llama2, without changing too much code.

Is there a library that offers a layer on top of local models that simulates the OpenAI API?



Thanks, this was hard to find


Ollama.ai is pretty good too, any differences with this one?

Seems like all of these open source wrappers, just as the closed sourced ones, are a race to the bottom.


I work on Ollama. It's a good question since there are quite a few tools emerging in this space.

The focus for Ollama is to make downloading and serving a model easy – there's an included `ollama` CLI but it's all powered by a REST API. Hopefully, it's a way to support really cool applications of LLMs like OP's onprem tool.

OP's tool is more focused on ingesting and analyzing data. There seems to be quite a bit of interesting opportunity as an application of LLMs – e.g. analyzing not only local docs but data in a remote data store.


I've built a couple projects that use Ollama. Thanks for making such a cool tool!


What is the most effective local llm model?


Totally depends on your use case. There are optimized models for stories, code, knowlegde...


How big are these models that are downloaded? Is 7B 7 gigabytes of data?


If the model is stored in 32bits, it will be ~4x param size

You can see the actual size from hugging face. For example, https://huggingface.co/WizardLM/WizardLM-7B-V1.0/tree/main

Size with quantization, https://github.com/ggerganov/llama.cpp#quantization


It's actually very straightforward.

7B = 7 Billions parameters.

If 1 parameter takes 1 byte then a 7B model would be about 7GB in size.

Usually 1 parameter takes 4 bytes tho (a 32-bit float), so it would be aboult 28GB.

You can use 16-bit, 8-bit or even 4-bit float.


wow, this looks approachable. will have to try it tomorrow


Wonderful. I love the advent of open source LLMs, and love the turnkey nature of this product.

What sold me on ChatGPT was its efficacy combined with its ease of use. As the owner of a consultancy, I find time to do technical exploration to be more and more scarce - stuff like this that makes it super easy for me to run an LLM is most welcome.


It's a little bit ironic that the package is called "onprem" but the second line imports an external model from huggingface...

    from onprem import LLM
    url = 'https://huggingface.co/TheBloke/CodeUp-Llama.....'
    llm = LLM(url, n_gpu_layers=43) # see below for GPU information
Anyway looks like a great little project, nice work!


I'm not the OP but I believe it is correct. It says 'RUN' ... LLM. Meaning you have to get it from somewhere.

Training your own from nothing is a monumental task, I don't think many of us can realistically do it from scratch.


Code that runs on prem doesn't magically show up on your server, it still has to be fetched from somewhere.


It's still running on-prem if the execution happens on compute hardware that you control. If importing code from somewhere else disqualified something from being on-prem, then no application would ever be on-prem - when's the last time a meaningful application had no dependencies!?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: