3 billion parameters. Does that mean I will be able to run on a 8gb consumer GPU...

dontwearitout · on May 3, 2023

Probably not out of the box but if some of the local deep learning wizards get a quantized version working well and optimize it a bit, definitely.

generalizations · on May 3, 2023

Means that once it's incorporated into llama.cpp, you can run it on your laptop.

amasad · on May 3, 2023

Hopefully on phones too

RHab · on May 3, 2023

No, I could only get 2.7B to run on 8GB VRam unfortunatly.

amasad · on May 3, 2023

it is 2.7B

tarruda · on May 3, 2023

Loading seems to have worked on my laptop's RTX 3070, `nvidia-smi` shows `5188MiB / 8192MiB` in memory usage.

pera · on May 3, 2023

their pytorch_model.bin is 10.4GB

tarruda · on May 3, 2023

I just loaded this on my laptop's RTX 3070 GPU by following the instructions here: https://huggingface.co/replit/replit-code-v1-3b

I don't know how I can test the model, but it seem loading worked. When I run `nvidia-smi` on another terminal, I see `5188MiB / 8192MiB` in the memory-usage column.

swyx · on May 3, 2023

you can load it but you cant run inference? whats the issue?

tarruda · on May 3, 2023

No issue, I'm simply unfamiliar with python machine learning APIs.

I managed to run inference locally by installing the requirements and running app.py from the demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo/...

It is very fast on my RTX 3070, VRAM usage goes to ~= 6.3GB during inference.