I'm similarly skeptical, but that said I'm running 30B parameter LLMs on my 32GB M1 Macbook Pro every day now. The trick is quantising them down to 4 (or even 3) bit, it's possible to massively reduce the memory requirements. Have a look at[1]
The devs working on llama.cpp have been discussing ways to further reduce the memory requirements by mmapping the large weights files (I thought LLMs mutated the weights as they run inference, but they clearly know more than me about the internals), bringing it within reach of phone memory.
So, iPhones are not as far off the computational capacity to run these models as you'd think. Memory (and to a greater extent, battery and cooling) are the limiting factors. iPads even less so, given they run M1 chips and have much larger batteries & much more RAM
Offtopic, but for what purpose are you running llms locally (especially everyday)? My understanding was that the prompting requires to make them work at all was too great.
A little bit of research, a little bit of actual useful tasks - I'm interested in summarisation, which alpaca is decent at (even compared to existing summarisation-specific models I've tried)
My other motivation is making sure I understand what offline LLMs can do... while I use GPT-3 and 4 extensively, I don't want to send something over the wire if I don't have to (e.g. if I can summarise e-mails locally, I'd rather do that than send them to OpenAI).
It's also surprisingly good at defining things if I'm somewhere with no internet connectivity and want to look something up (although obviously that's not really what it's good at & hallucination risks abound)
On alpaca, I've found "Below is an instruction that describes a task. Write a response that appropriately completes the request. Summarise the following text: " or "Give me a 5 word summary of the following: " to work fairly well using the 30B weights.
It's certainly nowhere close to the quality of OpenAI summarisation, just better than what I previously had locally (e.g. in summarising a family history project with transcripts of old letters, gpt-3.5-turbo was able to accurately read between the lines summarising an original poem which I found amazing).
I half wonder if the change in spelling from US -> UK makes a difference...
I'd run a test on that but I've just broken my alpaca setup for longer prompts (switched to use mainline llama.cpp, which required a model conversion & some code changes, and it's no longer allocating enough memory)
Off topic slightly, but are you running into limits with 32GB RAM that the 64GB model would meaningfully be adequate for? Do you wish you had one of the larger RAM models?
I've been pretty happy with 32GB, but the 30B models do push near to the limits. I don't see a big difference between the quality of 65B (running on a 64GB x86 host) and 30B on M1 (although that may be the 4bit quantisation though, so take that with a grain of salt). I'm just glad that I have it on an M1... I have a 3080 in my PC, but when I got that I was thinking more of Stable Diffusion and YOLO tasks rather than LLMs, and it just doesn't have the VRAM for LLMs.
Alpaca seems like it could be significantly improved with better training (some of the old training data was truncated), so I think there's a decent amount of improvement to be had at the current model size.
In the future though... what would really be a meaningful change would be a larger context size - the 8k tokens of GPT-4 was a big improvement for my uses... I would guess a future local llm with larger context would exceed 32GB, but that's speculation beyond my expertise, I don't know how context size and network size scale.
If it was a PC I'd say go for 64GB, but hard to recommend that given how much Apple charge for RAM upgrades. On my next upgrade (2+ years time, hopefully) I'll likely opt for 64GB+ though
Yeah, it is expensive. My other strong consideration is battery life, since DRAM is always running; going from 32 to 64 would be a hit to battery life regardless of workload, but hard to say exactly how big of a hit.
I'm curious, which configuration of the M1 MBP do you have?
I went for the 16" with M1 Max w/32 GPU cores and 1TB SSD (500GB free, I offload most large files my NAS/iCloud). On the added power usage, my understanding is that's less of a concern due to using LPDDR5?
The only drawback I've found with the M1 Max model is the added weight from the bigger heatsink just makes it a hair heavier than I'd like when picking it up at the front with one hand when open... and that in the winter time the case is cold no matter what you're running, I used to love that my Intel MBP acted as a mini leg warmer :-)
The devs working on llama.cpp have been discussing ways to further reduce the memory requirements by mmapping the large weights files (I thought LLMs mutated the weights as they run inference, but they clearly know more than me about the internals), bringing it within reach of phone memory.
So, iPhones are not as far off the computational capacity to run these models as you'd think. Memory (and to a greater extent, battery and cooling) are the limiting factors. iPads even less so, given they run M1 chips and have much larger batteries & much more RAM
https://arxiv.org/abs/2210.17323